mirror of
https://github.com/nesquena/hermes-webui.git
synced 2026-05-24 18:50:15 +00:00
f84b6a4e2f
Issue #1458 reports persistent-host crashes (≥1/day) when running the WebUI under launchd KeepAlive on macOS. Root cause: `bootstrap.py` calls `subprocess.Popen([python, "server.py"], start_new_session=True)`, probes /health, then exits 0. Under any process supervisor (launchd, systemd, supervisord, runit, s6), the supervisor sees its tracked PID exit, marks the program as "completed," and respawns it. The new bootstrap fails to bind port 8787 (orphaned server still has it), exits non-zero, supervisor respawns again — loop until the orphan crashes for some other reason and the next respawn finds the port free. This PR addresses Bug #1 of the three failure modes tracked in #1458: the `bootstrap.py` double-fork breaking process supervisors. Bug #2 (state.db FD leak) and Bug #3 (HTTP-unhealthy wedge) remain open under the same issue — they need diagnosis data before a fix can land. Changes ------- 1. `bootstrap.py`: - New `--foreground` argparse flag with help text mentioning launchd / systemd / supervisord. - New `_detect_supervisor()` that returns the env var name for any supervisor it detects: `INVOCATION_ID` / `JOURNAL_STREAM` / `NOTIFY_SOCKET` (systemd, s6), `XPC_SERVICE_NAME` (launchd), `SUPERVISOR_ENABLED` (supervisord), or `HERMES_WEBUI_FOREGROUND` for the explicit user opt-in. Truthy values for the explicit opt-in: `1` / `true` / `yes` / `on` (case-insensitive). - `main()` branches on `args.foreground or _detect_supervisor()`: - **Foreground path:** chdir to `agent_dir or REPO_ROOT`, then `os.execv(python, [python, server_path])` to replace the bootstrap process image with the server. The supervisor sees the long-lived server as the original child. No `wait_for_health` probe — the supervisor's KeepAlive / Restart=on-failure handles liveness. - **Default path:** unchanged. Spawn server as detached child via `Popen + start_new_session=True`, probe /health, return 0. This still works for interactive `bash start.sh` invocations. - Resolved env vars (HOST/PORT/STATE_DIR/AGENT_DIR) are now mutated on `os.environ` directly instead of into a local `env` copy so they are inherited across `os.execv`. 2. `docs/supervisor.md` (new): runnable launchd plist, systemd .service, and supervisord conf examples + a diagnostic recipe (`lsof` + ppid chain) for catching the orphan-loop in production. 3. `.gitignore`: allowlist `docs/supervisor.md` (the directory uses an opt-in pattern; matches the existing `!docs/docker.md` precedent). 4. `tests/test_bootstrap_foreground.py` (new): 35 regression tests covering the argparse flag, `_detect_supervisor()` behavior across all five supervisor env vars, the explicit opt-in's truthy/falsy values, and `main()`'s execv-vs-Popen routing decision under each input combination. `os.execv` is monkeypatched in the routing tests — we pin the structural choice (which call is made, with which args, in which cwd, with which env) not the post-exec behavior. Why this scope and no more -------------------------- Bug #2 (state.db FD leak) lists 5 candidate paths and asks the reporter for `lsof -p <pid> | sort | uniq -c | sort -rn | head -20` output to disambiguate. Until that data lands, any "fix" would be speculative — explicitly out of scope per the contributor-pickup comment on the issue. Bug #3 (launchd-running, port-listening, HTTP-unhealthy) was added in @stefanpieter's reply comment. Diagnosis is in flight; no concrete fix shape yet. Also out of scope. Running locally end-to-end verifies the behavior: ``` [bootstrap] Starting Hermes Web UI on http://127.0.0.1:8789 (foreground mode: --foreground) $ pgrep -af 'server.py' 2997632 /home/.../python /tmp/wt-fix-1458/server.py $ ps -o ppid -p 2997632 2997581 ← bash that ran bootstrap.py — same PID as the original bootstrap $ ps -p 2997581 -o cmd ... bootstrap.py ... ← but exec'd into server.py ``` The same PID that bash forked for `bootstrap.py` is now `server.py`. A supervisor watching that PID would correctly observe the long-lived server. No double-fork. Verification ------------ - 3811 tests pass (`pytest tests/` — full suite, +51 from this PR plus master-merge-in) - All 35 new bootstrap-foreground tests pass - `bash scripts/run-browser-tests.sh` PASS (HTTP API checks against worktree) - `bash scripts/webui_qa_agent.sh 8789` PASS (23/23 visual QA) - Live verified: server starts cleanly under both `--foreground` and `HERMES_WEBUI_FOREGROUND=1`; PID lineage confirms no double-fork Closes #1458 (Bug #1 only). Bugs #2 and #3 remain tracked under the issue.