Issue #1458 reports persistent-host crashes (≥1/day) when running the WebUI under launchd KeepAlive on macOS. Root cause: `bootstrap.py` calls `subprocess.Popen([python, "server.py"], start_new_session=True)`, probes /health, then exits 0. Under any process supervisor (launchd, systemd, supervisord, runit, s6), the supervisor sees its tracked PID exit, marks the program as "completed," and respawns it. The new bootstrap fails to bind port 8787 (orphaned server still has it), exits non-zero, supervisor respawns again — loop until the orphan crashes for some other reason and the next respawn finds the port free. This PR addresses Bug #1 of the three failure modes tracked in #1458: the `bootstrap.py` double-fork breaking process supervisors. Bug #2 (state.db FD leak) and Bug #3 (HTTP-unhealthy wedge) remain open under the same issue — they need diagnosis data before a fix can land. Changes ------- 1. `bootstrap.py`: - New `--foreground` argparse flag with help text mentioning launchd / systemd / supervisord. - New `_detect_supervisor()` that returns the env var name for any supervisor it detects: `INVOCATION_ID` / `JOURNAL_STREAM` / `NOTIFY_SOCKET` (systemd, s6), `XPC_SERVICE_NAME` (launchd), `SUPERVISOR_ENABLED` (supervisord), or `HERMES_WEBUI_FOREGROUND` for the explicit user opt-in. Truthy values for the explicit opt-in: `1` / `true` / `yes` / `on` (case-insensitive). - `main()` branches on `args.foreground or _detect_supervisor()`: - **Foreground path:** chdir to `agent_dir or REPO_ROOT`, then `os.execv(python, [python, server_path])` to replace the bootstrap process image with the server. The supervisor sees the long-lived server as the original child. No `wait_for_health` probe — the supervisor's KeepAlive / Restart=on-failure handles liveness. - **Default path:** unchanged. Spawn server as detached child via `Popen + start_new_session=True`, probe /health, return 0. This still works for interactive `bash start.sh` invocations. - Resolved env vars (HOST/PORT/STATE_DIR/AGENT_DIR) are now mutated on `os.environ` directly instead of into a local `env` copy so they are inherited across `os.execv`. 2. `docs/supervisor.md` (new): runnable launchd plist, systemd .service, and supervisord conf examples + a diagnostic recipe (`lsof` + ppid chain) for catching the orphan-loop in production. 3. `.gitignore`: allowlist `docs/supervisor.md` (the directory uses an opt-in pattern; matches the existing `!docs/docker.md` precedent). 4. `tests/test_bootstrap_foreground.py` (new): 35 regression tests covering the argparse flag, `_detect_supervisor()` behavior across all five supervisor env vars, the explicit opt-in's truthy/falsy values, and `main()`'s execv-vs-Popen routing decision under each input combination. `os.execv` is monkeypatched in the routing tests — we pin the structural choice (which call is made, with which args, in which cwd, with which env) not the post-exec behavior. Why this scope and no more -------------------------- Bug #2 (state.db FD leak) lists 5 candidate paths and asks the reporter for `lsof -p <pid> | sort | uniq -c | sort -rn | head -20` output to disambiguate. Until that data lands, any "fix" would be speculative — explicitly out of scope per the contributor-pickup comment on the issue. Bug #3 (launchd-running, port-listening, HTTP-unhealthy) was added in @stefanpieter's reply comment. Diagnosis is in flight; no concrete fix shape yet. Also out of scope. Running locally end-to-end verifies the behavior: ``` [bootstrap] Starting Hermes Web UI on http://127.0.0.1:8789 (foreground mode: --foreground) $ pgrep -af 'server.py' 2997632 /home/.../python /tmp/wt-fix-1458/server.py $ ps -o ppid -p 2997632 2997581 ← bash that ran bootstrap.py — same PID as the original bootstrap $ ps -p 2997581 -o cmd ... bootstrap.py ... ← but exec'd into server.py ``` The same PID that bash forked for `bootstrap.py` is now `server.py`. A supervisor watching that PID would correctly observe the long-lived server. No double-fork. Verification ------------ - 3811 tests pass (`pytest tests/` — full suite, +51 from this PR plus master-merge-in) - All 35 new bootstrap-foreground tests pass - `bash scripts/run-browser-tests.sh` PASS (HTTP API checks against worktree) - `bash scripts/webui_qa_agent.sh 8789` PASS (23/23 visual QA) - Live verified: server starts cleanly under both `--foreground` and `HERMES_WEBUI_FOREGROUND=1`; PID lineage confirms no double-fork Closes #1458 (Bug #1 only). Bugs #2 and #3 remain tracked under the issue.
5.8 KiB
Running Hermes Web UI under a process supervisor
Use a process supervisor (launchd, systemd, supervisord, runit, s6) when you want the Web UI to start at boot, restart on crash, or be managed alongside other services.
TL;DR
Pass --foreground to bootstrap.py (or bash start.sh):
bash start.sh --foreground
Or set HERMES_WEBUI_FOREGROUND=1 in the environment. The Web UI will
auto-detect launchd / systemd / supervisord even without the flag, but being
explicit is safer.
Why --foreground matters
Without it, bootstrap.py does this:
- Spawn
server.pyas a detached subprocess (start_new_session=True) - Probe
/healthuntil the server is up - Exit 0
That works for an interactive shell run (./start.sh returns to your
prompt with the server alive in the background). It is broken under any
process supervisor: the supervisor sees its tracked PID exit, marks the job
as completed, and respawns bootstrap.py. The respawn fails to bind port
8787 (the orphaned server still has it), exits non-zero, supervisor
respawns again — loop.
In foreground mode, bootstrap.py does its setup work and then calls
os.execv to replace its own process with server.py. The supervisor
sees the long-lived server as the original child. KeepAlive=true /
Restart=always work correctly.
launchd (macOS)
~/Library/LaunchAgents/com.example.hermes-webui.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.example.hermes-webui</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>/Users/yourname/hermes-webui/start.sh</string>
<string>--foreground</string>
</array>
<key>WorkingDirectory</key>
<string>/Users/yourname/hermes-webui</string>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/Users/yourname/.hermes/webui/launchd-stdout.log</string>
<key>StandardErrorPath</key>
<string>/Users/yourname/.hermes/webui/launchd-stderr.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>HOME</key>
<string>/Users/yourname</string>
<key>PATH</key>
<string>/usr/local/bin:/usr/bin:/bin</string>
</dict>
</dict>
</plist>
Load:
launchctl load ~/Library/LaunchAgents/com.example.hermes-webui.plist
launchctl print gui/$(id -u)/com.example.hermes-webui # check state
Reload after editing the plist:
launchctl unload ~/Library/LaunchAgents/com.example.hermes-webui.plist
launchctl load ~/Library/LaunchAgents/com.example.hermes-webui.plist
launchd sets XPC_SERVICE_NAME automatically, so even without the
--foreground argument the Web UI will auto-promote to foreground mode.
The flag is still recommended as documentation of intent.
systemd (Linux)
~/.config/systemd/user/hermes-webui.service:
[Unit]
Description=Hermes Web UI
After=network.target
[Service]
Type=simple
WorkingDirectory=%h/hermes-webui
ExecStart=/bin/bash %h/hermes-webui/start.sh --foreground
Restart=on-failure
RestartSec=5
# Optional: route stdout/stderr to journald instead of files
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=default.target
Enable + start:
systemctl --user daemon-reload
systemctl --user enable --now hermes-webui.service
journalctl --user -u hermes-webui.service -f
systemd sets INVOCATION_ID and JOURNAL_STREAM (when stdio is wired to
the journal), both of which auto-promote to foreground mode.
supervisord (cross-platform)
/etc/supervisor/conf.d/hermes-webui.conf:
[program:hermes-webui]
command=/bin/bash /home/youruser/hermes-webui/start.sh --foreground
directory=/home/youruser/hermes-webui
user=youruser
autostart=true
autorestart=true
stopsignal=TERM
stopwaitsecs=10
stdout_logfile=/var/log/hermes-webui.out.log
stderr_logfile=/var/log/hermes-webui.err.log
environment=HOME="/home/youruser",PATH="/usr/local/bin:/usr/bin:/bin"
Reload + start:
sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl status hermes-webui
supervisord sets SUPERVISOR_ENABLED, which auto-promotes to foreground
mode.
Auto-detected env vars (full list)
These trigger --foreground behavior even when the flag is not passed:
| Env var | Set by | Notes |
|---|---|---|
INVOCATION_ID |
systemd | Set on every service activation |
JOURNAL_STREAM |
systemd | Set when stdio is wired to journald |
NOTIFY_SOCKET |
systemd Type=notify / s6 |
sd_notify-style notification socket |
XPC_SERVICE_NAME |
launchd | Set to the plist Label |
SUPERVISOR_ENABLED |
supervisord | Always set under supervisord |
HERMES_WEBUI_FOREGROUND |
you | Explicit opt-in; accepts 1 / true / yes / on |
If you're running under a supervisor that is not in the list and your tracked
PID keeps exiting, set HERMES_WEBUI_FOREGROUND=1 in the service
environment.
Diagnostic recipe
If the Web UI keeps getting respawned and you suspect the double-fork loop:
# Check the running PID for the server
lsof -iTCP:8787 -sTCP:LISTEN
# Get its parent — should be the supervisor itself, NOT init (PID 1)
PID=$(lsof -tiTCP:8787 -sTCP:LISTEN)
ps -p "$PID" -o pid,ppid,cmd
ps -p "$(ps -o ppid= -p "$PID" | tr -d ' ')" -o pid,cmd
A healthy foreground-mode setup looks like:
PID PPID CMD
12345 6789 /path/to/python /path/to/server.py
6789 1 /sbin/launchd # or /usr/lib/systemd/systemd, etc.
If PPID is 1 (init) when it should be the supervisor, the orphan-server
loop is happening — re-check that --foreground (or one of the env vars)
is reaching the process.