mirror of
https://github.com/nesquena/hermes-webui.git
synced 2026-05-25 03:00:23 +00:00
efcfff3d7f
The dashboard banner 'Hermes agent is not responding' fires on every multi-container deployment that doesn't set 'pid: "service:hermes-agent"' in compose, because get_running_pid() relies on fcntl.flock and os.kill(pid, 0) — both PID-namespace-scoped and invisible across container boundaries. Fix: when get_running_pid() returns None, fall back to a freshness check on gateway_state.json. The gateway already writes that file on every tick with gateway_state == 'running' and an aware ISO-8601 updated_at timestamp, so a recent (<= 120s) timestamp is an equivalent live-process signal that needs only a shared volume — no PID namespace, no compose workaround, no extra HTTP probe URL. Behavior preserved: - In-namespace deployments still hit the PID-based path first; payload shape unchanged (no 'reason' key) so #716 contract holds. - Cross-container alive path adds reason='cross_container_freshness' so support diagnostics can tell which signal succeeded. - Stale updated_at, non-running gateway_state, malformed/naive/missing timestamps, and timestamps far in the future all still report 'down' — the fallback never produces a false positive. - Same redaction rules: argv/command/executable/env/raw pid never leak. Tests: 15 new cases in test_issue1879_cross_container_gateway_liveness.py covering the cross-container alive path, every refusal case, clock-skew tolerance, and backward compat with the #716 PID path. Existing #716 heartbeat tests (8) continue to pass.