hermes-webui

mirror of https://github.com/nesquena/hermes-webui.git synced 2026-05-25 03:00:23 +00:00

Author	SHA1	Message	Date
nesquena-hermes	f6ce79185c	Merge PR #2059 into stage-339 feat: add crash-safe turn journal writer by @ai-ag2026	2026-05-11 17:43:58 +00:00
Frank Song	2cd10868aa	Fix session recovery polish	2026-05-11 16:30:25 +08:00
ai-ag2026	5cd001d545	feat: add crash-safe turn journal writer	2026-05-11 08:49:53 +02:00
nesquena-hermes	12cef733e3	fix(recovery): preserve worktree metadata + workspace + message_count on state.db sidecar rebuild PR #2053 added worktree-backed session creation. PR #2041 (shipped in v0.51.42) added state.db sidecar reconciliation that rebuilds a missing <sid>.json sidecar from the canonical state.db row when the JSON file is gone (failed save, manual rm, restore-from-backup with mismatched dirs). The two interact silently. `_state_db_row_to_sidecar()` was hard-coding `'workspace': ''` and never propagating the four worktree_* fields from the row to the rebuilt sidecar dict. So a worktree-backed session that loses its sidecar and gets rebuilt from state.db: - loses `worktree_path` → matches the empty-session sidebar filter at `api/models.py:1067/1107` (which spares worktree-backed empty sessions via `not s.get('worktree_path')`) → session disappears from the sidebar even though the worktree directory still exists on disk. - loses `workspace` → downstream tools (terminal panels, file pickers that use `s.workspace`) operate on empty string instead of the original worktree path. - always reports `message_count == 0` → contributes to the empty-session filter even for sessions that have messages in `state.db.messages`. Fix: 1. `_read_state_db_missing_sidecar_rows()` SELECT now includes `workspace, worktree_path, worktree_branch, worktree_repo_root, worktree_created_at, message_count` (each gated by `_sql_optional_col()` so older state.db schemas without those columns continue to work — recovery degrades gracefully rather than 500ing). 2. `_state_db_row_to_sidecar()` propagates each field. workspace comes from the row if it's a string, otherwise '' (matching pre-fix behavior for non-worktree sessions). message_count comes from the row if it's an int, otherwise falls back to `len(messages)` so the rebuilt sidecar always has a coherent count. 3 new regression tests in tests/test_state_db_worktree_recovery.py exercise: - worktree session with messages → all four worktree_* fields preserved. - non-worktree session → worktree_* fields all None (no spurious propagation), workspace=''. - empty worktree session (the worst case) → confirms the rebuilt sidecar does NOT match the empty-session-exempt filter, so it stays visible in the sidebar. Caught by Opus advisor during stage-337 review (the cross-PR interaction between #2053 and the previously-shipped #2041 wasn't exercised by either PR's individual test suite).	2026-05-11 06:00:13 +00:00
nesquena-hermes	9f3f8ea902	fix(recovery): close concurrency hazards in state.db sidecar reconciliation Two concrete data-corruption vectors flagged in Opus review of PR #2041, both fixed atomically so the new repair-safe endpoint is safe for production: 1. Shared tmp filename under concurrent calls `tmp = target.with_suffix('.json.reconcile.tmp')` produced a fixed path per session ID. Two simultaneous repair-safe POSTs would interleave bytes in the same tmp file, then both rename → corrupted JSON. Now matches the `Session.save()` convention at api/models.py:484 with a pid+tid suffix. 2. TOCTOU between target.exists() check and tmp.replace(target) `os.replace()` overwrites unconditionally. If a concurrent Session.save() for the same SID materialized the live sidecar in the microsecond window between the existence check and the rename, the reconciliation would silently overwrite a live sidecar with a (lossier) state.db reconstruction. Switched to `os.link()` + `unlink(tmp)` which is atomic create-or-fail — on FileExistsError we record `skipped: sidecar_appeared_during_reconcile` and keep the live sidecar untouched. Plus a round-trip schema-parity test: materialize a sidecar from state.db, then load it back through `Session.load()` and assert the messages survive. Catches future schema drift between `_state_db_row_to_sidecar()` and `Session.__init__()`. Also adds a guard test confirming the .reconcile.tmp suffix includes pid+tid (regression guard for hazard #1). Tests: 23 passing across the recovery suite (was 21; +2 new in this commit). Co-authored-by: ai-ag2026 <261867348+ai-ag2026@users.noreply.github.com>	2026-05-11 02:44:38 +00:00
ai-ag2026	a34ded8e99	feat: reconcile missing WebUI sidecars from state db	2026-05-11 02:43:00 +00:00
ai-ag2026	90c3611732	feat: expose session recovery audit and safe repair endpoints	2026-05-11 02:43:00 +00:00
ai-ag2026	7b6d91d490	feat: add read-only session recovery audit	2026-05-11 02:06:43 +02:00
ai-ag2026	663817570c	fix: recover orphaned session backups on startup	2026-05-11 02:03:37 +02:00
Hermes Bot	0c6c6b3bb1	fix: absorb Opus advisor doc-only SHOULD-FIX nits (1) api/session_recovery.py: removed misleading dated-format comment claim. YYYYMMDD_HHMMSS_*.json files don't start with '_' so the underscore- skip wouldn't apply to them anyway. Replaced with the truthful general statement: any future non-session JSON marked with the '_' convention is skipped automatically. (2) CHANGELOG.md: fixed self-referential typo. v0.50.284 obviously couldn't have said 'v0.50.285' inside its release notes — the quoted text was 'after deploying v0.50.284'. Pure documentation. No behavior change. Tests still pass (8/8 in tests/test_metadata_save_wipe_1558.py).	2026-05-03 20:54:02 +00:00
Hermes Bot	1a7eaf518f	fix(session-recovery): skip _index.json + harden _msg_count against non-dict JSON (v0.50.284 follow-up) v0.50.284 shipped startup self-heal in api/session_recovery.py that crashed on the very first JSON file it scanned in the production session directory. Verified live on the prod server immediately after the v0.50.284 deploy: [recovery] startup recovery failed: 'list' object has no attribute 'get' Root cause: the production session dir contains _index.json — a top-level LIST of session metadata dicts (not a dict). _msg_count() did data.get('messages') which raises AttributeError on a list. The broad except Exception in server.py's startup hook swallowed the error and the recovery silently no-op'd for every user — defeating the entire purpose of the v0.50.284 release. Fix is three small defensive changes: 1. _msg_count() — added isinstance(data, dict) guard. Non-dict-shaped JSON files now return -1 (the harmless 'unknown count' sentinel) instead of raising AttributeError. 2. recover_all_sessions_on_startup() — skips any file whose name starts with '_' (the existing project convention for non-session metadata files like _index.json). These are convention-marked as system files, not session payloads. 3. recover_all_sessions_on_startup() — wraps recover_session(path) in try/except Exception so a single malformed file can't break recovery for the rest. Logs and continues. 2 new regression tests: - test_recover_all_sessions_on_startup_skips_non_session_index_json - test_msg_count_returns_neg1_for_non_dict_top_level 4026 → 4028 tests passing (+2). Net effect: any user wiped between v0.50.279 and v0.50.284 deploys whose session has a .bak shadow will now get auto-recovered on first launch of v0.50.285, as v0.50.284's release notes promised. Closes #1558 (follow-up — the original P0 was closed by v0.50.284 but the recovery half didn't actually run in production).	2026-05-03 20:50:06 +00:00
Hermes Bot	166f439eeb	fix: correct issue references #1557 → #1558 (nesquena review feedback) The PR title and body correctly say 'Closes #1558' but every code comment, the test file name, error-message strings, docstrings, and the original commit body referenced #1557 instead. Independent reviewer flagged this: > The 17 wrong references won't auto-close issue #1558 from the commit > message — and the test file name will be misleading for future archeology. > Worth a one-pass s/#1557/#1558/g (and rename test file → > test_metadata_save_wipe_1558.py) before merge so the artifacts agree > with reality. This commit: - Renames tests/test_metadata_save_wipe_1557.py → test_metadata_save_wipe_1558.py - Replaces 17 #1557 references with #1558 across: - tests/test_metadata_save_wipe_1558.py (7 refs) - api/models.py (5 refs in Session.save guard + backup safeguard comments) - api/routes.py (2 refs in _clear_stale_stream_state docstring + log) - api/session_recovery.py (3 refs) - server.py (3 refs in startup self-heal block) Verified: 6/6 tests in tests/test_metadata_save_wipe_1558.py pass with the renamed file + updated references.	2026-05-03 19:55:14 +00:00
nesquena-hermes	1d9a0cbba1	fix(P0 #1557 ): metadata-only Session.save() was wiping conversation history v0.50.279 introduced api.routes._clear_stale_stream_state() (#1525) which calls session.save() to clear stale active_stream_id/pending_* fields. The helper is called from /api/session and /api/session/status — both of which load the session with metadata_only=True. Session.load_metadata_only() synthesizes a stub with messages=[] (its whole purpose: fast metadata read without parsing the 400KB+ messages array). Session.save() unconditionally writes self.messages to disk via os.replace(), so saving a metadata-only stub atomically overwrites the on-disk JSON with messages=[], wiping the entire conversation. Production trigger: every SSE reconnect cycle after a server restart polls /api/session/status, which fans out to _clear_stale_stream_state, which saves the metadata-only stub. The user reported losing 1000+ message conversations and seeing 'Reconnecting…' loops on every prompt — the reconnect loop kept the cycle running until the conversation was empty. Fix: three layers, defense in depth. (1) api/models.py: load_metadata_only() now sets _loaded_metadata_only=True on the returned stub. Session.save() raises RuntimeError if that flag is set — a hard guard so any future caller making the same mistake cannot wipe data, only crash visibly. (2) api/routes.py: _clear_stale_stream_state() now detects the metadata-only flag and re-loads the full session with metadata_only=False before mutating persisted state. The full-load path also runs _repair_stale_pending() which independently clears the stream flags, so the explicit clear becomes a no-op in most cases — but messages stay intact. (3) api/models.py + api/session_recovery.py: every save() that would SHRINK the messages array (the precise failure shape of #1557) first snapshots the previous file to <sid>.json.bak. Server.py runs recover_all_sessions_on_startup() at boot — any session whose live JSON has fewer messages than its .bak is restored automatically. Idempotent on clean state. Backup overhead is zero on the normal grow-the-conversation path. Reproducer (master): test_metadata_only_save_does_not_wipe_messages goes from 1000 messages to 0 in a single save() call. After the fix, 1000 messages survive. Tests: 6 new regression tests in tests/test_metadata_save_wipe_1557.py covering all three layers. Full pytest: 4019 → 4025 (+6, all green). Live verified on port 8789: write 1000-msg session with stale active_stream_id, hit /api/session/status, /api/session — file ends with 1002 messages (_repair_stale_pending injects an error-marker pair on full reload, harmless existing behavior), active_stream_id cleared, pending cleared, no Reconnecting loop. Closes #1557. Reported by AvidFuturist via user feedback on v0.50.282.	2026-05-03 19:45:10 +00:00

13 Commits