Files
hermes-webui/api
nesquena-hermes 9f3f8ea902 fix(recovery): close concurrency hazards in state.db sidecar reconciliation
Two concrete data-corruption vectors flagged in Opus review of PR #2041,
both fixed atomically so the new repair-safe endpoint is safe for production:

1. Shared tmp filename under concurrent calls
   `tmp = target.with_suffix('.json.reconcile.tmp')` produced a fixed path
   per session ID. Two simultaneous repair-safe POSTs would interleave bytes
   in the same tmp file, then both rename → corrupted JSON. Now matches the
   `Session.save()` convention at api/models.py:484 with a pid+tid suffix.

2. TOCTOU between target.exists() check and tmp.replace(target)
   `os.replace()` overwrites unconditionally. If a concurrent Session.save()
   for the same SID materialized the live sidecar in the microsecond window
   between the existence check and the rename, the reconciliation would
   silently overwrite a live sidecar with a (lossier) state.db reconstruction.
   Switched to `os.link()` + `unlink(tmp)` which is atomic create-or-fail —
   on FileExistsError we record `skipped: sidecar_appeared_during_reconcile`
   and keep the live sidecar untouched.

Plus a round-trip schema-parity test: materialize a sidecar from state.db,
then load it back through `Session.load()` and assert the messages survive.
Catches future schema drift between `_state_db_row_to_sidecar()` and
`Session.__init__()`. Also adds a guard test confirming the .reconcile.tmp
suffix includes pid+tid (regression guard for hazard #1).

Tests: 23 passing across the recovery suite (was 21; +2 new in this commit).

Co-authored-by: ai-ag2026 <261867348+ai-ag2026@users.noreply.github.com>
2026-05-11 02:44:38 +00:00
..
2026-05-11 00:25:07 +00:00
2026-04-29 19:54:07 -07:00
2026-05-11 07:33:52 +08:00
2026-05-10 23:38:05 +00:00
2026-05-11 00:25:07 +00:00