PR #2053 added worktree-backed session creation. PR #2041 (shipped in
v0.51.42) added state.db sidecar reconciliation that rebuilds a missing
<sid>.json sidecar from the canonical state.db row when the JSON file is
gone (failed save, manual rm, restore-from-backup with mismatched dirs).
The two interact silently. `_state_db_row_to_sidecar()` was hard-coding
`'workspace': ''` and never propagating the four worktree_* fields from
the row to the rebuilt sidecar dict. So a worktree-backed session that
loses its sidecar and gets rebuilt from state.db:
- loses `worktree_path` → matches the empty-session sidebar filter at
`api/models.py:1067/1107` (which spares worktree-backed empty sessions
via `not s.get('worktree_path')`) → session disappears from the
sidebar even though the worktree directory still exists on disk.
- loses `workspace` → downstream tools (terminal panels, file pickers
that use `s.workspace`) operate on empty string instead of the original
worktree path.
- always reports `message_count == 0` → contributes to the empty-session
filter even for sessions that have messages in `state.db.messages`.
Fix:
1. `_read_state_db_missing_sidecar_rows()` SELECT now includes
`workspace, worktree_path, worktree_branch, worktree_repo_root,
worktree_created_at, message_count` (each gated by
`_sql_optional_col()` so older state.db schemas without those columns
continue to work — recovery degrades gracefully rather than 500ing).
2. `_state_db_row_to_sidecar()` propagates each field. workspace comes
from the row if it's a string, otherwise '' (matching pre-fix behavior
for non-worktree sessions). message_count comes from the row if
it's an int, otherwise falls back to `len(messages)` so the rebuilt
sidecar always has a coherent count.
3 new regression tests in tests/test_state_db_worktree_recovery.py
exercise:
- worktree session with messages → all four worktree_* fields preserved.
- non-worktree session → worktree_* fields all None (no spurious
propagation), workspace=''.
- empty worktree session (the worst case) → confirms the rebuilt sidecar
does NOT match the empty-session-exempt filter, so it stays visible
in the sidebar.
Caught by Opus advisor during stage-337 review (the cross-PR interaction
between #2053 and the previously-shipped #2041 wasn't exercised by either
PR's individual test suite).
Two concrete data-corruption vectors flagged in Opus review of PR #2041,
both fixed atomically so the new repair-safe endpoint is safe for production:
1. Shared tmp filename under concurrent calls
`tmp = target.with_suffix('.json.reconcile.tmp')` produced a fixed path
per session ID. Two simultaneous repair-safe POSTs would interleave bytes
in the same tmp file, then both rename → corrupted JSON. Now matches the
`Session.save()` convention at api/models.py:484 with a pid+tid suffix.
2. TOCTOU between target.exists() check and tmp.replace(target)
`os.replace()` overwrites unconditionally. If a concurrent Session.save()
for the same SID materialized the live sidecar in the microsecond window
between the existence check and the rename, the reconciliation would
silently overwrite a live sidecar with a (lossier) state.db reconstruction.
Switched to `os.link()` + `unlink(tmp)` which is atomic create-or-fail —
on FileExistsError we record `skipped: sidecar_appeared_during_reconcile`
and keep the live sidecar untouched.
Plus a round-trip schema-parity test: materialize a sidecar from state.db,
then load it back through `Session.load()` and assert the messages survive.
Catches future schema drift between `_state_db_row_to_sidecar()` and
`Session.__init__()`. Also adds a guard test confirming the .reconcile.tmp
suffix includes pid+tid (regression guard for hazard #1).
Tests: 23 passing across the recovery suite (was 21; +2 new in this commit).
Co-authored-by: ai-ag2026 <261867348+ai-ag2026@users.noreply.github.com>
(1) api/session_recovery.py: removed misleading dated-format comment claim.
YYYYMMDD_HHMMSS_*.json files don't start with '_' so the underscore-
skip wouldn't apply to them anyway. Replaced with the truthful general
statement: any future non-session JSON marked with the '_' convention
is skipped automatically.
(2) CHANGELOG.md: fixed self-referential typo. v0.50.284 obviously couldn't
have said 'v0.50.285' inside its release notes — the quoted text was
'after deploying v0.50.284'.
Pure documentation. No behavior change. Tests still pass (8/8 in
tests/test_metadata_save_wipe_1558.py).
v0.50.284 shipped startup self-heal in api/session_recovery.py that
crashed on the very first JSON file it scanned in the production
session directory. Verified live on the prod server immediately after
the v0.50.284 deploy:
[recovery] startup recovery failed: 'list' object has no attribute 'get'
Root cause: the production session dir contains _index.json — a
top-level LIST of session metadata dicts (not a dict). _msg_count()
did data.get('messages') which raises AttributeError on a list.
The broad except Exception in server.py's startup hook swallowed the
error and the recovery silently no-op'd for every user — defeating
the entire purpose of the v0.50.284 release.
Fix is three small defensive changes:
1. _msg_count() — added isinstance(data, dict) guard. Non-dict-shaped
JSON files now return -1 (the harmless 'unknown count' sentinel)
instead of raising AttributeError.
2. recover_all_sessions_on_startup() — skips any file whose name starts
with '_' (the existing project convention for non-session metadata
files like _index.json). These are convention-marked as system
files, not session payloads.
3. recover_all_sessions_on_startup() — wraps recover_session(path) in
try/except Exception so a single malformed file can't break recovery
for the rest. Logs and continues.
2 new regression tests:
- test_recover_all_sessions_on_startup_skips_non_session_index_json
- test_msg_count_returns_neg1_for_non_dict_top_level
4026 → 4028 tests passing (+2).
Net effect: any user wiped between v0.50.279 and v0.50.284 deploys
whose session has a .bak shadow will now get auto-recovered on first
launch of v0.50.285, as v0.50.284's release notes promised.
Closes#1558 (follow-up — the original P0 was closed by v0.50.284 but
the recovery half didn't actually run in production).
The PR title and body correctly say 'Closes #1558' but every code comment,
the test file name, error-message strings, docstrings, and the original
commit body referenced #1557 instead. Independent reviewer flagged this:
> The 17 wrong references won't auto-close issue #1558 from the commit
> message — and the test file name will be misleading for future archeology.
> Worth a one-pass s/#1557/#1558/g (and rename test file →
> test_metadata_save_wipe_1558.py) before merge so the artifacts agree
> with reality.
This commit:
- Renames tests/test_metadata_save_wipe_1557.py → test_metadata_save_wipe_1558.py
- Replaces 17 #1557 references with #1558 across:
- tests/test_metadata_save_wipe_1558.py (7 refs)
- api/models.py (5 refs in Session.save guard + backup safeguard comments)
- api/routes.py (2 refs in _clear_stale_stream_state docstring + log)
- api/session_recovery.py (3 refs)
- server.py (3 refs in startup self-heal block)
Verified: 6/6 tests in tests/test_metadata_save_wipe_1558.py pass
with the renamed file + updated references.
v0.50.279 introduced api.routes._clear_stale_stream_state() (#1525) which
calls session.save() to clear stale active_stream_id/pending_* fields. The
helper is called from /api/session and /api/session/status — both of which
load the session with metadata_only=True. Session.load_metadata_only()
synthesizes a stub with messages=[] (its whole purpose: fast metadata read
without parsing the 400KB+ messages array). Session.save() unconditionally
writes self.messages to disk via os.replace(), so saving a metadata-only
stub atomically overwrites the on-disk JSON with messages=[], wiping the
entire conversation.
Production trigger: every SSE reconnect cycle after a server restart polls
/api/session/status, which fans out to _clear_stale_stream_state, which
saves the metadata-only stub. The user reported losing 1000+ message
conversations and seeing 'Reconnecting…' loops on every prompt — the
reconnect loop kept the cycle running until the conversation was empty.
Fix: three layers, defense in depth.
(1) api/models.py: load_metadata_only() now sets _loaded_metadata_only=True
on the returned stub. Session.save() raises RuntimeError if that flag
is set — a hard guard so any future caller making the same mistake
cannot wipe data, only crash visibly.
(2) api/routes.py: _clear_stale_stream_state() now detects the metadata-only
flag and re-loads the full session with metadata_only=False before
mutating persisted state. The full-load path also runs
_repair_stale_pending() which independently clears the stream flags,
so the explicit clear becomes a no-op in most cases — but messages
stay intact.
(3) api/models.py + api/session_recovery.py: every save() that would
SHRINK the messages array (the precise failure shape of #1557) first
snapshots the previous file to <sid>.json.bak. Server.py runs
recover_all_sessions_on_startup() at boot — any session whose live
JSON has fewer messages than its .bak is restored automatically.
Idempotent on clean state. Backup overhead is zero on the normal
grow-the-conversation path.
Reproducer (master): test_metadata_only_save_does_not_wipe_messages goes
from 1000 messages to 0 in a single save() call. After the fix, 1000
messages survive.
Tests: 6 new regression tests in tests/test_metadata_save_wipe_1557.py
covering all three layers. Full pytest: 4019 → 4025 (+6, all green).
Live verified on port 8789: write 1000-msg session with stale active_stream_id,
hit /api/session/status, /api/session — file ends with 1002 messages
(_repair_stale_pending injects an error-marker pair on full reload, harmless
existing behavior), active_stream_id cleared, pending cleared, no Reconnecting
loop.
Closes#1557.
Reported by AvidFuturist via user feedback on v0.50.282.