When API server runs append messages directly to state.db, reconcile WebUI sidecar sessions with those canonical rows across API responses, model-facing streaming context, and active browser refresh.
Add append-only state.db merge helpers, metadata-only counts for refresh polling, and regression coverage for API visibility, context incorporation, and frontend refresh behavior.
Replace the earlier frontend-reset approach with a backend side-channel
approach that preserves the queue (event, data) tuple shape.
Problem (Opus catch):
- Live SSE frames emitted by _sse() in api/streaming.py:2296 carried no
'id:' field. Only journal-replay frames (via _sse_with_id) emitted IDs.
- Frontend's _lastRunJournalSeq cursor stayed at 0 during live streaming.
- Mid-stream error → reconnect-to-replay arrived with after_seq=0.
- Server replayed every journaled event from seq 1.
- assistantText (closure-scoped) had accumulated all live tokens already
→ double-rendered output.
Fix:
- api/config.py: STREAM_LAST_EVENT_ID: dict = {} module-level dict.
- api/streaming.py put(): capture journal event_id, write to
STREAM_LAST_EVENT_ID[stream_id]. Keep queue tuple as (event, data).
- api/routes.py _handle_sse_stream: read STREAM_LAST_EVENT_ID[stream_id]
at emit time, use _sse_with_id when set.
- api/streaming.py finally block: pop STREAM_LAST_EVENT_ID for cleanup.
Why side-channel instead of 3-tuple:
- Earlier attempt (queue tuple → (event, data, event_id)) broke 4 existing
tests: test_cancel_interrupt, test_sprint42, test_sprint51,
test_issue1857_usage_overwrite. These all unpack 'event, data = q.get()'.
- Frontend-reset approach (reset assistantText before replay) broke 3
other tests: test_smooth_text_fade, test_streaming_markdown,
test_streaming_race_fix. _wireSSE must NOT reset accumulators because
legacy reconnect doesn't replay events; only journal-replay does.
Side-channel preserves both invariants:
- Queue contract stays (event, data) — legacy consumers unbroken.
- Frontend accumulators stay alive on _wireSSE — legacy reconnect unbroken.
- Live SSE emits 'id:' so the journal cursor advances correctly.
6 regression tests added in test_stage364_opus_live_sse_event_id.py.
1 existing test (test_run_journal_streaming_static.test_streaming_journals_sse_events_before_queue_delivery) updated to be tuple-shape-agnostic.
Test results:
- Full pytest: 5713 passed, 10 skipped, 1 xfailed, 2 xpassed, 0 failed
- Previously-failing 5 tests: ALL PASS
- 6 new regression tests: ALL PASS
Replace the hardcoded Skyly cancellation wording with the configured bot_name from settings, falling back to Hermes when unset.
Keep the client-side fallback in sync by using window._botName if the session refresh after cancellation fails.
Co-authored-by: Obryn 🐉 <obryn-ai@dotbeeps.dev>
Opus stage-360 review caught that the docstring at api/streaming.py:40-43
said 'around the entire agent run' which is no longer accurate after the
narrow-lock refactor. The lock is now held only briefly for the env-mutation
critical section; the agent runs outside the lock and the finally block
re-acquires to atomically restore env vars.
Docstring now points to both narrow-lock implementations as references:
- _run_agent_streaming at line ~2719 (the original pattern)
- profile_env_for_background_worker at api/profiles.py:715 (added stage-360)
#2299 introduced profile_env_for_background_worker() in api/profiles.py and
changed _ENV_LOCK from threading.Lock() to threading.RLock(). Both changes
were incorrect:
1. RLock masked rather than fixed the underlying deadlock. The QA
test_env_lock_is_non_reentrant test exists precisely to enforce
non-reentrance — RLock would let a single thread hold _ENV_LOCK across
nested critical sections, which hides bugs while still allowing
different-thread races.
2. The original context manager held _ENV_LOCK for the ENTIRE 'yield'
duration, meaning the lock was held for the full background worker's
runtime (title generation, compression, update summary — possibly
many seconds). That blocked ALL other sessions on _ENV_LOCK, which
the QA test_third_message_completes runtime test caught as a timeout
on the third sequential message.
Fix: mirror the narrow-lock pattern from _run_agent_streaming:
- Acquire _ENV_LOCK only for env mutation (set runtime_env + patch
skill modules)
- Release immediately, yield to worker (no lock held)
- Reacquire in finally to restore env + skill modules
Restored _ENV_LOCK back to threading.Lock(). All 20 QA tests now pass,
including test_third_message_completes (was timing out, now 35s).