Opus stage-360 review caught that the docstring at api/streaming.py:40-43
said 'around the entire agent run' which is no longer accurate after the
narrow-lock refactor. The lock is now held only briefly for the env-mutation
critical section; the agent runs outside the lock and the finally block
re-acquires to atomically restore env vars.
Docstring now points to both narrow-lock implementations as references:
- _run_agent_streaming at line ~2719 (the original pattern)
- profile_env_for_background_worker at api/profiles.py:715 (added stage-360)
#2299 introduced profile_env_for_background_worker() in api/profiles.py and
changed _ENV_LOCK from threading.Lock() to threading.RLock(). Both changes
were incorrect:
1. RLock masked rather than fixed the underlying deadlock. The QA
test_env_lock_is_non_reentrant test exists precisely to enforce
non-reentrance — RLock would let a single thread hold _ENV_LOCK across
nested critical sections, which hides bugs while still allowing
different-thread races.
2. The original context manager held _ENV_LOCK for the ENTIRE 'yield'
duration, meaning the lock was held for the full background worker's
runtime (title generation, compression, update summary — possibly
many seconds). That blocked ALL other sessions on _ENV_LOCK, which
the QA test_third_message_completes runtime test caught as a timeout
on the third sequential message.
Fix: mirror the narrow-lock pattern from _run_agent_streaming:
- Acquire _ENV_LOCK only for env mutation (set runtime_env + patch
skill modules)
- Release immediately, yield to worker (no lock held)
- Reacquire in finally to restore env + skill modules
Restored _ENV_LOCK back to threading.Lock(). All 20 QA tests now pass,
including test_third_message_completes (was timing out, now 35s).
Keep distinct generated summary categories, route update-summary generation through the configured auxiliary model first, disclose capped large-range summary input, and constrain long summary panels.
The FakeAgent in test_issue1857_usage_overwrite returned only 2 messages
(user + assistant) without the conversation history. The real agent always
returns the full history plus new messages. This mismatch caused the new
_has_new_assistant_reply helper (which checks only messages beyond the
pre-turn offset) to see len(result)==len(prev) and incorrectly flag the
turn as a silent failure.
Fix: prepend conversation_history to the FakeAgent's response so the
message list mirrors production behavior.
When a provider error (401/429/rate-limit) causes the agent to return
without producing a new assistant reply, the WebUI should emit an
apperror event so the user sees an inline error. However, the detection
logic scanned ALL messages in result['messages'] — which includes the
full conversation history. If any prior turn had an assistant response,
_assistant_added would be True and the apperror would be silently
skipped, leaving the user staring at a blank response.
Extract a helper _has_new_assistant_reply(all_messages, prev_count)
that only inspects messages beyond the pre-turn history offset. Apply
it to both the main detection path and the self-heal/retry path.
Tests: 15 new cases covering history masking, empty content, whitespace,
edge-case shrinks, and multi-assistant scenarios.
Opus identified that PR #2227's preservation block had two related bugs in
the parent_session_id handling:
1. During preservation save: code did
_old_parent = s.parent_session_id
s.parent_session_id = None
s.save(touch_updated_at=False, skip_index=True)
s.parent_session_id = _old_parent
The save persisted parent=None to disk. The in-memory restoration didn't
reach the disk copy. Result: a /branch fork session that subsequently
compressed lost its 'Forked from X' badge on the preserved old snapshot.
2. Stamping the continuation: code did
if not s.parent_session_id:
s.parent_session_id = old_sid
The 'if not' guard skipped the stamp when the session already had a
parent_session_id from a prior fork. Result: fork-of-fork compression
broke lineage — the continuation jumped back to the original fork parent
instead of the just-preserved immediate predecessor snapshot.
Fix (matches Opus's recommendation):
- Remove the parent clearing during preservation save (preserve as-is)
- Drop the 'if not' guard; always stamp continuation to old_sid
This makes the lineage chain consistent: new → old → old.parent → ... root.
Traversal from the continuation always walks through the just-preserved
snapshot to get to its parent's parent, never jumping over the snapshot.
Two new regression tests pin both invariants:
- test_parent_session_id_stamped_unconditionally (no 'if not' guard)
- test_old_session_parent_preserved_during_archive_save (no parent=None)
Both pass against the fix. All 8 tests in the file pass.