Files
hermes-webui/docs/rfcs/turn-journal.md
T
nesquena-hermes 7690e08e70 docs(rfcs): establish docs/rfcs/ convention and polish turn-journal RFC
Moves docs/turn-journal-rfc.md → docs/rfcs/turn-journal.md, establishing
the convention for future design documents on hermes-webui's data-at-rest
and recovery surfaces. Adds docs/rfcs/README.md describing when an RFC
applies (large changes, durability/recovery semantics, new infrastructure
primitives) and the simple status header convention.

Polish on turn-journal.md:
- Added 3-line status header (Status / Author / Created) at top.
- Light tone edits on two flourishes that read fine in a PR description
  but felt off in permanent repo documentation. Author's voice preserved
  throughout the rest of the document.

Co-authored-by: ai-ag2026 <261867348+ai-ag2026@users.noreply.github.com>
2026-05-11 02:45:38 +00:00

6.2 KiB

RFC: WebUI Turn Journal for Crash-Safe Chat Submissions

  • Status: Proposed
  • Author: @ai-ag2026
  • Created: 2026-05-11

Problem

A WebUI chat turn crosses several durability boundaries:

  1. browser submits a user message,
  2. WebUI creates or updates session runtime metadata,
  3. the agent worker starts streaming,
  4. assistant output is appended,
  5. the JSON sidecar and derived index are saved.

If the server crashes between submission and the final sidecar save, recovery has to infer what happened from pending_user_message, active_stream_id, .json.bak, _index.json, and state.db. Those safeguards are useful, but they are still reconstructing intent after the fact.

The missing primitive is a small write-ahead journal for turns: record the submitted user turn durably before the worker starts, then advance the journal as the turn progresses.

Goals

  • Preserve the exact user-submitted turn, including attachments metadata, before any provider or worker work starts.
  • Make crash recovery deterministic: a submitted-but-unfinished turn can be reported or reconstructed without guessing.
  • Keep the journal append/update format simple enough for startup recovery, CLI audit, and future API repair endpoints.
  • Avoid turning recovery into a background daemon. This is storage hygiene, not a long-running service.

Non-goals

  • Replacing state.db.sessions or WebUI JSON sidecars.
  • Journaling every token or every SSE event.
  • Replaying tool calls or provider streams.
  • Automatically inventing assistant messages after ambiguous crashes.

Proposed storage

Use one JSONL file per session under the existing WebUI state area:

<SESSION_DIR>/_turn_journal/<session_id>.jsonl

Each line is an immutable event. Recovery can scan by turn_id and choose the latest status.

Event shape

{
  "version": 1,
  "event": "submitted",
  "turn_id": "20260511T001122Z-abcdef",
  "session_id": "abc123",
  "stream_id": "stream-xyz",
  "created_at": 1778458282.123,
  "role": "user",
  "content": "...",
  "attachments": [],
  "workspace": "/workspace",
  "model": "openai/gpt-5",
  "model_provider": "openai"
}

Later events for the same turn_id:

{"version":1,"event":"worker_started","turn_id":"...","created_at":1778458283.0}
{"version":1,"event":"assistant_started","turn_id":"...","created_at":1778458284.0}
{"version":1,"event":"completed","turn_id":"...","created_at":1778458299.0,"assistant_message_index":12}
{"version":1,"event":"interrupted","turn_id":"...","created_at":1778458301.0,"reason":"server_startup_recovery"}

Turn state machine

submitted -> worker_started -> assistant_started -> completed
submitted -> interrupted
worker_started -> interrupted
assistant_started -> interrupted

completed is terminal. interrupted is terminal unless a later explicit repair creates a new turn. Recovery should not silently resume a provider call.

Write rules

  1. On /api/chat/start or equivalent turn-submission path:
    • generate turn_id,
    • append submitted,
    • fsync the journal file,
    • only then start the worker.
  2. When worker thread enters _run_agent_streaming, append worker_started.
  3. When assistant output is first persisted or clearly begins, append assistant_started.
  4. After the sidecar save that includes the assistant answer succeeds, append completed.
  5. On cancellation or known worker exception, append interrupted with a reason.

Startup recovery semantics

On startup, for each journal file:

  • Latest event is completed: no action.
  • Latest event is submitted or worker_started and no matching user message exists in sidecar:
    • append/recover the user message into the session sidecar with a recovery marker.
  • Latest event is submitted, worker_started, or assistant_started and no completed assistant turn exists:
    • add a visible interruption marker, not a fake assistant answer.
  • Existing .json.bak and state.db recovery still run first so the sidecar is as complete as possible before journal reconciliation.

Audit additions

audit_session_recovery() can report:

  • turn_journal_pending_turn — repairable if the user message is absent from sidecar.
  • turn_journal_interrupted_turn — ok/warn depending on whether a visible marker exists.
  • turn_journal_malformed_event — manual review.

Safe repair should only materialize submitted user messages and interruption markers when the journal event content is valid JSON and the target message is absent.

API surface

Initial read-only endpoint can be folded into the existing recovery audit:

GET /api/session/recovery/audit

Later, if needed:

GET /api/session/turn-journal?session_id=<id>

The latter should be diagnostic-only and redact or omit large attachment payloads.

Rollout plan

  1. Land backup/sidecar recovery and audit primitives.
  2. Add this journal writer in the turn-submission path behind no config flag; it is local-only and append-only.
  3. Add read-only audit reporting for pending journal turns.
  4. Add safe repair for missing user messages and interruption markers.
  5. Once stable, consider pruning completed journal entries older than a retention window, but only after sidecar/index recovery has no findings.

Open questions

  • Exact place to define turn_id so browser retry and server retry do not duplicate the same user message.
  • Whether attachment files need their own durable manifest entry or whether metadata-only is enough for v1.
  • How much of the assistant partial output, if any, should be recoverable after assistant_started but before completed.
  • Whether completed journal entries should be compacted into a per-session checkpoint file.

Minimal implementation slice

The first implementation PR should be deliberately small:

  • helper: append_turn_journal_event(session_id, event)
  • helper: read_turn_journal(session_id)
  • unit tests for atomic append, malformed-line tolerance, and state derivation
  • one call site: append submitted before worker start
  • audit-only report of pending journal turns

Do not combine the first implementation with replay/repair. Replay is where most of the bugs in WAL systems live; ship the writer and audit first, prove the format, then add repair.