When context compression fires, the agent rotates to a new session_id.
The compression migration block correctly migrates the session lock,
SESSION_AGENT_CACHE, SESSIONS dict, and the session file rename, but
does not ensure s.profile is set on the continuation session.
On the next request, _run_agent_streaming resolves the profile via:
get_hermes_home_for_profile(getattr(s, 'profile', None))
With s.profile == None this falls back to the default profile's
HERMES_HOME. Memory tool calls then read and write the wrong profile's
MEMORY.md — confirmed by investigation: session 0dfefb (continuation
after compression from a troubleshooting profile session) read memory
at 16% / 1,184 chars with 4 entries, while the troubleshooting profile's
actual state was 72-77% / 5,000+ chars. That reading could only come
from the default profile's bank. Subsequent replace operations failed
because the target entries existed only in the troubleshooting profile.
There are two failure paths:
1. In-memory: if s.profile was None from the start (legacy session or
one created before this fix), the continuation session object carries
null through the current request.
2. Persistence: s.save() persists "profile": null to the continuation
session's JSON file (profile is in METADATA_FIELDS, models.py ~408).
On the next request, Session.load(new_sid) reads it back as null and
get_hermes_home_for_profile(None) falls back to the default profile.
Fix: capture _resolved_profile_name at request entry (~line 2019),
immediately after profile home resolution. This is the only point where
profile context is reliable: s.profile if already set, otherwise
get_active_profile_name() — which at that point reads thread-local
storage (_tls.profile) correctly set by the HTTP handler thread via
set_request_profile(). Calling get_active_profile_name() at compression
time instead would be unsafe: the streaming thread is a separate
threading.Thread, does not inherit TLS, and the call would fall back to
the process-global _active_profile which may belong to a different
concurrent tab.
Stamp s.profile in the compression migration block immediately after
s.session_id = new_sid. Guarded by `if not s.profile` so sessions that
already have a profile set are unaffected. A logger.info line records
when the stamp fires, making future investigation straightforward.
Fixes: memory writes bleeding into default profile after compression
Reproduces: reliably on any long non-default profile session that hits
the compression threshold (default: 0.80 context fill)
Per-request profile switches (process_wide=False, introduced in #1700)
update os.environ['HERMES_HOME'] but skip _set_hermes_home(), which is
responsible for monkeypatching module-level caches.
Both tools/skills_tool.py and tools/skill_manager_tool.py set
HERMES_HOME and SKILLS_DIR once at import time. When a non-default
profile is active in the WebUI, os.environ['HERMES_HOME'] is correctly
updated per-turn in the _ENV_LOCK block, but the module-level
constants still point at the root profile. All agent-side skill
operations — skills_list(), skill_view(), skill_manage() — read and
write to the wrong directory.
Add the same monkeypatching that _set_hermes_home() already performs
(profiles.py line ~620) to the per-turn env setup block in
streaming.py, covering both skills_tool and skill_manager_tool.
The WebUI display half was already fixed in #1917 via
_active_skills_dir() in routes.py. This patch fixes the agent-side
half so the running agent resolves skills from the correct profile.
Issue #1968: switching to a non-default profile in the WebUI dropdown
had no effect on which MCP servers were available. Every chat session,
regardless of profile, only saw the default profile's mcp_servers from
~/.hermes/config.yaml. Non-default profile MCP servers (postgres, custom
stdio servers, anything in <profile>/config.yaml) never registered.
Root cause: api/streaming.py:1922 called discover_mcp_tools() at the
TOP of _run_agent_streaming(), about 100 lines BEFORE the per-session
'os.environ["HERMES_HOME"] = _profile_home' mutation at line 2053.
discover_mcp_tools() reads ~/.hermes/config.yaml via get_hermes_home(),
which uses os.environ['HERMES_HOME']. So at the call site, HERMES_HOME
was still whatever the WebUI server process had at startup — the default
profile, every time.
Fix: relocate the discover_mcp_tools() call past the _ENV_LOCK block so
get_hermes_home() resolves to the session's actual profile home. Same
try/except wrapping is preserved; same idempotency semantics on
already-connected servers; same lazy-import pattern.
Caveat (out of scope, agent-side): _servers in tools/mcp_tool.py is a
process-global Dict[str, MCPServerTask] keyed only by server name. So
once profile A registers a server named e.g. 'postgres', profile B's
discovery sees 'postgres' as already connected and skips it — even if
B's config points at a different binary or DB. Concurrent multi-profile
WebUI processes will still hit 'first profile wins per server name'.
Fully fixing that requires keying _servers by (profile_home, name)
upstream in hermes-agent. This PR ships layer 1 only — fixes the
single-non-default-profile case (the headline symptom).
Tests: tests/test_issue1968_mcp_profile_discovery.py — 4 static tests
pinning the lexical ordering invariants. Verified mutation-safety: a
proof-of-concept revert (re-adding a discover call before the
HERMES_HOME mutation) makes the 'only called once' test fail.
Test suite: 5047 passed, 4 skipped, 3 xpassed, 0 regressions.
Closes#1968
CRITICAL: #1951 PENDING_GOAL_CONTINUATION race
Removes `PENDING_GOAL_CONTINUATION.discard(session_id)` from the
streaming worker's `finally` cleanup block. The marker is set inside
the SAME function call (line ~3328 on `goal_continue`) and the discard
in the `finally` (line ~3553) almost always raced ahead of the
frontend's SSE-receive → POST /api/chat/start round-trip, erasing
the marker before the consumer in routes.py could read it. The
consumer (`_start_chat_stream_for_session` in routes.py:6522) already
discards atomically when consuming, so removing the streaming-side
discard preserves single-use semantics and unblocks the
goal-continuation chain.
Adds tests/test_stage326_pending_goal_continuation_race.py with 5
regression guards:
1. streaming.py's finally must NOT discard PENDING_GOAL_CONTINUATION
2. routes.py consumer must check + set + discard atomically
3. PENDING_GOAL_CONTINUATION must be a set (GIL-safe single-op)
4. STREAM_GOAL_RELATED.pop must be keyed by stream_id, not session_id
5. PENDING_GOAL_CONTINUATION.add must precede the goal_continue SSE
emission in source ordering
HARDENING: #1956 composer-draft input validation
Per Opus, the POST /api/session/draft handler accepted unbounded /
arbitrary-typed text and files inputs. With the 400ms debounced
auto-save firing on every keystroke, a misbehaving client could
persist multi-MB strings into the session JSON. Adds:
- text: coerced to str if not already; clamped to 50_000 chars
- files: coerced to list if not already; clamped to 50 entries
Validation runs BEFORE the session lock acquire / save.
Adds tests/test_stage326_composer_draft_validation.py with 5 guards.
Verdict from Opus advisor on stage-326: SHIP-WITH-FIXES.
This commit applies the required + recommended fixes; #1957 hardening
fixed in a prior stage commit.
_build_native_multimodal_message() unconditionally embedded images as
native image_url parts, bypassing the agent's image_input_mode config.
Add _resolve_image_input_mode(cfg) helper mirroring the agent's
decide_image_input_mode logic, and wire it into
_build_native_multimodal_message with a new cfg parameter.
When mode resolves to 'text' (explicit aux vision config, or
image_input_mode: text), returns plain string so the agent's
existing text-mode pipeline (vision_analyze) handles images.
Closes#1959
The goal evaluation hook was firing on every completed assistant turn
when a goal was active, even for unrelated messages like "what time is
it". This burned the goal budget, triggered continuation prompts that
interrupted unrelated conversations, and made /goal status numbers
misleading.
Add STREAM_GOAL_RELATED and PENDING_GOAL_CONTINUATION flags to gate
the evaluate_goal_after_turn() call in the streaming loop. Only streams
started from goal kickoff (/goal <text>) or goal continuation are
marked as goal-related. Normal user messages skip the hook entirely.
The two get_model_context_length() fallback callsites in api/streaming.py
(session save + SSE usage payload) were calling the resolver with only
model + base_url. When the agent's compressor reports 0 (fresh/cached/
transitioning agent), resolution fell through to the 256K DEFAULT_FALLBACK
even when users had set model.context_length: 1048576 in config.yaml.
For LCM users on 1M-context models, the wrong window cascaded into a
session-killing failure: auto-compression triggered at ~25% of the wrong
value, floods of compress requests, 429s, credential pool exhaustion,
fallback 429s, then 'API call failed after 3 retries'.
Reported by @AvidFuturist on Discord with deepseek-v4-flash. Reproduced 5x.
Both callsites now pass config_context_length, provider, and
custom_providers. The resolver consults these BEFORE probing, so the
config override wins. Both are wrapped in except TypeError blocks that
retry with the legacy 2-arg form for older hermes-agent builds whose
get_model_context_length signature pre-dates these kwargs.
Tests: 7 source-string regressions guarding both call shapes, the safe
config parse, the legacy fallback, and the per-profile config source.
Also bumped the line-distance assertion in test_pr1341 (the test
explicitly invites bumping when a new pre-save mutation block is added).
Closes#1896
Co-authored-by: Hermes Agent <agent@hermes.local>
Same-session profile switches reused cached AIAgent from previous profile,
silently leaking the old persona's SOUL.md / system prompt into the new
profile's turns. session_id stays stable across profile switches, and the
signature didn't include the active profile home, so every signature input
matched and the stale agent was returned from SESSION_AGENT_CACHE.
Append _profile_home to the signature blob so profile switches force a
cache miss and a fresh agent build under the new HERMES_HOME (which
triggers a fresh load_soul_md() call).
Tests: 3 source-string regressions guarding the signature contract,
ordering, and empty-home fallback.
Closes#1897
Co-authored-by: Hermes Agent <agent@hermes.local>
Read agent.max_turns when constructing streaming WebUI AIAgent instances, pass it as max_iterations when supported, and include it in the per-session agent cache signature so budget changes take effect.
Add regression coverage for the config read, constructor kwarg, and cache key.
Closes#1695.
@Patrick-81 reported the bare "AIAgent not available -- check that
hermes-agent is on sys.path" error on a symlinked install (~/Programmes/hermes-agent
linked to ~/hermes-agent). The maintainer's response — three diagnostic
commands plus `pip install -e .` in the agent dir — fixed it for them.
This PR captures both halves of that learning so the next user with the
same shape doesn't have to file a new issue:
1. **Error message diagnostic block.** New helper
`_aiagent_import_error_detail()` in api/streaming.py builds a multi-line
diagnostic when the import fails, including:
- the running Python interpreter
- HERMES_WEBUI_AGENT_DIR (set value, or "(not set)")
- sys.path entries that mention hermes/agent (or "no entries mention..."
— itself a strong diagnostic signal)
- the most-common fix (`pip install -e .` in the agent dir)
- a pointer to docs/troubleshooting.md
The original error message string is preserved as the FIRST line so
existing log scrapers and docs-search keep matching.
Helper is kept as a separate function so it stays out of the hot path
until we actually need to raise — building it on every successful import
would be wasted work.
2. **New docs/troubleshooting.md.** Symptom → Why → Diagnostic commands →
Fix → When-to-file-a-bug template, with one entry to start: the
"AIAgent not available" flow Patrick-81 walked through. Future
recurring failure modes follow the same template. Required a one-line
addition to .gitignore — docs/* is gitignored with an allowlist, and
the new file needed `!docs/troubleshooting.md` to be tracked.
3. **README link.** docs/troubleshooting.md added to the `## Docs` section
so users know where to look first.
13 regression tests in tests/test_1695_aiagent_import_error_detail.py:
9 for the helper output shape (preserves original message line, includes
running python, shows HERMES_WEBUI_AGENT_DIR set/unset both ways, includes
pip-install-e hint, points at troubleshooting doc, lists relevant sys.path
entries when present, says "no entries..." when absent, output is multi-line)
plus 4 for the docs-presence regression (file exists, has the AIAgent
section, includes pip install -e ., describes the diagnostic chain with
readlink + agent/__init__.py verification).
190 streaming/aiagent tests pass after the change. ast.parse on
api/streaming.py clean.
CI failure on prior push was due to the docs/* gitignore swallowing the
new troubleshooting.md file silently — this commit adds the allowlist
entry so the file is tracked.