mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-21 03:39:54 +00:00
7ccfb97feeac4e58d1404edc73fc0d211b43490d
1080 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
008860a23f |
fix(approval): close remaining prompt_toolkit deadlock vectors (#15216)
PR #13734 fixed the concurrent-tool-executor vector (ThreadPoolExecutor
workers didn't inherit the CLI's TLS approval callback). Two vectors
remained that could still land in the deadlocking input() fallback:
1. _spawn_background_review spawns a raw threading.Thread with no
approval callback installed, so any dangerous-command guard the
review agent trips falls back to input() -> deadlock against the
parent's prompt_toolkit TUI (same class as delegate_task subagents,
fixed in
|
||
|
|
75b460bc94 | fix(email): add required Date header to outbound mail | ||
|
|
ec671c4154 |
feat(image-input): native multimodal routing based on model vision capability (#16506)
* feat(image-input): native multimodal routing based on model vision capability
Attach user-sent images as OpenAI-style content parts on the user turn when
the active model supports native vision, so vision-capable models see real
pixels instead of a lossy text description from vision_analyze.
Routing decision (agent/image_routing.py::decide_image_input_mode):
agent.image_input_mode = auto | native | text (default: auto)
In auto mode:
- If auxiliary.vision.provider/model is explicitly configured, keep the
text pipeline (user paid for a dedicated vision backend).
- Else if models.dev reports supports_vision=True for the active
provider/model, attach natively.
- Else fall back to text (current behaviour).
Call sites updated: gateway/run.py (all messaging platforms), tui_gateway
(dashboard/Ink), cli.py (interactive /attach + drag-drop).
run_agent.py changes:
- _prepare_anthropic_messages_for_api now passes image parts through
unchanged when the model supports vision — the Anthropic adapter
translates them to native image blocks. Previous behaviour
(vision_analyze → text) only runs for non-vision Anthropic models.
- New _prepare_messages_for_non_vision_model mirrors the same contract
for chat.completions and codex_responses paths, so non-vision models
on any provider get text-fallback instead of failing at the provider.
- New _model_supports_vision() helper reads models.dev caps.
vision_analyze description rewritten: positions it as a tool for images
NOT already visible in the conversation (URLs, tool output, deeper
inspection). Prevents the model from redundantly calling it on images
already attached natively.
Config default: agent.image_input_mode = auto.
Tests: 35 new (test_image_routing.py + test_vision_aware_preprocessing.py),
all existing tests that reference _prepare_anthropic_messages_for_api
still pass (198 targeted + new tests green).
* feat(image-input): size-cap + resize oversized images, charge image tokens in compressor
Two follow-ups that make the native image routing safer for long / heavy
sessions:
1) Oversize handling in build_native_content_parts:
- 20 MB ceiling per image (matches vision_tools._MAX_BASE64_BYTES,
the most restrictive provider — Gemini inline data).
- Delegates to vision_tools._resize_image_for_vision (Pillow-based,
already battle-tested) to downscale to 5 MB first-try.
- If Pillow is missing or resize still overshoots, the image is
dropped and reported back in skipped[]; caller falls back to text
enrichment for that image.
2) Image-token accounting in context_compressor:
- New _IMAGE_TOKEN_ESTIMATE = 1600 (matches Claude Code's constant;
within the realistic range for Anthropic/GPT-4o/Gemini billing).
- _content_length_for_budget() helper: sums text-part lengths and
charges _IMAGE_CHAR_EQUIVALENT (1600 * 4 chars) per image/image_url/
input_image part. Base64 payload inside image_url is NOT counted
as chars — dimensions don't matter, only image-presence.
- Both tail-cut sites (_prune_old_tool_results L527 and
_find_tail_cut_by_tokens L1126) now call the helper so multi-image
conversations don't slip past compression budget.
Tests: 9 new in test_image_routing.py (oversize triggers resize,
resize-fails-returns-None, oversize-skipped-reported), 11 new in
test_compressor_image_tokens.py (flat charge per image, multiple images,
Responses-API / Anthropic-native / OpenAI-chat shapes, no-inflation on
raw base64, bounds-check on the constant, integration test that an
image-heavy tail actually gets trimmed).
* fix(image-input): replace blanket 20MB ceiling with empirically-verified per-provider limits
The previous commit imposed a hardcoded 20 MB base64 ceiling on all
providers, triggering auto-resize on anything larger. This was wrong in
both directions:
* Too loose for Anthropic — actual limit is 5 MB (returns HTTP 400
'image exceeds 5 MB maximum' above that).
* Too strict for OpenAI / Codex / OpenRouter — accept 49 MB+ without
complaint (empirically verified April 2026 with progressive PNG
sizes).
New behaviour:
* _PROVIDER_BASE64_CEILING table: only anthropic and bedrock have a
ceiling (5 MB, since bedrock-on-Claude shares Anthropic's decoder).
* Providers NOT in the table get no ceiling — images attach at native
size and we trust the provider to return its own error if it
disagrees. A provider-specific 400 message is clearer than us
guessing wrong and silently degrading image quality.
* build_native_content_parts() gains a keyword-only provider arg;
gateway/CLI/TUI pass the active provider so Anthropic users get
auto-resize protection while OpenAI users don't pay it.
* Resize target dropped from 5 MB to 4 MB to slide safely under
Anthropic's boundary with header overhead.
Empirical measurements (direct API, no Hermes in the loop):
image b64 anthropic openrouter/gpt5.5 codex-oauth/gpt5.5
0.19 MB ✓ ✓ ✓
12.37 MB ✗ 400 5MB ✓ ✓
23.85 MB ✗ 400 5MB ✓ ✓
49.46 MB ✗ 413 ✓ ✓
Tests: rewrote TestOversizeHandling (5 tests): no-ceiling pass-through,
Anthropic resize fires, Anthropic skip on resize-fail, build_native_parts
routes ceiling by provider, unknown provider gets no ceiling. All 52
targeted tests pass.
* refactor(image-input): attempt native, shrink-and-retry on provider reject
Replace proactive per-provider size ceilings with a reactive shrink path
on the provider's actual rejection. All providers now attempt native
full-size attachment first; if the provider returns an image-too-large
error, the agent silently shrinks and retries once.
Why the previous design was wrong: hardcoding provider ceilings
(anthropic=5MB, others=unlimited) meant OpenAI users on a 10MB image
paid no tax, but Anthropic users lost quality on anything >5MB even
though the empirical behaviour at provider-reject time is the same
(shrink + retry). Baking the table into the routing layer also
requires updating Hermes every time a provider's limit changes.
Reactive design:
- image_routing.py: _file_to_data_url encodes native size, no ceiling.
build_native_content_parts drops its provider kwarg.
- error_classifier.py: new FailoverReason.image_too_large + pattern
match ("image exceeds", "image too large", etc.) checked BEFORE
context_overflow so Anthropic's 5MB rejection lands in the right
bucket.
- run_agent.py: new _try_shrink_image_parts_in_messages walks api
messages in-place, re-encodes oversized data: URL image parts
through vision_tools._resize_image_for_vision to fit under 4MB,
handles both chat.completions (dict image_url) and Responses
(string image_url) shapes, ignores http URLs (provider-fetched).
New image_shrink_retry_attempted flag in the retry loop fires the
shrink exactly once per turn after credential-pool recovery but
before auth retries.
E2E verified live against Anthropic claude-sonnet-4-6:
- 17.9MB PNG (23.9MB b64) attached at native size
- Anthropic returns 400 "image exceeds 5 MB maximum"
- Agent logs '📐 Image(s) exceeded provider size limit — shrank and
retrying...'
- Retry succeeds, correct response delivered in 6.8s total.
Tests: 12 new (8 shrink-helper shapes + 4 classifier signals),
replaces 5 proactive-ceiling tests with 3 simpler 'native attach works'
tests. 181 targeted tests pass. test_enum_members_exist in
test_error_classifier.py updated for the new enum value.
|
||
|
|
2e6699b319 |
fix: strip leaked declare-x env dump from terminal output on macOS (#15459)
On macOS (bash 3.2 and some Homebrew bash builds) `source`ing a file that contains `declare -x` statements prints each declaration to stdout. The persistent-shell wrapper in tools/environments/base.py was only redirecting stderr when sourcing the session snapshot, so ~60 lines of env vars leaked into every terminal tool response — blowing out context and triggering HTTP 400s on context-limited providers. Fix: redirect both stdout and stderr when sourcing the snapshot. Linux bash is silent here, so the redirect is harmless there; macOS no longer leaks. Closes #15459 Co-authored-by: Sanjays2402 <51058514+Sanjays2402@users.noreply.github.com> |
||
|
|
a32d07529c |
fix(file-tools): escalate to BLOCKED on repeated read_file dedup stubs (#16382)
read_file's dedup path returned a lightweight stub on re-reads of an unchanged file, then returned early — so the consecutive-read loop guard (hard block at count>=4) at the bottom of read_file_tool never ran for stub-looped calls. Weaker tool-following models (local Qwen3.6 variants in the reported case) ignore the passive 'refer to earlier result' hint and hammer the same read_file call until iteration budget runs out. Track per-key stub returns in task_data['dedup_hits'] and, on the second stub for the same (path, offset, limit), return a hard BLOCKED error mirroring the wording the real-read path already uses. A real read, an intervening non-read tool call (notify_other_tool_call), or reset_file_dedup (on context compression) all clear the counter so the guard never stays engaged longer than the actual loop. Closes #15759 |
||
|
|
517f30b043 |
improve(agent): guidance for plain-text URLs, subagent language/verification, hermes-config routing (#16325)
Four small tool-description / skill-content tweaks addressing recurring model mistakes seen in @versun's docx feedback (Kimi 2.6, but the patterns apply to every model): 1. browser_navigate description: call out .md/.txt/.json/.yaml/.csv/.xml, raw.githubusercontent.com, and API endpoints as specifically preferring curl or web_extract. The generic "prefer web_search or web_extract" was too weak; models kept firing up the browser for plain-text URLs. 2. delegate_task description: two additions. (a) Pass user language / output-style preferences in 'context' when they differ from English — otherwise subagents default to English and their summaries contaminate the final reply (caused the bilingual digest bug). (b) Subagent summaries are self-reports, not verified facts. For operations with external side-effects (HTTP uploads, remote writes, file creation at shared paths), require a verifiable handle (URL, ID, path) and verify it yourself before claiming success. 3. agent/prompt_builder.py Skills-mandatory block: new explicit line "Whenever the user asks to configure / set up / modify / install / enable / disable / troubleshoot Hermes Agent itself, load the `hermes-agent` skill first." The generic "load what's relevant" didn't route Hermes-meta questions (like "how do I turn off redaction?") to the one skill that has the answer. 4. skills/autonomous-ai-agents/hermes-agent/SKILL.md: new "Security & Privacy Toggles" section covering security.redact_secrets (with the import-time-snapshot restart-required caveat), privacy.redact_pii, approvals.mode (manual/smart/off) + --yolo + HERMES_YOLO_MODE, shell hooks allowlist, and how to disable network/media tools entirely. Every command verified against the actual config keys — no invented knobs. Co-authored-by: teknium1 <teknium@noreply.github.com> |
||
|
|
9c416e20ab |
feat(skills): install skills from a direct HTTP(S) URL (#16323)
* feat(skills): install skills from a direct HTTP(S) URL Adds UrlSource adapter so `hermes skills install <url-to-SKILL.md>` and `/skills install <url>` work as first-class operations — no more improvising with curl + patch + cp. - Claims identifiers that start with http(s):// and end in .md - Skips /.well-known/skills/ URLs (WellKnownSkillSource handles those) - Skill name from YAML frontmatter, URL-slug fallback - Single-file SKILL.md only (v1 scope — multi-file skills need a manifest) - Trust level 'community'; full security scan still runs - Lock file stores the URL as identifier so `hermes skills update` re-fetches from the same URL cleanly Scope matches real user need from @versun's docx feedback where `https://sharethis.chat/SKILL.md` had no first-class install path. * feat(skills): interactive name/category for URL installs + --name override Follow-up to the UrlSource adapter. The previous commit fell back to weak heuristics when frontmatter had no ``name:`` and could produce garbage names like ``SKILL`` or ``unnamed-skill``. Now: tools/skills_hub.py - ``UrlSource._is_valid_skill_name()`` — strict identifier check (``^[a-z][a-z0-9_-]*$``), rejects sentinel values (``SKILL``, ``README``, ``INDEX``, ``unnamed-skill``, empty, non-strings). - ``_resolve_skill_name()`` returns ``Optional[str]`` — ``None`` when nothing valid is resolvable. Also ignores unsafe frontmatter names (``../evil``) and falls through to URL slug instead of returning None immediately, so a URL with a bad frontmatter but a good path still works. - ``fetch()``/``inspect()`` carry an ``awaiting_name=True`` marker in metadata/extra when resolution fails, letting ``do_install`` decide whether to prompt, apply an override, or error out. hermes_cli/skills_hub.py - ``do_install`` gains a ``name_override`` parameter. - On URL-sourced bundles with ``awaiting_name=True``: 1. If ``name_override`` is valid → use it. 2. If ``name_override`` is invalid → refuse with a clear error. 3. Else if ``skip_confirm=True`` (non-interactive: slash / TUI / gateway / scripts) → refuse with an actionable retry hint pointing at ``--name <your-name>`` on both CLI and slash forms. 4. Else (interactive TTY) → prompt for the name. - Interactive TTY also prompts for a category when none is given for a URL-sourced install, hinting existing category buckets so users can reuse ``productivity``, ``devops``, etc. Empty input → flat install. - ``_existing_categories()`` scans ``~/.hermes/skills/`` for subdirs that look like category buckets (contain nested SKILL.md files); skips top-level skills and hidden dirs. - ``_prompt_for_skill_name()`` / ``_prompt_for_category()`` helpers (EOF/Ctrl-C-safe, match the existing ``Confirm [y/N]`` prompt style). hermes_cli/main.py - ``hermes skills install`` argparse gains ``--name <name>``. hermes_cli/skills_hub.py (slash) - ``/skills install <url> --name <x>`` parsing added. Tests - tests/tools/test_skills_hub.py: updated ``UrlSource`` tests to assert the new ``awaiting_name`` metadata; added 4 new tests for ``_is_valid_skill_name`` rejection sets and the awaiting-name marker. - tests/hermes_cli/test_skills_hub.py: 8 new tests covering --name override accept/reject, non-interactive error, interactive name prompt, interactive category prompt, cancel-aborts-install, and ``_existing_categories`` scan behavior (buckets vs flat skills). - E2E verified all four paths (no-name/no-override → error; --name override → install; frontmatter name → install; invalid --name → rejection). --------- Co-authored-by: teknium1 <teknium@noreply.github.com> |
||
|
|
b288934dff |
fix(discord_tool): coerce limit parameter to int before min() call
_search_members() and _fetch_messages() call min(limit, 100) assuming limit is int. Models can pass limit as a string (e.g. "10"), causing TypeError: '<' not supported between instances of 'str' and 'int'. Add try/except int() coercion with safe defaults at the top of both functions, matching the pattern used in session_search fix (#10522). |
||
|
|
478444c262 |
feat(checkpoints): auto-prune orphan and stale shadow repos at startup (#16303)
Every working dir hermes ever touches gets its own shadow git repo under
~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/. The per-repo _prune is a
no-op (comment in CheckpointManager._prune says so), so abandoned repos
from deleted/moved projects or one-off tmp dirs pile up forever. Field
reports put the typical offender at 1000+ repos / ~12 GB on active
contributor machines.
Adds an opt-in startup sweep that mirrors the sessions.auto_prune
pattern from #13861 / #16286:
- tools/checkpoint_manager.py: new prune_checkpoints() and
maybe_auto_prune_checkpoints() helpers. Deletes shadow repos that
are orphan (HERMES_WORKDIR marker points to a path that no longer
exists) or stale (newest in-repo mtime older than retention_days).
Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only
runs once per min_interval_hours regardless of how many hermes
processes start up.
- hermes_cli/config.py: new checkpoints.auto_prune /
retention_days / delete_orphans / min_interval_hours knobs.
Default auto_prune: false so users who rely on /rollback against
long-ago sessions never lose data silently.
- cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune,
called right next to the existing state.db maintenance block.
- Docs updated with the new config knobs.
- 11 regression tests: orphan/stale deletion, precedence, byte-freed
tracking, non-shadow dir skip, interval gating, corrupt marker
recovery.
Refs #3015 (session-file disk growth was fixed in #16286; this covers
the checkpoint side noted out-of-scope there).
|
||
|
|
ced8f44cd2 |
fix(file-tools): broaden dedup-status write guard to cover small wrappers
The write_file guard added in #16223 used strict equality against the internal dedup status message. In practice, the model sometimes prepends a short note or appends a trailing comment before calling write_file, which slipped past the strict check. Broaden the heuristic: reject writes whose stripped content equals the status message OR contains it and is <=2x its length. Short, status-dominated writes are always corruption; legitimate docs that quote the message verbatim are always much longer. Adds two tests: one for the small-wrapper corruption shape, one confirming large legitimate files that quote the status still write. |
||
|
|
977d5f56c9 | fix(file-tools): keep read dedup status out of file content | ||
|
|
a32b325d06 |
fix(tools): invalidate read_file dedup cache on write_file and patch
write_file_tool and patch_tool both call _update_read_timestamp to refresh the staleness tracker after writing, but they never invalidate the dedup cache entries for the written path. The dedup cache keys are (resolved_path, offset, limit) → mtime tuples populated by read_file_tool. On filesystems where a read and write land in the same mtime second (or when mtime granularity is 1s), the cached and current mtime are equal, so the dedup check incorrectly returns a 'File unchanged since last read' stub — even though the file was just overwritten. The agent then sees stale content (or a stale 'File not found' error) and enters expensive error-recovery loops, burning API calls. Fix: add _invalidate_dedup_for_path(filepath, task_id) that removes all dedup entries whose resolved path matches the written file. Called from _update_read_timestamp so both write_file_tool and patch_tool benefit automatically. Scoped to the writing task_id — other tasks' caches are not affected. 6 regression tests added covering: - read→write→read within same mtime second (core #13144 scenario) - invalidation across all offset/limit combinations - isolation: writing file A does not invalidate file B's cache - isolation: writing in task A does not invalidate task B's cache - _invalidate_dedup_for_path safety on missing task / empty dedup All 25 tests pass (19 existing + 6 new). Fixes #13144 |
||
|
|
dbe5015566 | fix(session-search): exclude current lineage root deterministically in recent mode | ||
|
|
f66ebe64e8 | fix(cli): coerce use_gateway config flags in tool routing | ||
|
|
ab6879634e |
yuanbao platform (#16298)
Co-authored-by: loongzhao <loongzhao@tencent.com> |
||
|
|
fd474d0f00 | fix(gateway): avoid cross-user mirror writes in per-user group sessions | ||
|
|
7317d69f19 | fix(security): treat quoted false as false in browser SSRF guards | ||
|
|
930494d687 |
fix(cron): reap orphaned MCP stdio subprocesses after each tick
MCP stdio servers are spawned via the SDK's stdio_client, which on Linux uses start_new_session=True (setsid). When a cron job is cancelled mid-way (timeout, agent finish, exception), the subprocess often escapes the SDK's teardown and survives as a session leader. Because setsid() detaches the child from the gateway's process group / cgroup tree, systemd does not reap it on service restart either — so every cron tick that touches an MCP tool leaks a dangling server process. Fix: * tools/mcp_tool.py — _run_stdio now wraps the whole stdio+session context in try/finally. On any exit path (clean, exception, cancellation), PIDs still alive are moved from the active _stdio_pids set into a new _orphan_stdio_pids set. Orphan detection is done via os.kill(pid, 0) — a cheap liveness probe that never signals the target. * tools/mcp_tool.py — _kill_orphaned_mcp_children gains an include_active=False flag. Default behaviour now only reaps the orphan set so concurrent sessions (other parallel cron jobs or live user chats) are never disrupted. The existing shutdown path passes include_active=True to keep the previous "kill everything" semantics after the MCP loop is stopped. * cron/scheduler.py — the cleanup hook is moved from run_job()'s finally (which would race with parallel siblings after #13021) into tick() after the ThreadPoolExecutor has joined every future. At that point there are no in-flight sessions from this tick, so sweeping the orphan set is always safe. Net effect: zero regression for healthy sessions, and orphan MCP servers no longer accumulate between gateway restarts. Made-with: Cursor |
||
|
|
75d3eaa0e4 |
fix(slack): exclude U/W user IDs from explicit target regex
Slack's chat.postMessage API rejects user IDs (U...) and workspace IDs (W...) — they are not valid conversation IDs. Posting to them fails because the API requires a channel ID (C/G/D). To DM a user, the sender must first call conversations.open to obtain a D... ID. Tighten _SLACK_TARGET_RE from [CGDUW] to [CGD] so the send path rejects U/W values as explicit targets and instead falls through to channel- name resolution (where they'll fail with a clear 'could not resolve' error rather than silently getting stuck in a retry loop on the API). Flip the corresponding regression test to assert U/W values are not explicit. Matches the narrower regex briandevans proposed in #15939. Co-authored-by: briandevans <brian@bde.io> |
||
|
|
802c7acb81 |
fix(Slack): resolve Slack channels by raw ID and enumerate joined channels
send_message(target='slack:<channel_id>') failed with "Could not resolve" because _parse_target_ref had no Slack branch — Slack's uppercase alphanumeric IDs fell through to channel-name resolution, which only matched by name. As a fallback, the agent would retry with bare target='slack' and post to the home channel instead. Three fixes: - _parse_target_ref recognizes Slack IDs (C/G/D/U/W prefix) as explicit targets so the name-resolver is bypassed entirely. - resolve_channel_name tries a case-sensitive raw-ID match before the existing name match, so any platform's IDs resolve cleanly. - _build_slack now actually calls users.conversations against each workspace's AsyncWebClient (paginated), instead of only returning session-history entries. This populates the directory with public and private channels the bot has joined, so action='list' shows them and they can also be addressed by name. Errors from one workspace don't block others. build_channel_directory becomes async (Slack web calls require it). The two async-context callers in gateway/run.py are awaited; the cron ticker thread call bridges via asyncio.run_coroutine_threadsafe. Slack bot needs channels:read and groups:read scopes for full enumeration; missing scopes degrade gracefully per-workspace. addressing #15927 |
||
|
|
5b2c59559a |
feat(terminal): collapse subagent task_ids to shared container (#16177)
Before: delegate_task children each allocated their own terminal
sandbox keyed by child task_id. Starting extra containers (or Modal
sandboxes / Daytona workspaces) is expensive, and the subagent's work
is invisible to the parent — files written by the child in its
container don't exist in the parent's when the subagent returns.
After: a single `_resolve_container_task_id` helper maps any
tool-call task_id to "default" UNLESS an env override is registered
for it. The parent agent and all delegate_task children therefore
share one long-lived sandbox — installed packages, cwd, /workspace
files, and /tmp scratch carry over freely between them.
RL and benchmark environments (TerminalBench2, HermesSweEnv, ...)
opt in to isolation via `register_task_env_overrides(task_id, {...})`;
those task_ids survive the collapse and get their own sandbox,
preserving the per-task Docker image behavior these benchmarks rely on.
file_state / active-subagents registry / TUI events still key off the
original child task_id, so the 'subagent wrote a file the parent read'
warning and UI per-subagent panels keep working.
Tradeoff: parallel delegate_task children (tasks=[...]) now share one
bash/container. Concurrent cd, env-var mutations, and writes to the
same path will collide. If that bites a specific workflow, the
subagent can opt back into isolation via register_task_env_overrides.
Applied at four lookup sites:
- tools/terminal_tool.py terminal_tool() and get_active_env()
- tools/file_tools.py _get_file_ops() and _get_live_tracking_cwd()
- tools/code_execution_tool.py _get_or_create_environment()
Docs: website/docs/user-guide/configuration.md updated to reflect the
shared-container reality and document the RL/benchmark carve-out.
Tests: tests/tools/test_shared_container_task_id.py (9 cases).
|
||
|
|
42c076d349 |
feat(browser): auto-spawn local Chromium for LAN/localhost URLs in cloud mode (#16136)
When a cloud browser provider (Browserbase / Browser-Use / Firecrawl) is
configured, browser_navigate now transparently spawns a local Chromium
sidecar for URLs whose host resolves to a private/loopback/LAN address
(localhost, 127.0.0.1, 192.168.x.x, 10.x.x.x, *.local, *.lan, *.internal,
::1, 169.254.x.x). Public URLs continue to use the cloud provider in the
same conversation.
Previously, setting BROWSERBASE_API_KEY / cloud_provider: browserbase
pinned the whole tool to cloud for the process — localhost URLs were
either SSRF-blocked (default) or sent to Browserbase (where they 404'd
because the cloud can't reach your LAN). Users who wanted 'cloud for
public, local for localhost' had no way to express it short of toggling
providers mid-session.
Implementation uses a composite session key scheme: the bare task_id
serves the cloud session, and a '{task_id}::local' sidecar serves the
local Chromium. _last_active_session_key[task_id] tracks which of the
two served the most recent nav so snapshot/click/fill/etc. hit the
correct one. cleanup_browser(bare_task_id) reaps both.
Feature is on by default. Opt out via:
browser:
auto_local_for_private_urls: false
The cloud provider never sees private URLs. Post-redirect SSRF guard
is preserved: redirects from public onto private addresses still block.
|
||
|
|
20cb706e03 |
chore: extend [SYSTEM:→[IMPORTANT: rename + AUTHOR_MAP
Follow-up to #6616 covering the remaining user-injected prompt markers that the original PR did not touch (reporter's second comment on #6576 explicitly flagged these). Azure OpenAI Default/DefaultV2 content filters treat any bracketed [SYSTEM: ...] as prompt-injection and reject with HTTP 400. Remaining call sites renamed: - cli.py: background-process notifications (watch_disabled, watch_match, completion), MCP reload notice (4 live + 1 docstring) - gateway/run.py: same notification paths + auto-loaded skill banner + MCP reload notice (5 live + 1 docstring) - tools/process_registry.py: comment reference Not renamed: - environments/hermes_base_env.py '[SYSTEM]\n{content}' — RL training trajectory rendering only, never sent to Azure, part of a symmetric [USER]/[ASSISTANT]/[TOOL] scheme. AUTHOR_MAP: buraysandro9@gmail.com -> ygd58. |
||
|
|
d09ab8ff13 |
fix(mcp-oauth): preserve server_url path for protected-resource validation (#16031)
Stop pre-stripping the path from the configured MCP server URL before constructing OAuthClientProvider. The MCP SDK strips the path itself via OAuthContext.get_authorization_base_url() for authorization-server discovery, but uses the full server_url through resource_url_from_server_url() + check_resource_allowed() to validate against the server's RFC 9728 Protected Resource Metadata. For servers whose PRM advertises a path-scoped resource (e.g. Notion's https://mcp.notion.com/mcp), our _parse_base_url() collapsed the URL to the origin, so check_resource_allowed() saw requested='/' vs configured='/mcp/' and refused the token. Fixes OAuth against Notion MCP (and any other path-scoped resource). Closes #16015. |
||
|
|
eb28145f36 |
feat(approval): hardline blocklist for unrecoverable commands (#15878)
Adds a floor below --yolo: a tiny set of commands so catastrophic they should never run via the agent, regardless of --yolo, gateway /yolo, approvals.mode=off, or cron approve mode. Opting into yolo is trusting the agent with your files and services — not trusting it to wipe the disk or power the box off. The list is deliberately small (12 patterns), covering only unrecoverable ops: - rm -rf targeting /, /home, /etc, /usr, /var, /boot, /bin, /sbin, /lib, ~, $HOME - mkfs (any variant) - dd + redirection to raw block devices (/dev/sd*, /dev/nvme*, etc.) - fork bomb - kill -1 / kill -9 -1 - shutdown, reboot, halt, poweroff, init 0/6, telinit 0/6, systemctl poweroff/reboot/halt/kexec Recoverable-but-costly commands (git reset --hard, rm -rf /tmp/x, chmod -R 777, curl | sh) stay in DANGEROUS_PATTERNS where yolo can still pass them through — that's what yolo is for. Container backends (docker/singularity/modal/daytona) continue to bypass both hardline and dangerous checks, since nothing they do can touch the host. Inspired by Mercury Agent's permission-hardened blocklist. |
||
|
|
97d54f0e4d |
fix(terminal): three-layer defense against watch_patterns notification spam (#15642)
* fix(terminal): three-layer defense against watch_patterns notification spam Background processes that stack notify_on_complete=True with watch_patterns can flood the user with duplicate, delayed notifications — matches deliver asynchronously via the completion queue and continue arriving minutes after the process has exited. The docstring warning against this (PR #12113) has proven insufficient; agents still misuse the combination. Three layered defenses, each sufficient on its own: 1. Mutual exclusion (terminal_tool.py): When both flags are set on a background process, drop watch_patterns with a warning. notify_on_complete wins because 'let me know when it's done' is the more useful signal and fires exactly once. Extracted as _resolve_notification_flag_conflict() so the rule is testable in isolation. 2. Suppress-after-exit (process_registry.py): _check_watch_patterns() now bails the moment session.exited is True. Post-exit chunks (buffered reads draining after the process is gone) no longer produce notifications. This is the fix flagged as future work in session 20260418_020302_79881c. 3. Global circuit breaker (process_registry.py): Per-session rate limits don't catch the sibling-flood case — N concurrent processes can each stay under 8/10s and still collectively spam. New WATCH_GLOBAL_MAX_PER_WINDOW=15 cap trips a 30-second cooldown across ALL sessions, emits a single watch_overflow_tripped event, silently counts dropped events, and emits a watch_overflow_released summary when the cooldown ends. Also updates the tool schema + docstring to document the new behavior. Tests: 8 new tests covering all three fixes (suppress-after-exit x2, mutual-exclusion resolver x4, global breaker trip/cooldown/release x2). All 60 tests across test_watch_patterns.py, test_notify_on_complete.py, test_terminal_tool.py pass. Real-world trigger: self-inflicted in session 20260425_051924 — three concurrent hermes-sweeper review subprocesses each set watch_patterns= ['failed validation', 'errored'] AND notify_on_complete=True, then iterated over multiple items, producing enough matches per process to defeat the per-session cap while staying under the global cap that didn't yet exist. * fix(terminal): aggressive 1-per-15s watch_patterns rate limit + strike-3 promotion Per Teknium's direction, the watch_patterns rate limit is now much more aggressive and self-healing. ## New rule — per session - HARD cap: 1 watch-match notification per 15 seconds per process. - Any match arriving inside the cooldown window is dropped and counts as ONE strike for that window (many drops in the same window still = 1 strike). - After 3 consecutive strike windows, watch_patterns is permanently disabled for the session and the session is auto-promoted to notify_on_complete semantics — exactly one notification when the process actually exits. - A cooldown window that expires with zero drops resets the consecutive strike counter — healthy cadence is forgiven. ## Schema + docstring rewritten The tool schema description now gives the model explicit guidance: - notify_on_complete is 'the right choice for almost every long-running task' - watch_patterns is for RARE one-shot signals on LONG-LIVED processes - Do NOT use watch_patterns with loops/batch jobs — error patterns fire every iteration and will hit the strike limit fast - Mutual exclusion is stated on both parameter descriptions - 1/15s cooldown and 3-strike promotion are stated in the watch_patterns description so the model sees the contract every turn ## Removed - WATCH_MAX_PER_WINDOW (8/10s) and WATCH_OVERLOAD_KILL_SECONDS (45) — the new 1/15s limit subsumes both; keeping them would double-count. - _watch_window_hits / _watch_window_start / _watch_overload_since fields on ProcessSession. Replaced by _watch_last_emit_at / _watch_cooldown_until / _watch_strike_candidate / _watch_consecutive_strikes. ## Kept - Global circuit breaker across all sessions (15/10s → 30s cooldown) as a secondary safety net for concurrent siblings. Still valuable when 20 short-lived processes each fire once — none individually violates the per-session limit. - Suppress-after-exit guard. - Mutual exclusion resolver at the tool entry point. ## Tests - 6 new tests in TestPerSessionRateLimit covering: first match delivers, second in cooldown suppressed, multi-drop = single strike, 3 strikes disables + promotes, clean window resets counter, suppressed count carried to next emit. - Global circuit breaker tests rewritten to use fresh sessions instead of hacking removed per-window fields. - 50/50 watch_patterns + notify_on_complete tests pass. - 60/60 including test_terminal_tool.py pass. |
||
|
|
81987f0350 |
feat(discord): split discord_server into discord + discord_admin tools
Split the monolithic discord_server tool (14 actions) into two: - discord: core actions (fetch_messages, search_members, create_thread) that are useful for the agent's normal operation. Auto-enabled on the discord platform via the pipeline fix. - discord_admin: server management actions (list channels/roles, pins, role assignment) that require explicit opt-in via hermes tools. Added to CONFIGURABLE_TOOLSETS and _DEFAULT_OFF_TOOLSETS. |
||
|
|
0d548d1db9 |
fix(cron): wire context_from through the update action
The tool schema promised 'On update, pass an empty array to clear' but the update branch ignored the context_from kwarg entirely — users could set the field at create time and never modify or clear it afterward. - tools/cronjob_tools.py: handle context_from in the update branch the same way script/enabled_toolsets/workdir are handled: normalize str/list to refs, validate each referenced job exists (same check the create branch does), store as list-or-None to match create_job()'s shape. Empty string or empty list clears the field. - tests/cron/test_cron_context_from.py: 6 new tests covering add/change/ clear (both shapes)/bad-ref/preserve-across-unrelated-update. |
||
|
|
5ac5365923 | feat(cron): add context_from field for cron job output chaining | ||
|
|
023b1bff11 |
fix(delegate): resolve subagent approval prompts without deadlocking parent TUI (#15491)
Subagents run inside a ThreadPoolExecutor. The CLI's interactive approval callback lives in tools/terminal_tool.py's threading.local(), which worker threads do not inherit. When a subagent hits a dangerous-command guard, prompt_dangerous_approval() falls back to input() from the worker thread, deadlocking against the parent's prompt_toolkit TUI that owns stdin. Fix: install a non-interactive callback into every subagent worker thread via ThreadPoolExecutor(initializer=set_approval_callback, initargs=(cb,)). The callback is config-gated by delegation.subagent_auto_approve: false (default) -> _subagent_auto_deny (safe; matches leaf tool blocklist) true -> _subagent_auto_approve (opt-in YOLO for cron/batch) Both emit a logger.warning audit line. Gateway sessions are unaffected because they resolve approvals via tools/approval.py's per-session queue, not through these TLS callbacks. Diagnosis credit: @MorAlekss (#14685). - hermes_cli/config.py: DEFAULT_CONFIG.delegation.subagent_auto_approve: False - cli-config.yaml.example: documented, commented (default) - tools/delegate_tool.py: _subagent_auto_deny, _subagent_auto_approve, _get_subagent_approval_callback, wired into the child timeout executor - tests/tools/test_delegate.py: 7 tests covering defaults, truthy coercion, and TLS scoping in the worker thread |
||
|
|
0ed37c0ca4 | docs(delegate): document max_concurrent_children and max_spawn_depth + cost warning | ||
|
|
fd10463069 | fix(env): safely quote ~/ subpaths in wrapped cd commands | ||
|
|
2de8a7a229 |
fix(skills): drop raw_content to avoid doubling skill payload
skill_view response went to the model verbatim; duplicating the SKILL.md body as raw_content on every tool call added token cost with no agent-facing benefit. Remove the field and update tests to assert on content only. The slash/preload caller (agent/skill_commands.py) already falls back to content when raw_content is absent, and it calls skill_view(preprocess=False) anyway, so content is already unrendered on that path. |
||
|
|
ead66f0c92 | fix(skills): apply inline shell in skill_view | ||
|
|
fcc05284fc |
fix(delegate): tool-activity-aware heartbeat stale detection (#13041) (#15183)
A child running a legitimately long-running tool (terminal command, browser fetch, big file read) holds current_tool set and keeps api_call_count frozen while the tool runs. The previous stale check treated that as idle after 5 heartbeat cycles (~150s), stopped touching the parent, and let the gateway kill the session. Split the threshold in two: - _HEARTBEAT_STALE_CYCLES_IDLE=5 (~150s) — applied only when current_tool is None (child wedged between turns) - _HEARTBEAT_STALE_CYCLES_IN_TOOL=20 (~600s) — applied when the child is inside a tool call Stale counter also resets when current_tool changes (new tool = progress). The hard child_timeout_seconds (default 600s) is still the final cap, so genuinely stuck tools don't get to block forever. |
||
|
|
8d12fb1e6b |
refactor(spotify): convert to built-in bundled plugin under plugins/spotify (#15174)
Moves the Spotify integration from tools/ into plugins/spotify/,
matching the existing pattern established by plugins/image_gen/ for
third-party service integrations.
Why:
- tools/ should be reserved for foundational capabilities (terminal,
read_file, web_search, etc.). tools/providers/ was a one-off
directory created solely for spotify_client.py.
- plugins/ is already the home for image_gen backends, memory
providers, context engines, and standalone hook-based plugins.
Spotify is a third-party service integration and belongs alongside
those, not in tools/.
- Future service integrations (eventually: Deezer, Apple Music, etc.)
now have a pattern to copy.
Changes:
- tools/spotify_tool.py → plugins/spotify/tools.py (handlers + schemas)
- tools/providers/spotify_client.py → plugins/spotify/client.py
- tools/providers/ removed (was only used for Spotify)
- New plugins/spotify/__init__.py with register(ctx) calling
ctx.register_tool() × 7. The handler/check_fn wiring is unchanged.
- New plugins/spotify/plugin.yaml (kind: backend, bundled, auto-load).
- tests/tools/test_spotify_client.py: import paths updated.
tools_config fix — _DEFAULT_OFF_TOOLSETS now wins over plugin auto-enable:
- _get_platform_tools() previously auto-enabled unknown plugin
toolsets for new platforms. That was fine for image_gen (which has
no toolset of its own) but bad for Spotify, which explicitly
requires opt-in (don't ship 7 tool schemas to users who don't use
it). Added a check: if a plugin toolset is in _DEFAULT_OFF_TOOLSETS,
it stays off until the user picks it in 'hermes tools'.
Pre-existing test bug fix:
- tests/hermes_cli/test_plugins.py::test_list_returns_sorted
asserted names were sorted, but list_plugins() sorts by key
(path-derived, e.g. image_gen/openai). With only image_gen plugins
bundled, name and key order happened to agree. Adding plugins/spotify
broke that coincidence (spotify sorts between openai-codex and xai
by name but after xai by key). Updated test to assert key order,
which is what the code actually documents.
Validation:
- scripts/run_tests.sh tests/hermes_cli/test_plugins.py \
tests/hermes_cli/test_tools_config.py \
tests/hermes_cli/test_spotify_auth.py \
tests/tools/test_spotify_client.py \
tests/tools/test_registry.py
→ 143 passed
- E2E plugin load: 'spotify' appears in loaded plugins, all 7 tools
register into the spotify toolset, check_fn gating intact.
|
||
|
|
e5d41f05d4 |
feat(spotify): consolidate tools (9→7), add spotify skill, surface in hermes setup (#15154)
Three quality improvements on top of #15121 / #15130 / #15135: 1. Tool consolidation (9 → 7) - spotify_saved_tracks + spotify_saved_albums → spotify_library with kind='tracks'|'albums'. Handler code was ~90 percent identical across the two old tools; the merge is a behavioral no-op. - spotify_activity dropped. Its 'now_playing' action was a duplicate of spotify_playback.get_currently_playing (both return identical 204/empty payloads). Its 'recently_played' action moves onto spotify_playback as a new action — history belongs adjacent to live state. - Net: each API call ships 2 fewer tool schemas when the Spotify toolset is enabled, and the action surface is more discoverable (everything playback-related is on one tool). 2. Spotify skill (skills/media/spotify/SKILL.md) Teaches the agent canonical usage patterns so common requests don't balloon into 4+ tool calls: - 'play X' = one search, then play by URI (not search + scan + describe + play) - 'what's playing' = single get_currently_playing (no preflight get_state chain) - Don't retry on '403 Premium required' or '403 No active device' — both require user action - URI/URL/bare-ID format normalization - Full failure-mode reference for 204/401/403/429 3. Surfaced in 'hermes setup' tool status Adds 'Spotify (PKCE OAuth)' to the tool status list when auth.json has a Spotify access/refresh token. Matches the homeassistant pattern but reads from auth.json (OAuth-based) rather than env vars. Docs updated to reflect the new 7-tool surface, and mention the companion skill in the 'Using it' section. Tests: 54 passing (client 22, auth 15, tools_config 35 — 18 = 54 after renaming/replacing the spotify_activity tests with library + recently_played coverage). Docusaurus build clean. |
||
|
|
e87a2100f6 |
fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)
Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path, so **every subsequent call on the affected server failed until the gateway was manually restarted**. Reporter: #13383. The OAuth-based recovery path (``_handle_auth_error_and_retry``) already exists for 401s, but it only fires on auth errors. Session expiry slipped through because the access token is still valid — nothing 401'd, so the existing recovery branch was skipped. Fix --- Add a sibling function ``_handle_session_expired_and_retry`` that detects MCP session-expiry via ``_is_session_expired_error`` (a narrow allow-list of known-stable substrings: ``"invalid or expired session"``, ``"session expired"``, ``"session not found"``, ``"unknown session"``, etc.) and then uses the existing transport reconnect mechanism: * Sets ``MCPServerTask._reconnect_event`` — the server task's lifecycle loop already interprets this as "tear down the current ``streamablehttp_client`` + ``ClientSession`` and rebuild them, reusing the existing OAuth provider instance". * Waits up to 15 s for the new session to come back ready. * Retries the original call once. If the retry succeeds, returns its result and resets the circuit-breaker error count. If the retry raises, or if the reconnect doesn't ready in time, falls through to the caller's generic error path. Unlike the 401 path, this does **not** call ``handle_401`` — the access token is already valid and running an OAuth refresh would be a pointless round-trip. All 5 MCP handlers (``call_tool``, ``list_resources``, ``read_resource``, ``list_prompts``, ``get_prompt``) now consult both recovery paths before falling through: recovered = _handle_auth_error_and_retry(...) # 401 path if recovered is not None: return recovered recovered = _handle_session_expired_and_retry(...) # new if recovered is not None: return recovered # generic error response Narrow scope — explicitly not changed ------------------------------------- * **Detection is string-based on a 5-entry allow-list.** The MCP SDK wraps JSON-RPC errors in ``McpError`` whose exception type + attributes vary across SDK versions, so matching on message substrings is the durable path. Kept narrow to avoid false positives — a regular ``RuntimeError("Tool failed")`` will NOT trigger spurious reconnects (pinned by ``test_is_session_expired_rejects_unrelated_errors``). * **No change to the existing 401 recovery flow.** The new path is consulted only after the auth path declines (returns ``None``). * **Retry count stays at 1.** If the reconnect-then-retry also fails, we don't loop — the error surfaces normally so the model sees a failed tool call rather than a hang. * **``InterruptedError`` is explicitly excluded** from session-expired detection so user-cancel signals always short-circuit the same way they did before (pinned by ``test_is_session_expired_rejects_interrupted_error``). Regression coverage ------------------- ``tests/tools/test_mcp_tool_session_expired.py`` (new, 16 cases): Unit tests for ``_is_session_expired_error``: * ``test_is_session_expired_detects_invalid_or_expired_session`` — reporter's exact wpcom-mcp text. * ``test_is_session_expired_detects_expired_session_variant`` — "Session expired" / "expired session" variants. * ``test_is_session_expired_detects_session_not_found`` — server GC variant ("session not found", "unknown session"). * ``test_is_session_expired_is_case_insensitive``. * ``test_is_session_expired_rejects_unrelated_errors`` — narrow-scope canary: random RuntimeError / ValueError / 401 don't trigger. * ``test_is_session_expired_rejects_interrupted_error`` — user cancel must never route through reconnect. * ``test_is_session_expired_rejects_empty_message``. Handler integration tests: * ``test_call_tool_handler_reconnects_on_session_expired`` — reporter's full repro: first call raises "Invalid or expired session", handler signals ``_reconnect_event``, retries once, returns the retry's success result with no ``error`` key. * ``test_call_tool_handler_non_session_expired_error_falls_through`` — preserved-behaviour canary: random tool failures do NOT trigger reconnect. * ``test_session_expired_handler_returns_none_without_loop`` — defensive: cold-start / shutdown race. * ``test_session_expired_handler_returns_none_without_server_record`` — torn-down server falls through cleanly. * ``test_session_expired_handler_returns_none_when_retry_also_fails`` — no retry loop on repeated failure. Parametrised across all 4 non-``tools/call`` handlers: * ``test_non_tool_handlers_also_reconnect_on_session_expired`` [list_resources / read_resource / list_prompts / get_prompt]. **15 of 16 fail on clean ``origin/main`` (``6fb69229``)** with ``ImportError: cannot import name '_is_session_expired_error'`` — the fix's surface symbols don't exist there yet. The 1 passing test is an ordering artefact of pytest-xdist worker collection. Validation ---------- ``source venv/bin/activate && python -m pytest tests/tools/test_mcp_tool_session_expired.py -q`` → **16 passed**. Broader MCP suite (5 files: ``test_mcp_tool.py``, ``test_mcp_tool_401_handling.py``, ``test_mcp_tool_session_expired.py``, ``test_mcp_reconnect_signal.py``, ``test_mcp_oauth.py``) → **230 passed, 0 regressions**. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
8c2732a9f9 |
fix(security): strip MCP auth on cross-origin redirect
Add event hook to httpx.AsyncClient in MCP HTTP transport that strips Authorization headers when a redirect targets a different origin, preventing credential leakage to third-party servers. |
||
|
|
15050fd965 |
fix(mcp_oauth): raise RuntimeError instead of asserting OAuth port is set
``tools/mcp_oauth.py`` relied on ``assert _oauth_port is not None`` to guard the module-level port set by ``build_oauth_auth``. Python's ``-O`` / ``-OO`` optimization flags strip ``assert`` statements entirely, so a deployment that runs ``python -O -m hermes ...`` silently loses the check: ``_oauth_port`` stays ``None`` and the failure surfaces much later as an obscure ``int()`` or ``http.server.HTTPServer((host, None))`` TypeError rather than the intended "OAuth callback port not set" signal. Replace with an explicit ``if … raise RuntimeError(...)`` so the invariant is preserved regardless of the interpreter's optimization level. Docstring updated to document the new exception. Found during a proactive audit of ``assert`` statements in non-test code paths. |
||
|
|
5fa2f4258a |
fix: serialize Pydantic AnyUrl fields when persisting MCP OAuth state
OAuth client information and token responses from the MCP SDK contain Pydantic AnyUrl fields (client_uri, redirect_uris, etc.). The previous model_dump() call returned a dict with these AnyUrl objects still as their native Python type, which then crashed json.dumps with: TypeError: Object of type AnyUrl is not JSON serializable This caused any OAuth-based MCP server (e.g. alphaxiv) to fail registration with an "OAuth flow error" traceback during startup. Adding mode="json" tells Pydantic to serialize all fields to JSON-compatible primitives (AnyUrl -> str, datetime -> ISO string, etc.) before returning the dict, so the standard json.dumps can handle it. Three call sites fixed: - HermesTokenStorage.set_tokens - HermesTokenStorage.set_client_info - build_oauth_auth pre-registration write |
||
|
|
7e9dd9ca45 | Add native Spotify tools with PKCE auth | ||
|
|
852c7f3be3 |
feat(cron): per-job workdir for project-aware cron runs (#15110)
Cron jobs can now specify a per-job working directory. When set, the job runs as if launched from that directory: AGENTS.md / CLAUDE.md / .cursorrules from that dir are injected into the system prompt, and the terminal / file / code-exec tools use it as their cwd (via TERMINAL_CWD). When unset, old behaviour is preserved (no project context files, tools use the scheduler's cwd). Requested by @bluthcy. ## Mechanism - cron/jobs.py: create_job / update_job accept 'workdir'; validated to be an absolute existing directory at create/update time. - cron/scheduler.py run_job: if job.workdir is set, point TERMINAL_CWD at it and flip skip_context_files to False before building the agent. Restored in finally on every exit path. - cron/scheduler.py tick: workdir jobs run sequentially (outside the thread pool) because TERMINAL_CWD is process-global. Workdir-less jobs still run in the parallel pool unchanged. - tools/cronjob_tools.py + hermes_cli/cron.py + hermes_cli/main.py: expose 'workdir' via the cronjob tool and 'hermes cron create/edit --workdir ...'. Empty string on edit clears the field. ## Validation - tests/cron/test_cron_workdir.py (21 tests): normalize, create, update, JSON round-trip via cronjob tool, tick partition (workdir jobs run on the main thread, not the pool), run_job env toggle + restore in finally. - Full targeted suite (tests/cron/, test_cronjob_tools.py, test_cron.py, test_config_cwd_bridge.py, test_worktree.py): 314/314 passed. - Live smoke: hermes cron create --workdir $(pwd) works; relative path rejected; list shows 'Workdir:'; edit --workdir '' clears. |
||
|
|
7634c1386f |
feat(delegate): diagnostic dump when a subagent times out with 0 API calls (#15105)
When a subagent in delegate_task times out before making its first LLM
request, write a structured diagnostic file under
~/.hermes/logs/subagent-timeout-<sid>-<ts>.log capturing enough state
for the user (and us) to debug the hang. The old error message —
'Subagent timed out after Ns with no response. The child may be stuck
on a slow API call or unresponsive network request.' — gave no
observability for the 0-API-call case, which is the hardest to reason
about remotely.
The diagnostic captures:
- timeout config vs actual duration
- goal (truncated to 1000 chars)
- child config: model, provider, api_mode, base_url, max_iterations,
quiet_mode, platform, _delegate_role, _delegate_depth
- enabled_toolsets + loaded tool names
- system prompt byte/char count (catches oversized prompts that
providers silently choke on)
- tool schema count + byte size
- child's get_activity_summary() snapshot
- Python stack of the worker thread at the moment of timeout
(reveals whether the hang is in credential resolution, transport,
prompt construction, etc.)
Wiring:
- _run_single_child captures the worker thread via a small wrapper
around child.run_conversation so we can look up its stack at
timeout.
- After a FuturesTimeoutError, we pull child.get_activity_summary()
to read api_call_count. If 0 AND it was a timeout (not a raise),
_dump_subagent_timeout_diagnostic() is invoked.
- The returned path is surfaced in the error string so the parent
agent (and therefore the user / gateway) sees exactly where to look.
- api_calls > 0 timeouts keep the old 'stuck on slow API call'
phrasing since that's the correct diagnosis for those.
This does NOT change any behavior for successful subagent runs,
non-timeout errors, or subagents that made at least one API call
before hanging.
Tests: 7 cases (tests/tools/test_delegate_subagent_timeout_diagnostic.py)
- output format + required sections + field values
- long-goal truncation with [truncated] marker
- missing / already-exited worker thread branches
- unwritable HERMES_HOME/logs/ returns None without raising
- _run_single_child wiring: 0 API calls → dump + diagnostic_path in error
- _run_single_child wiring: N>0 API calls → no dump, old message
Refs: #14726
|
||
|
|
18f3fc8a6f |
fix(tests): resolve 17 persistent CI test failures (#15084)
Make the main-branch test suite pass again. Most failures were tests
still asserting old shapes after recent refactors; two were real source
bugs.
Source fixes:
- tools/mcp_tool.py: _kill_orphaned_mcp_children() slept 2s on every
shutdown even when no tracked PIDs existed, making test_shutdown_is_parallel
measure ~3s for 3 parallel 1s shutdowns. Early-return when pids is empty.
- hermes_cli/tips.py: tip 105 was 157 chars; corpus max is 150.
Test fixes (mostly stale mock targets / missing fixture fields):
- test_zombie_process_cleanup, test_agent_cache: patch run_agent.cleanup_vm
(the local name bound at import), not tools.terminal_tool.cleanup_vm.
- test_browser_camofox: patch tools.browser_camofox.load_config, not
hermes_cli.config.load_config (the source module, not the resolved one).
- test_flush_memories_codex._chat_response_with_memory_call: add
finish_reason, tool_call.id, tool_call.type so the chat_completions
transport normalizer doesn't AttributeError.
- test_concurrent_interrupt: polling_tool signature now accepts
messages= kwarg that _invoke_tool() passes through.
- test_minimax_provider: add _fallback_chain=[] to the __new__'d agent
so switch_model() doesn't AttributeError.
- test_skills_config: SKILLS_DIR MagicMock + .rglob stopped working
after the scanner switched to agent.skill_utils.iter_skill_index_files
(os.walk-based). Point SKILLS_DIR at a real tmp_path and patch
agent.skill_utils.get_external_skills_dirs.
- test_browser_cdp_tool: browser_cdp toolset was intentionally split into
'browser-cdp' (commit
|
||
|
|
4350668ae4 |
fix(transcription): fall back to CPU when CUDA runtime libs are missing
faster-whisper's device="auto" picks CUDA when ctranslate2's wheel ships CUDA shared libs, even on hosts without the NVIDIA runtime (libcublas.so.12 / libcudnn*). On those hosts the model often loads fine but transcribe() fails at first dlopen, and the broken model stays cached in the module-global — every subsequent voice message in the gateway process fails identically until restart. - Add _load_local_whisper_model() wrapper: try auto, catch missing-lib errors, retry on device=cpu compute_type=int8. - Wrap transcribe() with the same fallback: evict cached model, reload on CPU, retry once. Required because the dlopen failure only surfaces at first kernel launch, not at model construction. - Narrow marker list (libcublas, libcudnn, libcudart, 'cannot be loaded', 'no kernel image is available', 'no CUDA-capable device', driver mismatch). Deliberately excludes 'CUDA out of memory' and similar — those are real runtime failures that should surface, not be silently retried on CPU. - Tests for load-time fallback, runtime fallback (with cached-model eviction verified), and the OOM non-fallback path. Reported via Telegram voice-message dumps on WSL2 hosts where libcublas isn't installed by default. |
||
|
|
34c3e67109 |
fix: sanitize tool schemas for llama.cpp backends; restore MCP in TUI (#15032)
Local llama.cpp servers (e.g. ggml-org/llama.cpp:full-cuda) fail the entire
request with HTTP 400 'Unable to generate parser for this template. ...
Unrecognized schema: "object"' when any tool schema contains shapes its
json-schema-to-grammar converter can't handle:
* 'type': 'object' without 'properties'
* bare string schema values ('additionalProperties: "object"')
* 'type': ['X', 'null'] arrays (nullable form)
Cloud providers accept these silently, so they ship from external MCP
servers (Atlassian, GCloud, Datadog) and from a couple of our own tools.
Changes
- tools/schema_sanitizer.py: walks the finalized tool list right before it
leaves get_tool_definitions() and repairs the hostile shapes in a deep
copy. No-op on well-formed schemas. Recurses into properties, items,
additionalProperties, anyOf/oneOf/allOf, and $defs.
- model_tools.get_tool_definitions(): invoke the sanitizer as the last
step so all paths (built-in, MCP, plugin, dynamically-rebuilt) get
covered uniformly.
- tools/browser_cdp_tool.py, tools/mcp_tool.py: fix our own bare-object
schemas so sanitization isn't load-bearing for in-repo tools.
- tui_gateway/server.py: _load_enabled_toolsets() was passing
include_default_mcp_servers=False at runtime. That's the config-editing
variant (see PR #3252) — it silently drops every default MCP server
from the TUI's enabled_toolsets, which is why the TUI didn't hit the
llama.cpp crash (no MCP tools sent at all). Switch to True so TUI
matches CLI behavior.
Tests
tests/tools/test_schema_sanitizer.py (17 tests) covers the individual
failure modes, well-formed pass-through, deep-copy isolation, and
required-field pruning.
E2E: loaded the default 'hermes-cli' toolset with MCP discovery and
confirmed all 27 resolved tool schemas pass a llama.cpp-compatibility
walk (no 'object' node missing 'properties', no bare-string schema
values).
|
||
|
|
5a1c599412 |
feat(browser): CDP supervisor — dialog detection + response + cross-origin iframe eval (#14540)
* docs: browser CDP supervisor design (for upcoming PR) Design doc ahead of implementation — dialog + iframe detection/interaction via a persistent CDP supervisor. Covers backend capability matrix (verified live 2026-04-23), architecture, lifecycle, policy, agent surface, PR split, non-goals, and test plan. Supersedes #12550. No code changes in this commit. * feat(browser): add persistent CDP supervisor for dialog + frame detection Single persistent CDP WebSocket per Hermes task_id that subscribes to Page/Runtime/Target events and maintains thread-safe state for pending dialogs, frame tree, and console errors. Supervisor lives in its own daemon thread running an asyncio loop; external callers use sync API (snapshot(), respond_to_dialog()) that bridges onto the loop. Auto-attaches to OOPIF child targets via Target.setAutoAttach{flatten:true} and enables Page+Runtime on each so iframe-origin dialogs surface through the same supervisor. Dialog policies: must_respond (default, 300s safety timeout), auto_dismiss, auto_accept. Frame tree capped at 30 entries + OOPIF depth 2 to keep snapshot payloads bounded on ad-heavy pages. E2E verified against real Chrome via smoke test — detects + responds to main-frame alerts, iframe-contentWindow alerts, preserves frame tree, graceful no-dialog error path, clean shutdown. No agent-facing tool wiring in this commit (comes next). * feat(browser): add browser_dialog tool wired to CDP supervisor Agent-facing response-only tool. Schema: action: 'accept' | 'dismiss' (required) prompt_text: response for prompt() dialogs (optional) dialog_id: disambiguate when multiple dialogs queued (optional) Handler: SUPERVISOR_REGISTRY.get(task_id).respond_to_dialog(...) check_fn shares _browser_cdp_check with browser_cdp so both surface and hide together. When no supervisor is attached (Camofox, default Playwright, or no browser session started yet), tool is hidden; if somehow invoked it returns a clear error pointing the agent to browser_navigate / /browser connect. Registered in _HERMES_CORE_TOOLS and the browser / hermes-acp / hermes-api-server toolsets alongside browser_cdp. * feat(browser): wire CDP supervisor into session lifecycle + browser_snapshot Supervisor lifecycle: * _get_session_info lazy-starts the supervisor after a session row is materialized — covers every backend code path (Browserbase, cdp_url override, /browser connect, future providers) with one hook. * cleanup_browser(task_id) stops the supervisor for that task first (before the backend tears down CDP). * cleanup_all_browsers() calls SUPERVISOR_REGISTRY.stop_all(). * /browser connect eagerly starts the supervisor for task 'default' so the first snapshot already shows pending_dialogs. * /browser disconnect stops the supervisor. CDP URL resolution for the supervisor: 1. BROWSER_CDP_URL / browser.cdp_url override. 2. Fallback: session_info['cdp_url'] from cloud providers (Browserbase). browser_snapshot merges supervisor state (pending_dialogs + frame_tree) into its JSON output when a supervisor is active — the agent reads pending_dialogs from the snapshot it already requests, then calls browser_dialog to respond. No extra tool surface. Config defaults: * browser.dialog_policy: 'must_respond' (new) * browser.dialog_timeout_s: 300 (new) No version bump — new keys deep-merge into existing browser section. Deadlock fix in supervisor event dispatch: * _on_dialog_opening and _on_target_attached used to await CDP calls while the reader was still processing an event — but only the reader can set the response Future, so the call timed out. * Both now fire asyncio.create_task(...) so the reader stays pumping. * auto_dismiss/auto_accept now actually close the dialog immediately. Tests (tests/tools/test_browser_supervisor.py, 11 tests, real Chrome): * supervisor start/snapshot * main-frame alert detection + dismiss * iframe.contentWindow alert * prompt() with prompt_text reply * respond with no pending dialog -> clean error * auto_dismiss clears on event * registry idempotency * registry stop -> snapshot reports inactive * browser_dialog tool no-supervisor error * browser_dialog invalid action * browser_dialog end-to-end via tool handler xdist-safe: chrome_cdp fixture uses a per-worker port. Skipped when google-chrome/chromium isn't installed. * docs(browser): document browser_dialog tool + CDP supervisor - user-guide/features/browser.md: new browser_dialog section with workflow, availability gate, and dialog_policy table - reference/tools-reference.md: row for browser_dialog, tool count bumped 53 -> 54, browser tools count 11 -> 12 - reference/toolsets-reference.md: browser_dialog added to browser toolset row with note on pending_dialogs / frame_tree snapshot fields Full design doc lives at developer-guide/browser-supervisor.md (committed earlier). * fix(browser): reconnect loop + recent_dialogs for Browserbase visibility Found via Browserbase E2E test that revealed two production-critical issues: 1. **Supervisor WebSocket drops when other clients disconnect.** Browserbase's CDP proxy tears down our long-lived WebSocket whenever a short-lived client (e.g. agent-browser CLI's per-command CDP connection) disconnects. Fixed with a reconnecting _run loop that re-attaches with exponential backoff on drops. _page_session_id and _child_sessions are reset on each reconnect; pending_dialogs and frames are preserved across reconnects. 2. **Browserbase auto-dismisses dialogs server-side within ~10ms.** Their Playwright-based CDP proxy dismisses alert/confirm/prompt before our Page.handleJavaScriptDialog call can respond. So pending_dialogs is empty by the time the agent reads a snapshot on Browserbase. Added a recent_dialogs ring buffer (capacity 20) that retains a DialogRecord for every dialog that opened, with a closed_by tag: * 'agent' — agent called browser_dialog * 'auto_policy' — local auto_dismiss/auto_accept fired * 'watchdog' — must_respond timeout auto-dismissed (300s default) * 'remote' — browser/backend closed it on us (Browserbase) Agents on Browserbase now see the dialog history with closed_by='remote' so they at least know a dialog fired, even though they couldn't respond. 3. **Page.javascriptDialogClosed matching bug.** The event doesn't include a 'message' field (CDP spec has only 'result' and 'userInput') but our _on_dialog_closed was matching on message. Fixed to match by session_id + oldest-first, with a safety assumption that only one dialog is in flight per session (the JS thread is blocked while a dialog is up). Docs + tests updated: * browser.md: new availability matrix showing the three backends and which mode (pending / recent / response) each supports * developer-guide/browser-supervisor.md: three-field snapshot schema with closed_by semantics * test_browser_supervisor.py: +test_recent_dialogs_ring_buffer (12/12 passing against real Chrome) E2E verified both backends: * Local Chrome via /browser connect: detect + respond full workflow (smoke_supervisor.py all 7 scenarios pass) * Browserbase: detect via recent_dialogs with closed_by='remote' (smoke_supervisor_browserbase_v2.py passes) Camofox remains out of scope (REST-only, no CDP) — tracked for upstream PR 3. * feat(browser): XHR bridge for dialog response on Browserbase (FIXED) Browserbase's CDP proxy auto-dismisses native JS dialogs within ~10ms, so Page.handleJavaScriptDialog calls lose the race. Solution: bypass native dialogs entirely. The supervisor now injects Page.addScriptToEvaluateOnNewDocument with a JavaScript override for window.alert/confirm/prompt. Those overrides perform a synchronous XMLHttpRequest to a magic host ('hermes-dialog-bridge.invalid'). We intercept those XHRs via Fetch.enable with a requestStage=Request pattern. Flow when a page calls alert('hi'): 1. window.alert override intercepts, builds XHR GET to http://hermes-dialog-bridge.invalid/?kind=alert&message=hi 2. Sync XHR blocks the page's JS thread (mirrors real dialog semantics) 3. Fetch.requestPaused fires on our WebSocket; supervisor surfaces it as a pending dialog with bridge_request_id set 4. Agent reads pending_dialogs from browser_snapshot, calls browser_dialog 5. Supervisor calls Fetch.fulfillRequest with JSON body: {accept: true|false, prompt_text: '...', dialog_id: 'd-N'} 6. The injected script parses the body, returns the appropriate value from the override (undefined for alert, bool for confirm, string|null for prompt) This works identically on Browserbase AND local Chrome — no native dialog ever fires, so Browserbase's auto-dismiss has nothing to race. Dialog policies (must_respond / auto_dismiss / auto_accept) all still work. Bridge is installed on every attached session (main page + OOPIF child sessions) so iframe dialogs are captured too. Native-dialog path kept as a fallback for backends that don't auto-dismiss (so a page that somehow bypasses our override — e.g. iframes that load after Fetch.enable but before the init-script runs — still gets observed via Page.javascriptDialogOpening). E2E VERIFIED: * Local Chrome: 13/13 pytest tests green (12 original + new test_bridge_captures_prompt_and_returns_reply_text that asserts window.__ret === 'AGENT-SUPPLIED-REPLY' after agent responds) * Browserbase: smoke_bb_bridge_v2.py runs 4/4 PASS: - alert('BB-ALERT-MSG') dismiss → page.alert_ret = undefined ✓ - prompt('BB-PROMPT-MSG', 'default-xyz') accept with 'AGENT-REPLY' → page.prompt_ret === 'AGENT-REPLY' ✓ - confirm('BB-CONFIRM-MSG') accept → page.confirm_ret === true ✓ - confirm('BB-CONFIRM-MSG') dismiss → page.confirm_ret === false ✓ Docs updated in browser.md and developer-guide/browser-supervisor.md — availability matrix now shows Browserbase at full parity with local Chrome for both detection and response. * feat(browser): cross-origin iframe interaction via browser_cdp(frame_id=...) Adds iframe interaction to the CDP supervisor PR (was queued as PR 2). Design: browser_cdp gets an optional frame_id parameter. When set, the tool looks up the frame in the supervisor's frame_tree, grabs its child cdp_session_id (OOPIF session), and dispatches the CDP call through the supervisor's already-connected WebSocket via run_coroutine_threadsafe. Why not stateless: on Browserbase, each fresh browser_cdp WebSocket must re-negotiate against a signed connectUrl. The session info carries a specific URL that can expire while the supervisor's long-lived connection stays valid. Routing via the supervisor sidesteps this. Agent workflow: 1. browser_snapshot → frame_tree.children[] shows OOPIFs with is_oopif=true 2. browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF frame_id>, params={'expression': 'document.title', 'returnByValue': True}) 3. Supervisor dispatches the call on the OOPIF's child session Supervisor state fixes needed along the way: * _on_frame_detached now skips reason='swap' (frame migrating processes) * _on_frame_detached also skips when the frame is an OOPIF with a live child session — Browserbase fires spurious remove events when a same-origin iframe gets promoted to OOPIF * _on_target_detached clears cdp_session_id but KEEPS the frame record so the agent still sees the OOPIF in frame_tree during transient session flaps E2E VERIFIED on Browserbase (smoke_bb_iframe_agent_path.py): browser_cdp(method='Runtime.evaluate', params={'expression': 'document.title', 'returnByValue': True}, frame_id=<OOPIF>) → {'success': True, 'result': {'value': 'Example Domain'}} The iframe is <iframe src='https://example.com/'> inside a top-level data: URL page on a real Browserbase session. The agent Runtime.evaluates INSIDE the cross-origin iframe and gets example.com's title back. Tests (tests/tools/test_browser_supervisor.py — 16 pass total): * test_browser_cdp_frame_id_routes_via_supervisor — injects fake OOPIF, verifies routing via supervisor, Runtime.evaluate returns 1+1=2 * test_browser_cdp_frame_id_missing_supervisor — clean error when no supervisor attached * test_browser_cdp_frame_id_not_in_frame_tree — clean error on bad frame_id Docs (browser.md and developer-guide/browser-supervisor.md) updated with the iframe workflow, availability matrix now shows OOPIF eval as shipped for local Chrome + Browserbase. * test(browser): real-OOPIF E2E verified manually + chrome_cdp uses --site-per-process When asked 'did you test the iframe stuff' I had only done a mocked pytest (fake injected OOPIF) plus a Browserbase E2E. Closed the local-Chrome real-OOPIF gap by writing /tmp/dialog-iframe-test/ smoke_local_oopif.py: * 2 http servers on different hostnames (localhost:18905 + 127.0.0.1:18906) * Chrome with --site-per-process so the cross-origin iframe becomes a real OOPIF in its own process * Navigate, find OOPIF in supervisor.frame_tree, call browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF>) which routes through the supervisor's child session * Asserts iframe document.title === 'INNER-FRAME-XYZ' (from the inner page, retrieved via OOPIF eval) PASSED on 2026-04-23. Tried to embed this as a pytest but hit an asyncio version quirk between venv (3.11) and the system python (3.13) — Page.navigate hangs in the pytest harness but works in standalone. Left a self-documenting skip test that points to the smoke script + describes the verification. chrome_cdp fixture now passes --site-per-process so future iframe tests can rely on OOPIF behavior. Result: 16 pass + 1 documented-skip = 17 tests in tests/tools/test_browser_supervisor.py. * docs(browser): add dialog_policy + dialog_timeout_s to configuration.md, fix tool count Pre-merge docs audit revealed two gaps: 1. user-guide/configuration.md browser config example was missing the two new dialog_* knobs. Added with a short table explaining must_respond / auto_dismiss / auto_accept semantics and a link to the feature page for the full workflow. 2. reference/tools-reference.md header said '54 built-in tools' — real count on main is 54, this branch adds browser_dialog so it's 55. Fixed the header. (browser count was already correctly bumped 11 -> 12 in the earlier docs commit.) No code changes. |
||
|
|
3ccda2aa05 | fix(mcp): seed protocol header before HTTP initialize | ||
|
|
983bbe2d40 |
feat(skills): add design-md skill for Google's DESIGN.md spec (#14876)
* feat(config): make tool output truncation limits configurable Port from anomalyco/opencode#23770: expose a new `tool_output` config section so users can tune the hardcoded truncation caps that apply to terminal output and read_file pagination. Three knobs under `tool_output`: - max_bytes (default 50_000) — terminal stdout/stderr cap - max_lines (default 2000) — read_file pagination cap - max_line_length (default 2000) — per-line cap in line-numbered view All three keep their existing hardcoded values as defaults, so behaviour is unchanged when the section is absent. Power users on big-context models can raise them; small-context local models can lower them. Implementation: - New `tools/tool_output_limits.py` reads the section with defensive fallback (missing/invalid values → defaults, never raises). - `tools/terminal_tool.py` MAX_OUTPUT_CHARS now comes from get_max_bytes(). - `tools/file_operations.py` normalize_read_pagination() and _add_line_numbers() now pull the limits at call time. - `hermes_cli/config.py` DEFAULT_CONFIG gains the `tool_output` section so `hermes setup` writes defaults into fresh configs. - Docs page `user-guide/configuration.md` gains a "Tool Output Truncation Limits" section with large-context and small-context example configs. Tests (18 new in tests/tools/test_tool_output_limits.py): - Default resolution with missing / malformed / non-dict config. - Full and partial user overrides. - Coercion of bad values (None, negative, wrong type, str int). - Shortcut accessors delegate correctly. - DEFAULT_CONFIG exposes the section with the right defaults. - Integration: normalize_read_pagination clamps to the configured max_lines. * feat(skills): add design-md skill for Google's DESIGN.md spec Built-in skill under skills/creative/ that teaches the agent to author, lint, diff, and export DESIGN.md files — Google's open-source (Apache-2.0) format for describing a visual identity to coding agents. Covers: - YAML front matter + markdown body anatomy - Full token schema (colors, typography, rounded, spacing, components) - Canonical section order + duplicate-heading rejection - Component property whitelist + variants-as-siblings pattern - CLI workflow via 'npx @google/design.md' (lint/diff/export/spec) - Lint rule reference including WCAG contrast checks - Common YAML pitfalls (quoted hex, negative dimensions, dotted refs) - Starter template at templates/starter.md Package verified live on npm (@google/design.md@0.1.1). |