mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-21 03:39:54 +00:00

Files

T

Teknium eacce70a35 docs: comprehensive 2-week sweep of feature/PR coverage gaps (#28497 )

Catch the website docs up to two weeks of merged work (May 4 – May 18, 2026,
roughly 1,080 PRs). The audit found ~50 user-visible features that had landed
in code with no docs footprint, plus a handful of stale pages. This PR closes
every gap the scan turned up.

New pages
- user-guide/features/deliverable-mode.md — extension list, agent triggers,
  kanban_complete artifacts pattern, [[as_document]] override (PR #27813).
- developer-guide/web-search-provider-plugin.md — authoring guide modeled on
  image-gen-provider-plugin, covering brave_free / ddgs / etc. (PR #25448).

Providers / auth
- Rename "Alibaba Cloud" → "Qwen Cloud (Alibaba DashScope)" everywhere the
  display label shows up; provider id stays `alibaba` (PR #24835).
- Document OAuth refresh-token quarantine for xAI / MiniMax / Codex (PRs
  #28116 / #28118 / #28119).
- Document Nous JWT minting from refresh token + invalid-refresh quarantine
  + cross-profile shared token store (PRs #27663 / #19712).
- Add `## Microsoft Entra ID authentication (keyless)` section to
  azure-foundry guide — DefaultAzureCredential, RBAC, OpenAI + Anthropic
  routing details (PR #28101 / #9df9816da).
- Custom providers `api_mode` is now prompted-and-persisted, not just URL
  autodetected (PR #25068).
- Delegation honours `api_mode` + auto-detects anthropic_messages base URLs
  (PR #26824).
- `x_search` auto-enables when xAI credentials are present (PR #27376).
- Add `xAI Grok OAuth (SuperGrok)` row to providers headline table (PR
  #26534).
- NVIDIA NIM billing-origin header is set automatically (PR #26585).

Windows / installer
- `install.ps1`: document `-Commit <sha>` and `-Tag <v>` pin params plus
  the BOM-strip / git-retry hardening (PR #28169).
- Document Hermes Desktop thin installer + first-launch bootstrap (PR
  #27822).
- Document `dep_ensure` Windows bootstrap (PR #27845).
- Document install-method auto-detection (pip / git / homebrew / nixos) and
  the matching update command (PR #27843).

Gateway / messaging
- `/platform list|pause|resume` full description + circuit-breaker
  semantics (PR #26600).
- Slack / Matrix / Mattermost get parallel `allowed_channels` /
  `allowed_rooms` allowlist sections matching Telegram/Discord/DingTalk
  (PR #21251).
- Discord `allow_any_attachment` + `max_attachment_bytes` (config and env
  vars) (PR #27245).
- Discord clarify-choice button rendering (PR #25485).
- Telegram `guest_mode` @mention bypass for allowlisted groups (PR
  #22759).
- Telegram `notifications` mode (`important` vs `all`) (PR #22793).
- `[[as_document]]` skill / response directive for forcing
  document-style media delivery (PR #21210).

CLI / TUI
- `/new [name]` argument (PR #19637).
- `/subgoal` user-supplied criteria appended to `/goal` (PR #25449).
- `/exit --delete` flag confirmation prompts for destructive slash
  commands (PR #22687).
- Status-bar additions: ▶ N background indicator (PR #27175), context
  compression count (PR #21218), YOLO mode banner+statusbar warning (PR
  #26238).
- `display.timestamps` + `docker_extra_args` config keys (PR #23599).
- TUI collapsible startup banner sections (PR #20625).
- `HERMES_SESSION_ID` exported to tool subprocesses (PR #23847).

i18n
- Refresh display.language locale list from 8 → 16 (en, zh, zh-hant, ja,
  de, es, fr, tr, uk, af, ko, it, ga, pt, ru, hu) — matches
  `agent/i18n.py:SUPPORTED_LANGUAGES`.

Tools / features
- `vision_analyze` native-pixel passthrough for vision-capable callers,
  with auxiliary text-describer fallback (PR #22955).
- `session_search` rewrite to the single-shape tool (discovery / scroll /
  browse modes) (PRs #27590 / #27840).
- Clarify MCP transport scope: client supports stdio + SSE; embedded
  `hermes mcp serve` is stdio-only (PR #21227).
- Web search backends table: add Brave Search (free tier) and DDGS rows
  (PR #21337).
- ACP session-scoped edit auto-approval modes (PR #27862).
- Curator rename map in the user-visible per-run summary (PR #22910).
- Prompt caching feature page reference in features/overview.md — Claude
  cross-session 1-hour prefix cache on native Anthropic / OpenRouter /
  Nous Portal (PR #23828).
- Cron per-job profile parameter (PR #28124).
- `--no-skills` flag for `hermes profile create` (PR #20986).

Build
- Verified with `npm run build` in `website/`; both `en` and `zh-Hans`
  locales compile. Remaining broken-link/anchor warnings are pre-existing
  (`rl-training.md` from learning-path / overview; the
  zh-Hans translation lag the docs skill already calls out).

2026-05-18 23:55:25 -07:00

10 KiB

Raw Permalink Blame History

title, description, sidebar_label, sidebar_position

title	description	sidebar_label	sidebar_position
Vision & Image Paste	Paste images from your clipboard into the Hermes CLI for multimodal vision analysis.	Vision & Image Paste	7

Vision & Image Paste

Hermes Agent supports multimodal vision — you can paste images from your clipboard directly into the CLI and ask the agent to analyze, describe, or work with them. Images are sent to the model as base64-encoded content blocks, so any vision-capable model can process them.

How It Works

Copy an image to your clipboard (screenshot, browser image, etc.)
Attach it using one of the methods below
Type your question and press Enter
The image appears as a [📎 Image #1] badge above the input
On submit, the image is sent to the model as a vision content block

You can attach multiple images before sending — each gets its own badge. Press Ctrl+C to clear all attached images.

Images are saved to ~/.hermes/images/ as PNG files with timestamped filenames.

Paste Methods

How you attach an image depends on your terminal environment. Not all methods work everywhere — here's the full breakdown:

`/paste` Command

The most reliable explicit image-attach fallback.

/paste

Type /paste and press Enter. Hermes checks your clipboard for an image and attaches it. This is the safest option when your terminal rewrites Cmd+V/Ctrl+V, or when you copied only an image and there is no bracketed-paste text payload to inspect.

Ctrl+V / Cmd+V

Hermes now treats paste as a layered flow:

normal text paste first
native clipboard / OSC52 text fallback if the terminal did not deliver text cleanly
image attach when the clipboard or pasted payload resolves to an image or image path

This means pasted macOS screenshot temp paths and file://... image URIs can attach immediately instead of sitting in the composer as raw text.

:::warning If your clipboard has only an image (no text), terminals still cannot send binary image bytes directly. Use /paste as the explicit image-attach fallback. :::

`/terminal-setup` for VS Code / Cursor / Windsurf

If you run the TUI inside a local VS Code-family integrated terminal on macOS, Hermes can install the recommended workbench.action.terminal.sendSequence bindings for better multiline and undo/redo parity:

/terminal-setup

This is especially useful when Cmd+Enter, Cmd+Z, or Shift+Cmd+Z are being intercepted by the IDE. Run it on the local machine only — not inside an SSH session.

Platform Compatibility

Environment	`/paste`	Cmd/Ctrl+V	`/terminal-setup`	Notes
macOS Terminal / iTerm2	✅	✅	n/a	Best experience — native clipboard + screenshot-path recovery
Apple Terminal	✅	✅	n/a	If Cmd+←/→/⌫ gets rewritten, use Ctrl+A / Ctrl+E / Ctrl+U fallbacks
Linux X11 desktop	✅	✅	n/a	Requires `xclip` (`apt install xclip`)
Linux Wayland desktop	✅	✅	n/a	Requires `wl-paste` (`apt install wl-clipboard`)
WSL2 (Windows Terminal)	✅	✅	n/a	Uses `powershell.exe` — no extra install needed
VS Code / Cursor / Windsurf (local)	✅	✅	✅	Recommended for better Cmd+Enter / undo / redo parity
VS Code / Cursor / Windsurf (SSH)	❌²	❌²	❌³	Run `/terminal-setup` on the local machine instead
SSH terminal (any)	❌²	❌²	n/a	Remote clipboard not accessible

² See SSH & Remote Sessions below ³ The command writes local IDE keybindings and should not be run from the remote host

Platform-Specific Setup

macOS

No setup required. Hermes uses osascript (built into macOS) to read the clipboard. For faster performance, optionally install pngpaste:

brew install pngpaste

Linux (X11)

Install xclip:

# Ubuntu/Debian
sudo apt install xclip

# Fedora
sudo dnf install xclip

# Arch
sudo pacman -S xclip

Linux (Wayland)

Modern Linux desktops (Ubuntu 22.04+, Fedora 34+) often use Wayland by default. Install wl-clipboard:

# Ubuntu/Debian
sudo apt install wl-clipboard

# Fedora
sudo dnf install wl-clipboard

# Arch
sudo pacman -S wl-clipboard

:::tip How to check if you're on Wayland

echo $XDG_SESSION_TYPE
# "wayland" = Wayland, "x11" = X11, "tty" = no display server

:::

WSL2

No extra setup required. Hermes detects WSL2 automatically (via /proc/version) and uses powershell.exe to access the Windows clipboard through .NET's System.Windows.Forms.Clipboard. This is built into WSL2's Windows interop — powershell.exe is available by default.

The clipboard data is transferred as base64-encoded PNG over stdout, so no file path conversion or temp files are needed.

:::info WSLg Note If you're running WSLg (WSL2 with GUI support), Hermes tries the PowerShell path first, then falls back to wl-paste. WSLg's clipboard bridge only supports BMP format for images — Hermes auto-converts BMP to PNG using Pillow (if installed) or ImageMagick's convert command. :::

Verify WSL2 clipboard access

# 1. Check WSL detection
grep -i microsoft /proc/version

# 2. Check PowerShell is accessible
which powershell.exe

# 3. Copy an image, then check
powershell.exe -NoProfile -Command "Add-Type -AssemblyName System.Windows.Forms; [System.Windows.Forms.Clipboard]::ContainsImage()"
# Should print "True"

SSH & Remote Sessions

Clipboard image paste does not fully work over SSH. When you SSH into a remote machine, the Hermes CLI runs on the remote host. Clipboard tools (xclip, wl-paste, powershell.exe, osascript) read the clipboard of the machine they run on — which is the remote server, not your local machine. Your local clipboard image is therefore inaccessible from the remote side.

Text can sometimes still bridge through terminal paste or OSC52, but image clipboard access and local screenshot temp paths remain tied to the machine running Hermes.

Workarounds for SSH

Upload the image file — Save the image locally, upload it to the remote server via scp, VSCode's file explorer (drag-and-drop), or any file transfer method. Then reference it by path. (A /attach <filepath> command is planned for a future release.)
Use a URL — If the image is accessible online, just paste the URL in your message. The agent can use vision_analyze to look at any image URL directly.
X11 forwarding — Connect with ssh -X to forward X11. This lets xclip on the remote machine access your local X11 clipboard. Requires an X server running locally (XQuartz on macOS, built-in on Linux X11 desktops). Slow for large images.
Use a messaging platform — Send images to Hermes via Telegram, Discord, Slack, or WhatsApp. These platforms handle image upload natively and are not affected by clipboard/terminal limitations.

Why Terminals Can't Paste Images

This is a common source of confusion, so here's the technical explanation:

Terminals are text-based interfaces. When you press Ctrl+V (or Cmd+V), the terminal emulator:

Reads the clipboard for text content
Wraps it in bracketed paste escape sequences
Sends it to the application through the terminal's text stream

If the clipboard contains only an image (no text), the terminal has nothing to send. There is no standard terminal escape sequence for binary image data. The terminal simply does nothing.

This is why Hermes uses a separate clipboard check — instead of receiving image data through the terminal paste event, it calls OS-level tools (osascript, powershell.exe, xclip, wl-paste) directly via subprocess to read the clipboard independently.

Supported Models

Image paste works with any vision-capable model. The image is sent as a base64-encoded data URL in the OpenAI vision content format:

{
  "type": "image_url",
  "image_url": {
    "url": "data:image/png;base64,..."
  }
}

Most modern models support this format, including GPT-4 Vision, Claude (with vision), Gemini, and open-source multimodal models served through OpenRouter.

Image Routing (Vision-Capable vs Text-Only Models)

When a user attaches an image — from the CLI clipboard, the gateway (Telegram/Discord photo), or any other entry point — Hermes routes it based on whether your current model actually supports vision:

Your model	What happens to the image
Vision-capable (GPT-4V, Claude with vision, Gemini, Qwen-VL, MiMo-VL, etc.)	Sent as real pixels using the provider's native image content format above. No text summary layer.
Text-only (DeepSeek V3, smaller open-source models, older chat-only endpoints)	Routed through the `vision_analyze` auxiliary tool — an auxiliary vision model describes the image, and the text description is injected into the conversation.

You don't configure this — Hermes looks up your current model's capability in the provider metadata and picks the right path automatically. The practical effect: you can switch between vision and non-vision models mid-session and image handling "just works" without changing your workflow. Text-only models get coherent context about the image rather than a broken multimodal payload they'd have to reject.

Which auxiliary model handles the text-description path is configurable under auxiliary.vision — see Auxiliary Models.

`vision_analyze` has the same dual behavior

The vision_analyze tool itself follows the same routing. When the active main model is vision-capable and its provider supports image content inside tool results (currently the Anthropic, OpenAI, Azure-OpenAI, and Gemini 3.x stacks), vision_analyze short-circuits the auxiliary describer and returns the raw image pixels as a multimodal tool-result envelope. The main model sees the image natively on its next turn — no aux call, no text-summary information loss, no extra latency.

For text-only main models (or providers whose tool-result channel doesn't carry images), vision_analyze falls back to the legacy path: it asks the configured auxiliary vision model to describe the image and returns the description as plain text. Either way the calling tool signature is the same — the tool decides which path to take at runtime based on the active model.

10 KiB Raw Permalink Blame History