Files
hermes-webui/.github
nesquena-hermes 5b6f69c884 ci(docker): runtime smoke gate for Docker init logic
Closes the source-only-test gap that let v0.51.84's :ro-mount x chown -h
{} + startup regression reach review with 5800+ green pytests. Adds a
new GitHub Actions workflow .github/workflows/docker-smoke.yml that
actually runs 'docker compose up' against each compose variant.

Triggers
--------
Path-filtered on pull_request + push to master:
  Dockerfile, docker_init.bash, docker-compose*.yml, .dockerignore,
  .env.docker.example, .github/workflows/docker-smoke.yml itself.
Also workflow_dispatch for manual runs.

Jobs
----
1. compose-config -- preflight that 'docker compose config' parses each
   of the three compose files. Cheap, fast, catches schema/interpolation
   drift in parallel before any container starts.

2. smoke (matrix: single / two-container / three-container) -- for each
   variant:
   a. Reap any leftover hermes-smoke-* containers/volumes/networks from
      prior runs (defence-in-depth on self-hosted runners; hosted runners
      are fresh).
   b. docker build -t ghcr.io/nesquena/hermes-webui:latest .
      Critical: the multi-container compose files reference the GHCR
      image. Without this retag, multi-container smoke would test the
      previously-released image, NOT the PR's docker_init.bash / Dockerfile
      changes. With the retag, Compose's default pull_policy=missing keeps
      the local build in place and the PR is genuinely exercised.
   c. mktemp -d for ephemeral HERMES_HOME + HERMES_WORKSPACE so the
      runner's host filesystem is never touched.
   d. docker compose up -d --wait --wait-timeout 120 (Dockerfile carries a
      HEALTHCHECK so --wait blocks on 'healthy', not just 'running').
   e. curl /health probe with a 30-attempt x 2s poll loop as headroom for
      the multi-container variants' Python dep install phase.
   f. grep startup logs for known-bad signatures:
        EROFS | Read-only file system | Traceback | PermissionError |
        error_exit | groupmod: cannot | usermod: cannot |
        Failed to set (UID|GID|owner|permissions|ownership)
      These are the exact patterns that would have flagged #2470 in real
      time. Failed-to-set is anchored to specific objects to avoid false
      positives on benign locale/library bootstrap warnings.
   g. trap on EXIT: docker compose down -v --remove-orphans + rm -rf the
      ephemeral host paths, regardless of how the job exited.

Safety
------
- permissions: contents: read only -- no GITHUB_TOKEN write scope.
- Fork PRs run with no secrets (standard pull_request, not
  pull_request_target).
- No host bind mounts; no ~/.hermes exposure; no network egress beyond
  what compose itself needs to pull the agent image.
- timeout-minutes: 15 on the smoke job as a hard ceiling against a
  hung docker build.
- Per-run COMPOSE_PROJECT name (hermes-smoke-VARIANT-RUNID-ATTEMPT)
  so concurrent runs or reruns can't clobber each other.

Out of scope for v1 (per design review)
---------------------------------------
- HERMES_WEBUI_SMOKE_TEST env flag in docker_init.bash -- production-code
  footgun that would let any leaked env var silently exit before
  serving traffic.
- --user 60000:60000 -- incompatible with the image's root-init phase
  and would skip the very chown branch we are guarding against.
- Local-runnable scripts/docker-smoke-test.sh -- defer until CI gating
  ships and we see what contributors actually trip over.
- Hadolint / yamllint -- separate lint workflow, follow-up PR.
- Podman runtime smoke -- defer until a podman-specific bug ships.

Pre-merge verification
----------------------
- actionlint: clean
- YAML parse: clean (3 triggers, 2 jobs, 3-variant matrix)
- bash -n on all 6 run-blocks: clean
- pytest tests/ -q --timeout=60: 5889 passed, 6 skipped (no test impact;
  workflow-only change)
- Opus design review on the brief (REVISE -> minimum scope adopted)
- Opus implementation review on this workflow (APPROVE)
2026-05-18 00:09:41 +00:00
..