wiki-browser continuous delivery — Design
Draft

wiki-browser continuous delivery — Design

2026-05-21Danielwiki-browser · sub-project #11

Problem

With sub-project #10 (deployment & git-sync) the deployed wiki-browser participates in a shared master: it pulls document commits to serve them and pushes incorporations back. But the binary itself is still updated by hand — make build-arm64, copy to the Pi, systemctl restart (wrapped as make deploy). Every change to wiki-browser's own Go source needs a human to remember to cut a release. This sub-project removes that step: when a test-passing change to wiki-browser's code lands on master, the Pi rebuilds and redeploys itself.

There is a second, quieter gap. The two of us work trunk-based — commits go straight to master, no pull requests, and no CI. Nothing runs the test suite before code is effectively live. CD is the natural place to introduce that gate, so this spec also stands up the project's first automated test run.

Supersedes — sub-project #10

#10 wired doc-sync to a native GitHub Settings → Webhooks hook posting to /api/webhook/github. This spec retires that native webhook. A native webhook cannot run tests, and CD needs a test gate — so a single GitHub Action (the WB Action) becomes the only caller of the Pi, handling both doc-sync and deploy in one request per push. The #10 Task-9 handler (handleGitHubWebhook) is rewritten here: new route name, new payload schema. Implementers of #10's remaining tasks and of this spec must treat the native-webhook provisioning step as cancelled.

Goals & non-goals

Goals

Non-goals

Approach

CD has three moving questions: where tests run, where the binary is built, and how the Pi is triggered.

ApproachTests runPi buildTriggerVerdict
B. All-on-Pi (timer poll, Pi-side tests, Pi build)PiNative arm64Timer pollRejected
C. External build + artifact pullGitHub ActionsCross-compiled in CIArtifact pollRejected
The test gate and the pure-Go build decide it.

B is rejected: running the full suite on the Pi spends a couple of minutes of CPU on the serving host for every deploy, gives no real CI surface (no run history, no clean environment), and forces the Pi to diff master itself to decide what changed. C is rejected: shipping a cross-compiled artifact means artifact storage, the Pi authenticating to GitHub to fetch it, and version-matching an artifact back to a commit — too much machinery when the project is pure-Go (CGO_ENABLED=0 already) and a Raspberry Pi 5 builds it natively in seconds.

A is chosen. Tests run free, off-device, in GitHub Actions — which doubles as the CI the project never had. The Pi only ever hears "this tested commit is ready," then rebuilds natively and redeploys. Within A, four sub-decisions, all settled during discovery:

Design

End-to-end flow

GitHub · getorcha/orcha WB Action — path-filter · go test HMAC POST /api/webhook/ci {deploy, commit} RASPBERRY PI 5 wiki-browser webhook endpoint · git-sync Sync · drain writes commit cd-trigger file wb-cd.path (watch) activates wb-cd (oneshot) git archive → build atomic swap bin/ restart · /healthz
Every push runs the WB Action. A tracked document change or deploy-worthy code change sends the one CI webhook. Only a test-passing wiki-browser code change sets deploy:true. The endpoint pulls via git-sync; deploy:true additionally writes the trigger file, which the .path unit turns into a wb-cd run.

Step by step:

  1. A push lands on master. The WB Action runs (every push).
  2. An internal classifier — a Go helper in the wiki-browser module — diffs github.event.before..github.sha. It checks two sets: deploy build inputs, and tracked document paths — the latter via the same internal/walker matching code the running wiki uses, fed the extensions/exclude values from a committed policy file.
  3. If deploy build inputs changed, the Action runs go test ./.... If only tracked documents changed, tests are skipped.
  4. If neither set changed, the Action exits without calling the Pi. Otherwise it makes one HMAC-signed POST /api/webhook/ci with {deploy, commit, ref}, where deploy = code-changed && tests-passed and commit is the full 40-character SHA.
  5. The endpoint handler verifies the HMAC, triggers a git-sync Sync (doc pull), and — if deploy — writes commit to the trigger file. Responds 204.
  6. A systemd .path unit sees the trigger file change and activates wb-cd.service.
  7. wb-cd builds the exact commit, swaps the binaries, restarts wiki-browser, health-checks, records success or rolls back, then re-reads the trigger file before exiting and repeats if a newer commit arrived mid-run.

The WB Action

One workflow, committed at the monorepo root — .github/workflows/wiki-browser.yml, not under wiki-browser/, since GitHub only reads workflows from the repo root. It is created as part of this implementation. It fires on every push to master (it must, so doc-only pushes can still reach the Pi), and discriminates internally with a path classifier rather than a workflow-level paths: trigger.

yaml — .github/workflows/wiki-browser.ymlname: wiki-browser
on: { push: { branches: [master] } }

jobs:
  wiki-browser:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }      # github.event.before must exist
      - uses: actions/setup-go@v5         # before classify — it is a go program
        with: { go-version-file: wiki-browser/go.mod }
      - id: classify
        working-directory: wiki-browser
        run: go run ./cmd/classify-ci-change
          --base "${{ github.event.before }}"
          --head "${{ github.sha }}"
          --policy ci-tracked-paths.yaml
      - if: steps.classify.outputs.code_changed == 'true'
        working-directory: wiki-browser
        run: go test ./...
      - name: notify the pi          # one request if tracked docs or code changed
        if: always() && steps.classify.outputs.notify == 'true'
        run: # HMAC-sign {deploy,commit,ref} and POST it; deploy = code && success

The classifier must use the previous push range, not only HEAD~1..HEAD: base = github.event.before, head = github.sha. For the all-zero first-push sentinel, fall back to the root commit range. This catches multi-commit pushes correctly.

It emits three booleans: code_changed, tracked_changed, and notify. code_changed covers deploy build inputs: wiki-browser/** minus docs/, .claude/, and *.md, plus the workflow and classifier files that control deployment. Templates and static assets (internal/server/templates/, static/) are baked in via embed.FS and are build inputs. tracked_changed covers any changed file the running wiki would serve: repo-relative path under root, extension in extensions, and not matched by either baked-in walker excludes or configured exclude patterns.

tracked_changed must answer exactly what the running wiki would serve, so the classifier does not reimplement that rule. The implementation extracts the path-matching predicate from internal/walker into a small reusable matcher — (repo-relative path, extensions, excludes) → tracked, with the baked-in excludes (.git, node_modules, .worktrees, .obsidian, .claude, tmp-*) living inside that shared code — and both the walker and classify-ci-change call it. The matching logic is thus one implementation and cannot drift. The classifier is a Go helper in the wiki-browser module (cmd/classify-ci-change), run via go run; it needs no filesystem walk, only the git diff --name-only path list.

The one thing still mirrored is the data. The production wiki-browser.yaml holds host paths and secret references and is not committed, so the extensions/exclude values are kept in a committed non-secret file — wiki-browser/ci-tracked-paths.yaml — read by the classifier. Only those two values must stay in step with production config; provisioning verifies that by code review. A classifier error (unreadable policy file, a failed git diff) exits non-zero and fails the Action — surfaced as a red run, never a silently dropped push.

Note

On a code+doc push the docs sync only after the test run finishes (~2 min), because the single request is sent at the end of the job. Doc-only pushes wait just for runner spin-up and classification (~30 s). This latency was explicitly accepted in discovery; if a missed Action ever leaves docs stale, git-sync's existing opt-in poll_interval is the safety net.

The single webhook endpoint

The #10 Task-9 handler is rewritten. The route is renamed POST /api/webhook/ci and the payload is no longer GitHub's native push event but our own schema:

json — request body{ "deploy": true, "commit": "a1b2c3d4e5f6789012345678901234567890abcd",
  "ref": "refs/heads/master", "delivery_id": "github-run-123456789" }

It stays a public route — mounted outside the OAuth session middleware, like /auth/* — protected by HMAC: read the body under a size cap, compute HMAC-SHA256(body, secret), constant-time compare against X-Hub-Signature-256: sha256=<hex>. The secret file is the existing git.webhook_secret_file from #10 — unchanged on the Pi; only its counterpart moves (from the native-webhook config to a GitHub Actions secret). Handler behavior:

wiki-browser needs no privilege for this — it only writes a file it owns. The systemd .path unit (running as part of systemd) does the activation.

wb-cd — the deploy oneshot

A new binary, cmd/wb-cd, built by make build alongside wiki-browser and wb-agent. It is a Go binary rather than a shell script for two reasons: the rollback logic is correctness-critical and deserves unit tests, and it reuses internal/alert for Slack notifications. It runs once per activation, but it drains trigger state before exit. One target commit loop:

  1. Read the target full commit from the trigger file. If it equals the recorded deployed commit → no target is pending. If it equals the recorded poisoned commit → skip (a known-bad commit; do not retry until a new commit arrives).
  2. Ensure the clone contains commitgit cat-file -e commit^{commit}, polled with a bounded wait. git-sync's Sync, kicked by the same webhook, delivers it within seconds.
  3. git archive commit -- wiki-browser into a temp dir — only the one self-contained Go module, not the rest of the monorepo — and run make -C wiki-browser build COMMIT=<full-sha> VERSION=<short-sha> BUILD_TIME=<ts> there. This builds the exact approved commit, not the live working tree (git-sync keeps that at master HEAD, which may have moved past commit). git archive is a pure read of the object database — no mutation of the live clone, so the sole-mutator invariant holds.
  4. Snapshot the current live binaries to bin/prev/.
  5. Atomically rename() the three freshly built binaries over the live paths.
  6. sudo systemctl restart wiki-browser — this blocks through the graceful drain.
  7. Health-check (see below). Healthy → record commit as deployed. Restart failure or unhealthy → roll back.
  8. Before exit, re-read cd.trigger_file. If it now names a different un-deployed, non-poisoned commit, start the loop again. This is what makes .path activation safe when a push lands while wb-cd.service is already active.
Note

Every wb-cd target loop is idempotent: re-running re-builds and re-swaps from scratch. A run interrupted between the three rename() calls leaves a mixed binary set, but the next activation (or a manual one) re-runs the whole sequence cleanly. Each rename() itself is atomic, so no individual binary is ever torn.

CD self-update

cmd/wb-cd lives under wiki-browser/, so a change to its own source is a build input like any other. Because wb-cd is a oneshot, keeping it current needs no special machinery: make -C wiki-browser build (step 3 above) compiles all three binaries, and step 5 rename()s the new wb-cd over its own path. The running process keeps its old inode and finishes cleanly; the next trigger runs the new wb-cd. There is no "hot-swap a running process" problem because between triggers nothing is running — the atomic rename() is the one hard requirement.

The residual risk is a new wb-cd that compiles but is behaviorally broken: the old (working) CD will build and install it, and the next trigger runs broken CD. Three things contain this — wb-cd carries unit tests; a systemd OnFailure= hook alerts on any non-zero CD exit; and the failure degrades gracefully — broken CD never touches wiki-browser's health, it just stops deploying, and make deploy remains as the break-glass path.

Health check & rollback

After the restart returns, wb-cd polls GET /healthz for a bounded window (default ~90 s). The check passes only when the response is 200 and the reported commit equals the full commit just built — proving not only that the process came up but that the swap actually took.

The state file is small JSON at cd.state_file:

json — cd-state.json{ "deployed_commit": "a1b2c3d4e5f6789012345678901234567890abcd",
  "deployed_at": "2026-05-21T14:30:00Z",
  "poisoned_commit": "" }
Warning

Rollback restores the binary, not the database. If a deploy carries a non-backward-compatible collab-DB schema migration, the new binary migrates the DB and a binary-only rollback then runs the old code against the new schema. CD cannot detect or undo this. The discipline this imposes: schema migrations for wiki-browser must be backward-compatible (expand-contract) — the old binary must tolerate the new schema. This is good practice regardless, but CD makes it load-bearing.

Graceful drain

This is a change to wiki-browser, not wb-cd. main.go today gives srv.Shutdown a 5 s budget — both too short and aimed at the wrong thing. srv.Shutdown waits for active HTTP handlers; but an Agent job runs in a background goroutine in the agent service — the POST /api/agent/jobs handler that queued it has already returned. So srv.Shutdown finishing says nothing about whether a 2–3 min job is still running. (Incorporation is fine: it runs synchronously inside its handler, so srv.Shutdown's existing wait already covers it; it is seconds anyway.)

The new SIGTERM sequence:

  1. Enter process-wide draining mode. Read-only routes keep serving. New mutating HTTP requests get 503 Service Unavailable with Retry-After unless they are part of the shutdown path itself.
  2. Tell the agent service to stop dequeuing new jobs. Queued jobs simply stay in the DB; the next instance drains them on boot.
  3. Wait — on a WaitGroup over in-flight job goroutines — for running jobs to finish. HTTP stays up serving reads throughout, so the wiki is browsable during the drain.
  4. Once jobs are drained, srv.Shutdown, close the DBs, exit.
  5. A 10 min cap on the in-flight job wait — a genuinely hung job cannot strand a redeploy forever; past the cap the drain gives up, that job dies and is swept by collab.Recover on next boot.

Mutating routes include POST /api/agent/jobs, proposal/incorporation routes, edit/write routes, and manual sync/write endpoints. The CI webhook itself is allowed only before shutdown begins; once draining, a concurrent CI request should receive 503 and be retried by the Action rather than extending the old process. wiki-browser.service gets TimeoutStopSec=11m — above the cap, so systemd never SIGKILLs mid-drain. wb-cd's own systemctl restart call blocks for the whole drain; its unit timeout accommodates that.

Versioning & observability

So a human can tell at a glance whether a new version is live, while rollback can compare exact identity. Three values are stamped into the binary at build time via -ldflags -X: the full commit (main.commit), the display short sha (main.version), and a build timestamp (main.buildTime). Build time, not process-start time, is the "deploy time" — it is baked into the binary, so it does not drift when the Pi reboots or the service restarts for an unrelated reason.

makefileCOMMIT     ?= $(shell git rev-parse HEAD 2>/dev/null || echo dev)
VERSION    ?= $(shell git rev-parse --short HEAD 2>/dev/null || echo dev)
BUILD_TIME ?= $(shell date -u +%Y-%m-%dT%H:%MZ)
LDFLAGS     = -s -w -X main.commit=$(COMMIT) -X main.version=$(VERSION) -X main.buildTime=$(BUILD_TIME)

The temp dir from git archive has no .git, so wb-cd passes the values explicitly: make -C wiki-browser build COMMIT=<full-sha> VERSION=<short-sha> BUILD_TIME=<ts>. The defaults above keep a bare make build on a dev machine working.

Two surfaces expose it:

systemd units

UnitTypeRole
wb-cd.path.pathPathChanged= the trigger file → activates wb-cd.service. Coalescing is acceptable because wb-cd re-reads the trigger file before exit.
wb-cd.serviceoneshotUser=karn; TimeoutStartSec=20m for build + up-to-11 min drain + health poll; OnFailure=wb-cd-alert.service.
wb-cd-alert.serviceoneshotBackstop: POSTs a "wb-cd run failed unexpectedly" line to Slack when a CD run exits non-zero. wb-cd self-alerts for its known failure paths; this catches a crash.
wiki-browser.service(edit)TimeoutStopSec=11m for the drain.
systemd — wb-cd.path[Path]
PathChanged=/srv/wiki-browser/cd-trigger
Unit=wb-cd.service

[Install]
WantedBy=multi-user.target
systemd — wb-cd.service[Service]
Type=oneshot
User=karn
ExecStart=/srv/wiki-browser/bin/wb-cd -config=/srv/wiki-browser/wiki-browser.yaml
TimeoutStartSec=20m
OnFailure=wb-cd-alert.service

wb-cd runs as the unprivileged karn account and needs exactly one root action — restarting wiki-browser. A narrowly scoped sudoers rule grants only that:

sudoers — /etc/sudoers.d/wb-cdkarn ALL=(root) NOPASSWD: /usr/bin/systemctl restart wiki-browser

Configuration

One new optional block in wiki-browser.yaml. Both binaries read the same config file: wiki-browser needs trigger_file (the handler writes it); wb-cd needs all of it, plus it reuses the existing root, listen and alert values.

yaml — wiki-browser.yamlcd:
  bin_dir: "/srv/wiki-browser/bin"          # live binaries + bin/prev/
  trigger_file: "/srv/wiki-browser/cd-trigger"
  state_file: "/srv/wiki-browser/cd-state.json"
  health_poll_timeout: "90s"             # default

Absent a cd: block, wiki-browser behaves exactly as today — the /api/webhook/ci route still mounts for doc-sync if a webhook secret is configured, but a deploy:true payload has nowhere to write and is logged and ignored. The .path unit's watched path is fixed in the unit file; provisioning must keep it in step with cd.trigger_file.

The Action also gets a committed non-secret path-policy file that mirrors the document-tracking part of production config:

yaml — wiki-browser/ci-tracked-paths.yamlextensions: [".md", ".html"]
exclude:
  - "www/**"
  - "marketing/**"

Provisioning

One-time operator runbook, split by where each step runs. The Pi steps need host access the implementation does not have. The GitHub steps are gh-driven and the operator runs them.

StepWhereHow
Install the Go toolchain (native arm64 build)Pioperator
Install wb-cd.path, wb-cd.service, wb-cd-alert.service; systemctl enable --now wb-cd.pathPioperator
Add TimeoutStopSec=11m to wiki-browser.servicePioperator
Install the sudoers rule for karnPioperator
Create bin/prev/; confirm the webhook secret file existsPioperator
Confirm wiki-browser/ci-tracked-paths.yaml mirrors production extensions and excludeGitHubcode review
Set the GitHub Actions secret with the same HMAC valueGitHubgh secret set WB_DEPLOY_HMAC_SECRET
Set the webhook URL (Actions variable or secret)GitHubgh variable set WB_WEBHOOK_URL
Delete the native Settings→Webhooks hook, if #10 already created itGitHubgh api -X DELETE repos/getorcha/orcha/hooks/<id>
The WB Action workflow file is committed by the implementation; the secret/variable/hook steps are run by the operator.

Failure modes

FailureDetectionBehavior
Tests fail in the WB Actionred Action rundeploy:false is POSTed if tracked docs also changed; otherwise no Pi call. The red Action is the signal.
make build fails on the Pinon-zero exitAbort before any binary is touched; live binary untouched; Slack alert; record the commit as poisoned so it is not retried until a new commit lands.
systemctl restart fails or new binary unhealthyrestart exit / /healthz pollRoll back to bin/prev/; restart; alert; commit marked poisoned.
Rollback itself failssecond /healthz pollLoud alert; stop — manual intervention. No thrash loop.
wb-cd crashes mid-runsystemd OnFailure=Alert. Swaps are atomic; a re-run is idempotent and cleans up.
Target commit not yet pulledgit cat-file -eBounded wait for git-sync's pull; on timeout, alert and retry on the next trigger.
Drain exceeds the 10 min capcap timerJob abandoned, swept by collab.Recover on boot.
GitHub Actions outageNo sync, no deploy. git-sync's opt-in poll_interval covers doc-sync; make deploy covers deploy.
Non-backward-compatible DB migration in a rolled-back deploy— (not detected)Out of scope — see the migration warning. Mitigated by expand-contract migration discipline.

Security

Resolved decisions

Settled during discovery; recorded so the rationale is not relitigated:

References