wiki-browser continuous delivery — Design
Problem
With sub-project #10 (deployment & git-sync) the deployed wiki-browser participates in a shared master: it pulls document commits to serve them and pushes incorporations back. But the binary itself is still updated by hand — make build-arm64, copy to the Pi, systemctl restart (wrapped as make deploy). Every change to wiki-browser's own Go source needs a human to remember to cut a release. This sub-project removes that step: when a test-passing change to wiki-browser's code lands on master, the Pi rebuilds and redeploys itself.
There is a second, quieter gap. The two of us work trunk-based — commits go straight to master, no pull requests, and no CI. Nothing runs the test suite before code is effectively live. CD is the natural place to introduce that gate, so this spec also stands up the project's first automated test run.
#10 wired doc-sync to a native GitHub Settings → Webhooks hook posting to /api/webhook/github. This spec retires that native webhook. A native webhook cannot run tests, and CD needs a test gate — so a single GitHub Action (the WB Action) becomes the only caller of the Pi, handling both doc-sync and deploy in one request per push. The #10 Task-9 handler (handleGitHubWebhook) is rewritten here: new route name, new payload schema. Implementers of #10's remaining tasks and of this spec must treat the native-webhook provisioning step as cancelled.
Goals & non-goals
Goals
- A test-passing change to wiki-browser's own Go source on
masteris rebuilt and redeployed to the Pi with no human action. - Deploys are crash-safe: an in-flight Agent job (2–3 min) drains to completion rather than being killed.
- A bad deploy never takes the wiki down: a failed build is caught before any binary is touched; a binary that will not come up healthy is auto-rolled-back to the last-known-good.
- The monorepo gets its first automated test gate for wiki-browser code — on every push, deploy or not.
- The running version (full commit + display version + build time) is visible in the UI and at
/healthz. - CD only ever reads the clone (
git archive,git cat-file); git-sync stays the sole mutator of the clone's git state.
Non-goals
- High availability. One Pi. The sub-second binary swap plus the restart is the only downtime; that is acceptable.
- Database rollback. CD rolls back the binary only — see the migration warning.
- A staging environment. Tests are the only gate; a passing deploy goes straight to the one production Pi.
- A PR / branch workflow. The collaboration policy stays trunk-based.
- Multi-step rollback history. Exactly one step back (
bin/prev/). - Slack alerts on test failure. A red Action run plus GitHub's own notifications are the test-failure signal. CD's alerts cover deploy-side failures only.
Approach
CD has three moving questions: where tests run, where the binary is built, and how the Pi is triggered.
| Approach | Tests run | Pi build | Trigger | Verdict |
|---|---|---|---|---|
| A. WB Action + single webhook + Pi-native build | GitHub Actions | Native arm64 | Action → webhook | Recommended |
| B. All-on-Pi (timer poll, Pi-side tests, Pi build) | Pi | Native arm64 | Timer poll | Rejected |
| C. External build + artifact pull | GitHub Actions | Cross-compiled in CI | Artifact poll | Rejected |
B is rejected: running the full suite on the Pi spends a couple of minutes of CPU on the serving host for every deploy, gives no real CI surface (no run history, no clean environment), and forces the Pi to diff master itself to decide what changed. C is rejected: shipping a cross-compiled artifact means artifact storage, the Pi authenticating to GitHub to fetch it, and version-matching an artifact back to a commit — too much machinery when the project is pure-Go (CGO_ENABLED=0 already) and a Raspberry Pi 5 builds it natively in seconds.
A is chosen. Tests run free, off-device, in GitHub Actions — which doubles as the CI the project never had. The Pi only ever hears "this tested commit is ready," then rebuilds natively and redeploys. Within A, four sub-decisions, all settled during discovery:
- One GitHub Action, not a native webhook. The WB Action runs on every push, diffs the exact previous-push range, classifies changed paths with the same document-tracking rules as the deployed wiki, and makes at most one request to the Pi — so a code+doc push is one webhook, not two.
- A
oneshot, not a daemon. CD has no state worth keeping in memory; between deploys nothing runs, so there is nothing to crash-recover — and a oneshot updates itself for free (see CD self-update). systemd .pathactivation, not a poll. The webhook handler writes a trigger file; a.pathunit turns that into awb-cdrun.wb-cdthen drains the trigger state before exiting, so a second push during a long build or drain is not lost.- Auto-rollback, not fail-forward. The bar for shipping CD is "CD cannot take the wiki down"; rollback to last-known-good is the binary-level equivalent of the crash-safety #10 insisted on.
Design
End-to-end flow
deploy:true. The endpoint pulls via git-sync; deploy:true additionally writes the trigger file, which the .path unit turns into a wb-cd run.Step by step:
- A push lands on
master. The WB Action runs (every push). - An internal classifier — a Go helper in the wiki-browser module — diffs
github.event.before..github.sha. It checks two sets: deploy build inputs, and tracked document paths — the latter via the sameinternal/walkermatching code the running wiki uses, fed theextensions/excludevalues from a committed policy file. - If deploy build inputs changed, the Action runs
go test ./.... If only tracked documents changed, tests are skipped. - If neither set changed, the Action exits without calling the Pi. Otherwise it makes one HMAC-signed
POST /api/webhook/ciwith{deploy, commit, ref}, wheredeploy = code-changed && tests-passedandcommitis the full 40-character SHA. - The endpoint handler verifies the HMAC, triggers a git-sync
Sync(doc pull), and — ifdeploy— writescommitto the trigger file. Responds204. - A
systemd .pathunit sees the trigger file change and activateswb-cd.service. wb-cdbuilds the exact commit, swaps the binaries, restarts wiki-browser, health-checks, records success or rolls back, then re-reads the trigger file before exiting and repeats if a newer commit arrived mid-run.
The WB Action
One workflow, committed at the monorepo root — .github/workflows/wiki-browser.yml, not under wiki-browser/, since GitHub only reads workflows from the repo root. It is created as part of this implementation. It fires on every push to master (it must, so doc-only pushes can still reach the Pi), and discriminates internally with a path classifier rather than a workflow-level paths: trigger.
yaml — .github/workflows/wiki-browser.ymlname: wiki-browser on: { push: { branches: [master] } } jobs: wiki-browser: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: { fetch-depth: 0 } # github.event.before must exist - uses: actions/setup-go@v5 # before classify — it is a go program with: { go-version-file: wiki-browser/go.mod } - id: classify working-directory: wiki-browser run: go run ./cmd/classify-ci-change --base "${{ github.event.before }}" --head "${{ github.sha }}" --policy ci-tracked-paths.yaml - if: steps.classify.outputs.code_changed == 'true' working-directory: wiki-browser run: go test ./... - name: notify the pi # one request if tracked docs or code changed if: always() && steps.classify.outputs.notify == 'true' run: # HMAC-sign {deploy,commit,ref} and POST it; deploy = code && success
The classifier must use the previous push range, not only HEAD~1..HEAD: base = github.event.before, head = github.sha. For the all-zero first-push sentinel, fall back to the root commit range. This catches multi-commit pushes correctly.
It emits three booleans: code_changed, tracked_changed, and notify. code_changed covers deploy build inputs: wiki-browser/** minus docs/, .claude/, and *.md, plus the workflow and classifier files that control deployment. Templates and static assets (internal/server/templates/, static/) are baked in via embed.FS and are build inputs. tracked_changed covers any changed file the running wiki would serve: repo-relative path under root, extension in extensions, and not matched by either baked-in walker excludes or configured exclude patterns.
tracked_changed must answer exactly what the running wiki would serve, so the classifier does not reimplement that rule. The implementation extracts the path-matching predicate from internal/walker into a small reusable matcher — (repo-relative path, extensions, excludes) → tracked, with the baked-in excludes (.git, node_modules, .worktrees, .obsidian, .claude, tmp-*) living inside that shared code — and both the walker and classify-ci-change call it. The matching logic is thus one implementation and cannot drift. The classifier is a Go helper in the wiki-browser module (cmd/classify-ci-change), run via go run; it needs no filesystem walk, only the git diff --name-only path list.
The one thing still mirrored is the data. The production wiki-browser.yaml holds host paths and secret references and is not committed, so the extensions/exclude values are kept in a committed non-secret file — wiki-browser/ci-tracked-paths.yaml — read by the classifier. Only those two values must stay in step with production config; provisioning verifies that by code review. A classifier error (unreadable policy file, a failed git diff) exits non-zero and fails the Action — surfaced as a red run, never a silently dropped push.
On a code+doc push the docs sync only after the test run finishes (~2 min), because the single request is sent at the end of the job. Doc-only pushes wait just for runner spin-up and classification (~30 s). This latency was explicitly accepted in discovery; if a missed Action ever leaves docs stale, git-sync's existing opt-in poll_interval is the safety net.
The single webhook endpoint
The #10 Task-9 handler is rewritten. The route is renamed POST /api/webhook/ci and the payload is no longer GitHub's native push event but our own schema:
json — request body{ "deploy": true, "commit": "a1b2c3d4e5f6789012345678901234567890abcd", "ref": "refs/heads/master", "delivery_id": "github-run-123456789" }
It stays a public route — mounted outside the OAuth session middleware, like /auth/* — protected by HMAC: read the body under a size cap, compute HMAC-SHA256(body, secret), constant-time compare against X-Hub-Signature-256: sha256=<hex>. The secret file is the existing git.webhook_secret_file from #10 — unchanged on the Pi; only its counterpart moves (from the native-webhook config to a GitHub Actions secret). Handler behavior:
- Bad / missing signature →
401, logged. - Bad payload — non-
masterref, malformed non-40-hex commit, or oversized body →400, logged. delivery_id(the GitHub Actions run id) is recorded in this delivery's log line — correlation only, no behavioral effect.- Always: call
gitSync.RequestSync()— the doc pull. This is the job the native webhook used to do. - If
deployis true: write the fullcommittocd.trigger_file(atomic write — temp file +rename). - Respond
204immediately; both the sync and the CD run proceed out of band.
wiki-browser needs no privilege for this — it only writes a file it owns. The systemd .path unit (running as part of systemd) does the activation.
wb-cd — the deploy oneshot
A new binary, cmd/wb-cd, built by make build alongside wiki-browser and wb-agent. It is a Go binary rather than a shell script for two reasons: the rollback logic is correctness-critical and deserves unit tests, and it reuses internal/alert for Slack notifications. It runs once per activation, but it drains trigger state before exit. One target commit loop:
- Read the target full commit from the trigger file. If it equals the recorded deployed commit → no target is pending. If it equals the recorded poisoned commit → skip (a known-bad commit; do not retry until a new commit arrives).
- Ensure the clone contains
commit—git cat-file -e commit^{commit}, polled with a bounded wait. git-sync'sSync, kicked by the same webhook, delivers it within seconds. git archive commit -- wiki-browserinto a temp dir — only the one self-contained Go module, not the rest of the monorepo — and runmake -C wiki-browser build COMMIT=<full-sha> VERSION=<short-sha> BUILD_TIME=<ts>there. This builds the exact approved commit, not the live working tree (git-sync keeps that atmasterHEAD, which may have moved pastcommit).git archiveis a pure read of the object database — no mutation of the live clone, so the sole-mutator invariant holds.- Snapshot the current live binaries to
bin/prev/. - Atomically
rename()the three freshly built binaries over the live paths. sudo systemctl restart wiki-browser— this blocks through the graceful drain.- Health-check (see below). Healthy → record
commitas deployed. Restart failure or unhealthy → roll back. - Before exit, re-read
cd.trigger_file. If it now names a different un-deployed, non-poisoned commit, start the loop again. This is what makes.pathactivation safe when a push lands whilewb-cd.serviceis already active.
Every wb-cd target loop is idempotent: re-running re-builds and re-swaps from scratch. A run interrupted between the three rename() calls leaves a mixed binary set, but the next activation (or a manual one) re-runs the whole sequence cleanly. Each rename() itself is atomic, so no individual binary is ever torn.
CD self-update
cmd/wb-cd lives under wiki-browser/, so a change to its own source is a build input like any other. Because wb-cd is a oneshot, keeping it current needs no special machinery: make -C wiki-browser build (step 3 above) compiles all three binaries, and step 5 rename()s the new wb-cd over its own path. The running process keeps its old inode and finishes cleanly; the next trigger runs the new wb-cd. There is no "hot-swap a running process" problem because between triggers nothing is running — the atomic rename() is the one hard requirement.
The residual risk is a new wb-cd that compiles but is behaviorally broken: the old (working) CD will build and install it, and the next trigger runs broken CD. Three things contain this — wb-cd carries unit tests; a systemd OnFailure= hook alerts on any non-zero CD exit; and the failure degrades gracefully — broken CD never touches wiki-browser's health, it just stops deploying, and make deploy remains as the break-glass path.
Health check & rollback
After the restart returns, wb-cd polls GET /healthz for a bounded window (default ~90 s). The check passes only when the response is 200 and the reported commit equals the full commit just built — proving not only that the process came up but that the swap actually took.
- Healthy → write
{deployed_commit, deployed_at}to the state file. Done. - Restart failed or unhealthy → roll back:
rename()thebin/prev/binaries back into place, restart again, re-verify/healthz. Fire a Slack alert (condition, commit, last error). Recordcommitaspoisonedin the state file so a re-fired trigger for the same broken commit is skipped — CD retries only when a new commit lands, i.e. someone pushed a fix. - Rollback itself fails (both new and previous binary unhealthy) → alert loudly and stop. No thrash loop; a human intervenes.
The state file is small JSON at cd.state_file:
json — cd-state.json{ "deployed_commit": "a1b2c3d4e5f6789012345678901234567890abcd", "deployed_at": "2026-05-21T14:30:00Z", "poisoned_commit": "" }
Rollback restores the binary, not the database. If a deploy carries a non-backward-compatible collab-DB schema migration, the new binary migrates the DB and a binary-only rollback then runs the old code against the new schema. CD cannot detect or undo this. The discipline this imposes: schema migrations for wiki-browser must be backward-compatible (expand-contract) — the old binary must tolerate the new schema. This is good practice regardless, but CD makes it load-bearing.
Graceful drain
This is a change to wiki-browser, not wb-cd. main.go today gives srv.Shutdown a 5 s budget — both too short and aimed at the wrong thing. srv.Shutdown waits for active HTTP handlers; but an Agent job runs in a background goroutine in the agent service — the POST /api/agent/jobs handler that queued it has already returned. So srv.Shutdown finishing says nothing about whether a 2–3 min job is still running. (Incorporation is fine: it runs synchronously inside its handler, so srv.Shutdown's existing wait already covers it; it is seconds anyway.)
The new SIGTERM sequence:
- Enter process-wide draining mode. Read-only routes keep serving. New mutating HTTP requests get
503 Service UnavailablewithRetry-Afterunless they are part of the shutdown path itself. - Tell the agent service to stop dequeuing new jobs. Queued jobs simply stay in the DB; the next instance drains them on boot.
- Wait — on a
WaitGroupover in-flight job goroutines — for running jobs to finish. HTTP stays up serving reads throughout, so the wiki is browsable during the drain. - Once jobs are drained,
srv.Shutdown, close the DBs, exit. - A 10 min cap on the in-flight job wait — a genuinely hung job cannot strand a redeploy forever; past the cap the drain gives up, that job dies and is swept by
collab.Recoveron next boot.
Mutating routes include POST /api/agent/jobs, proposal/incorporation routes, edit/write routes, and manual sync/write endpoints. The CI webhook itself is allowed only before shutdown begins; once draining, a concurrent CI request should receive 503 and be retried by the Action rather than extending the old process. wiki-browser.service gets TimeoutStopSec=11m — above the cap, so systemd never SIGKILLs mid-drain. wb-cd's own systemctl restart call blocks for the whole drain; its unit timeout accommodates that.
Versioning & observability
So a human can tell at a glance whether a new version is live, while rollback can compare exact identity. Three values are stamped into the binary at build time via -ldflags -X: the full commit (main.commit), the display short sha (main.version), and a build timestamp (main.buildTime). Build time, not process-start time, is the "deploy time" — it is baked into the binary, so it does not drift when the Pi reboots or the service restarts for an unrelated reason.
makefileCOMMIT ?= $(shell git rev-parse HEAD 2>/dev/null || echo dev) VERSION ?= $(shell git rev-parse --short HEAD 2>/dev/null || echo dev) BUILD_TIME ?= $(shell date -u +%Y-%m-%dT%H:%MZ) LDFLAGS = -s -w -X main.commit=$(COMMIT) -X main.version=$(VERSION) -X main.buildTime=$(BUILD_TIME)
The temp dir from git archive has no .git, so wb-cd passes the values explicitly: make -C wiki-browser build COMMIT=<full-sha> VERSION=<short-sha> BUILD_TIME=<ts>. The defaults above keep a bare make build on a dev machine working.
Two surfaces expose it:
/healthz— upgraded from plainokto JSON:{"status":"ok","commit":"a1b2c3d4e5f6789012345678901234567890abcd","version":"a1b2c3d","built":"2026-05-21T14:30Z"}. Still unauthenticated, still registered ahead of the auth middleware.wb-cdcompares the fullcommit, never the short display version.- UI chrome footer — a subtle, small line in
shell.html(a1b2c3d · 2026-05-21 14:30), the display version linked to the GitHub commit.
systemd units
| Unit | Type | Role |
|---|---|---|
wb-cd.path | .path | PathChanged= the trigger file → activates wb-cd.service. Coalescing is acceptable because wb-cd re-reads the trigger file before exit. |
wb-cd.service | oneshot | User=karn; TimeoutStartSec=20m for build + up-to-11 min drain + health poll; OnFailure=wb-cd-alert.service. |
wb-cd-alert.service | oneshot | Backstop: POSTs a "wb-cd run failed unexpectedly" line to Slack when a CD run exits non-zero. wb-cd self-alerts for its known failure paths; this catches a crash. |
wiki-browser.service | (edit) | TimeoutStopSec=11m for the drain. |
systemd — wb-cd.path[Path]
PathChanged=/srv/wiki-browser/cd-trigger
Unit=wb-cd.service
[Install]
WantedBy=multi-user.target
systemd — wb-cd.service[Service]
Type=oneshot
User=karn
ExecStart=/srv/wiki-browser/bin/wb-cd -config=/srv/wiki-browser/wiki-browser.yaml
TimeoutStartSec=20m
OnFailure=wb-cd-alert.service
wb-cd runs as the unprivileged karn account and needs exactly one root action — restarting wiki-browser. A narrowly scoped sudoers rule grants only that:
sudoers — /etc/sudoers.d/wb-cdkarn ALL=(root) NOPASSWD: /usr/bin/systemctl restart wiki-browser
Configuration
One new optional block in wiki-browser.yaml. Both binaries read the same config file: wiki-browser needs trigger_file (the handler writes it); wb-cd needs all of it, plus it reuses the existing root, listen and alert values.
yaml — wiki-browser.yamlcd: bin_dir: "/srv/wiki-browser/bin" # live binaries + bin/prev/ trigger_file: "/srv/wiki-browser/cd-trigger" state_file: "/srv/wiki-browser/cd-state.json" health_poll_timeout: "90s" # default
Absent a cd: block, wiki-browser behaves exactly as today — the /api/webhook/ci route still mounts for doc-sync if a webhook secret is configured, but a deploy:true payload has nowhere to write and is logged and ignored. The .path unit's watched path is fixed in the unit file; provisioning must keep it in step with cd.trigger_file.
The Action also gets a committed non-secret path-policy file that mirrors the document-tracking part of production config:
yaml — wiki-browser/ci-tracked-paths.yamlextensions: [".md", ".html"] exclude: - "www/**" - "marketing/**"
Provisioning
One-time operator runbook, split by where each step runs. The Pi steps need host access the implementation does not have. The GitHub steps are gh-driven and the operator runs them.
| Step | Where | How |
|---|---|---|
| Install the Go toolchain (native arm64 build) | Pi | operator |
Install wb-cd.path, wb-cd.service, wb-cd-alert.service; systemctl enable --now wb-cd.path | Pi | operator |
Add TimeoutStopSec=11m to wiki-browser.service | Pi | operator |
Install the sudoers rule for karn | Pi | operator |
Create bin/prev/; confirm the webhook secret file exists | Pi | operator |
Confirm wiki-browser/ci-tracked-paths.yaml mirrors production extensions and exclude | GitHub | code review |
| Set the GitHub Actions secret with the same HMAC value | GitHub | gh secret set WB_DEPLOY_HMAC_SECRET |
| Set the webhook URL (Actions variable or secret) | GitHub | gh variable set WB_WEBHOOK_URL |
| Delete the native Settings→Webhooks hook, if #10 already created it | GitHub | gh api -X DELETE repos/getorcha/orcha/hooks/<id> |
Failure modes
| Failure | Detection | Behavior |
|---|---|---|
| Tests fail in the WB Action | red Action run | deploy:false is POSTed if tracked docs also changed; otherwise no Pi call. The red Action is the signal. |
make build fails on the Pi | non-zero exit | Abort before any binary is touched; live binary untouched; Slack alert; record the commit as poisoned so it is not retried until a new commit lands. |
systemctl restart fails or new binary unhealthy | restart exit / /healthz poll | Roll back to bin/prev/; restart; alert; commit marked poisoned. |
| Rollback itself fails | second /healthz poll | Loud alert; stop — manual intervention. No thrash loop. |
wb-cd crashes mid-run | systemd OnFailure= | Alert. Swaps are atomic; a re-run is idempotent and cleans up. |
| Target commit not yet pulled | git cat-file -e | Bounded wait for git-sync's pull; on timeout, alert and retry on the next trigger. |
| Drain exceeds the 10 min cap | cap timer | Job abandoned, swept by collab.Recover on boot. |
| GitHub Actions outage | — | No sync, no deploy. git-sync's opt-in poll_interval covers doc-sync; make deploy covers deploy. |
| Non-backward-compatible DB migration in a rolled-back deploy | — (not detected) | Out of scope — see the migration warning. Mitigated by expand-contract migration discipline. |
Security
/api/webhook/ciis the single unauthenticated route: constant-time HMAC compare, reject a missing signature, cap the body size, requireref == refs/heads/master, and require a full 40-hex commit. It triggers only a fetch and (conditionally) a trigger-file write — it carries no user-controlled path or command, only a boolean, branch ref, delivery id, and commit.wb-cdruns as the unprivilegedkarnaccount; the sudoers grant is one exact command (systemctl restart wiki-browser), nothing more.- The HMAC secret has two halves: a file on the Pi (mode
0600, owned bykarn, not in git) and a GitHub Actions secret. Actions secrets are write-only in the GitHub UI and masked in logs. - The Pi runs
go buildon repo code on every deploy. This executes nothing beyond the compiler and module downloads (go.sum-verified), and the monorepo is already trusted — the team writesmasterdirectly and the Agent runtime already executes against repo content. Building it is within the existing trust boundary. Tests, which do run arbitrary repo code, run in GitHub's isolated runners, not on the Pi.
Resolved decisions
Settled during discovery; recorded so the rationale is not relitigated:
- Build on the Pi, natively. Pure-Go (
CGO_ENABLED=0), a Pi 5 builds in seconds — external cross-compile + artifact transport is not worth its machinery. - Tests run in GitHub Actions, not on the Pi. Free off-device compute, a real CI surface, and the project's first automated test gate.
- A single GitHub Action, native webhook retired. One request per push; a native webhook cannot run the test gate CD needs.
- A
oneshotactivated by a.pathunit, not a daemon or a timer poll. Nothing to crash-recover; self-update is free. - Graceful drain, not kill-and-recover. A 2–3 min Agent job is finished, not thrown away. Deploy latency was explicitly accepted as a non-problem; the drain cap is 10 min.
- Auto-rollback to one previous version. CD must not be able to take the wiki down.
- git hooks rejected as the trigger. They would couple CD to git-sync's internal git invocations; the Action's webhook is decoupled and precise.
References
- Pi deployment & git-sync — design (#10) — the spec this one revises; source of the webhook handler, the
alertpackage, and the startup sequence. - Agent runtime — design (#3) — the job model whose in-flight goroutines the drain waits on.
internal/server/handler_webhook.go— the #10 Task-9 handler rewritten here.internal/agent/service.go— gains stop-dequeue + in-flightWaitGroupfor the drain.cmd/wiki-browser/main.go— theSIGTERMpath and the 5 ssrv.Shutdownbudget replaced here.deploy/wiki-browser.service,Makefile— extended forTimeoutStopSec, version-ldflags, and thewb-cdbuild.