Pi deployment & git-sync — Design
Problem
wiki-browser runs fine on a developer laptop against a local checkout, but the collaboration loop it was built for — view a document, open Topics, discuss, incorporate the resolution — only pays off when both collaborators point at the same always-on instance. Today there is no such instance.
The two of us (Daniel, Max) work trunk-based: documents are committed straight to master of getorcha/orcha and pushed to GitHub, no pull requests. The intended loop is: a document lands on master → the wiki serves it → we discuss Topics on it in the UI → the Agent rewrites the Source → the resolution is committed and pushed back to master → both of us git pull the updated copy. For that loop to close, a deployed wiki-browser must do two things it cannot do today:
- Pull to serve. When a commit lands on
master, the deployed instance must fetch it and serve the new content. Today the server only sees a filesystem; it has no notion of a remote. - Push what it incorporates. Incorporation already writes the rewritten Source and runs
git commitwithTopic:/Proposal:trailers (internal/collab/gitops.go) — but it never pushes. The commit sits on the deployed clone, invisible to everyone else. There is nofetch,pull, orpushanywhere in the codebase.
The target host is a Raspberry Pi 5 on the home LAN, already reachable from outside through an existing Nginx Proxy Manager that terminates TLS and routes a domain to a host:port. So this spec is two things at once: an operational runbook for standing the Pi up, and the design of a new git-sync engine that turns the local-only server into a participant on a shared master.
A related want — continuously redeploying the wiki-browser binary itself when its source changes — is explicitly a later sub-project. It is addressed here only to the extent of leaving clean seams (see Design § CD seams).
Goals & non-goals
Goals
- A deployed wiki-browser on the Pi serves the live
masterofgetorcha/orcha; a teammate's push is reflected within seconds. - Incorporation commits are pushed to
masterautomatically; both collaborators get them with a plaingit pull. - The process is crash- and restart-safe at every step: killing it at any instant leaves recoverable, consistent state — no lost incorporation, no orphaned commit, no DB/git divergence.
- Reproducible provisioning (OS deps, secrets, systemd, proxy host, GitHub webhook) and a simple manual binary-update procedure.
- Leave clean, documented seams for a future continuous-delivery sub-project — without building it now.
Non-goals
- Continuous delivery of the wiki-browser binary. Its own sub-project. This spec only avoids foreclosing it.
- A PR / branch workflow for documents. The collaboration policy is trunk-based; incorporations go straight to
master. - Polling as the primary sync trigger. A GitHub webhook is primary. A poll exists only as an opt-in, disabled-by-default safety net.
- High availability / failover. One Pi. Occasional downtime is acceptable; the design degrades gracefully and self-heals on restart rather than avoiding downtime.
- Automatic resolution of genuine content conflicts. If the Pi and a teammate edit the same document in the same instant, a human resolves it. The design makes this rare and makes it loud, not invisible.
- Multi-repo or multi-instance support. One clone, one wiki, one collab DB.
Approach
The new behavior is a git-sync engine: pull-to-serve on a webhook, push-after-incorporate, and a startup catch-up. The question is where it lives.
| Approach | New Go code | Webhook-native | Race-safe | Verdict |
|---|---|---|---|---|
| A. In-binary engine + one shared git lock | Moderate | Yes | Yes | Recommended |
| B. External pull (cron / systemd timer) + in-binary push | Low | No | No | Rejected |
| C. Bare mirror repo + separate served working tree | High | Partial | Yes | Rejected |
B is ruled out by the webhook: a GitHub webhook is an HTTP request, and it lands in the wiki-browser process — so the pull must be in-binary regardless. Adding a second process that also mutates the same clone reintroduces exactly the race the lock exists to prevent: an external git merge interleaving with the in-binary git add/git commit. C separates "receive" from "serve" cleanly but is far too much machinery for two users and one host, and the wiki-browser already commits straight into the served tree — C would mean re-plumbing that.
A is chosen. The webhook already arrives in the Go process; incorporation, committing, and startup recovery already live there. A new internal/gitsync package adds one mutex that every git mutation acquires, and the whole system stays single-process, single-writer, single-lock.
Pros — Approach A
- One process, one lock — pulls and incorporation commits cannot interleave by construction.
- Composes with the webhook (already in-process) and with the existing commit / recover code.
- Sync status is in-process, so it can be surfaced in the UI and read by a future CD watcher.
Cons — Approach A
- New Go code in the binary (a package plus an incorporation seam).
- A large fetch can briefly block an incorporation behind the lock — acceptable; fetches are small and infrequent.
Design
Topology & Pi layout
On-disk layout. The config lives inside the clone, so the .claude/skills/ the Agent runtime requires (wb-incorporate, wb-perspective — validated at startup by validateAgentRuntimeRoot) arrive and stay current with every pull. Databases live outside the clone on a stable path:
filesystem — raspberry pi/srv/orcha/ # clone of getorcha/orcha — this IS cfg.root wiki-browser/ .claude/skills/{wb-incorporate,wb-perspective}/ # ship via git wiki-browser.yaml # config; gitignored; agent-runtime root /srv/wiki-browser/ bin/{wiki-browser,wb-agent} # arm64 binaries data/{collab.db,index.db} # absolute paths in config secrets/{google-client-secret,github-webhook-secret,slack-webhook-url} agent-logs/
The collab DB holds every Topic, proposal, discussion message and session — none of it is in git. It is the one irreplaceable artifact on the Pi and is treated accordingly (see Operations). The index DB is disposable: it is rebuilt from the repo.
The git-sync engine (internal/gitsync)
A new package wrapping the clone. It owns one mutex that every git mutation acquires — webhook fetch, incorporation commit + push, startup catch-up, background push. Reads that do not mutate (the existing git log recovery scan, content hashing) are unaffected. Sketch of the surface:
go — internal/gitsynctype SyncResult struct { OldHead, NewHead string ChangedPaths []string // repo-relative, ext/exclude-filtered Rebased bool } type State string // "synced" | "syncing" | "push-pending" | "diverged" type Status struct { State State Head string // current HEAD sha — also the CD watcher's signal Ahead int // local commits not yet on origin LastSyncAt time.Time LastError string } func New(cfg Config) (*Repo, error) // validates: git repo, on branch, remote present func (r *Repo) Sync(ctx) (SyncResult, error) // lock → reconcile func (r *Repo) Push(ctx) error // lock → reconcile → git push func (r *Repo) Incorporate(ctx, fn func() (string, error)) (string, error) func (r *Repo) Status() Status
The shared primitive is reconcile — never exported, always under the lock: git fetch <remote> <branch>, then git merge --ff-only; if the local branch has diverged (it carries unpushed incorporation commits and origin also moved), fall back to git rebase <remote>/<branch>. Every public operation is built on it:
Sync=reconcile. Used by the webhook and by startup catch-up.Push=reconcilethengit push. Used by the background pusher.Incorporate=reconcile, then runfn(the existingcollab.Incorporate), thengit push— all under one lock hold.
After any reconcile that changed files, the engine refreshes the served view deterministically rather than relying on filesystem-watch timing: it calls a new walker.Rescan() (re-runs the existing scan(), atomically swapping the file-set map), then for each path in ChangedPaths calls index.Reindex if the walker still has it or index.Remove otherwise. ChangedPaths comes from git diff --name-only OldHead NewHead. The fsnotify watcher stays in place for local edits and as a backstop, but correctness of a git-driven update does not depend on it.
Webhook endpoint
POST /api/webhook/github is a public route — mounted alongside /auth/*, outside the OAuth session middleware, since GitHub cannot authenticate. It is protected by HMAC instead:
- Read the body with a size cap (e.g. 5 MiB). Compute
HMAC-SHA256(body, secret)and compare it to theX-Hub-Signature-256header with a constant-time compare. Missing or mismatched signature →401, logged. The secret is read from the file atgit.webhook_secret_file. - Act only on
pushevents whoserefisrefs/heads/<branch>. Everything else →204, ignored. - Respond
204immediately; run theSyncasynchronously so GitHub's delivery never waits on git I/O. - Coalesce. If a sync is already running, set a single "re-run pending" flag rather than queueing N goroutines. Bursts collapse to at most one follow-up sync.
The webhook also fires for the Pi's own incorporation pushes. That is harmless: the follow-up Sync fetches, finds origin/<branch> already equal to local HEAD, and the fast-forward is a no-op. No de-duplication needed.
Incorporation: push integration
Incorporation today (internal/collab/incorporate.go) loads the proposal, stale-checks base_source_sha against the Source on disk, writes the recovery marker, commits via CommitSourceRewrite, then completes the DB transaction. The change is to run that unchanged sequence inside the engine's lock, with a reconcile before and a push after. The incorporate HTTP handler calls:
go — incorporate handlersha, err := gitSync.Incorporate(ctx, func() (string, error) { return collab.Incorporate(store, in) // existing call, unchanged })
Ordering inside one lock hold: (1) reconcile — the working tree is now equal to origin/<branch>. (2) collab.Incorporate runs; its stale-check now compares the proposal's base_source_sha against the freshly-pulled Source — so an upstream edit to the same document cleanly fails the check as ErrStaleProposal, and the user regenerates. (3) git push.
The push is the last, best-effort step. The commit and the DB transaction are already done and mutually consistent before the push is attempted — so a push failure does not fail the incorporation. The handler reports success; the engine sets state push-pending; the background pusher retries. The local commit is authoritative, the push is replication. This is the property that makes a restart mid-incorporation safe (see CD seams). Batched incorporation (#9) flows through the identical seam — it is the same collab.Incorporate call with ChildTopicIDs populated. Perspective regeneration commits nothing and never pushes.
Conflict & drift handling
| Situation | Resolution |
|---|---|
| Upstream edits a document that has an open Topic / generated proposal | The proposal's base_source_sha no longer matches the pulled Source → existing ErrStaleProposal / freshness machinery fires → the user regenerates the proposal. No new code. |
| Push rejected — someone pushed in the window between our fetch and our push | reconcile rebases the local incorporation commit onto the new origin/<branch>; retry the push. Bounded attempts (e.g. 3). An incorporation commit is a single-file pathspec commit, so the rebase is clean unless the same document moved upstream. |
| Genuine rebase conflict — the Pi and a teammate edited the same document concurrently | git rebase --abort; set state diverged; log loudly; surface the UI banner and fire an alert (see Alerting). The incorporation already succeeded locally and the DB is consistent — only the push is blocked. A human resolves the merge on the Pi. The background pusher skips while diverged so it does not thrash. |
| Network down — fetch or push fails | Transient. Sync errors are non-fatal and logged; the background pusher and the next webhook retry. Local state stays consistent throughout. |
Before deployment, the watched tree only changed via incorporation or a local editor. After deployment, arbitrary upstream pushes rewrite watched documents far more often. That exercises the existing Topic-anchor / freshness / recover paths (character-offset anchors drifting under an upstream rewrite) harder than they have been. This is existing behavior, not new design — but it is the most likely place for a latent bug to surface, and the implementation plan must include explicit tests for "open Topic on a document that is rewritten by an upstream pull."
Startup sequence
Catch-up runs early in run() — after config.Load, before walker.New — so the initial filesystem scan and collab.Recover both see the latest tree:
config.Load.gitsync.New— validate the clone (is a git repo, on the configured branch, remote present); fail loudly with a clear message if not (provisioning did the initialgit clone).Sync— fetch + fast-forward. This is what self-heals webhooks missed while the Pi was offline.Push— flush any incorporation commit that a previous run committed but did not push.walker.New,index.Open,collab.Open,RevokeSessions,SweepIncompleteJobs,collab.Recover— unchanged, now operating on the up-to-date tree.
Steps 3–4 are non-fatal. If the Pi boots with no network, gitsync.New still succeeds (the clone is valid), the Sync/Push errors are logged, and the server starts and serves whatever the clone currently holds. The next webhook or the background pusher reconciles once the network returns. Contrast collab.Recover, which stays fatal — a corrupt collab DB is not something to serve through.
Configuration
Two new top-level blocks. git: configures the sync engine; alert: configures the Slack notifier. The commit-author identity is the existing agent.author_name/author_email; SSH transport for fetch/push uses the service user's deploy key and needs no config entry.
yaml — wiki-browser.yamlgit: remote: "origin" # default branch: "master" # default webhook_secret_file: "/srv/wiki-browser/secrets/github-webhook-secret" poll_interval: "0" # 0 = webhook-only. Set e.g. "10m" to # enable a safety-net poll. alert: slack_webhook_url_file: "/srv/wiki-browser/secrets/slack-webhook-url" fail_threshold: "15m" # alert if sync/push fails continuously this long
Both new secrets follow the existing google_client_secret_file convention — the config holds a path, not the value. The parsed config struct then carries only paths, never secret material, so a stray slog of the config or a config-wrapped error cannot leak a secret; the config file itself stays non-sensitive; and each secret rotates independently. Config load validates each path with an os.Stat existence check, exactly as it already does for google_client_secret_file.
Webhook-only has one gap: a webhook missed while the Pi is online (a GitHub delivery failure, a momentary proxy hiccup) leaves the Pi stale until the next push. Offline misses self-heal at startup (step 3 above). poll_interval closes the online gap and is disabled by default — the safety net is one config line away without changing the primary mechanism.
Provisioning checklist
- OS & deps: 64-bit Raspberry Pi OS;
git; Node.js LTS plus theclaudeCLI (arm64 — supported); the cross-compiledwiki-browserandwb-agentbinaries frommake build-arm64. - Service user: the service runs as the unprivileged
karnaccount (not root). The clone,/srv/wiki-browser, the SSH deploy key, andwiki-browser.yamlare owned by it. - Clone:
git clone git@github.com:getorcha/orcha.git /srv/orcha, checked out onmaster. - GitHub deploy key: as the
karnuser — generate an SSH key, add its public half to thegetorcha/orcharepo as a deploy key with write access, and runssh-keyscan github.com >> ~/.ssh/known_hosts. Doing every step askarnis what lands the key andknown_hostsin its home, where non-interactive fetch/push looks for them. - Agent auth: the
claudeCLI must be authenticated as thekarnuser. Default: a one-time interactive subscription login — credentials persist inkarn's home, no per-token API billing. Alternative: exportANTHROPIC_API_KEYvia the systemdEnvironmentFile. wiki-browser itself never reads the key — it only spawnsclaude, which inherits the process environment. An expired subscription login is caught by the sustained-Agent-failure alert. - Secrets: three files under
/srv/wiki-browser/secrets/— the Google OAuth client secret, the GitHub webhook HMAC secret, the Slack webhook URL — each mode0600, owned bykarn, none in git. The config references them by path only. - Config:
wiki-browser.yamlat/srv/orcha/wiki-browser/—dev_modeoff;listen: ":8080";public_base_url: https://wiki.<domain>; absoluteindex_db/collab_dbpaths under/srv/wiki-browser/data/;wb_agent_binpointing at the deployedwb-agent. - systemd unit: extend
deploy/wiki-browser.servicewithUser=karn,WorkingDirectory=,EnvironmentFile=(aPATHthat includes node andclaude; plusANTHROPIC_API_KEYonly if the API-key route is chosen), andWants=/After=network-online.target— startup catch-up needs the network. - Nginx Proxy Manager: a proxy host
wiki.<domain>→ the Pi's wiki-browser port; enable Websockets Support and disable proxy buffering so the realtime stream is not buffered. - GitHub webhook: repo Settings → Webhooks → payload URL
https://wiki.<domain>/api/webhook/github, content-type JSON, the HMAC secret, "Just the push event". - Google OAuth: register
https://wiki.<domain>/auth/callbackas an authorized redirect URI on the OAuth client.
Operations
- Binary updates (manual, until CD):
make build-arm64→ copy both binaries to/srv/wiki-browser/bin/→systemctl restart wiki-browser. Templates and static assets are embedded (embed.FS), so a restart is mandatory — there is no hot reload. Amake deploytarget wraps the copy + restart. - Backups: a systemd timer runs
sqlite3 collab.db ".backup"to off-device storage on a schedule. The collab DB is the only irreplaceable artifact; the index DB is rebuilt from the repo and is not backed up. - Observability:
gitsynclogs every sync (old→new HEAD, changed-path count, rebased?) and every push outcome.GET /api/sync-status(session-gated) returnsStatusas JSON. The UI chrome shows a banner when state isdiverged, since that state needs a human; the engine broadcasts state transitions over the existing realtime hub so the banner updates live. The active notification path is covered separately — see Alerting.
Alerting
A UI banner and log lines are passive — they assume someone is looking. A diverged state, a silently broken deploy key, or an Agent runtime that has stopped working all have to reach the operator. A small general-purpose notifier — internal/alert — POSTs a message to a Slack incoming webhook ({"text": ...}), the URL read from the file at alert.slack_webhook_url_file. Both the gitsync engine and the Agent service hold a reference to it.
Three conditions raise an alert:
diverged— immediately, on the transition into the state. A genuine same-document conflict always needs a human. The message includes a deep link to the affected document —public_base_url+/doc/<source-path>, the path taken from the conflicting incorporation commit — so the operator opens it straight from the notification.- Sustained sync/push failure — on a timer, not a transition. The engine stamps a
failingSincetime on the first failure. The background pusher's retry tick evaluates it; oncenow − failingSinceexceedsalert.fail_threshold(default 15 min) it emits the alert exactly once. A state that simply stayspush-pendingis not a transition, so this check has to be time-driven rather than transition-driven. It catches a network outage, a revoked deploy key, or GitHub being unreachable. - Sustained Agent-job failure. The Agent service counts consecutive job failures; three in a row raises an alert. A single failed job is normal and already visible in the UI — three in a row means the runtime itself is broken: an expired
claudelogin, a missing binary, or the API down.
Each alert carries the condition, the current HEAD, and the last error. When a condition clears — the rebase is resolved, the network returns, an Agent job succeeds — a single "recovered" message fires and the relevant counter resets. Alerts are edge-triggered: a long outage produces one alert and one recovery, never a stream.
The notifier is best-effort and must never block or fail the operation that triggered it — a failed POST is logged and dropped. If alerting itself is down, the UI banner and logs remain as the passive fallback.
Failure modes
| Failure | Detection | Behavior / recovery |
|---|---|---|
| Pi offline / rebooted | — | Startup Sync catches up; webhooks missed while down are subsumed by the fetch. |
| Process killed mid-incorporation, after commit, before DB completion | collab.Recover on next boot | Recovery reconciles the DB against git log trailers — existing machinery. |
| Process killed after DB completion, before push | Startup Push (local branch ahead of origin) | The pending commit is pushed on next boot. Local commit was authoritative throughout. |
| Push fails (network) | git push error | State push-pending; background pusher retries; no state corruption. An alert fires if failures persist past fail_threshold. |
| Rebase conflict (same document edited both sides) | git rebase exit status | State diverged; surfaced in UI + logs; alert fires immediately; human resolves on the Pi. |
| Agent job crash | SweepIncompleteJobs on boot | Existing — incomplete jobs swept before Recover. |
Agent runtime broken (expired claude login, missing binary, API down) | 3 consecutive job failures | Jobs surface as failed in the UI; an alert fires so the operator can re-authenticate or fix the runtime. |
| collab DB corruption / SD-card loss | DB open / integrity error | Restore from the latest off-device backup. |
| Bad / missing webhook signature | HMAC verify | 401, logged, no sync. Repeated failures indicate a secret mismatch. |
| Deploy-key auth failure | fetch / push error | Sync stalls, logged; sync-status reflects LastError; an alert fires after fail_threshold. |
Security
- The webhook is the single unauthenticated route: constant-time HMAC compare, reject a missing signature, cap the body size. It only ever triggers a fetch — it carries no user-controlled path or command.
dev_modemust be off; the config validator already refusesdev_modetogether with anhttpspublic_base_url, and the production URL ishttps— so dev mode cannot be enabled by accident.- The process runs as the unprivileged
karnaccount, never root — a compromise is contained to that user's footprint. - The Pi holds four secrets — the OAuth client secret, the GitHub webhook secret, the SSH deploy key, and the Slack webhook URL (plus an
ANTHROPIC_API_KEYonly if the API-key route is chosen for Agent auth). Each is a separate file, mode0600, owned bykarn, none in git; the config references them by path, so the parsed config never holds secret material. - The deploy key has write access to the whole monorepo: that is the blast radius if the Pi is compromised. Acceptable under a trunk-based policy where
masteris writable anyway; noted so the trade-off is explicit. - Anyone with UI access (Daniel, Max) can trigger Agent jobs and therefore Claude usage — subscription quota or, on the API-key route, API spend. Acceptable for two trusted users; noted.
Continuous-delivery seams
CD of the wiki-browser binary is a later sub-project. This design does not build it, but three additive seams keep it from fighting the architecture later:
Syncreturns a result, notvoid.SyncResultcarries{OldHead, NewHead, ChangedPaths}. Useful now for logging; later it is how CD answers "didwiki-browser/source change in this push?" without re-querying git.Statusincludes the current HEAD sha. The same/api/sync-statussurface that showssynced/divergedis the thing an external CD watcher reads to detect "new code landed."- Invariant — the wiki-browser process is the sole mutator of the clone's git state. CD, whenever it is built, observes (reads
sync-status, reacts) and never runs its ownfetch/pull/commit/pushon/srv/orcha. Violating this re-introduces the Approach B two-writer race. CD therefore lives as a separate systemd unit, not inside the binary.
The hard part of CD — surviving a restart at any instant — is already delivered by this design: a CD restart is just another crash, and the startup sequence (collab.Recover + startup Push of pending commits + SweepIncompleteJobs) already makes every step recoverable. The "local commit authoritative, push replayable" property specifically covers a restart mid-incorporation.
One CD-era refinement is deliberately not done here: graceful-shutdown drain. main.go currently gives srv.Shutdown a 5 s budget, which can cut off a slow incorporation. Recovery still catches it, so this is polish, not a correctness gap — but the CD sub-project should revisit it so routine redeploys are clean rather than merely safe.
Resolved decisions
The draft's open questions were all resolved during review:
- Proxy buffering for the realtime stream — operator-managed. The NPM proxy host (Websockets Support, buffering) is configured by the operator as part of provisioning; no work for the implementation.
- Service user — the unprivileged
karnaccount on the Pi. listenbind address —:8080on the Pi; the public edge stays NPM-only.divergedrecovery — resolved manually on the Pi; no in-UI retry/abort affordance. Discoverability is handled by active Slack alerting (see Alerting).walker.Rescan()cost — accepted as designed. A full re-walk per sync is fine at the current repo size; it will be revisited only if it becomes a measured problem.
References
- wiki-browser — original design
- Agent runtime — design (#3)
- Topic resolution & incorporation — design (#4)
- Realtime collaboration — design (#6)
- Batched incorporation — design (#9)
internal/collab/gitops.go— the only place that runsgit committoday; the push integrates here.internal/collab/incorporate.go— the incorporation protocol wrapped bygitsync.Incorporate.cmd/wiki-browser/main.go— the startup sequence the catch-up step slots into.deploy/wiki-browser.service— the systemd unit extended during provisioning.