Draft

Pi deployment & git-sync — Design

2026-05-21Danielwiki-browser · sub-project #10

Problem

wiki-browser runs fine on a developer laptop against a local checkout, but the collaboration loop it was built for — view a document, open Topics, discuss, incorporate the resolution — only pays off when both collaborators point at the same always-on instance. Today there is no such instance.

The two of us (Daniel, Max) work trunk-based: documents are committed straight to master of getorcha/orcha and pushed to GitHub, no pull requests. The intended loop is: a document lands on master → the wiki serves it → we discuss Topics on it in the UI → the Agent rewrites the Source → the resolution is committed and pushed back to master → both of us git pull the updated copy. For that loop to close, a deployed wiki-browser must do two things it cannot do today:

Pull to serve. When a commit lands on master, the deployed instance must fetch it and serve the new content. Today the server only sees a filesystem; it has no notion of a remote.
Push what it incorporates. Incorporation already writes the rewritten Source and runs git commit with Topic:/Proposal: trailers (internal/collab/gitops.go) — but it never pushes. The commit sits on the deployed clone, invisible to everyone else. There is no fetch, pull, or push anywhere in the codebase.

The target host is a Raspberry Pi 5 on the home LAN, already reachable from outside through an existing Nginx Proxy Manager that terminates TLS and routes a domain to a host:port. So this spec is two things at once: an operational runbook for standing the Pi up, and the design of a new git-sync engine that turns the local-only server into a participant on a shared master.

A related want — continuously redeploying the wiki-browser binary itself when its source changes — is explicitly a later sub-project. It is addressed here only to the extent of leaving clean seams (see Design § CD seams).

Goals & non-goals

Goals

A deployed wiki-browser on the Pi serves the live master of getorcha/orcha; a teammate's push is reflected within seconds.
Incorporation commits are pushed to master automatically; both collaborators get them with a plain git pull.
The process is crash- and restart-safe at every step: killing it at any instant leaves recoverable, consistent state — no lost incorporation, no orphaned commit, no DB/git divergence.
Reproducible provisioning (OS deps, secrets, systemd, proxy host, GitHub webhook) and a simple manual binary-update procedure.
Leave clean, documented seams for a future continuous-delivery sub-project — without building it now.

Non-goals

Continuous delivery of the wiki-browser binary. Its own sub-project. This spec only avoids foreclosing it.
A PR / branch workflow for documents. The collaboration policy is trunk-based; incorporations go straight to master.
Polling as the primary sync trigger. A GitHub webhook is primary. A poll exists only as an opt-in, disabled-by-default safety net.
High availability / failover. One Pi. Occasional downtime is acceptable; the design degrades gracefully and self-heals on restart rather than avoiding downtime.
Automatic resolution of genuine content conflicts. If the Pi and a teammate edit the same document in the same instant, a human resolves it. The design makes this rare and makes it loud, not invisible.
Multi-repo or multi-instance support. One clone, one wiki, one collab DB.

Approach

The new behavior is a git-sync engine: pull-to-serve on a webhook, push-after-incorporate, and a startup catch-up. The question is where it lives.

Sync-engine placement. The webhook choice and the existing in-binary commit path decide it.
Approach	New Go code	Webhook-native	Race-safe	Verdict
A. In-binary engine + one shared git lock	Moderate	Yes	Yes	Recommended
B. External pull (cron / systemd timer) + in-binary push	Low	No	No	Rejected
C. Bare mirror repo + separate served working tree	High	Partial	Yes	Rejected

B is ruled out by the webhook: a GitHub webhook is an HTTP request, and it lands in the wiki-browser process — so the pull must be in-binary regardless. Adding a second process that also mutates the same clone reintroduces exactly the race the lock exists to prevent: an external git merge interleaving with the in-binary git add/git commit. C separates "receive" from "serve" cleanly but is far too much machinery for two users and one host, and the wiki-browser already commits straight into the served tree — C would mean re-plumbing that.

A is chosen. The webhook already arrives in the Go process; incorporation, committing, and startup recovery already live there. A new internal/gitsync package adds one mutex that every git mutation acquires, and the whole system stays single-process, single-writer, single-lock.

Pros — Approach A

One process, one lock — pulls and incorporation commits cannot interleave by construction.
Composes with the webhook (already in-process) and with the existing commit / recover code.
Sync status is in-process, so it can be surfaced in the UI and read by a future CD watcher.

Cons — Approach A

New Go code in the binary (a package plus an incorporation seam).
A large fetch can briefly block an incorporation behind the lock — acceptable; fetches are small and infrequent.

Design

Topology & Pi layout

The public edge is Nginx Proxy Manager. wiki-browser binds a localhost / LAN port only. Git traffic to GitHub is outbound SSH from the wiki-browser process; the inbound webhook rides the same HTTPS path as browsers.

On-disk layout. The config lives inside the clone, so the .claude/skills/ the Agent runtime requires (wb-incorporate, wb-perspective — validated at startup by validateAgentRuntimeRoot) arrive and stay current with every pull. Databases live outside the clone on a stable path:

filesystem — raspberry pi/srv/orcha/                         # clone of getorcha/orcha — this IS cfg.root
  wiki-browser/
    .claude/skills/{wb-incorporate,wb-perspective}/   # ship via git
    wiki-browser.yaml               # config; gitignored; agent-runtime root
/srv/wiki-browser/
  bin/{wiki-browser,wb-agent}       # arm64 binaries
  data/{collab.db,index.db}         # absolute paths in config
  secrets/{google-client-secret,github-webhook-secret,slack-webhook-url}
  agent-logs/

The collab DB holds every Topic, proposal, discussion message and session — none of it is in git. It is the one irreplaceable artifact on the Pi and is treated accordingly (see Operations). The index DB is disposable: it is rebuilt from the repo.

The git-sync engine (`internal/gitsync`)

A new package wrapping the clone. It owns one mutex that every git mutation acquires — webhook fetch, incorporation commit + push, startup catch-up, background push. Reads that do not mutate (the existing git log recovery scan, content hashing) are unaffected. Sketch of the surface:

go — internal/gitsynctype SyncResult struct {
    OldHead, NewHead string
    ChangedPaths     []string   // repo-relative, ext/exclude-filtered
    Rebased          bool
}

type State string   // "synced" | "syncing" | "push-pending" | "diverged"

type Status struct {
    State      State
    Head       string   // current HEAD sha — also the CD watcher's signal
    Ahead      int      // local commits not yet on origin
    LastSyncAt time.Time
    LastError  string
}

func New(cfg Config) (*Repo, error)            // validates: git repo, on branch, remote present
func (r *Repo) Sync(ctx) (SyncResult, error)  // lock → reconcile
func (r *Repo) Push(ctx) error                // lock → reconcile → git push
func (r *Repo) Incorporate(ctx, fn func() (string, error)) (string, error)
func (r *Repo) Status() Status

The shared primitive is reconcile — never exported, always under the lock: git fetch <remote> <branch>, then git merge --ff-only; if the local branch has diverged (it carries unpushed incorporation commits and origin also moved), fall back to git rebase <remote>/<branch>. Every public operation is built on it:

Sync = reconcile. Used by the webhook and by startup catch-up.
Push = reconcile then git push. Used by the background pusher.
Incorporate = reconcile, then run fn (the existing collab.Incorporate), then git push — all under one lock hold.

After any reconcile that changed files, the engine refreshes the served view deterministically rather than relying on filesystem-watch timing: it calls a new walker.Rescan() (re-runs the existing scan(), atomically swapping the file-set map), then for each path in ChangedPaths calls index.Reindex if the walker still has it or index.Remove otherwise. ChangedPaths comes from git diff --name-only OldHead NewHead. The fsnotify watcher stays in place for local edits and as a backstop, but correctness of a git-driven update does not depend on it.

Webhook endpoint

POST /api/webhook/github is a public route — mounted alongside /auth/*, outside the OAuth session middleware, since GitHub cannot authenticate. It is protected by HMAC instead:

Read the body with a size cap (e.g. 5 MiB). Compute HMAC-SHA256(body, secret) and compare it to the X-Hub-Signature-256 header with a constant-time compare. Missing or mismatched signature → 401, logged. The secret is read from the file at git.webhook_secret_file.
Act only on push events whose ref is refs/heads/<branch>. Everything else → 204, ignored.
Respond 204 immediately; run the Sync asynchronously so GitHub's delivery never waits on git I/O.
Coalesce. If a sync is already running, set a single "re-run pending" flag rather than queueing N goroutines. Bursts collapse to at most one follow-up sync.

Note

The webhook also fires for the Pi's own incorporation pushes. That is harmless: the follow-up Sync fetches, finds origin/<branch> already equal to local HEAD, and the fast-forward is a no-op. No de-duplication needed.

Incorporation: push integration

Incorporation today (internal/collab/incorporate.go) loads the proposal, stale-checks base_source_sha against the Source on disk, writes the recovery marker, commits via CommitSourceRewrite, then completes the DB transaction. The change is to run that unchanged sequence inside the engine's lock, with a reconcile before and a push after. The incorporate HTTP handler calls:

go — incorporate handlersha, err := gitSync.Incorporate(ctx, func() (string, error) {
    return collab.Incorporate(store, in)   // existing call, unchanged
})

Ordering inside one lock hold: (1) reconcile — the working tree is now equal to origin/<branch>. (2) collab.Incorporate runs; its stale-check now compares the proposal's base_source_sha against the freshly-pulled Source — so an upstream edit to the same document cleanly fails the check as ErrStaleProposal, and the user regenerates. (3) git push.

The push is the last, best-effort step. The commit and the DB transaction are already done and mutually consistent before the push is attempted — so a push failure does not fail the incorporation. The handler reports success; the engine sets state push-pending; the background pusher retries. The local commit is authoritative, the push is replication. This is the property that makes a restart mid-incorporation safe (see CD seams). Batched incorporation (#9) flows through the identical seam — it is the same collab.Incorporate call with ChildTopicIDs populated. Perspective regeneration commits nothing and never pushes.

Conflict & drift handling

The reconcile-before-commit ordering shrinks the push-rejection window to milliseconds; the residual genuine-conflict case is made loud, not silent.
Situation	Resolution
Upstream edits a document that has an open Topic / generated proposal	The proposal's `base_source_sha` no longer matches the pulled Source → existing `ErrStaleProposal` / freshness machinery fires → the user regenerates the proposal. No new code.
Push rejected — someone pushed in the window between our fetch and our push	`reconcile` rebases the local incorporation commit onto the new `origin/<branch>`; retry the push. Bounded attempts (e.g. 3). An incorporation commit is a single-file pathspec commit, so the rebase is clean unless the same document moved upstream.
Genuine rebase conflict — the Pi and a teammate edited the same document concurrently	`git rebase --abort`; set state `diverged`; log loudly; surface the UI banner and fire an alert (see Alerting). The incorporation already succeeded locally and the DB is consistent — only the push is blocked. A human resolves the merge on the Pi. The background pusher skips while `diverged` so it does not thrash.
Network down — fetch or push fails	Transient. `Sync` errors are non-fatal and logged; the background pusher and the next webhook retry. Local state stays consistent throughout.

Warning

Before deployment, the watched tree only changed via incorporation or a local editor. After deployment, arbitrary upstream pushes rewrite watched documents far more often. That exercises the existing Topic-anchor / freshness / recover paths (character-offset anchors drifting under an upstream rewrite) harder than they have been. This is existing behavior, not new design — but it is the most likely place for a latent bug to surface, and the implementation plan must include explicit tests for "open Topic on a document that is rewritten by an upstream pull."

Startup sequence

Catch-up runs early in run() — after config.Load, before walker.New — so the initial filesystem scan and collab.Recover both see the latest tree:

config.Load.
gitsync.New — validate the clone (is a git repo, on the configured branch, remote present); fail loudly with a clear message if not (provisioning did the initial git clone).
Sync — fetch + fast-forward. This is what self-heals webhooks missed while the Pi was offline.
Push — flush any incorporation commit that a previous run committed but did not push.
walker.New, index.Open, collab.Open, RevokeSessions, SweepIncompleteJobs, collab.Recover — unchanged, now operating on the up-to-date tree.

Decision

Steps 3–4 are non-fatal. If the Pi boots with no network, gitsync.New still succeeds (the clone is valid), the Sync/Push errors are logged, and the server starts and serves whatever the clone currently holds. The next webhook or the background pusher reconciles once the network returns. Contrast collab.Recover, which stays fatal — a corrupt collab DB is not something to serve through.

Configuration

Two new top-level blocks. git: configures the sync engine; alert: configures the Slack notifier. The commit-author identity is the existing agent.author_name/author_email; SSH transport for fetch/push uses the service user's deploy key and needs no config entry.

yaml — wiki-browser.yamlgit:
  remote: "origin"              # default
  branch: "master"              # default
  webhook_secret_file: "/srv/wiki-browser/secrets/github-webhook-secret"
  poll_interval: "0"             # 0 = webhook-only. Set e.g. "10m" to
                               # enable a safety-net poll.
alert:
  slack_webhook_url_file: "/srv/wiki-browser/secrets/slack-webhook-url"
  fail_threshold: "15m"     # alert if sync/push fails continuously this long

Note

Both new secrets follow the existing google_client_secret_file convention — the config holds a path, not the value. The parsed config struct then carries only paths, never secret material, so a stray slog of the config or a config-wrapped error cannot leak a secret; the config file itself stays non-sensitive; and each secret rotates independently. Config load validates each path with an os.Stat existence check, exactly as it already does for google_client_secret_file.

Note

Webhook-only has one gap: a webhook missed while the Pi is online (a GitHub delivery failure, a momentary proxy hiccup) leaves the Pi stale until the next push. Offline misses self-heal at startup (step 3 above). poll_interval closes the online gap and is disabled by default — the safety net is one config line away without changing the primary mechanism.

Provisioning checklist

OS & deps: 64-bit Raspberry Pi OS; git; Node.js LTS plus the claude CLI (arm64 — supported); the cross-compiled wiki-browser and wb-agent binaries from make build-arm64.
Service user: the service runs as the unprivileged karn account (not root). The clone, /srv/wiki-browser, the SSH deploy key, and wiki-browser.yaml are owned by it.
Clone: git clone git@github.com:getorcha/orcha.git /srv/orcha, checked out on master.
GitHub deploy key: as the karn user — generate an SSH key, add its public half to the getorcha/orcha repo as a deploy key with write access, and run ssh-keyscan github.com >> ~/.ssh/known_hosts. Doing every step as karn is what lands the key and known_hosts in its home, where non-interactive fetch/push looks for them.
Agent auth: the claude CLI must be authenticated as the karn user. Default: a one-time interactive subscription login — credentials persist in karn's home, no per-token API billing. Alternative: export ANTHROPIC_API_KEY via the systemd EnvironmentFile. wiki-browser itself never reads the key — it only spawns claude, which inherits the process environment. An expired subscription login is caught by the sustained-Agent-failure alert.
Secrets: three files under /srv/wiki-browser/secrets/ — the Google OAuth client secret, the GitHub webhook HMAC secret, the Slack webhook URL — each mode 0600, owned by karn, none in git. The config references them by path only.
Config: wiki-browser.yaml at /srv/orcha/wiki-browser/ — dev_mode off; listen: ":8080"; public_base_url: https://wiki.<domain>; absolute index_db / collab_db paths under /srv/wiki-browser/data/; wb_agent_bin pointing at the deployed wb-agent.
systemd unit: extend deploy/wiki-browser.service with User=karn, WorkingDirectory=, EnvironmentFile= (a PATH that includes node and claude; plus ANTHROPIC_API_KEY only if the API-key route is chosen), and Wants=/After=network-online.target — startup catch-up needs the network.
Nginx Proxy Manager: a proxy host wiki.<domain> → the Pi's wiki-browser port; enable Websockets Support and disable proxy buffering so the realtime stream is not buffered.
GitHub webhook: repo Settings → Webhooks → payload URL https://wiki.<domain>/api/webhook/github, content-type JSON, the HMAC secret, "Just the push event".
Google OAuth: register https://wiki.<domain>/auth/callback as an authorized redirect URI on the OAuth client.

Operations

Binary updates (manual, until CD): make build-arm64 → copy both binaries to /srv/wiki-browser/bin/ → systemctl restart wiki-browser. Templates and static assets are embedded (embed.FS), so a restart is mandatory — there is no hot reload. A make deploy target wraps the copy + restart.
Backups: a systemd timer runs sqlite3 collab.db ".backup" to off-device storage on a schedule. The collab DB is the only irreplaceable artifact; the index DB is rebuilt from the repo and is not backed up.
Observability: gitsync logs every sync (old→new HEAD, changed-path count, rebased?) and every push outcome. GET /api/sync-status (session-gated) returns Status as JSON. The UI chrome shows a banner when state is diverged, since that state needs a human; the engine broadcasts state transitions over the existing realtime hub so the banner updates live. The active notification path is covered separately — see Alerting.

Alerting

A UI banner and log lines are passive — they assume someone is looking. A diverged state, a silently broken deploy key, or an Agent runtime that has stopped working all have to reach the operator. A small general-purpose notifier — internal/alert — POSTs a message to a Slack incoming webhook ({"text": ...}), the URL read from the file at alert.slack_webhook_url_file. Both the gitsync engine and the Agent service hold a reference to it.

Three conditions raise an alert:

diverged — immediately, on the transition into the state. A genuine same-document conflict always needs a human. The message includes a deep link to the affected document — public_base_url + /doc/<source-path>, the path taken from the conflicting incorporation commit — so the operator opens it straight from the notification.
Sustained sync/push failure — on a timer, not a transition. The engine stamps a failingSince time on the first failure. The background pusher's retry tick evaluates it; once now − failingSince exceeds alert.fail_threshold (default 15 min) it emits the alert exactly once. A state that simply stays push-pending is not a transition, so this check has to be time-driven rather than transition-driven. It catches a network outage, a revoked deploy key, or GitHub being unreachable.
Sustained Agent-job failure. The Agent service counts consecutive job failures; three in a row raises an alert. A single failed job is normal and already visible in the UI — three in a row means the runtime itself is broken: an expired claude login, a missing binary, or the API down.

Each alert carries the condition, the current HEAD, and the last error. When a condition clears — the rebase is resolved, the network returns, an Agent job succeeds — a single "recovered" message fires and the relevant counter resets. Alerts are edge-triggered: a long outage produces one alert and one recovery, never a stream.

Note

The notifier is best-effort and must never block or fail the operation that triggered it — a failed POST is logged and dropped. If alerting itself is down, the UI banner and logs remain as the passive fallback.

Failure modes

Failure	Detection	Behavior / recovery
Pi offline / rebooted	—	Startup `Sync` catches up; webhooks missed while down are subsumed by the fetch.
Process killed mid-incorporation, after commit, before DB completion	`collab.Recover` on next boot	Recovery reconciles the DB against `git log` trailers — existing machinery.
Process killed after DB completion, before push	Startup `Push` (local branch ahead of origin)	The pending commit is pushed on next boot. Local commit was authoritative throughout.
Push fails (network)	`git push` error	State `push-pending`; background pusher retries; no state corruption. An alert fires if failures persist past `fail_threshold`.
Rebase conflict (same document edited both sides)	`git rebase` exit status	State `diverged`; surfaced in UI + logs; alert fires immediately; human resolves on the Pi.
Agent job crash	`SweepIncompleteJobs` on boot	Existing — incomplete jobs swept before `Recover`.
Agent runtime broken (expired `claude` login, missing binary, API down)	3 consecutive job failures	Jobs surface as failed in the UI; an alert fires so the operator can re-authenticate or fix the runtime.
collab DB corruption / SD-card loss	DB open / integrity error	Restore from the latest off-device backup.
Bad / missing webhook signature	HMAC verify	`401`, logged, no sync. Repeated failures indicate a secret mismatch.
Deploy-key auth failure	fetch / push error	Sync stalls, logged; `sync-status` reflects `LastError`; an alert fires after `fail_threshold`.

Security

The webhook is the single unauthenticated route: constant-time HMAC compare, reject a missing signature, cap the body size. It only ever triggers a fetch — it carries no user-controlled path or command.
dev_mode must be off; the config validator already refuses dev_mode together with an https public_base_url, and the production URL is https — so dev mode cannot be enabled by accident.
The process runs as the unprivileged karn account, never root — a compromise is contained to that user's footprint.
The Pi holds four secrets — the OAuth client secret, the GitHub webhook secret, the SSH deploy key, and the Slack webhook URL (plus an ANTHROPIC_API_KEY only if the API-key route is chosen for Agent auth). Each is a separate file, mode 0600, owned by karn, none in git; the config references them by path, so the parsed config never holds secret material.
The deploy key has write access to the whole monorepo: that is the blast radius if the Pi is compromised. Acceptable under a trunk-based policy where master is writable anyway; noted so the trade-off is explicit.
Anyone with UI access (Daniel, Max) can trigger Agent jobs and therefore Claude usage — subscription quota or, on the API-key route, API spend. Acceptable for two trusted users; noted.

Continuous-delivery seams

CD of the wiki-browser binary is a later sub-project. This design does not build it, but three additive seams keep it from fighting the architecture later:

Sync returns a result, not void. SyncResult carries {OldHead, NewHead, ChangedPaths}. Useful now for logging; later it is how CD answers "did wiki-browser/ source change in this push?" without re-querying git.
Status includes the current HEAD sha. The same /api/sync-status surface that shows synced/diverged is the thing an external CD watcher reads to detect "new code landed."
Invariant — the wiki-browser process is the sole mutator of the clone's git state. CD, whenever it is built, observes (reads sync-status, reacts) and never runs its own fetch/pull/commit/push on /srv/orcha. Violating this re-introduces the Approach B two-writer race. CD therefore lives as a separate systemd unit, not inside the binary.

The hard part of CD — surviving a restart at any instant — is already delivered by this design: a CD restart is just another crash, and the startup sequence (collab.Recover + startup Push of pending commits + SweepIncompleteJobs) already makes every step recoverable. The "local commit authoritative, push replayable" property specifically covers a restart mid-incorporation.

Note

One CD-era refinement is deliberately not done here: graceful-shutdown drain. main.go currently gives srv.Shutdown a 5 s budget, which can cut off a slow incorporation. Recovery still catches it, so this is polish, not a correctness gap — but the CD sub-project should revisit it so routine redeploys are clean rather than merely safe.

Resolved decisions

The draft's open questions were all resolved during review:

Proxy buffering for the realtime stream — operator-managed. The NPM proxy host (Websockets Support, buffering) is configured by the operator as part of provisioning; no work for the implementation.
Service user — the unprivileged karn account on the Pi.
listen bind address — :8080 on the Pi; the public edge stays NPM-only.
diverged recovery — resolved manually on the Pi; no in-UI retry/abort affordance. Discoverability is handled by active Slack alerting (see Alerting).
walker.Rescan() cost — accepted as designed. A full re-walk per sync is fine at the current repo size; it will be revisited only if it becomes a measured problem.

References

wiki-browser — original design
Agent runtime — design (#3)
Topic resolution & incorporation — design (#4)
Realtime collaboration — design (#6)
Batched incorporation — design (#9)
internal/collab/gitops.go — the only place that runs git commit today; the push integrates here.
internal/collab/incorporate.go — the incorporation protocol wrapped by gitsync.Incorporate.
cmd/wiki-browser/main.go — the startup sequence the catch-up step slots into.
deploy/wiki-browser.service — the systemd unit extended during provisioning.

Problem

Goals & non-goals

Goals

Non-goals

Approach

Pros — Approach A

Cons — Approach A

Design

Topology & Pi layout

The git-sync engine (internal/gitsync)

Webhook endpoint

Incorporation: push integration

Conflict & drift handling

Startup sequence

Configuration

Provisioning checklist

Operations

Alerting

Failure modes

Security

Continuous-delivery seams

Resolved decisions

References

The git-sync engine (`internal/gitsync`)