Real-time collaboration mechanics — Design
Draft

Real-time collaboration mechanics — Design

2026-05-13Danielwiki-browser · sub-project #6

Problem

Sub-projects #1–#4 and #7 settle the data model for Topics, proposals, agent jobs, and authenticated collaborators. The UI today reads that data through REST and refreshes only on iframe navigation or after the calling user's own mutations. With two collaborators acting on the same Documents — one drafting Topics while the other reviews proposals — staleness is now visible in normal use: Max posts a reply, Daniel does not see it; Daniel discards a Topic, Max keeps clicking on a Resolve view that no longer maps to an open Topic; a proposal stales mid-review and the reviewer has no way to know until they refresh.

The domain model assigns this to sub-project #6 — Real-time collaboration mechanics. Its job is to make the second collaborator's view feel within ~1 s of the first's actions, surface presence ("Max is also reading this Document, currently on Topic X"), and codify how the existing data-layer races are reflected in the UI before the losing client's next click. The mechanics are cross-cutting: #5 (Perspectives) and #8 (UI integration) will both ride them.

References: domain model, decisions log.

Goals & non-goals

Goals

Non-goals

Approach

Three transport options were considered.

OptionEffortFit for #6's loadSurvives Nginx Proxy Manager
Keep polling, expand scopeLowMisses the ~1 s targetTrivially
WebSocket per sessionMediumOverkill; nothing needs duplex in v1Needs Upgrade wiring
Transport comparison — recommended row is SSE.

SSE is one HTTP request, ships in Go's stdlib via net/http + http.Flusher, and has a documented client (EventSource) with built-in reconnect semantics. For two collaborators viewing 1–2 Documents at a time, the server holds 2–4 long-lived connections per process — well below any meaningful limit on the Pi. The transport is thin enough that swapping to WebSocket later, if push-from-client ever becomes load-bearing, is a localized change inside internal/realtime.

Decision — transport

SSE behind a process-local pub/sub hub, one connection per source_path per signed-in client. Hub publishes are advisory; REST endpoints remain the authoritative source of every record.

Decision — stream scope

Per-Document only for v1. The client opens an EventSource(/api/stream?source_path=X) on Document load and closes it on navigation. Activity on Documents the user is not currently viewing surfaces only when they next visit them. Session-wide / cross-Document streams remain parking-lot for a future sub-project.

Decision — REST is authoritative; events advise

Hub events carry IDs and previews; clients refetch the canonical row through the existing REST endpoints. Reconnect after a drop = "refetch the three relevant endpoints for the focused Document." No event replay, no Last-Event-ID, no per-subscriber ring buffer.

Design

Architecture overview

A new internal/realtime package holds the hub. The hub is in-process, in-memory, keyed by source_path. The HTTP layer adds two endpoints — one SSE stream, one tiny focus-update POST. Existing server-process mutation handlers call Hub.Publish after their SQLite write commits; the agent runner publishes after it observes rows committed by the wb-agent subprocess.

chrome.js (Daniel) chrome.js (Max) GET /api/stream SSE · source-scoped internal/realtime Hub · per source_path Topic / message handlers Discard / Incorporate (#4) Agent runner status subscribe Publish() EventSource per Document: · opened on /doc/X load · closed on navigation · auto-reconnects on drop SQLite (collab DB) authoritative state publish after commit
Hub is process-local. Clients subscribe per Document. Mutation handlers and the agent runner publish after their SQLite writes commit.

The internal/realtime package

One package, one type. Internal storage is map[string]map[*subscription]struct{} per source guarded by a sync.RWMutex (standard pattern; sync.Map is the wrong shape here — we iterate the set on every publish). Each subscription owns a buffered channel and per-session metadata.

go// internal/realtime/hub.go (sketch)

type Event struct {
    Type    string          // e.g. "topic.created"
    Payload any             // JSON-marshalable
}

type Principal struct {
    UserID      string
    DisplayName string
    SessionID   string
}

type Hub interface {
    // Subscribe allocates a Subscriber but does NOT yet register it for fanout.
    // The SSE handler writes the `subscribed` handshake frame, then calls
    // Activate. This guarantees `subscribed` is the first frame on the wire.
    Subscribe(sourcePath string, p Principal) (*Subscriber, error)
    Activate(s *Subscriber)
    Publish(sourcePath string, evt Event)
    Close()
}

type Subscriber interface {
    ID() string          // the subscriber_id
    Events() <-chan Event
    SetFocus(topicID string)
    Close()
}

Error vocabulary for Subscribe: ErrHubClosed only (returned when Hub.Close has been called — shutdown path). No per-source subscriber cap in v1; the auth allowlist (two users) is the de-facto cap. sourcePath is assumed pre-validated by the SSE handler; the hub does not re-validate.

Subscribers are created by the SSE handler when a client connects, and closed when the handler returns. The 64-event buffer is per subscriber; overflow closes the subscriber's channel and the SSE handler returns, prompting the client to reconnect and refetch. (See Backpressure below for buffer-size rationale.)

Decision — events are hints, not an ordered log

Server-process mutation handlers call Hub.Publish after their SQLite write commits, inside the write-funnel path where possible. Agent-produced rows are different: wb-agent is a separate process, writes directly to SQLite, and cannot call the server's in-memory hub from inside its transaction. For those rows the server runner publishes only after the subprocess exits and the runner observes the committed DB state.

No event sequence number or replay log ships in v1. Clients must treat every event as a hint to refetch canonical REST state and then sort/render by durable fields such as topic-message sequence, proposal revision_number, and job status timestamps. The transport promises fast convergence, not a complete ordered event history.

HTTP surface

Method · PathAuthBody / QueryResponses
GET /api/stream RequireCollaborator ?source_path=<path> 200 text/event-stream · 400 bad_source_path · 401 unauthenticated · 403 forbidden · 404 unknown_source · 503 collab_unavailable
POST /api/stream/focus RequireCollaborator + RequireCSRF {subscriber_id, topic_id?} (JSON) 204 no content · 400 bad_request · 401 unauthenticated · 403 forbidden · 404 unknown_subscriber / unknown_topic · 429 too_many_focus_calls

Both endpoints mount in internal/server/server.go alongside the existing collab routes, behind the same session middleware. The SSE handler does not require CSRF — it is a read-only GET, same treatment as GET /api/topics. The focus POST does require CSRF, same as every other mutating endpoint.

SSE handler responsibilities

  1. Validate source_path via collab.ValidateSourcePath; reject 400 on failure. Verify the file exists on disk (mirrors the existing content handler) and reject 404 if not.
  2. Set headers: Content-Type: text/event-stream, Cache-Control: no-store, Connection: keep-alive, X-Accel-Buffering: no. The header is a best-effort hint; deployment still requires proxy_buffering off in Nginx Proxy Manager.
  3. Call Hub.Subscribe(source_path, principal) to allocate a Subscriber (not yet in the fanout set). Write the subscribed handshake frame carrying the server-generated subscriber_id — the only field — and Flusher.Flush(). Only then call Hub.Activate(subscriber), which adds the subscriber to the source's fanout set and triggers the presence recompute. This split guarantees subscribed is the first frame on the wire; concurrent publishes from other sources cannot race in ahead of it. The client renders presence from the presence.updated event that follows, not from subscribed.
  4. Loop: select on the subscriber's event channel, the request context, and a 15 s keepalive ticker. Write each event as one SSE frame — event: and data: lines only, never id: (v1 has no replay log, so honouring a Last-Event-ID reconnect header is impossible):
    textevent: topic.created
    data: {"topic_id":"…","source_path":"…",…}
    
    
    On each keepalive tick, read the session row. If revoked_at is set or expires_at has passed, fall through to step 5 (subscriber close). Otherwise, if now − last_seen_at ≥ auth.SessionTouchInterval (10 min), call store.TouchSession to extend the sliding expiry — without this the existing middleware semantics would let a passive SSE-only observer time out under their feet. Then write a single :keepalive comment line. After every write, call Flusher.Flush().
  5. On context cancellation, channel close, write error, or session-revalidation failure (step 4), call Subscriber.Close() and return. The hub recomputes presence and republishes. This is the single termination path.

Focus endpoint responsibilities

  1. Decode {subscriber_id, topic_id?}. topic_id empty/absent means "no Topic focused, just viewing the Document."
  2. Apply a 1-call/sec limiter (in-memory token bucket keyed by (session_id, subscriber_id) so two tabs in one session get independent quotas). Reject 429 too_many_focus_calls on burst.
  3. Resolve the subscriber by subscriber_id. Verify it belongs to the calling session (defense against a hijacked id from another principal). If absent or session-mismatched, return 404 unknown_subscriber — recoverable by reconnecting. This handle disambiguates two tabs on the same source: each tab focuses its own subscription independently.
  4. If topic_id is non-empty, load the Topic and require topic.source_path to match the subscriber's source. Unknown / cross-source IDs return 404 unknown_topic. Terminal Topics (incorporated or discarded) silently coerce to topic_id = "" — focus on a closed Topic is treated as "no Topic focused, just viewing the Document," because under normal contention the topic.discarded / topic.incorporated event is in flight and the chrome will repaint anyway. Avoids forcing chrome to hand-roll the "swallow 409 silently" path on every focus call.
  5. Call Subscriber.SetFocus(topic_id); the hub fans out the recomputed presence snapshot to every subscriber on the same source as the resolved subscription.
  6. Return 204.

Event surface

Every SSE frame's data payload is a JSON object. Field shape is fixed by this spec; previews are short (≤160 chars) to keep event frames small. Clients re-fetch canonical bodies through REST.

TypePayloadPublished bySubscribers do
subscribed {subscriber_id} SSE handler on connect (before joining hub fanout) Store subscriber_id for use on POST /api/stream/focus. Presence arrives separately via the presence.updated that follows immediately as a side-effect of subscribe.
topic.created {topic_id, source_path, anchor_kind, first_message_preview, created_by, created_at} handleCreateTopic Append to Topic list; refetch /api/topics?source_path if rendering depends on more than the preview; refetch /api/topics/{id}/proposals for any open Resolve view to surface stale banners caused by the new Topic
topic.message_appended {topic_id, message_id, sequence, kind, body_preview, author_user_id?, proposal_id?, created_at}proposal_id is non-null only when kind = 'agent-proposal'; author_user_id is null for the same case (per #1's "Agent is not a user") handleAppendTopicMessage (human messages) and the agent runner after wb-agent insert-proposal commits (agent-proposal messages) If the focused thread matches, refetch /api/topics/{id}/messages; otherwise badge the topic card
topic.discarded {topic_id, discarded_by, discarded_at} Discard handler (ships with #4) Mark Topic closed; remove from open list; if focused, close the thread view
topic.incorporated {topic_id, source_path, commit_sha, incorporated_by, incorporated_at} Incorporate handler (ships with #4) Remove the incorporated Topic from the list; reload the current Document iframe so the rendered Source, the wb-source-sha meta tag, and anchors reflect commit_sha; refetch /api/topics?source_path (other Topics' anchors flipped to marker); for every other Topic on this Source whose Resolve drawer is open, refetch /api/topics/{id}/proposals AND reload both diff iframes — the proposal's base_source_sha is now superseded, so both the left "current Source" iframe and the right /content/preview/proposals/{id} iframe must re-render against the new state to keep #4's stale banner accurate
proposal.created {proposal_id, topic_id, source_path, revision_number, agent_job_id} Agent runner after post-invariant check succeeds (#4) Refetch /api/topics/{id}/proposals; if a job-running spinner is up, swap to a "review proposal" affordance
job.updated {job_id, kind, status, topic_id?, persona_name?} Agent runner on every state transition Replaces the legacy GET /api/agent/jobs/{id} poll; refetch the job for error_tail if status is failed / timed_out
presence.updated {subscriptions: [{subscriber_id, user_id, display_name, focused_topic_id?}]} — one entry per live subscription (so two tabs from the same user appear as two entries with the same user_id) Hub on subscribe / unsubscribe / focus change Re-render the presence chip (dedup display by user_id if showing a chip-per-person; or one badge per subscription if showing a chip-per-tab) and any per-Topic "Max is reading this" indicator (derive from any subscription whose focused_topic_id matches the Topic)
Event surface. Payloads carry IDs + minimal previews; canonical state stays in REST. Exception: subscribed is a handshake event, not a state-bearing event — its payload is the session-bound handle the client uses to authenticate POST /api/stream/focus.
Decision — no proposal.staled event

Stale-proposal status (#4) is derived on read from the comparison of the proposal's bytes against current open Topics. The events that can make a proposal stale (topic.created, topic.incorporated) already cause the client to refetch /api/topics/{id}/proposals, where the existing #4 stale-check resurfaces the banner. A dedicated proposal.staled event would duplicate state without adding information. Reserve the name in case a future case requires distinct payload, but do not emit it in v1.

Decision — publisher receives its own events

The hub does not filter the publisher out of fanout. The publishing client receives its own topic.created / topic.message_appended / etc. Clients dedupe by record ID; the result is that the publisher's UI converges through the same render path as every other subscriber. Optimistic-update divergence is one bug class we avoid by spending a few extra bytes on self-events.

Decision — publish-after-commit, with a pinned order for Agent incorporate jobs

Every publisher calls Hub.Publish only after its SQLite write commits. For in-process handlers (handleCreateTopic, handleAppendTopicMessage, handleDiscardTopic) this means inside the write-funnel callback after the row commits.

For Agent incorporate jobs, the runner publishes only after collab.CompleteJob succeeds (so consumers cannot observe job.updated{status=succeeded} while the row is still running), and emits events in this exact order so the proposer's tab transitions correctly:

  1. topic.message_appended for the agent-proposal row (carries proposal_id).
  2. proposal.created for the proposal row.
  3. job.updated with the terminal status.

For failed / timed-out incorporate jobs only job.updated is published. For perspective jobs (#5), the analogous order is settled there. Why: job.updated{succeeded} arriving before proposal.created would briefly transition the proposer's UI to "no proposal" before bouncing back to "review proposal." Pinning the order eliminates the flicker without needing a synthetic "generating-finished" state.

Presence model

Hub state per source:

gotype subscription struct {
    SubscriberID    string
    Principal       Principal
    FocusedTopicID  string // "" = viewing the Document, no Topic open
    sender          chan Event
    // ...
}

On every subscribe, unsubscribe, or SetFocus, the hub:

  1. Recomputes the snapshot for the affected source: one entry per live subscription, {subscriber_id, user_id, display_name, focused_topic_id?}. No collapse by user_id — two tabs from the same user produce two entries with the same user_id, each carrying its own focused_topic_id. This matches the focus-POST design (each tab focuses its own subscription independently) and avoids the "tab B subscribed last with empty focus overwrites tab A's focus on Topic X" bug that any collapse policy is forced to invent a tiebreak for.
  2. Publishes presence.updated to every subscriber on that source — including the user whose state changed (so the publisher's UI also re-renders).

The full snapshot is published, not a delta. With two collaborators × small tab counts the payload is trivially small; deltas would be premature optimisation. Clients decide whether the rendered presence chip dedups by user_id (one chip per person, "Max is reading…" with any focus from any of their tabs) or surfaces per-tab entries — the event surface supports both without server-side state.

Auth wiring

Both endpoints mount behind auth.RequireCollaborator. Anonymous requests get 401; signed-in non-collaborators (an unlikely state in v1 since the allowlist is the whole user set) get 403. Initial SSE auth failures, like other protected APIs, return 401/403 with JSON bodies and no HTML redirect for tests and non-browser clients. Browser EventSource does not expose those bodies or status codes, so chrome must classify stream failures with a normal GET /auth/me probe.

Open streams revalidate auth on the 15 s keepalive tick. If the session expires or is revoked while a stream is open, the handler closes the subscriber and returns; the browser reconnect path then probes /auth/me. Anonymous probe result ⇒ location.reload() so the page routes through /auth/login. Still-authenticated probe result ⇒ keep the normal EventSource reconnect running and treat the error as a transient transport failure.

Client wiring (contract — implemented in #8)

This subsection is the canonical contract chrome.js must satisfy. The visual polish — chip placement, toast styling, breakpoint behaviour, choice between per-user and per-tab presence rendering — is owned by #8 and listed under cross-cutting items below. The wiring itself is fixed here.

  1. On iframe load: (a) call loadTopics() as today, (b) fetch GET /api/agent/jobs?source_path=… once to bootstrap in-flight job state (the hub does not replay job state on subscribe), (c) open new EventSource('/api/stream?source_path=' + encodeURIComponent(sourcePath)).
  2. On every subscribed event (initial connect and every EventSource auto-reconnect), replace the per-source subscriber_id in the local map — the previous one is dead. Presence renders from the presence.updated event that follows.
  3. For each event type, dispatch to a handler that updates UI state and triggers REST refetches per the table above. Dedup events by record ID so self-events are idempotent.
  4. On Topic focus (open thread / close thread), POST /api/stream/focus with the stored subscriber_id and topic_id? (null when closing). Debounce locally at 1/sec to match the server limiter. On 404 unknown_subscriber, drop the stale ID and re-issue focus once the next subscribed event arrives.
  5. On iframe navigation, close the EventSource and clear the per-source subscriber_id.
  6. On EventSource.onerror, let the built-in reconnect run and issue a normal GET /auth/me probe. If the probe returns anonymous / 401, force a full page reload; otherwise keep reconnecting without disrupting the page. A 404 unknown_source on the initial connect is terminal — EventSource does not retry on 4xx, and chrome should not either; surface a "this document no longer exists" state and stop.

Concurrent edits and conflict surfacing

The data layer already prevents bad writes. The stream surfaces them faster.

RaceData layerWhat #6 adds
Append race in the same Topic Per-topic monotonic sequence via the write funnel (#1) Both clients receive both topic.message_appended events; both render in sequence order; views converge
Discard race First commit wins; second POST gets 409 topic_closed Losing client receives topic.discarded before its click; Discard button greys; if click slipped through, the 409 is the safety net
Approval race on the same Topic Per-source agent slot (#3); stale-proposal check (#4) returns 409 stale_proposal Loser receives topic.incorporated; refetches /api/topics/{id}/proposals; stale banner appears immediately
New Topic opened mid-proposal Stale-check resurfaces missing-marker reason (#4) Proposer's tab receives topic.created; refetches proposals; Resolve view shows the stale banner before approval is attempted
Two collaborators creating Topics on the same Source No conflict — both writes succeed Both clients see both topic.created events; both Topic lists converge

Backpressure and slow consumers

Per-subscriber buffer is 64 events. A single Agent incorporation produces a burst of 3 events (topic.message_appended for agent-proposal, proposal.created, job.updated{succeeded}) plus prior job.updated{running}; combined with presence churn from focus changes and concurrent collaborator activity, 16 is borderline under realistic bursts. 64 events × ~256 bytes per Event struct ≈ 16 KiB per subscriber; even at 64 subscribers (well beyond v1's expected ~4) total channel memory stays under 1 MiB. The reconnect-and-refetch recovery path is the expected behaviour on overflow, not pathological.

If Publish finds the channel full, the hub closes the subscriber and removes it from the source's set. The SSE handler's next select sees the channel closed, returns, and the client reconnects. This drops queued events but the reconnect refetches authoritative state via REST, so no record is lost — only events that were never meaningful (the client was not consuming them) are.

The hub never blocks publishers waiting on slow subscribers. Publish is non-blocking: it tries each subscriber's channel; on full, it triggers async close.

Process lifecycle

Operational notes

Testing strategy

Migration considerations

Open questions

None blocking. Items below are deliberate parking-lot, owned elsewhere.

Cross-cutting items for sister sub-projects

For #5 — Perspectives

For #8 — Wiki-browser UI integration

The wire-level contract (when to open the EventSource, what to call on focus, how to handle reconnect / auth-expiry, where to bootstrap job state) is fixed in the Client wiring subsection above and is not restated here. The items below are the genuinely-open UI decisions #8 owns.

References