Draft

Document model & persistence — Design

2026-05-11Danielwiki-browsersub-project #1

Problem

Sub-project #1 of the collaborative-annotations initiative must settle what a Document is, where its parts live in storage, and how they're written. The domain model defines vocabulary and invariants; this spec turns those abstractions into bytes-on-disk, schemas, and write paths.

Without this layer settled, sub-projects #2 (Topic core), #4 (Incorporation), and #5 (Perspectives) all have nowhere to land their concrete behavior. The full trail of decisions and trade-offs that produced this design lives in the decisions log.

Goals

A storage layout for Source, Perspectives, Topics, proposals, and supporting metadata.
Clear ownership of every write: which writes go to git vs. SQLite, who initiates each, what integrity rules apply.
A users data model that downstream features can reference — without designing authentication.
A commit format for Agent-driven Source rewrites that preserves human auditability.

Non-goals

Auth/authz. How users authenticate, what permissions exist, how the users table gets populated. Orthogonal initiative.
Topic data model details. The exact shape of topics.anchor, the exhaustive list of topic_messages.kind values, the orphaned-anchor representation — owned by sub-project #2.
Incorporation state machine semantics. Rework prompts, conflict handling, the Agent's per-Incorporation prompt — owned by #4.
Diff UI implementation. The decision to use Tier 1 + Tier 2 is here; how they render is in #4 / #8.
Perspective generation algorithm. The cache shape is here; what the Agent does to produce a body is in #5.

Approach

Split storage along a clean axis: content that benefits from human-readable diff and blame goes in git; everything else goes in one SQLite database.

Two stores; one direction of write between them.

Source files live where they already do — in the orcha repo, served by the existing wiki-browser pipeline. The harness adds a single SQLite database wiki-browser-collab.db. Writes are asymmetric:

Source writes = git add + git commit invoked by the Agent on Incorporation, after a human approves the diff.
All other writes = SQL through a single funnel goroutine, mirroring internal/index/index.go.

Reads from SQLite are concurrent. Reads of Source content use the working tree — the same files wiki-browser already serves.

Design

What a Document is

A Document is not a stored object. It is the logical identity source_path — the relative path of a file under the wiki-browser root. The Document is what you get by joining:

The Source: file content at <root>/<source_path> in the working tree.
Its Perspectives: rows in perspectives where source_path matches.
Its Topics: rows in topics where source_path matches.

A Document springs into existence the moment any of these accrete — there is no documents table and no opt-in registration. Files served by wiki-browser that have no Topics and no Perspective defs behave exactly as they do today.

Decision — no opt-in for collaborative Documents

Every file served by wiki-browser is eligible. A file becomes a "live" Document by virtue of having a Topic opened on it or Perspective defs configured for it. This minimises ceremony and keeps the existing reader's experience identical for non-collaborative content.

Out of scope (v1) — external Source rename / delete

Per invariant, the Agent is the only writer of Source — renames and deletes do not happen through the system. If a Source file is renamed or deleted by an external tool (or by hand), Topics and Perspectives keyed by the old source_path become orphaned with no automatic recovery in v1. Reconciliation behaviour (rename-tracking, tombstoning, soft delete) is deferred. Operators who rename a doc through the wiki-browser surface should first discard any open Topics on it.

Schema — full

All tables live in wiki-browser-collab.db. Foreign keys are declared and enforced (PRAGMA foreign_keys = ON at connection time). Identifiers are TEXT (UUIDs or path-shaped natural keys); timestamps are INTEGER (unix seconds).

sql-- Users — auth/authz orthogonal; stubbed here for FK targets.
-- A bootstrap row is created from configuration at first run (see "Users").
CREATE TABLE users (
  id           TEXT PRIMARY KEY,
  display_name TEXT NOT NULL,
  created_at   INTEGER NOT NULL
);

-- Topics — one row per discussion thread on a Document.
-- Outcome columns (commit_sha, incorporated_*, discarded_*) are null while open.
-- The Topic's state (open / incorporated / discarded) is derived from these columns.
-- CHECK constraints enforce that columns remain mutually consistent at all times.
CREATE TABLE topics (
  id                       TEXT PRIMARY KEY,
  source_path              TEXT NOT NULL,
  anchor                   TEXT,                     -- JSON; schema defined by #2
  commit_sha               TEXT,                     -- set on incorporation
  incorporated_proposal_id TEXT,                     -- FK to incorporation_proposals(id)
  incorporated_by          TEXT,                     -- FK to users(id)
  incorporated_at          INTEGER,
  discarded_by             TEXT,                     -- FK to users(id)
  discarded_at             INTEGER,
  created_at               INTEGER NOT NULL,
  created_by               TEXT NOT NULL,            -- FK to users(id)
  updated_at               INTEGER NOT NULL,
  -- All-or-nothing for incorporation outcome columns:
  CHECK ((commit_sha               IS NULL) = (incorporated_at IS NULL)),
  CHECK ((incorporated_proposal_id IS NULL) = (incorporated_at IS NULL)),
  CHECK ((incorporated_by          IS NULL) = (incorporated_at IS NULL)),
  -- All-or-nothing for discard outcome columns:
  CHECK ((discarded_by              IS NULL) = (discarded_at    IS NULL)),
  -- Mutually exclusive outcomes — a Topic cannot be both incorporated and discarded:
  CHECK (incorporated_at IS NULL OR discarded_at IS NULL),
  FOREIGN KEY (created_by)               REFERENCES users(id),
  FOREIGN KEY (incorporated_by)          REFERENCES users(id),
  FOREIGN KEY (discarded_by)             REFERENCES users(id),
  -- Composite FK: the proposal pointed at must belong to THIS topic.
  FOREIGN KEY (incorporated_proposal_id, id)
    REFERENCES incorporation_proposals(id, topic_id)
);
CREATE INDEX topics_by_source_path ON topics(source_path);

-- Topic messages — narrative of the conversation.
-- author_user_id is nullable: null for system / agent-generated messages.
-- kind values are #2's call; at minimum includes 'human' and 'agent-proposal'.
CREATE TABLE topic_messages (
  id              TEXT PRIMARY KEY,
  topic_id        TEXT NOT NULL,
  kind            TEXT NOT NULL,
  body            TEXT NOT NULL,
  author_user_id  TEXT,                     -- null for non-human messages
  proposal_id     TEXT,                     -- non-null when kind='agent-proposal'
  sequence        INTEGER NOT NULL,          -- monotonic per topic; allocated by write funnel
  created_at      INTEGER NOT NULL,
  FOREIGN KEY (topic_id)       REFERENCES topics(id),
  FOREIGN KEY (author_user_id) REFERENCES users(id),
  FOREIGN KEY (proposal_id)    REFERENCES incorporation_proposals(id)
);
CREATE UNIQUE INDEX topic_messages_topic_sequence
  ON topic_messages(topic_id, sequence);

-- Incorporation proposals — append-only event log.
-- One row per Agent proposal (including reworks). State is fully derived:
--   incorporated = topics.incorporated_proposal_id points at this row
--   superseded   = a higher revision_number exists for the same topic_id
--   proposed     = otherwise
-- base_source_sha pins the Source content the Agent generated against. On
-- approval, the harness verifies it still matches the current Source's git
-- blob hash; mismatch means the proposal is stale and must be regenerated.
CREATE TABLE incorporation_proposals (
  id              TEXT PRIMARY KEY,
  topic_id        TEXT NOT NULL,
  revision_number INTEGER NOT NULL,
  proposed_source TEXT NOT NULL,
  base_source_sha TEXT NOT NULL,            -- git blob SHA at generation time
  proposed_by     TEXT NOT NULL,            -- FK to users(id)
  created_at      INTEGER NOT NULL,
  FOREIGN KEY (topic_id)    REFERENCES topics(id),
  FOREIGN KEY (proposed_by) REFERENCES users(id)
);
CREATE UNIQUE INDEX incorporation_proposals_topic_rev
  ON incorporation_proposals(topic_id, revision_number);
-- Required so the topics → incorporation_proposals composite FK above resolves:
CREATE UNIQUE INDEX incorporation_proposals_id_topic
  ON incorporation_proposals(id, topic_id);

-- Incorporation attempts — recovery marker inserted before Source is written.
-- A row exists from human approval until the commit + Topic update finish.
-- Incomplete rows let startup distinguish "approved but not committed yet" from
-- unrelated dirty working-tree edits.
CREATE TABLE incorporation_attempts (
  id              TEXT PRIMARY KEY,
  proposal_id     TEXT NOT NULL UNIQUE,
  topic_id        TEXT NOT NULL,
  source_path     TEXT NOT NULL,
  base_source_sha TEXT NOT NULL,
  approved_by     TEXT NOT NULL,
  approved_at     INTEGER NOT NULL,
  committed_sha   TEXT,
  completed_at    INTEGER,
  created_at      INTEGER NOT NULL,
  CHECK ((committed_sha IS NULL) = (completed_at IS NULL)),
  FOREIGN KEY (proposal_id, topic_id)
    REFERENCES incorporation_proposals(id, topic_id),
  FOREIGN KEY (approved_by) REFERENCES users(id)
);
CREATE INDEX incorporation_attempts_incomplete
  ON incorporation_attempts(completed_at);

-- Perspective definitions — per-Document personas.
-- persona_sha keys the cache; bumping it invalidates cached generations.
CREATE TABLE perspective_defs (
  source_path  TEXT NOT NULL,
  id           TEXT NOT NULL,             -- e.g. 'cfo', 'engineer'
  label        TEXT NOT NULL,
  persona      TEXT NOT NULL,             -- prompt text
  persona_sha  TEXT NOT NULL,             -- SHA-256 of persona text (hex)
  created_at   INTEGER NOT NULL,
  updated_at   INTEGER NOT NULL,
  PRIMARY KEY (source_path, id)
);

-- Perspectives — generated content, sha-keyed on both Source and persona.
-- Lookup: WHERE source_path=? AND perspective_id=? AND source_sha=? AND persona_sha=?
-- Miss → regenerate, insert, serve. Hit → serve.
CREATE TABLE perspectives (
  source_path     TEXT NOT NULL,
  perspective_id  TEXT NOT NULL,
  source_sha      TEXT NOT NULL,           -- git blob SHA of the Source
  persona_sha     TEXT NOT NULL,           -- SHA-256 of persona text (hex)
  body            TEXT NOT NULL,
  generated_at    INTEGER NOT NULL,
  PRIMARY KEY (source_path, perspective_id, source_sha, persona_sha)
);

Note — circular foreign key

topics.incorporated_proposal_id points at incorporation_proposals(id), and incorporation_proposals.topic_id points back at topics(id). SQLite resolves this naturally given insertion order — a Topic is always inserted first, then proposals; the back-reference is set only on incorporation. No deferred-constraints work needed.

Database location and lifecycle

The collab DB is authoritative for everything it stores — Topics, messages, proposals, perspective definitions, perspective cache, and users. The cache rows could in principle be regenerated; everything else is irreplaceable. Treat the file like any other production data store.

Location: a stable path configured in wiki-browser.yaml (default ./wiki-browser-collab.db, sibling to the index DB). Unlike the index DB, the collab DB must not be treated as disposable.
Gitignored, but deployment is responsible for backups. The harness does not implement backup automation in v1.
WAL mode is enabled (PRAGMA journal_mode = WAL) so the single write goroutine can run concurrently with read connections. This produces two sidecar files (-wal, -shm); backups must either checkpoint first (PRAGMA wal_checkpoint(TRUNCATE)) or copy all three files atomically.
Foreign-key enforcement is per-connection in SQLite. Every connection (read or write) must have foreign_keys = ON. The project already uses modernc.org/sqlite; with that driver the harness opens the DB using a DSN that includes _pragma=foreign_keys(1), and applies the same DSN to all connections in the pool.

Schema migrations

The collab DB carries user data that mustn't be discarded; it never recreates itself from scratch. We need a real migration tool from day one.

Approach: a minimal internal/migrate package — a numbered list of forward-only SQL migrations applied at process start, with a schema_migrations table recording which have been run. No down-migrations in v1. This is a tiny amount of code and saves us from a panicked retrofit on first schema change.

Hashes — `source_sha` and `persona_sha`

Two content hashes parameterise the system; both are defined explicitly here so callers do not have to guess.

source_sha = the git blob SHA of the Source file content (40-char SHA-1 hex). Computed via git hash-object <path> against the working tree, or read from git ls-tree HEAD -- <path> for the committed version. Content-addressed; changes whenever the file content changes by even one byte.
persona_sha = SHA-256 of the persona text, hex-encoded. Recomputed by the harness on every write to perspective_defs.persona.

Both hashes are stored alongside the content they're derived from and used as cache keys; mismatches invalidate cached rows implicitly.

Source writes — protocol and recovery

Each Incorporation produces exactly one commit on the working branch, but the path from "human clicks Approve" to "SQLite reflects the commit" crosses two stores. The protocol below specifies the write order and the recovery procedure that guarantees no crash can leave the system in an unrecoverable state.

Write order

Stale-check. Compute the current Source content's source_sha via git hash-object. Compare to incorporation_proposals.base_source_sha. If they differ, the proposal is stale: surface "Source has changed; regenerate the proposal" in the UI and stop.
Record the approved attempt in incorporation_attempts. This SQLite row is inserted before touching the working tree, so startup can recover if the process dies after the file write but before the commit.
Write the proposed Source to disk at <root>/<source_path>.
Commit. git add && git commit as the Agent identity, with trailers Topic:, Proposal:, and Approved-by:. Capture the resulting commit sha from git rev-parse HEAD.
Update topics and complete the attempt in one SQLite transaction: set topics.commit_sha, topics.incorporated_proposal_id, topics.incorporated_by, topics.incorporated_at, plus incorporation_attempts.committed_sha and completed_at.

Crash windows and recovery

Startup recovery handles every durable partial state from the write order above. The recovery loop runs before accepting new Incorporation work.

Commit landed, SQLite incomplete. Enumerate commits with Topic: trailers via git log --grep='^Topic:' --format='%H%n%aI%n%B%x00' (or equivalent). For any commit whose Topic still has commit_sha IS NULL, parse Topic:, Proposal:, and Approved-by:, then apply the missing Topic update and complete the matching incorporation_attempts row.
Attempt exists, no commit landed. For each incomplete incorporation_attempts row, load the proposal and compute the current working-tree source_sha. If it still equals base_source_sha, the file write never happened; resume from the write step. If it equals the git blob hash of proposed_source (computed with git hash-object --stdin), the file write happened but the commit did not; resume from the commit step.
Attempt exists, but working tree is neither base nor proposal. Treat this as external or partial modification. Do not auto-commit or overwrite. Mark Source writes disabled for that Document and surface operator intervention: inspect the working tree, then either restore the base Source and retry, or discard/rework the proposal.

The reconciliation is idempotent and runs at every startup. Cost is bounded by the number of commits with Topic: trailers plus incomplete attempts — small in practice.

Commit message format

git commit message<subject — ≤72 chars, derived from Topic title or summary>

<optional body paragraph(s) — typically the Topic's resolution rationale>

Approved-by: <display name> <user-id>
Topic: <topic_id>
Proposal: <revision_number>

Author and committer: Orcha Agent <agent@orcha.local> (configured via git -c user.name=… -c user.email=… at commit time, owned by sub-project #3). git blame then reflects the truth — the Agent really wrote the change — while trailers keep who-approved and which-conversation searchable from git log --grep.

State derivation rules

No stored state columns for Topic or proposal "state." Both are computed from outcome columns:

Entity	Derived state	Predicate
Topic	incorporated	`commit_sha IS NOT NULL`
	discarded	`discarded_at IS NOT NULL`
	open	neither
Proposal	incorporated	`topics.incorporated_proposal_id = this.id`
	superseded	exists newer `revision_number` for same `topic_id`
	proposed	otherwise

The Topic's orphaned sub-state (anchor invalid after a Source rewrite) is a property of the anchor JSON, not the Topic; owned by #2.

Perspective cache invalidation

Cache lookups join on both source_sha and persona_sha. Either changing invalidates the cache:

Source advances on Incorporation → next read of any Perspective for that source_path is a miss, triggers regeneration of that Perspective specifically.
Persona text edited (a row update in perspective_defs, with persona_sha recomputed) → next read of that Perspective is a miss, triggers regeneration.

The cache is keyed for natural staleness; there is no explicit invalidate step. Stale rows remain in perspectives until evicted by a future sweep job — out of scope for v1.

UX note — passed to #5 / #8

Lazy refresh means the first reader of a stale Perspective waits for regeneration. The UI must signal "regenerating…" so the wait is comprehensible. Decided here; rendered there.

Write paths and concurrency

SQLite writes funnel through a single goroutine reading a buffered channel of mutation requests, mirroring internal/index/index.go. Reads use a separate pool / direct connection — SQLite handles concurrent reads natively. This avoids SQLITE_BUSY errors and serialises writes without explicit application-level locking.

Git writes (Source commits) are similarly serialised: a single Agent goroutine receives Incorporation requests; the Agent runs the rewrite, captures the proposed source into incorporation_proposals, awaits human approval, then commits. The harness is the only thing that ever writes git. Sub-project #3 owns this loop's mechanics.

Users — stub data model plus bootstrap

The users table holds the minimum needed for foreign-key targets and UI rendering. The auth/authz project later owns:

How rows enter this table at scale (signup, SSO, manual provisioning).
How HTTP requests authenticate (cookies, tokens, headers).
What permissions exist (who can resolve a Topic, etc.).

None of those decisions affect the schema in this spec.

Bootstrap for v1 (single operator)

To make the NOT-NULL FK constraints satisfiable from day one without dragging auth into scope, the harness reads two values from wiki-browser.yaml and inserts a single row into users at startup:

yaml# wiki-browser.yaml
operator:
  user_id:      "daniel"
  display_name: "Daniel Barreto"

The startup routine runs INSERT OR IGNORE INTO users(id, display_name, created_at) VALUES (?, ?, ?). Every Topic / message / proposal action in v1 attributes to this single operator. When the real auth project lands, the bootstrap row remains valid; new users join via that project's flow.

Decision — the Agent is not a user

The Agent never appears in users. It writes git as a configured identity (commit author = "Orcha Agent"), not as an authenticated principal. Its actions inside SQLite (creating proposal rows, generating perspectives) are attributed to the human who triggered them via the relevant *_by columns. The harness is trusted; the Agent is part of the harness.

Open questions

Small items, none blocking implementation. Resolve during the writing-plans phase or first PR.

Stale Perspective sweep. Old rows in perspectives accumulate as Source advances. A simple job ("keep only the row matching the current source_sha + persona_sha") suffices. Not in v1.
Multi-line subject derivation. When a Topic's title or summary spans multiple lines, the commit subject needs a deterministic truncation. Cosmetic; defer.

References

Domain model — vocabulary, invariants, sub-project decomposition.
Decisions & parking lot — full trail of choices made during this brainstorm and items deferred to other sub-projects.
wiki-browser/internal/index/index.go — existing single-funnel SQLite write pattern; the collab DB mirrors it.
wiki-browser/internal/walker/walker.go — existing fsnotify-driven file cache; the source-files-as-truth story reuses this pipeline as-is.