Document model & persistence — Design
Draft

Document model & persistence — Design

2026-05-11Danielwiki-browsersub-project #1

Problem

Sub-project #1 of the collaborative-annotations initiative must settle what a Document is, where its parts live in storage, and how they're written. The domain model defines vocabulary and invariants; this spec turns those abstractions into bytes-on-disk, schemas, and write paths.

Without this layer settled, sub-projects #2 (Topic core), #4 (Incorporation), and #5 (Perspectives) all have nowhere to land their concrete behavior. The full trail of decisions and trade-offs that produced this design lives in the decisions log.

Goals

Non-goals

Approach

Split storage along a clean axis: content that benefits from human-readable diff and blame goes in git; everything else goes in one SQLite database.

GIT WORKING TREE Source files .md and .html Versioned by git history. One commit per Incorporation, authored by the Agent. Writes: Agent only. SQLITE — WIKI-BROWSER-COLLAB.DB Topics, messages, proposals, perspectives, perspective_defs, users Single write funnel (goroutine); concurrent reads. No versioning of derived content. Writes: harness, on user actions. Agent commits on Incorporation
Two stores; one direction of write between them.

Source files live where they already do — in the orcha repo, served by the existing wiki-browser pipeline. The harness adds a single SQLite database wiki-browser-collab.db. Writes are asymmetric:

Reads from SQLite are concurrent. Reads of Source content use the working tree — the same files wiki-browser already serves.

Design

What a Document is

A Document is not a stored object. It is the logical identity source_path — the relative path of a file under the wiki-browser root. The Document is what you get by joining:

A Document springs into existence the moment any of these accrete — there is no documents table and no opt-in registration. Files served by wiki-browser that have no Topics and no Perspective defs behave exactly as they do today.

Decision — no opt-in for collaborative Documents

Every file served by wiki-browser is eligible. A file becomes a "live" Document by virtue of having a Topic opened on it or Perspective defs configured for it. This minimises ceremony and keeps the existing reader's experience identical for non-collaborative content.

Out of scope (v1) — external Source rename / delete

Per invariant, the Agent is the only writer of Source — renames and deletes do not happen through the system. If a Source file is renamed or deleted by an external tool (or by hand), Topics and Perspectives keyed by the old source_path become orphaned with no automatic recovery in v1. Reconciliation behaviour (rename-tracking, tombstoning, soft delete) is deferred. Operators who rename a doc through the wiki-browser surface should first discard any open Topics on it.

Schema — full

All tables live in wiki-browser-collab.db. Foreign keys are declared and enforced (PRAGMA foreign_keys = ON at connection time). Identifiers are TEXT (UUIDs or path-shaped natural keys); timestamps are INTEGER (unix seconds).

sql-- Users — auth/authz orthogonal; stubbed here for FK targets.
-- A bootstrap row is created from configuration at first run (see "Users").
CREATE TABLE users (
  id           TEXT PRIMARY KEY,
  display_name TEXT NOT NULL,
  created_at   INTEGER NOT NULL
);

-- Topics — one row per discussion thread on a Document.
-- Outcome columns (commit_sha, incorporated_*, discarded_*) are null while open.
-- The Topic's state (open / incorporated / discarded) is derived from these columns.
-- CHECK constraints enforce that columns remain mutually consistent at all times.
CREATE TABLE topics (
  id                       TEXT PRIMARY KEY,
  source_path              TEXT NOT NULL,
  anchor                   TEXT,                     -- JSON; schema defined by #2
  commit_sha               TEXT,                     -- set on incorporation
  incorporated_proposal_id TEXT,                     -- FK to incorporation_proposals(id)
  incorporated_by          TEXT,                     -- FK to users(id)
  incorporated_at          INTEGER,
  discarded_by             TEXT,                     -- FK to users(id)
  discarded_at             INTEGER,
  created_at               INTEGER NOT NULL,
  created_by               TEXT NOT NULL,            -- FK to users(id)
  updated_at               INTEGER NOT NULL,
  -- All-or-nothing for incorporation outcome columns:
  CHECK ((commit_sha               IS NULL) = (incorporated_at IS NULL)),
  CHECK ((incorporated_proposal_id IS NULL) = (incorporated_at IS NULL)),
  CHECK ((incorporated_by          IS NULL) = (incorporated_at IS NULL)),
  -- All-or-nothing for discard outcome columns:
  CHECK ((discarded_by              IS NULL) = (discarded_at    IS NULL)),
  -- Mutually exclusive outcomes — a Topic cannot be both incorporated and discarded:
  CHECK (incorporated_at IS NULL OR discarded_at IS NULL),
  FOREIGN KEY (created_by)               REFERENCES users(id),
  FOREIGN KEY (incorporated_by)          REFERENCES users(id),
  FOREIGN KEY (discarded_by)             REFERENCES users(id),
  -- Composite FK: the proposal pointed at must belong to THIS topic.
  FOREIGN KEY (incorporated_proposal_id, id)
    REFERENCES incorporation_proposals(id, topic_id)
);
CREATE INDEX topics_by_source_path ON topics(source_path);

-- Topic messages — narrative of the conversation.
-- author_user_id is nullable: null for system / agent-generated messages.
-- kind values are #2's call; at minimum includes 'human' and 'agent-proposal'.
CREATE TABLE topic_messages (
  id              TEXT PRIMARY KEY,
  topic_id        TEXT NOT NULL,
  kind            TEXT NOT NULL,
  body            TEXT NOT NULL,
  author_user_id  TEXT,                     -- null for non-human messages
  proposal_id     TEXT,                     -- non-null when kind='agent-proposal'
  sequence        INTEGER NOT NULL,          -- monotonic per topic; allocated by write funnel
  created_at      INTEGER NOT NULL,
  FOREIGN KEY (topic_id)       REFERENCES topics(id),
  FOREIGN KEY (author_user_id) REFERENCES users(id),
  FOREIGN KEY (proposal_id)    REFERENCES incorporation_proposals(id)
);
CREATE UNIQUE INDEX topic_messages_topic_sequence
  ON topic_messages(topic_id, sequence);

-- Incorporation proposals — append-only event log.
-- One row per Agent proposal (including reworks). State is fully derived:
--   incorporated = topics.incorporated_proposal_id points at this row
--   superseded   = a higher revision_number exists for the same topic_id
--   proposed     = otherwise
-- base_source_sha pins the Source content the Agent generated against. On
-- approval, the harness verifies it still matches the current Source's git
-- blob hash; mismatch means the proposal is stale and must be regenerated.
CREATE TABLE incorporation_proposals (
  id              TEXT PRIMARY KEY,
  topic_id        TEXT NOT NULL,
  revision_number INTEGER NOT NULL,
  proposed_source TEXT NOT NULL,
  base_source_sha TEXT NOT NULL,            -- git blob SHA at generation time
  proposed_by     TEXT NOT NULL,            -- FK to users(id)
  created_at      INTEGER NOT NULL,
  FOREIGN KEY (topic_id)    REFERENCES topics(id),
  FOREIGN KEY (proposed_by) REFERENCES users(id)
);
CREATE UNIQUE INDEX incorporation_proposals_topic_rev
  ON incorporation_proposals(topic_id, revision_number);
-- Required so the topics → incorporation_proposals composite FK above resolves:
CREATE UNIQUE INDEX incorporation_proposals_id_topic
  ON incorporation_proposals(id, topic_id);

-- Incorporation attempts — recovery marker inserted before Source is written.
-- A row exists from human approval until the commit + Topic update finish.
-- Incomplete rows let startup distinguish "approved but not committed yet" from
-- unrelated dirty working-tree edits.
CREATE TABLE incorporation_attempts (
  id              TEXT PRIMARY KEY,
  proposal_id     TEXT NOT NULL UNIQUE,
  topic_id        TEXT NOT NULL,
  source_path     TEXT NOT NULL,
  base_source_sha TEXT NOT NULL,
  approved_by     TEXT NOT NULL,
  approved_at     INTEGER NOT NULL,
  committed_sha   TEXT,
  completed_at    INTEGER,
  created_at      INTEGER NOT NULL,
  CHECK ((committed_sha IS NULL) = (completed_at IS NULL)),
  FOREIGN KEY (proposal_id, topic_id)
    REFERENCES incorporation_proposals(id, topic_id),
  FOREIGN KEY (approved_by) REFERENCES users(id)
);
CREATE INDEX incorporation_attempts_incomplete
  ON incorporation_attempts(completed_at);

-- Perspective definitions — per-Document personas.
-- persona_sha keys the cache; bumping it invalidates cached generations.
CREATE TABLE perspective_defs (
  source_path  TEXT NOT NULL,
  id           TEXT NOT NULL,             -- e.g. 'cfo', 'engineer'
  label        TEXT NOT NULL,
  persona      TEXT NOT NULL,             -- prompt text
  persona_sha  TEXT NOT NULL,             -- SHA-256 of persona text (hex)
  created_at   INTEGER NOT NULL,
  updated_at   INTEGER NOT NULL,
  PRIMARY KEY (source_path, id)
);

-- Perspectives — generated content, sha-keyed on both Source and persona.
-- Lookup: WHERE source_path=? AND perspective_id=? AND source_sha=? AND persona_sha=?
-- Miss → regenerate, insert, serve. Hit → serve.
CREATE TABLE perspectives (
  source_path     TEXT NOT NULL,
  perspective_id  TEXT NOT NULL,
  source_sha      TEXT NOT NULL,           -- git blob SHA of the Source
  persona_sha     TEXT NOT NULL,           -- SHA-256 of persona text (hex)
  body            TEXT NOT NULL,
  generated_at    INTEGER NOT NULL,
  PRIMARY KEY (source_path, perspective_id, source_sha, persona_sha)
);
Note — circular foreign key

topics.incorporated_proposal_id points at incorporation_proposals(id), and incorporation_proposals.topic_id points back at topics(id). SQLite resolves this naturally given insertion order — a Topic is always inserted first, then proposals; the back-reference is set only on incorporation. No deferred-constraints work needed.

Database location and lifecycle

The collab DB is authoritative for everything it stores — Topics, messages, proposals, perspective definitions, perspective cache, and users. The cache rows could in principle be regenerated; everything else is irreplaceable. Treat the file like any other production data store.

Schema migrations

The collab DB carries user data that mustn't be discarded; it never recreates itself from scratch. We need a real migration tool from day one.

Approach: a minimal internal/migrate package — a numbered list of forward-only SQL migrations applied at process start, with a schema_migrations table recording which have been run. No down-migrations in v1. This is a tiny amount of code and saves us from a panicked retrofit on first schema change.

Hashes — source_sha and persona_sha

Two content hashes parameterise the system; both are defined explicitly here so callers do not have to guess.

Both hashes are stored alongside the content they're derived from and used as cache keys; mismatches invalidate cached rows implicitly.

Source writes — protocol and recovery

Each Incorporation produces exactly one commit on the working branch, but the path from "human clicks Approve" to "SQLite reflects the commit" crosses two stores. The protocol below specifies the write order and the recovery procedure that guarantees no crash can leave the system in an unrecoverable state.

Write order

  1. Stale-check. Compute the current Source content's source_sha via git hash-object. Compare to incorporation_proposals.base_source_sha. If they differ, the proposal is stale: surface "Source has changed; regenerate the proposal" in the UI and stop.
  2. Record the approved attempt in incorporation_attempts. This SQLite row is inserted before touching the working tree, so startup can recover if the process dies after the file write but before the commit.
  3. Write the proposed Source to disk at <root>/<source_path>.
  4. Commit. git add && git commit as the Agent identity, with trailers Topic:, Proposal:, and Approved-by:. Capture the resulting commit sha from git rev-parse HEAD.
  5. Update topics and complete the attempt in one SQLite transaction: set topics.commit_sha, topics.incorporated_proposal_id, topics.incorporated_by, topics.incorporated_at, plus incorporation_attempts.committed_sha and completed_at.

Crash windows and recovery

Startup recovery handles every durable partial state from the write order above. The recovery loop runs before accepting new Incorporation work.

  1. Commit landed, SQLite incomplete. Enumerate commits with Topic: trailers via git log --grep='^Topic:' --format='%H%n%aI%n%B%x00' (or equivalent). For any commit whose Topic still has commit_sha IS NULL, parse Topic:, Proposal:, and Approved-by:, then apply the missing Topic update and complete the matching incorporation_attempts row.
  2. Attempt exists, no commit landed. For each incomplete incorporation_attempts row, load the proposal and compute the current working-tree source_sha. If it still equals base_source_sha, the file write never happened; resume from the write step. If it equals the git blob hash of proposed_source (computed with git hash-object --stdin), the file write happened but the commit did not; resume from the commit step.
  3. Attempt exists, but working tree is neither base nor proposal. Treat this as external or partial modification. Do not auto-commit or overwrite. Mark Source writes disabled for that Document and surface operator intervention: inspect the working tree, then either restore the base Source and retry, or discard/rework the proposal.

The reconciliation is idempotent and runs at every startup. Cost is bounded by the number of commits with Topic: trailers plus incomplete attempts — small in practice.

Commit message format

git commit message<subject — ≤72 chars, derived from Topic title or summary>

<optional body paragraph(s) — typically the Topic's resolution rationale>

Approved-by: <display name> <user-id>
Topic: <topic_id>
Proposal: <revision_number>

Author and committer: Orcha Agent <agent@orcha.local> (configured via git -c user.name=… -c user.email=… at commit time, owned by sub-project #3). git blame then reflects the truth — the Agent really wrote the change — while trailers keep who-approved and which-conversation searchable from git log --grep.

State derivation rules

No stored state columns for Topic or proposal "state." Both are computed from outcome columns:

EntityDerived statePredicate
Topicincorporatedcommit_sha IS NOT NULL
discardeddiscarded_at IS NOT NULL
openneither
Proposalincorporatedtopics.incorporated_proposal_id = this.id
supersededexists newer revision_number for same topic_id
proposedotherwise

The Topic's orphaned sub-state (anchor invalid after a Source rewrite) is a property of the anchor JSON, not the Topic; owned by #2.

Perspective cache invalidation

Cache lookups join on both source_sha and persona_sha. Either changing invalidates the cache:

The cache is keyed for natural staleness; there is no explicit invalidate step. Stale rows remain in perspectives until evicted by a future sweep job — out of scope for v1.

UX note — passed to #5 / #8

Lazy refresh means the first reader of a stale Perspective waits for regeneration. The UI must signal "regenerating…" so the wait is comprehensible. Decided here; rendered there.

Write paths and concurrency

SQLite writes funnel through a single goroutine reading a buffered channel of mutation requests, mirroring internal/index/index.go. Reads use a separate pool / direct connection — SQLite handles concurrent reads natively. This avoids SQLITE_BUSY errors and serialises writes without explicit application-level locking.

Git writes (Source commits) are similarly serialised: a single Agent goroutine receives Incorporation requests; the Agent runs the rewrite, captures the proposed source into incorporation_proposals, awaits human approval, then commits. The harness is the only thing that ever writes git. Sub-project #3 owns this loop's mechanics.

Users — stub data model plus bootstrap

The users table holds the minimum needed for foreign-key targets and UI rendering. The auth/authz project later owns:

None of those decisions affect the schema in this spec.

Bootstrap for v1 (single operator)

To make the NOT-NULL FK constraints satisfiable from day one without dragging auth into scope, the harness reads two values from wiki-browser.yaml and inserts a single row into users at startup:

yaml# wiki-browser.yaml
operator:
  user_id:      "daniel"
  display_name: "Daniel Barreto"

The startup routine runs INSERT OR IGNORE INTO users(id, display_name, created_at) VALUES (?, ?, ?). Every Topic / message / proposal action in v1 attributes to this single operator. When the real auth project lands, the bootstrap row remains valid; new users join via that project's flow.

Decision — the Agent is not a user

The Agent never appears in users. It writes git as a configured identity (commit author = "Orcha Agent"), not as an authenticated principal. Its actions inside SQLite (creating proposal rows, generating perspectives) are attributed to the human who triggered them via the relevant *_by columns. The harness is trusted; the Agent is part of the harness.

Open questions

Small items, none blocking implementation. Resolve during the writing-plans phase or first PR.

References