Document model & persistence — Design
Problem
Sub-project #1 of the collaborative-annotations initiative must settle what a Document is, where its parts live in storage, and how they're written. The domain model defines vocabulary and invariants; this spec turns those abstractions into bytes-on-disk, schemas, and write paths.
Without this layer settled, sub-projects #2 (Topic core), #4 (Incorporation), and #5 (Perspectives) all have nowhere to land their concrete behavior. The full trail of decisions and trade-offs that produced this design lives in the decisions log.
Goals
- A storage layout for Source, Perspectives, Topics, proposals, and supporting metadata.
- Clear ownership of every write: which writes go to git vs. SQLite, who initiates each, what integrity rules apply.
- A users data model that downstream features can reference — without designing authentication.
- A commit format for Agent-driven Source rewrites that preserves human auditability.
Non-goals
- Auth/authz. How users authenticate, what permissions exist, how the users table gets populated. Orthogonal initiative.
- Topic data model details. The exact shape of
topics.anchor, the exhaustive list oftopic_messages.kindvalues, the orphaned-anchor representation — owned by sub-project #2. - Incorporation state machine semantics. Rework prompts, conflict handling, the Agent's per-Incorporation prompt — owned by #4.
- Diff UI implementation. The decision to use Tier 1 + Tier 2 is here; how they render is in #4 / #8.
- Perspective generation algorithm. The cache shape is here; what the Agent does to produce a body is in #5.
Approach
Split storage along a clean axis: content that benefits from human-readable diff and blame goes in git; everything else goes in one SQLite database.
Source files live where they already do — in the orcha repo, served by the existing wiki-browser pipeline. The harness adds a single SQLite database wiki-browser-collab.db. Writes are asymmetric:
- Source writes =
git add+git commitinvoked by the Agent on Incorporation, after a human approves the diff. - All other writes = SQL through a single funnel goroutine, mirroring
internal/index/index.go.
Reads from SQLite are concurrent. Reads of Source content use the working tree — the same files wiki-browser already serves.
Design
What a Document is
A Document is not a stored object. It is the logical identity source_path — the relative path of a file under the wiki-browser root. The Document is what you get by joining:
- The Source: file content at
<root>/<source_path>in the working tree. - Its Perspectives: rows in
perspectiveswheresource_pathmatches. - Its Topics: rows in
topicswheresource_pathmatches.
A Document springs into existence the moment any of these accrete — there is no documents table and no opt-in registration. Files served by wiki-browser that have no Topics and no Perspective defs behave exactly as they do today.
Every file served by wiki-browser is eligible. A file becomes a "live" Document by virtue of having a Topic opened on it or Perspective defs configured for it. This minimises ceremony and keeps the existing reader's experience identical for non-collaborative content.
Per invariant, the Agent is the only writer of Source — renames and deletes do not happen through the system. If a Source file is renamed or deleted by an external tool (or by hand), Topics and Perspectives keyed by the old source_path become orphaned with no automatic recovery in v1. Reconciliation behaviour (rename-tracking, tombstoning, soft delete) is deferred. Operators who rename a doc through the wiki-browser surface should first discard any open Topics on it.
Schema — full
All tables live in wiki-browser-collab.db. Foreign keys are declared and enforced (PRAGMA foreign_keys = ON at connection time). Identifiers are TEXT (UUIDs or path-shaped natural keys); timestamps are INTEGER (unix seconds).
sql-- Users — auth/authz orthogonal; stubbed here for FK targets. -- A bootstrap row is created from configuration at first run (see "Users"). CREATE TABLE users ( id TEXT PRIMARY KEY, display_name TEXT NOT NULL, created_at INTEGER NOT NULL ); -- Topics — one row per discussion thread on a Document. -- Outcome columns (commit_sha, incorporated_*, discarded_*) are null while open. -- The Topic's state (open / incorporated / discarded) is derived from these columns. -- CHECK constraints enforce that columns remain mutually consistent at all times. CREATE TABLE topics ( id TEXT PRIMARY KEY, source_path TEXT NOT NULL, anchor TEXT, -- JSON; schema defined by #2 commit_sha TEXT, -- set on incorporation incorporated_proposal_id TEXT, -- FK to incorporation_proposals(id) incorporated_by TEXT, -- FK to users(id) incorporated_at INTEGER, discarded_by TEXT, -- FK to users(id) discarded_at INTEGER, created_at INTEGER NOT NULL, created_by TEXT NOT NULL, -- FK to users(id) updated_at INTEGER NOT NULL, -- All-or-nothing for incorporation outcome columns: CHECK ((commit_sha IS NULL) = (incorporated_at IS NULL)), CHECK ((incorporated_proposal_id IS NULL) = (incorporated_at IS NULL)), CHECK ((incorporated_by IS NULL) = (incorporated_at IS NULL)), -- All-or-nothing for discard outcome columns: CHECK ((discarded_by IS NULL) = (discarded_at IS NULL)), -- Mutually exclusive outcomes — a Topic cannot be both incorporated and discarded: CHECK (incorporated_at IS NULL OR discarded_at IS NULL), FOREIGN KEY (created_by) REFERENCES users(id), FOREIGN KEY (incorporated_by) REFERENCES users(id), FOREIGN KEY (discarded_by) REFERENCES users(id), -- Composite FK: the proposal pointed at must belong to THIS topic. FOREIGN KEY (incorporated_proposal_id, id) REFERENCES incorporation_proposals(id, topic_id) ); CREATE INDEX topics_by_source_path ON topics(source_path); -- Topic messages — narrative of the conversation. -- author_user_id is nullable: null for system / agent-generated messages. -- kind values are #2's call; at minimum includes 'human' and 'agent-proposal'. CREATE TABLE topic_messages ( id TEXT PRIMARY KEY, topic_id TEXT NOT NULL, kind TEXT NOT NULL, body TEXT NOT NULL, author_user_id TEXT, -- null for non-human messages proposal_id TEXT, -- non-null when kind='agent-proposal' sequence INTEGER NOT NULL, -- monotonic per topic; allocated by write funnel created_at INTEGER NOT NULL, FOREIGN KEY (topic_id) REFERENCES topics(id), FOREIGN KEY (author_user_id) REFERENCES users(id), FOREIGN KEY (proposal_id) REFERENCES incorporation_proposals(id) ); CREATE UNIQUE INDEX topic_messages_topic_sequence ON topic_messages(topic_id, sequence); -- Incorporation proposals — append-only event log. -- One row per Agent proposal (including reworks). State is fully derived: -- incorporated = topics.incorporated_proposal_id points at this row -- superseded = a higher revision_number exists for the same topic_id -- proposed = otherwise -- base_source_sha pins the Source content the Agent generated against. On -- approval, the harness verifies it still matches the current Source's git -- blob hash; mismatch means the proposal is stale and must be regenerated. CREATE TABLE incorporation_proposals ( id TEXT PRIMARY KEY, topic_id TEXT NOT NULL, revision_number INTEGER NOT NULL, proposed_source TEXT NOT NULL, base_source_sha TEXT NOT NULL, -- git blob SHA at generation time proposed_by TEXT NOT NULL, -- FK to users(id) created_at INTEGER NOT NULL, FOREIGN KEY (topic_id) REFERENCES topics(id), FOREIGN KEY (proposed_by) REFERENCES users(id) ); CREATE UNIQUE INDEX incorporation_proposals_topic_rev ON incorporation_proposals(topic_id, revision_number); -- Required so the topics → incorporation_proposals composite FK above resolves: CREATE UNIQUE INDEX incorporation_proposals_id_topic ON incorporation_proposals(id, topic_id); -- Incorporation attempts — recovery marker inserted before Source is written. -- A row exists from human approval until the commit + Topic update finish. -- Incomplete rows let startup distinguish "approved but not committed yet" from -- unrelated dirty working-tree edits. CREATE TABLE incorporation_attempts ( id TEXT PRIMARY KEY, proposal_id TEXT NOT NULL UNIQUE, topic_id TEXT NOT NULL, source_path TEXT NOT NULL, base_source_sha TEXT NOT NULL, approved_by TEXT NOT NULL, approved_at INTEGER NOT NULL, committed_sha TEXT, completed_at INTEGER, created_at INTEGER NOT NULL, CHECK ((committed_sha IS NULL) = (completed_at IS NULL)), FOREIGN KEY (proposal_id, topic_id) REFERENCES incorporation_proposals(id, topic_id), FOREIGN KEY (approved_by) REFERENCES users(id) ); CREATE INDEX incorporation_attempts_incomplete ON incorporation_attempts(completed_at); -- Perspective definitions — per-Document personas. -- persona_sha keys the cache; bumping it invalidates cached generations. CREATE TABLE perspective_defs ( source_path TEXT NOT NULL, id TEXT NOT NULL, -- e.g. 'cfo', 'engineer' label TEXT NOT NULL, persona TEXT NOT NULL, -- prompt text persona_sha TEXT NOT NULL, -- SHA-256 of persona text (hex) created_at INTEGER NOT NULL, updated_at INTEGER NOT NULL, PRIMARY KEY (source_path, id) ); -- Perspectives — generated content, sha-keyed on both Source and persona. -- Lookup: WHERE source_path=? AND perspective_id=? AND source_sha=? AND persona_sha=? -- Miss → regenerate, insert, serve. Hit → serve. CREATE TABLE perspectives ( source_path TEXT NOT NULL, perspective_id TEXT NOT NULL, source_sha TEXT NOT NULL, -- git blob SHA of the Source persona_sha TEXT NOT NULL, -- SHA-256 of persona text (hex) body TEXT NOT NULL, generated_at INTEGER NOT NULL, PRIMARY KEY (source_path, perspective_id, source_sha, persona_sha) );
topics.incorporated_proposal_id points at incorporation_proposals(id), and incorporation_proposals.topic_id points back at topics(id). SQLite resolves this naturally given insertion order — a Topic is always inserted first, then proposals; the back-reference is set only on incorporation. No deferred-constraints work needed.
Database location and lifecycle
The collab DB is authoritative for everything it stores — Topics, messages, proposals, perspective definitions, perspective cache, and users. The cache rows could in principle be regenerated; everything else is irreplaceable. Treat the file like any other production data store.
- Location: a stable path configured in
wiki-browser.yaml(default./wiki-browser-collab.db, sibling to the index DB). Unlike the index DB, the collab DB must not be treated as disposable. - Gitignored, but deployment is responsible for backups. The harness does not implement backup automation in v1.
- WAL mode is enabled (
PRAGMA journal_mode = WAL) so the single write goroutine can run concurrently with read connections. This produces two sidecar files (-wal,-shm); backups must either checkpoint first (PRAGMA wal_checkpoint(TRUNCATE)) or copy all three files atomically. - Foreign-key enforcement is per-connection in SQLite. Every connection (read or write) must have
foreign_keys = ON. The project already usesmodernc.org/sqlite; with that driver the harness opens the DB using a DSN that includes_pragma=foreign_keys(1), and applies the same DSN to all connections in the pool.
Schema migrations
The collab DB carries user data that mustn't be discarded; it never recreates itself from scratch. We need a real migration tool from day one.
Approach: a minimal internal/migrate package — a numbered list of forward-only SQL migrations applied at process start, with a schema_migrations table recording which have been run. No down-migrations in v1. This is a tiny amount of code and saves us from a panicked retrofit on first schema change.
Hashes — source_sha and persona_sha
Two content hashes parameterise the system; both are defined explicitly here so callers do not have to guess.
source_sha= the git blob SHA of the Source file content (40-char SHA-1 hex). Computed viagit hash-object <path>against the working tree, or read fromgit ls-tree HEAD -- <path>for the committed version. Content-addressed; changes whenever the file content changes by even one byte.persona_sha= SHA-256 of the persona text, hex-encoded. Recomputed by the harness on every write toperspective_defs.persona.
Both hashes are stored alongside the content they're derived from and used as cache keys; mismatches invalidate cached rows implicitly.
Source writes — protocol and recovery
Each Incorporation produces exactly one commit on the working branch, but the path from "human clicks Approve" to "SQLite reflects the commit" crosses two stores. The protocol below specifies the write order and the recovery procedure that guarantees no crash can leave the system in an unrecoverable state.
Write order
- Stale-check. Compute the current Source content's
source_shaviagit hash-object. Compare toincorporation_proposals.base_source_sha. If they differ, the proposal is stale: surface "Source has changed; regenerate the proposal" in the UI and stop. - Record the approved attempt in
incorporation_attempts. This SQLite row is inserted before touching the working tree, so startup can recover if the process dies after the file write but before the commit. - Write the proposed Source to disk at
<root>/<source_path>. - Commit.
git add && git commitas the Agent identity, with trailersTopic:,Proposal:, andApproved-by:. Capture the resulting commit sha fromgit rev-parse HEAD. - Update
topicsand complete the attempt in one SQLite transaction: settopics.commit_sha,topics.incorporated_proposal_id,topics.incorporated_by,topics.incorporated_at, plusincorporation_attempts.committed_shaandcompleted_at.
Crash windows and recovery
Startup recovery handles every durable partial state from the write order above. The recovery loop runs before accepting new Incorporation work.
- Commit landed, SQLite incomplete. Enumerate commits with
Topic:trailers viagit log --grep='^Topic:' --format='%H%n%aI%n%B%x00'(or equivalent). For any commit whose Topic still hascommit_sha IS NULL, parseTopic:,Proposal:, andApproved-by:, then apply the missing Topic update and complete the matchingincorporation_attemptsrow. - Attempt exists, no commit landed. For each incomplete
incorporation_attemptsrow, load the proposal and compute the current working-treesource_sha. If it still equalsbase_source_sha, the file write never happened; resume from the write step. If it equals the git blob hash ofproposed_source(computed withgit hash-object --stdin), the file write happened but the commit did not; resume from the commit step. - Attempt exists, but working tree is neither base nor proposal. Treat this as external or partial modification. Do not auto-commit or overwrite. Mark Source writes disabled for that Document and surface operator intervention: inspect the working tree, then either restore the base Source and retry, or discard/rework the proposal.
The reconciliation is idempotent and runs at every startup. Cost is bounded by the number of commits with Topic: trailers plus incomplete attempts — small in practice.
Commit message format
git commit message<subject — ≤72 chars, derived from Topic title or summary>
<optional body paragraph(s) — typically the Topic's resolution rationale>
Approved-by: <display name> <user-id>
Topic: <topic_id>
Proposal: <revision_number>
Author and committer: Orcha Agent <agent@orcha.local> (configured via git -c user.name=… -c user.email=… at commit time, owned by sub-project #3). git blame then reflects the truth — the Agent really wrote the change — while trailers keep who-approved and which-conversation searchable from git log --grep.
State derivation rules
No stored state columns for Topic or proposal "state." Both are computed from outcome columns:
| Entity | Derived state | Predicate |
|---|---|---|
| Topic | incorporated | commit_sha IS NOT NULL |
| discarded | discarded_at IS NOT NULL | |
| open | neither | |
| Proposal | incorporated | topics.incorporated_proposal_id = this.id |
| superseded | exists newer revision_number for same topic_id | |
| proposed | otherwise |
The Topic's orphaned sub-state (anchor invalid after a Source rewrite) is a property of the anchor JSON, not the Topic; owned by #2.
Perspective cache invalidation
Cache lookups join on both source_sha and persona_sha. Either changing invalidates the cache:
- Source advances on Incorporation → next read of any Perspective for that
source_pathis a miss, triggers regeneration of that Perspective specifically. - Persona text edited (a row update in
perspective_defs, withpersona_sharecomputed) → next read of that Perspective is a miss, triggers regeneration.
The cache is keyed for natural staleness; there is no explicit invalidate step. Stale rows remain in perspectives until evicted by a future sweep job — out of scope for v1.
Lazy refresh means the first reader of a stale Perspective waits for regeneration. The UI must signal "regenerating…" so the wait is comprehensible. Decided here; rendered there.
Write paths and concurrency
SQLite writes funnel through a single goroutine reading a buffered channel of mutation requests, mirroring internal/index/index.go. Reads use a separate pool / direct connection — SQLite handles concurrent reads natively. This avoids SQLITE_BUSY errors and serialises writes without explicit application-level locking.
Git writes (Source commits) are similarly serialised: a single Agent goroutine receives Incorporation requests; the Agent runs the rewrite, captures the proposed source into incorporation_proposals, awaits human approval, then commits. The harness is the only thing that ever writes git. Sub-project #3 owns this loop's mechanics.
Users — stub data model plus bootstrap
The users table holds the minimum needed for foreign-key targets and UI rendering. The auth/authz project later owns:
- How rows enter this table at scale (signup, SSO, manual provisioning).
- How HTTP requests authenticate (cookies, tokens, headers).
- What permissions exist (who can resolve a Topic, etc.).
None of those decisions affect the schema in this spec.
Bootstrap for v1 (single operator)
To make the NOT-NULL FK constraints satisfiable from day one without dragging auth into scope, the harness reads two values from wiki-browser.yaml and inserts a single row into users at startup:
yaml# wiki-browser.yaml
operator:
user_id: "daniel"
display_name: "Daniel Barreto"
The startup routine runs INSERT OR IGNORE INTO users(id, display_name, created_at) VALUES (?, ?, ?). Every Topic / message / proposal action in v1 attributes to this single operator. When the real auth project lands, the bootstrap row remains valid; new users join via that project's flow.
The Agent never appears in users. It writes git as a configured identity (commit author = "Orcha Agent"), not as an authenticated principal. Its actions inside SQLite (creating proposal rows, generating perspectives) are attributed to the human who triggered them via the relevant *_by columns. The harness is trusted; the Agent is part of the harness.
Open questions
Small items, none blocking implementation. Resolve during the writing-plans phase or first PR.
- Stale Perspective sweep. Old rows in
perspectivesaccumulate as Source advances. A simple job ("keep only the row matching the currentsource_sha+persona_sha") suffices. Not in v1. - Multi-line subject derivation. When a Topic's title or summary spans multiple lines, the commit subject needs a deterministic truncation. Cosmetic; defer.
References
- Domain model — vocabulary, invariants, sub-project decomposition.
- Decisions & parking lot — full trail of choices made during this brainstorm and items deferred to other sub-projects.
wiki-browser/internal/index/index.go— existing single-funnel SQLite write pattern; the collab DB mirrors it.wiki-browser/internal/walker/walker.go— existing fsnotify-driven file cache; the source-files-as-truth story reuses this pipeline as-is.