Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Document Diagnostics — Design

1. Scope & goals

Problem

Once inline editing ships (edit-history plan), every user edit puts derived values out of date with no path to refresh them. The derived values — validation results, fraud flags, tax-compliance issues, per-line-item VAT checks, matching, reconciliation — today live in three places: embedded in document.structured_data, scattered across per-subsystem columns on document, and split across tables (document_match, ap_document_cluster). There is no unified audit of derivations and no recompute lifecycle other than full re-ingestion.

Approach

Introduce diagnostics as a first-class concept. One dedicated JSONB column on document holds the materialized latest snapshot. A new unified document_processor_run table records every execution of every processor (both ingestion-time and edit-triggered), replacing ap_ingestion_post_process_stat and absorbing per-document matching/reconciliation status tracking. A workers-service-hosted pipeline, triggered 60 seconds after the last edit (SQS-delayed idempotent handler), rebuilds diagnostics for the document's current version. The detail view marks stale and in-progress sections inline (grayed + badge) and auto-refreshes each one as its run completes, reusing the existing document-events SSE stream.

In scope

Out of scope, explicitly deferred

Depends on

The edit-history plan must be merged before this one. This design assumes:

2. Terminology

3. Storage model

3.1 New column: document.diagnostics

ALTER TABLE document ADD COLUMN diagnostics JSONB;

Materialized latest snapshot of all diagnostic outputs. Read directly by the detail view. Top-level keys are absent when the corresponding processor has never successfully run for this document.

Shape:

{
  "validations":    { "<check-name>": { "status": "pass|warning|error|uncertain",
                                        "field": "...", "message": "...",
                                        "details": {...}, "resolved-value": ...,
                                        "confidence": 0.9, "reasoning": "..." },
                      ... },
  "fraud-flags":    [ { "rule-id": "ef1-01", "type": "bank-account-mismatch",
                        "severity": "warning",
                        "message": "...", "details": {...}, "suggestion": "..." },
                      ... ],
  "tax-issues":     [ { "type": "missing-vat-id", "severity": "error",
                        "message": "...", "suggestion": "..." },
                      ... ],
  "line-items":     { "<line-item-id>": { "vat-validation": { "status": "valid|invalid|warning|skipped",
                                                              "expected-rate": 19,
                                                              "reasoning": "...",
                                                              "suggestion": "..." } },
                      ... },
  "matching":       { "matches": [ { "document-id": "<uuid>",
                                     "blended-score": 0.95,
                                     "llm-confidence": "high",
                                     "match-method": "llm" },
                                   ... ] },
  "reconciliation": { "status": "reconciled|incomplete|error", "details": {...} }
}

line-items is keyed by the per-line id assigned in the edit-history plan, so per-line diagnostics stay correctly associated across reorderings and deletions.

matching is a summary snapshot. The relational document_match table remains the primary query source for "which other documents match this one" lookups — its pairwise, cross-document shape is not something JSONB should represent. The snapshot exists in diagnostics.matching so the run row's audit is self-contained and the UI has one source of truth to render.

3.2 New table: document_processor_run

CREATE TYPE processor_run_status  AS ENUM ('pending', 'running', 'completed', 'failed');
CREATE TYPE processor_run_trigger AS ENUM ('ingestion', 'edit', 'manual');

CREATE TABLE document_processor_run (
    id                      UUID PRIMARY KEY DEFAULT uuidv7(),
    document_id             UUID NOT NULL REFERENCES document(id) ON DELETE CASCADE,
    processor_id            TEXT NOT NULL,
    trigger_kind            processor_run_trigger NOT NULL,
    ingestion_id            UUID REFERENCES ap_ingestion(id) ON DELETE CASCADE,
    triggered_by_history_id UUID REFERENCES document_history(id),
    document_version        INT  NOT NULL,
    started_at              TIMESTAMPTZ NOT NULL DEFAULT now(),
    ended_at                TIMESTAMPTZ,
    input_tokens            INTEGER,
    output_tokens           INTEGER,
    model                   TEXT,
    commit_sha              TEXT,
    status                  processor_run_status NOT NULL DEFAULT 'running',
    result                  JSONB,
    error                   TEXT,

    CONSTRAINT doc_processor_run_trigger_xor CHECK (
        (trigger_kind = 'ingestion' AND ingestion_id IS NOT NULL AND triggered_by_history_id IS NULL)
        OR (trigger_kind = 'edit'   AND ingestion_id IS NULL     AND triggered_by_history_id IS NOT NULL)
        OR (trigger_kind = 'manual' AND ingestion_id IS NULL     AND triggered_by_history_id IS NULL)
    )
);

CREATE INDEX idx_doc_processor_run_doc_proc_version
    ON document_processor_run(document_id, processor_id, document_version DESC);

CREATE INDEX idx_doc_processor_run_ingestion
    ON document_processor_run(ingestion_id) WHERE ingestion_id IS NOT NULL;

One row per run. processor_id is free-form text (not an ENUM) so that adding new processors in the future does not require a migration. Known values used by this release:

The trigger-XOR constraint keeps each row's provenance unambiguous.

3.3 Removed from structured_data

Per-document-type schemas (schema/invoice/structured_data.clj, schema/purchase_order/structured_data.clj, schema/goods_received_note/structured_data.clj, schema/contract/structured_data.clj) lose these top-level keys:

The LineItem schemas lose :vat-validation (moves to diagnostics.line-items[<id>].vat-validation).

3.4 Removed from document

Five columns become derivable from document_processor_run rows and are queued for removal in PENDING-CLEANUPS.md (see §9):

Latest state per subsystem is read via a SELECT … ORDER BY document_version DESC, started_at DESC LIMIT 1 (or DISTINCT ON (processor_id)) on the covering index. Attempt counts use COUNT(*).

The existing matching_status ENUM type is also queued for removal (§9).

3.5 Unchanged

4. Write paths

4.1 Processor runs: two-phase write

Every processor run writes to document_processor_run in two phases, regardless of trigger kind:

  1. At dispatch: INSERT a row with status='running', started_at = now(), ended_at = NULL, result = NULL, document_version = <doc version at dispatch time>.
  2. At completion: UPDATE the row. On success: status='completed', ended_at = now(), result = <subsystem output>, token counts and model populated (for LLM processors). On failure: status='failed', ended_at = now(), error = <message>, result may be NULL.

The UI reads the current state off this row — no separate "is it recomputing?" flag is needed on the document. The pg_notify trigger (§6.3) fires on every row transition so the SSE stream delivers both diagnostic-run-started and diagnostic-run-completed events.

4.2 Ingestion-time

The ingestion-completion handler (introduced by the edit-history plan) now additionally writes run rows. The pipeline:

  1. Transcription, classification, extraction (pre-existing; write to ap_ingestion columns as today).
  2. Phase 1 processors (parallel): validations, accounts-matcher, cost-center-matcher, accruals-matcher, supplier-matcher, supplier-verifier, tax-compliance-analyzer, financial-validation-resolver. Each INSERTs its run row at start (status='running') and UPDATEs on completion.
  3. Phase 2 processors (parallel, see phase-1 results): fraud-detector, uncertain-validations-resolver. Same two-phase write pattern.
  4. vat-validation runs as its own named processor, producing one run row per execution whose result is a map keyed by line-item id: {"<line-item-id>": {"status": ..., "expected-rate": ..., ...}, ...}. The per-line VAT check was previously inline in validation.clj; promoting it to a standalone processor gives it its own audit row and lets it recompute independently on edits.
  5. Ingestion-completion transaction (same transaction as the edit-history plan's document_history ingestion row + document.structured_data write):

Diagnostic-only refreshes do not bump document.version. Version tracks changes to editable state. Matching, reconciliation, and every edit-triggered recompute update document.diagnostics and document.needs_human_review in place, gated by the run's own document_version equaling document.version at commit time (see §4.4). If the document has been edited since the run started, the run's output is captured in document_processor_run.result for audit but is not merged into document.diagnostics.

Matching and reconciliation run on their existing post-ingestion worker schedule. Each creates its own run row (with trigger_kind = 'ingestion' and the same ingestion_id) and updates document.diagnostics.matching / .reconciliation on its terminal transition — subject to the version gate above.

4.3 Edit-triggered recompute

  1. User edit lands in an HTMX endpoint (edit-history plan). The edit handler, inside its existing transaction, enqueues an SQS message (§5.2) carrying document-id, the new document.version, the edit's history-id, and enqueued-at.
  2. After 60 s (SQS DelaySeconds), the consumer wakes up in the workers service.
  3. Idempotency check: fetch document.version and the max created_at on document_history for this document. If there is any document_history row newer than the message's enqueued-at, a later edit has already scheduled its own message — this one is stale, skip and ack.
  4. Dispatch: for each diagnostic processor, INSERT a run row with trigger_kind='edit', triggered_by_history_id = <message's history-id>, document_version = <current doc version>, status='running'. Dispatch work (same invocation shape as ingestion-time).
  5. Per-processor completion: the processor updates its own run row and, in the same transaction, merges its output slice into document.diagnostics and recomputes document.needs_human_reviewonly if the run's document_version equals document.version at commit time. If the document has been edited again in the meantime, the run's output is captured in the run row but not applied to the materialized document.diagnostics (§4.4).

Phase ordering for edit-triggered cycles mirrors ingestion: phase-2 processors (fraud, uncertain-validations) start only after phase-1 processors commit. Orchestration detail is deferred to the implementation plan.

4.4 Concurrency: edit during in-flight recompute

An edit arriving mid-cycle does not cancel anything running. Two outcomes:

4.5 Matching & reconciliation specifics

Matching continues to write document_match rows (the relational, pairwise shape) on every successful matching run. Additionally the matching processor:

Reconciliation writes its run row and its summary into document.diagnostics.reconciliation. No more document.reconciliation_status column.

5. Throttle & dispatch

5.1 SQS queue

New queue created in both production and local environments:

The consumer runs inside the workers service (the same JVM process that hosts the ingestion worker). It shares DB pool, LLM clients, and processor implementations with the ingestion pipeline — no new codepaths for LLM work, only a new entrypoint that invokes the existing processors against an already-ingested document.

5.2 Message shape & idempotency

{:document-id      "<uuid>"
 :history-id       "<uuid-of-triggering-edit>"
 :document-version <int>         ;; version at enqueue time
 :enqueued-at      "<iso-instant>"}

Enqueue is part of the edit handler's transaction (SQS send is deferred to commit via an outbox-style pattern or a post-commit hook; exact mechanism is an implementation detail left to the plan).

Idempotency logic at the consumer:

(defn should-run? [db-pool {:keys [document-id enqueued-at]}]
  (let [{:document/keys [version]} (fetch-document db-pool document-id)
        newest-edit                (latest-history-at db-pool document-id)]
    ;; If any edit occurred after this message was enqueued,
    ;; a later (and more up-to-date) message exists in the queue.
    (not (and newest-edit
              (pos? (compare (:created-at newest-edit) enqueued-at))))))

When should-run? is false, ack the message and do nothing. When true, dispatch for the current document version.

5.3 Consumer logic

fetch document + version
if should-run?
  for each diagnostic processor:
    insert run row status='running', document_version=doc.version
    enqueue processor work
orchestrate phase ordering (phase-2 waits for phase-1 commits)
each processor on completion:
  update own run row to completed/failed with result/error
  if run.document_version == current doc.version:
    merge result slice into document.diagnostics
    recompute document.needs_human_review
  NOTIFY document_events (via pg_notify trigger)

6. Read path & UI

6.1 Rendering with section states

The detail-view handler fetches:

{:document/keys [diagnostics version]} ...
latest-runs = (latest-run-per-processor db-pool document-id)

latest-runs is a single query:

SELECT DISTINCT ON (processor_id) processor_id, status, document_version,
                                  result, error, started_at, ended_at
  FROM document_processor_run
 WHERE document_id = $1
 ORDER BY processor_id, document_version DESC, started_at DESC;

Per diagnostic section, the renderer classifies state. Content in all non-empty cases comes from document.diagnostics.<subsystem> — which always carries the most recently applied snapshot (written by the last run whose document_version equaled document.version at commit). No second lookup needed.

Condition State Rendering
no row AND no diagnostics slice never-run Empty-state placeholder, "no analysis yet".
status='running' recomputing Gray opacity + "Recomputing…" pill in section header. Content from document.diagnostics (empty if first-ever run).
status='completed' AND document_version = document.version current Normal rendering from document.diagnostics.<subsystem>.
status='completed' AND document_version < document.version stale Gray opacity + "Stale (version behind)" pill. Content from document.diagnostics.
status='failed' error Red pill + error message text. Content from document.diagnostics (last-successful snapshot) if any.

Both stale and recomputing states use the grayed-out pattern (opacity ~0.55) plus a distinctive header badge; the visual language is shared so the user reads "don't trust these yet" either way.

The editable-value helper introduced by the edit-history plan is unaffected — it wraps editable leaves, which are all outside of diagnostics.

6.2 SSE integration

No new SSE endpoint. The detail-view SSE handler (src/com/getorcha/app/http/documents/view/shared.clj:1018) already subscribes to document-events keyed by tenant-id and dispatches on :event/type. The case branch is extended:

:diagnostic-run-started
;; Re-render the section with the 'recomputing' state.
(let [processor-id (:processor/id event)]
  {:event "diagnostic-run-started"
   :data  (hiccup/html (render-section-recomputing document-id processor-id ...))})

:diagnostic-run-completed
;; Re-render the section with the completed content (or error state).
(let [processor-id (:processor/id event)]
  {:event "diagnostic-run-completed"
   :data  (hiccup/html (render-section document-id processor-id ...))})

HTMX on the client listens with hx-ext="sse" + per-section sse-swap="diagnostic-run-completed-<processor>" attributes (or a single sse-swap that carries processor-id in the event name). The exact HTMX attribute shape is an implementation detail; the server side is a pure extension of the existing exec-fn.

6.3 pg_notify bridge

Today, trigger_document_matching_event (resources/migrations/20260302194439-add-matching-event-trigger.up.sql) fires pg_notify('document_events', …) on document.matching_status transitions. That column is going away.

A new trigger on document_processor_run replaces it:

CREATE OR REPLACE FUNCTION notify_processor_run_event()
RETURNS TRIGGER AS $$
DECLARE
    payload jsonb;
    le_tenant_id uuid;
    legal_entity_id uuid;
BEGIN
    -- Only notify on state transitions that matter to the UI
    IF NEW.status IS NULL OR NEW.status NOT IN ('running', 'completed', 'failed') THEN
        RETURN NEW;
    END IF;

    SELECT d.legal_entity_id, le.tenant_id INTO legal_entity_id, le_tenant_id
      FROM document d JOIN legal_entity le ON le.id = d.legal_entity_id
     WHERE d.id = NEW.document_id;

    payload := jsonb_build_object(
        'event/type',      CASE WHEN NEW.status = 'running'
                                THEN 'diagnostic-run-started'
                                ELSE 'diagnostic-run-completed' END,
        'document/id',     NEW.document_id::text,
        'processor/id',    NEW.processor_id,
        'document-version', NEW.document_version,
        'run-status',      NEW.status::text,
        'legal-entity/id', legal_entity_id::text,
        'tenant/id',       le_tenant_id::text
    );
    PERFORM pg_notify('document_events', payload::text);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trigger_processor_run_event
    AFTER INSERT OR UPDATE OF status ON document_processor_run
    FOR EACH ROW
    WHEN (NEW.status IN ('running', 'completed', 'failed'))
    EXECUTE FUNCTION notify_processor_run_event();

Drop the old trigger_document_matching_event and its notify_matching_event function in the same migration.

7. needs_human_review

The column stays on document. Criterion is unchanged — true iff:

Only the source location changes: from structured_data.validation-results / structured_data.fraud-flags to diagnostics.validations / diagnostics.fraud-flags. The logic moves to an application-level helper (the trigger-based derivation was already scheduled for removal by the edit-history plan).

When does it update?

Queue semantics (list views filtering WHERE needs_human_review = true) are unchanged. NULL continues to mean "initial analysis not yet complete"; such documents are excluded from the queue.

8. missing-fields removal

structured_data.missing-fields is LLM-reported extraction metadata (the prompt asks the LLM to list fields it could not extract). A codebase search confirms no reader of the top-level key exists. The same information is already derivable — and is in fact derived — by the required-fields check inside the validations processor (validation.clj:548), which reads current structured_data against per-document-type required-field sets and stores the list under validation-results[:required-fields][:details][:missing-fields].

This plan removes the top-level field entirely rather than letting it decay into a stale LLM opinion after edits:

No PENDING-CLEANUPS.md entry — this is a full removal, not a deprecation.

The derivational answer to "what's currently missing?" lives where it already lives, inside diagnostics.validations.required-fields.details.missing-fields, and recomputes naturally with every validations run.

9. Migration plan

9.1 Migration file

One file, ships with the release. Ordering matters — the schema changes land first, the data backfill runs against the new schema, and the triggers flip last.

-- up

-- 1) New types + table
CREATE TYPE processor_run_status  AS ENUM ('pending', 'running', 'completed', 'failed');
CREATE TYPE processor_run_trigger AS ENUM ('ingestion', 'edit', 'manual');

CREATE TABLE document_processor_run ( ... );  -- full schema per §3.2
CREATE INDEX idx_doc_processor_run_doc_proc_version ...;
CREATE INDEX idx_doc_processor_run_ingestion ...;

-- 2) New diagnostics column
ALTER TABLE document ADD COLUMN diagnostics JSONB;

-- 3) Backfill run rows from ap_ingestion_post_process_stat
INSERT INTO document_processor_run
    (document_id, processor_id, trigger_kind, ingestion_id, document_version,
     started_at, ended_at, input_tokens, output_tokens, model, status, commit_sha)
SELECT i.document_id, s.processor_id, 'ingestion', s.ingestion_id, 1,
       s.started_at, s.ended_at, s.input_tokens, s.output_tokens, s.model,
       'completed', i.commit_sha
  FROM ap_ingestion_post_process_stat s
  JOIN ap_ingestion i ON i.id = s.ingestion_id
 WHERE i.document_id IS NOT NULL;

-- 4) Backfill one matching-run row per document with non-null matching_status
INSERT INTO document_processor_run
    (document_id, processor_id, trigger_kind, ingestion_id, document_version,
     started_at, ended_at, status, error)
SELECT d.id, 'matching', 'ingestion',
       (SELECT id FROM ap_ingestion WHERE document_id = d.id
         ORDER BY completed_at DESC NULLS LAST LIMIT 1),
       1,
       COALESCE(d.matching_failed_at, d.updated_at),
       CASE WHEN d.matching_status IN ('succeeded','failed','skipped')
            THEN COALESCE(d.matching_failed_at, d.updated_at) END,
       CASE d.matching_status
            WHEN 'pending'     THEN 'pending'
            WHEN 'in-progress' THEN 'running'
            WHEN 'succeeded'   THEN 'completed'
            WHEN 'failed'      THEN 'failed'
            WHEN 'skipped'     THEN 'completed'
       END::processor_run_status,
       d.matching_error
  FROM document d
 WHERE d.matching_status IS NOT NULL;

-- 5) Backfill one reconciliation-run row per document with non-null reconciliation_status
INSERT INTO document_processor_run
    (document_id, processor_id, trigger_kind, ingestion_id, document_version,
     started_at, ended_at, status, result)
SELECT d.id, 'reconciliation', 'ingestion',
       (SELECT id FROM ap_ingestion WHERE document_id = d.id
         ORDER BY completed_at DESC NULLS LAST LIMIT 1),
       1, d.updated_at, d.updated_at, 'completed',
       jsonb_build_object('status', d.reconciliation_status)
  FROM document d
 WHERE d.reconciliation_status IS NOT NULL;

-- 6) Backfill a synthetic 'validations' run per document with validation-results
INSERT INTO document_processor_run
    (document_id, processor_id, trigger_kind, ingestion_id, document_version,
     started_at, ended_at, status, result)
SELECT d.id, 'validations', 'ingestion',
       (SELECT id FROM ap_ingestion WHERE document_id = d.id
         ORDER BY completed_at DESC NULLS LAST LIMIT 1),
       1, d.updated_at, d.updated_at, 'completed',
       d.structured_data -> 'validation-results'
  FROM document d
 WHERE d.structured_data ? 'validation-results';

-- 7) Seed document.diagnostics from the keys currently in structured_data
UPDATE document SET diagnostics = jsonb_strip_nulls(jsonb_build_object(
    'validations',    structured_data -> 'validation-results',
    'fraud-flags',    structured_data -> 'fraud-flags',
    'tax-issues',     structured_data -> 'tax-issues',
    'line-items',     (SELECT jsonb_object_agg(li ->> 'id',
                                               jsonb_build_object('vat-validation',
                                                                  li -> 'vat-validation'))
                         FROM jsonb_array_elements(structured_data -> 'line-items') li
                        WHERE li ? 'vat-validation'),
    'reconciliation', CASE WHEN reconciliation_status IS NOT NULL
                           THEN jsonb_build_object('status', reconciliation_status) END
))
WHERE structured_data IS NOT NULL;

-- 8) Strip those keys from structured_data (including missing-fields per §8)
UPDATE document SET
    structured_data = structured_data
                    - 'validation-results'
                    - 'fraud-flags'
                    - 'tax-issues'
                    - 'missing-fields'
WHERE structured_data IS NOT NULL;

-- 8b) Strip vat-validation from each line item
UPDATE document SET structured_data = jsonb_set(
    structured_data, '{line-items}',
    COALESCE(
        (SELECT jsonb_agg(li - 'vat-validation' ORDER BY (li ->> 'order')::int)
           FROM jsonb_array_elements(structured_data -> 'line-items') li),
        '[]'::jsonb))
WHERE structured_data ? 'line-items'
  AND jsonb_typeof(structured_data -> 'line-items') = 'array';

-- 9) Replace matching pg_notify trigger with a processor-run trigger
DROP TRIGGER IF EXISTS trigger_document_matching_event ON document;
DROP FUNCTION IF EXISTS notify_matching_event();

CREATE OR REPLACE FUNCTION notify_processor_run_event() ...;
CREATE TRIGGER trigger_processor_run_event ... ;  -- full body per §6.3

Down migration recreates the dropped columns/tables with NULL values (data is not recoverable bit-for-bit; this is a one-way migration in practice).

9.2 PENDING-CLEANUPS.md additions

Appended to the file created by the edit-history plan:

## `ap_ingestion_post_process_stat` (entire table)
- **Replaced by:** `document_processor_run` (unified per-processor
  run history, any trigger kind).
- **Stopped being written:** <DATE>, when the diagnostics
  recompute pipeline shipped and post-process handlers started
  writing to `document_processor_run`.
- **Gate to drop:** backfill verified, no open queries against the
  old table.

## `document.matching_status`, `matching_error`, `matching_attempts`, `matching_failed_at`
- **Replaced by:** `document_processor_run` rows where `processor_id
  = 'matching'`. Latest status via `DISTINCT ON (processor_id) …
  ORDER BY document_version DESC`; attempts via `COUNT(*)`.
- **Stopped being written:** <DATE>.
- **Gate to drop:** all readers migrated to the new table; the
  pg_notify trigger replaced by `trigger_processor_run_event`.

## `matching_status` ENUM type
- **Replaced by:** `processor_run_status` ENUM.
- **Gate to drop:** after the columns above are dropped.

## `document.reconciliation_status`
- **Replaced by:** `document.diagnostics.reconciliation.status`
  (materialized) and `document_processor_run` rows where
  `processor_id = 'reconciliation'`.
- **Stopped being written:** <DATE>.
- **Gate to drop:** reconciliation UI reads from `document.diagnostics`.

## `structured_data.{validation-results, fraud-flags, tax-issues}`
## `structured_data.line-items[*].vat-validation`
- **Replaced by:** `document.diagnostics` + `document_processor_run`.
- **Stopped being written:** <DATE>.
- **Gate to drop:** migration verified, all readers moved to
  `document.diagnostics`.

9.3 Infra & ops touchpoints

Not schema, but must ship in the same release:

9.4 Downstream updates (non-schema)

10. Risk & rollout