Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
Once inline editing ships (edit-history
plan), every user edit
puts derived values out of date with no path to refresh them. The
derived values — validation results, fraud flags, tax-compliance
issues, per-line-item VAT checks, matching, reconciliation — today
live in three places: embedded in document.structured_data,
scattered across per-subsystem columns on document, and split across
tables (document_match, ap_document_cluster). There is no unified
audit of derivations and no recompute lifecycle other than full
re-ingestion.
Introduce diagnostics as a first-class concept. One dedicated
JSONB column on document holds the materialized latest snapshot. A
new unified document_processor_run table records every execution of
every processor (both ingestion-time and edit-triggered), replacing
ap_ingestion_post_process_stat and absorbing per-document
matching/reconciliation status tracking. A workers-service-hosted
pipeline, triggered 60 seconds after the last edit (SQS-delayed
idempotent handler), rebuilds diagnostics for the document's current
version. The detail view marks stale and in-progress sections
inline (grayed + badge) and auto-refreshes each one as its run
completes, reusing the existing document-events SSE stream.
document.diagnostics (JSONB materialized snapshot).document_processor_run table; dropped
ap_ingestion_post_process_stat.structured_data: validation-results,
fraud-flags, tax-issues, per-line-item vat-validation.structured_data.missing-fields (dead
LLM metadata; see §8).document: matching_status, matching_error,
matching_attempts, matching_failed_at, reconciliation_status
(all derivable from the run table; see §9).running → completed/failed) for every
processor run, regardless of trigger (ingestion / edit / manual).needs_human_review re-sourced from document.diagnostics,
preserving today's criterion exactly.PENDING-CLEANUPS.md additions for the columns/tables
made redundant.init_aws script, test fixtures).triggered_by_history_id and per-processor rows, so
a future "which subsystems does this path affect?" decision layer
slots in cleanly. For this release every edit-triggered cycle runs
all diagnostic processors.'manual' value is added to processor_run_trigger so the
schema is ready for it.The edit-history plan must be merged before this one. This design assumes:
document.version exists and is bumped on every edit.document_history table exists with its change_type enum.trg_update_document_from_ingestion trigger is gone).resources/migrations/PENDING-CLEANUPS.md exists — this plan
appends to it.validations, fraud-detector, matching, accounts-matcher,
…).document_processor_run.'ingestion' (pipeline),
'edit' (debounced recompute), 'manual' (future; user-initiated).document_version vs document.version and status. See §6.document.diagnosticsALTER TABLE document ADD COLUMN diagnostics JSONB;
Materialized latest snapshot of all diagnostic outputs. Read directly by the detail view. Top-level keys are absent when the corresponding processor has never successfully run for this document.
Shape:
{
"validations": { "<check-name>": { "status": "pass|warning|error|uncertain",
"field": "...", "message": "...",
"details": {...}, "resolved-value": ...,
"confidence": 0.9, "reasoning": "..." },
... },
"fraud-flags": [ { "rule-id": "ef1-01", "type": "bank-account-mismatch",
"severity": "warning",
"message": "...", "details": {...}, "suggestion": "..." },
... ],
"tax-issues": [ { "type": "missing-vat-id", "severity": "error",
"message": "...", "suggestion": "..." },
... ],
"line-items": { "<line-item-id>": { "vat-validation": { "status": "valid|invalid|warning|skipped",
"expected-rate": 19,
"reasoning": "...",
"suggestion": "..." } },
... },
"matching": { "matches": [ { "document-id": "<uuid>",
"blended-score": 0.95,
"llm-confidence": "high",
"match-method": "llm" },
... ] },
"reconciliation": { "status": "reconciled|incomplete|error", "details": {...} }
}
line-items is keyed by the per-line id assigned in the
edit-history plan, so per-line diagnostics stay correctly associated
across reorderings and deletions.
matching is a summary snapshot. The relational document_match
table remains the primary query source for "which other documents
match this one" lookups — its pairwise, cross-document shape is not
something JSONB should represent. The snapshot exists in
diagnostics.matching so the run row's audit is self-contained and
the UI has one source of truth to render.
document_processor_runCREATE TYPE processor_run_status AS ENUM ('pending', 'running', 'completed', 'failed');
CREATE TYPE processor_run_trigger AS ENUM ('ingestion', 'edit', 'manual');
CREATE TABLE document_processor_run (
id UUID PRIMARY KEY DEFAULT uuidv7(),
document_id UUID NOT NULL REFERENCES document(id) ON DELETE CASCADE,
processor_id TEXT NOT NULL,
trigger_kind processor_run_trigger NOT NULL,
ingestion_id UUID REFERENCES ap_ingestion(id) ON DELETE CASCADE,
triggered_by_history_id UUID REFERENCES document_history(id),
document_version INT NOT NULL,
started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
ended_at TIMESTAMPTZ,
input_tokens INTEGER,
output_tokens INTEGER,
model TEXT,
commit_sha TEXT,
status processor_run_status NOT NULL DEFAULT 'running',
result JSONB,
error TEXT,
CONSTRAINT doc_processor_run_trigger_xor CHECK (
(trigger_kind = 'ingestion' AND ingestion_id IS NOT NULL AND triggered_by_history_id IS NULL)
OR (trigger_kind = 'edit' AND ingestion_id IS NULL AND triggered_by_history_id IS NOT NULL)
OR (trigger_kind = 'manual' AND ingestion_id IS NULL AND triggered_by_history_id IS NULL)
)
);
CREATE INDEX idx_doc_processor_run_doc_proc_version
ON document_processor_run(document_id, processor_id, document_version DESC);
CREATE INDEX idx_doc_processor_run_ingestion
ON document_processor_run(ingestion_id) WHERE ingestion_id IS NOT NULL;
One row per run. processor_id is free-form text (not an ENUM) so
that adding new processors in the future does not require a migration.
Known values used by this release:
validations, fraud-detector,
tax-compliance-analyzer, vat-validation, matching,
reconciliation.accounts-matcher, cost-center-matcher,
accruals-matcher, supplier-matcher, supplier-verifier,
service-category, bu-code, financial-validation-resolver,
uncertain-validations-resolver.The trigger-XOR constraint keeps each row's provenance unambiguous.
structured_dataPer-document-type schemas
(schema/invoice/structured_data.clj,
schema/purchase_order/structured_data.clj,
schema/goods_received_note/structured_data.clj,
schema/contract/structured_data.clj) lose these top-level keys:
:validation-results → moves to diagnostics.validations.:fraud-flags → moves to diagnostics.fraud-flags.:tax-issues → moves to diagnostics.tax-issues.:missing-fields → removed entirely (§8).The LineItem schemas lose :vat-validation (moves to
diagnostics.line-items[<id>].vat-validation).
documentFive columns become derivable from document_processor_run rows and
are queued for removal in PENDING-CLEANUPS.md (see §9):
matching_statusmatching_errormatching_attemptsmatching_failed_atreconciliation_statusLatest state per subsystem is read via a SELECT … ORDER BY document_version DESC, started_at DESC LIMIT 1 (or DISTINCT ON (processor_id)) on the covering index. Attempt counts use
COUNT(*).
The existing matching_status ENUM type is also queued for removal
(§9).
ap_ingestion per-phase timings (transcription / classification /
extraction). These are ingestion-only and never recompute on edit;
moving them would add churn without benefit.document_match pairwise table, ap_document_cluster table, and
document.cluster_id — the relational shape is load-bearing for
cross-document queries and matching is still an active writer.document.needs_human_review (a boolean summary column) — kept,
re-sourced from diagnostics; same criterion as today (§7).Every processor run writes to document_processor_run in two phases,
regardless of trigger kind:
status='running',
started_at = now(), ended_at = NULL, result = NULL,
document_version = <doc version at dispatch time>.status='completed', ended_at = now(), result = <subsystem output>, token counts and model populated (for LLM processors).
On failure: status='failed', ended_at = now(), error = <message>, result may be NULL.The UI reads the current state off this row — no separate "is it
recomputing?" flag is needed on the document. The pg_notify trigger
(§6.3) fires on every row transition so the SSE stream delivers both
diagnostic-run-started and diagnostic-run-completed events.
The ingestion-completion handler (introduced by the edit-history plan) now additionally writes run rows. The pipeline:
ap_ingestion columns as today).validations,
accounts-matcher, cost-center-matcher, accruals-matcher,
supplier-matcher, supplier-verifier,
tax-compliance-analyzer, financial-validation-resolver. Each
INSERTs its run row at start (status='running') and UPDATEs on
completion.fraud-detector, uncertain-validations-resolver. Same two-phase
write pattern.vat-validation runs as its own named processor, producing one
run row per execution whose result is a map keyed by
line-item id:
{"<line-item-id>": {"status": ..., "expected-rate": ..., ...}, ...}.
The per-line VAT check was previously inline in
validation.clj; promoting it to a standalone processor gives it
its own audit row and lets it recompute independently on edits.document_history ingestion row +
document.structured_data write):
document.structured_data is set to the editable-only portion of
the extraction output (i.e. without validation-results,
fraud-flags, tax-issues, missing-fields, or per-line-item
vat-validation).document.diagnostics is set to the assembled snapshot
(validations, fraud-flags, tax-issues, per-line VAT validations
keyed by id — whichever are present at this point;
matching and reconciliation may still be pending and their
keys absent until their workers finish).document.needs_human_review is computed from the new
document.diagnostics using the existing criterion (§7).document.version is bumped (as in edit-history).Diagnostic-only refreshes do not bump document.version. Version
tracks changes to editable state. Matching, reconciliation, and
every edit-triggered recompute update document.diagnostics and
document.needs_human_review in place, gated by the run's own
document_version equaling document.version at commit time
(see §4.4). If the document has been edited since the run started,
the run's output is captured in document_processor_run.result for
audit but is not merged into document.diagnostics.
Matching and reconciliation run on their existing post-ingestion
worker schedule. Each creates its own run row (with trigger_kind = 'ingestion' and the same ingestion_id) and updates
document.diagnostics.matching / .reconciliation on its terminal
transition — subject to the version gate above.
document-id, the new document.version, the
edit's history-id, and enqueued-at.DelaySeconds), the consumer wakes up in the
workers service.document.version and the max
created_at on document_history for this document. If there is
any document_history row newer than the message's enqueued-at,
a later edit has already scheduled its own message — this one is
stale, skip and ack.trigger_kind='edit', triggered_by_history_id = <message's history-id>, document_version = <current doc version>,
status='running'. Dispatch work (same invocation shape as
ingestion-time).document.diagnostics and recomputes document.needs_human_review
— only if the run's document_version equals
document.version at commit time. If the document has been
edited again in the meantime, the run's output is captured in the
run row but not applied to the materialized document.diagnostics
(§4.4).Phase ordering for edit-triggered cycles mirrors ingestion: phase-2 processors (fraud, uncertain-validations) start only after phase-1 processors commit. Orchestration detail is deferred to the implementation plan.
An edit arriving mid-cycle does not cancel anything running. Two outcomes:
document_version
is now behind document.version. When each completes, the guard
at §4.3 step 5 prevents its slice from being applied to
document.diagnostics; the UI sees them as stale
(run.document_version < document.version) and a freshly-enqueued
cycle for the new version will eventually supersede them.Matching continues to write document_match rows (the relational,
pairwise shape) on every successful matching run. Additionally the
matching processor:
document_processor_run row (with processor_id = 'matching').document.diagnostics.matching
(array of match refs: document-id, score, confidence, method).document-events pub with a :diagnostic-run-completed
event whose :processor/id is "matching" — replacing today's
trigger_document_matching_event trigger on document.matching_status
(see §9).Reconciliation writes its run row and its summary into
document.diagnostics.reconciliation. No more
document.reconciliation_status column.
New queue created in both production and local environments:
v1-orcha-global-diagnostics-recomputev1-orcha-global-diagnostics-recompute-dlqingest_queue's 10-minute
budget to let long-running processors finish).The consumer runs inside the workers service (the same JVM process that hosts the ingestion worker). It shares DB pool, LLM clients, and processor implementations with the ingestion pipeline — no new codepaths for LLM work, only a new entrypoint that invokes the existing processors against an already-ingested document.
{:document-id "<uuid>"
:history-id "<uuid-of-triggering-edit>"
:document-version <int> ;; version at enqueue time
:enqueued-at "<iso-instant>"}
Enqueue is part of the edit handler's transaction (SQS send is deferred to commit via an outbox-style pattern or a post-commit hook; exact mechanism is an implementation detail left to the plan).
Idempotency logic at the consumer:
(defn should-run? [db-pool {:keys [document-id enqueued-at]}]
(let [{:document/keys [version]} (fetch-document db-pool document-id)
newest-edit (latest-history-at db-pool document-id)]
;; If any edit occurred after this message was enqueued,
;; a later (and more up-to-date) message exists in the queue.
(not (and newest-edit
(pos? (compare (:created-at newest-edit) enqueued-at))))))
When should-run? is false, ack the message and do nothing. When
true, dispatch for the current document version.
fetch document + version
if should-run?
for each diagnostic processor:
insert run row status='running', document_version=doc.version
enqueue processor work
orchestrate phase ordering (phase-2 waits for phase-1 commits)
each processor on completion:
update own run row to completed/failed with result/error
if run.document_version == current doc.version:
merge result slice into document.diagnostics
recompute document.needs_human_review
NOTIFY document_events (via pg_notify trigger)
The detail-view handler fetches:
{:document/keys [diagnostics version]} ...
latest-runs = (latest-run-per-processor db-pool document-id)
latest-runs is a single query:
SELECT DISTINCT ON (processor_id) processor_id, status, document_version,
result, error, started_at, ended_at
FROM document_processor_run
WHERE document_id = $1
ORDER BY processor_id, document_version DESC, started_at DESC;
Per diagnostic section, the renderer classifies state. Content in
all non-empty cases comes from document.diagnostics.<subsystem> —
which always carries the most recently applied snapshot (written
by the last run whose document_version equaled document.version
at commit). No second lookup needed.
| Condition | State | Rendering |
|---|---|---|
| no row AND no diagnostics slice | never-run | Empty-state placeholder, "no analysis yet". |
status='running' |
recomputing | Gray opacity + "Recomputing…" pill in section header. Content from document.diagnostics (empty if first-ever run). |
status='completed' AND document_version = document.version |
current | Normal rendering from document.diagnostics.<subsystem>. |
status='completed' AND document_version < document.version |
stale | Gray opacity + "Stale (version behind)" pill. Content from document.diagnostics. |
status='failed' |
error | Red pill + error message text. Content from document.diagnostics (last-successful snapshot) if any. |
Both stale and recomputing states use the grayed-out pattern (opacity ~0.55) plus a distinctive header badge; the visual language is shared so the user reads "don't trust these yet" either way.
The editable-value helper introduced by the edit-history plan is
unaffected — it wraps editable leaves, which are all outside of
diagnostics.
No new SSE endpoint. The detail-view SSE handler
(src/com/getorcha/app/http/documents/view/shared.clj:1018) already
subscribes to document-events keyed by tenant-id and dispatches
on :event/type. The case branch is extended:
:diagnostic-run-started
;; Re-render the section with the 'recomputing' state.
(let [processor-id (:processor/id event)]
{:event "diagnostic-run-started"
:data (hiccup/html (render-section-recomputing document-id processor-id ...))})
:diagnostic-run-completed
;; Re-render the section with the completed content (or error state).
(let [processor-id (:processor/id event)]
{:event "diagnostic-run-completed"
:data (hiccup/html (render-section document-id processor-id ...))})
HTMX on the client listens with hx-ext="sse" + per-section
sse-swap="diagnostic-run-completed-<processor>" attributes (or a
single sse-swap that carries processor-id in the event name). The
exact HTMX attribute shape is an implementation detail; the server
side is a pure extension of the existing exec-fn.
Today, trigger_document_matching_event
(resources/migrations/20260302194439-add-matching-event-trigger.up.sql)
fires pg_notify('document_events', …) on document.matching_status
transitions. That column is going away.
A new trigger on document_processor_run replaces it:
CREATE OR REPLACE FUNCTION notify_processor_run_event()
RETURNS TRIGGER AS $$
DECLARE
payload jsonb;
le_tenant_id uuid;
legal_entity_id uuid;
BEGIN
-- Only notify on state transitions that matter to the UI
IF NEW.status IS NULL OR NEW.status NOT IN ('running', 'completed', 'failed') THEN
RETURN NEW;
END IF;
SELECT d.legal_entity_id, le.tenant_id INTO legal_entity_id, le_tenant_id
FROM document d JOIN legal_entity le ON le.id = d.legal_entity_id
WHERE d.id = NEW.document_id;
payload := jsonb_build_object(
'event/type', CASE WHEN NEW.status = 'running'
THEN 'diagnostic-run-started'
ELSE 'diagnostic-run-completed' END,
'document/id', NEW.document_id::text,
'processor/id', NEW.processor_id,
'document-version', NEW.document_version,
'run-status', NEW.status::text,
'legal-entity/id', legal_entity_id::text,
'tenant/id', le_tenant_id::text
);
PERFORM pg_notify('document_events', payload::text);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_processor_run_event
AFTER INSERT OR UPDATE OF status ON document_processor_run
FOR EACH ROW
WHEN (NEW.status IN ('running', 'completed', 'failed'))
EXECUTE FUNCTION notify_processor_run_event();
Drop the old trigger_document_matching_event and its
notify_matching_event function in the same migration.
needs_human_reviewThe column stays on document. Criterion is unchanged — true iff:
diagnostics.validations has status = 'error', ORdiagnostics.fraud-flags has severity = 'critical'.Only the source location changes: from structured_data.validation-results
/ structured_data.fraud-flags to diagnostics.validations /
diagnostics.fraud-flags. The logic moves to an application-level
helper (the trigger-based derivation was already scheduled for removal
by the edit-history plan).
When does it update?
document.diagnostics.document.diagnostics,
when the run's document_version equals document.version.Queue semantics (list views filtering
WHERE needs_human_review = true) are unchanged. NULL continues to
mean "initial analysis not yet complete"; such documents are excluded
from the queue.
missing-fields removalstructured_data.missing-fields is LLM-reported extraction metadata
(the prompt asks the LLM to list fields it could not extract). A
codebase search confirms no reader of the top-level key exists. The
same information is already derivable — and is in fact derived — by
the required-fields check inside the validations processor
(validation.clj:548), which reads current structured_data against
per-document-type required-field sets and stores the list under
validation-results[:required-fields][:details][:missing-fields].
This plan removes the top-level field entirely rather than letting it decay into a stale LLM opinion after edits:
:missing-fields from
schema/invoice/structured_data.clj:273,
schema/purchase_order/structured_data.clj:24,
schema/goods_received_note/structured_data.clj:29,
schema/contract/structured_data.clj:96."missing-fields": [...] line from the four extraction
prompt templates in workers/ap/ingestion/extraction.clj (at
approximately lines 394, 950, 1207, 1481 — one per document type).UPDATE document SET structured_data = structured_data - 'missing-fields' WHERE structured_data ? 'missing-fields';.No PENDING-CLEANUPS.md entry — this is a full removal, not a
deprecation.
The derivational answer to "what's currently missing?" lives where
it already lives, inside
diagnostics.validations.required-fields.details.missing-fields, and
recomputes naturally with every validations run.
One file, ships with the release. Ordering matters — the schema changes land first, the data backfill runs against the new schema, and the triggers flip last.
-- up
-- 1) New types + table
CREATE TYPE processor_run_status AS ENUM ('pending', 'running', 'completed', 'failed');
CREATE TYPE processor_run_trigger AS ENUM ('ingestion', 'edit', 'manual');
CREATE TABLE document_processor_run ( ... ); -- full schema per §3.2
CREATE INDEX idx_doc_processor_run_doc_proc_version ...;
CREATE INDEX idx_doc_processor_run_ingestion ...;
-- 2) New diagnostics column
ALTER TABLE document ADD COLUMN diagnostics JSONB;
-- 3) Backfill run rows from ap_ingestion_post_process_stat
INSERT INTO document_processor_run
(document_id, processor_id, trigger_kind, ingestion_id, document_version,
started_at, ended_at, input_tokens, output_tokens, model, status, commit_sha)
SELECT i.document_id, s.processor_id, 'ingestion', s.ingestion_id, 1,
s.started_at, s.ended_at, s.input_tokens, s.output_tokens, s.model,
'completed', i.commit_sha
FROM ap_ingestion_post_process_stat s
JOIN ap_ingestion i ON i.id = s.ingestion_id
WHERE i.document_id IS NOT NULL;
-- 4) Backfill one matching-run row per document with non-null matching_status
INSERT INTO document_processor_run
(document_id, processor_id, trigger_kind, ingestion_id, document_version,
started_at, ended_at, status, error)
SELECT d.id, 'matching', 'ingestion',
(SELECT id FROM ap_ingestion WHERE document_id = d.id
ORDER BY completed_at DESC NULLS LAST LIMIT 1),
1,
COALESCE(d.matching_failed_at, d.updated_at),
CASE WHEN d.matching_status IN ('succeeded','failed','skipped')
THEN COALESCE(d.matching_failed_at, d.updated_at) END,
CASE d.matching_status
WHEN 'pending' THEN 'pending'
WHEN 'in-progress' THEN 'running'
WHEN 'succeeded' THEN 'completed'
WHEN 'failed' THEN 'failed'
WHEN 'skipped' THEN 'completed'
END::processor_run_status,
d.matching_error
FROM document d
WHERE d.matching_status IS NOT NULL;
-- 5) Backfill one reconciliation-run row per document with non-null reconciliation_status
INSERT INTO document_processor_run
(document_id, processor_id, trigger_kind, ingestion_id, document_version,
started_at, ended_at, status, result)
SELECT d.id, 'reconciliation', 'ingestion',
(SELECT id FROM ap_ingestion WHERE document_id = d.id
ORDER BY completed_at DESC NULLS LAST LIMIT 1),
1, d.updated_at, d.updated_at, 'completed',
jsonb_build_object('status', d.reconciliation_status)
FROM document d
WHERE d.reconciliation_status IS NOT NULL;
-- 6) Backfill a synthetic 'validations' run per document with validation-results
INSERT INTO document_processor_run
(document_id, processor_id, trigger_kind, ingestion_id, document_version,
started_at, ended_at, status, result)
SELECT d.id, 'validations', 'ingestion',
(SELECT id FROM ap_ingestion WHERE document_id = d.id
ORDER BY completed_at DESC NULLS LAST LIMIT 1),
1, d.updated_at, d.updated_at, 'completed',
d.structured_data -> 'validation-results'
FROM document d
WHERE d.structured_data ? 'validation-results';
-- 7) Seed document.diagnostics from the keys currently in structured_data
UPDATE document SET diagnostics = jsonb_strip_nulls(jsonb_build_object(
'validations', structured_data -> 'validation-results',
'fraud-flags', structured_data -> 'fraud-flags',
'tax-issues', structured_data -> 'tax-issues',
'line-items', (SELECT jsonb_object_agg(li ->> 'id',
jsonb_build_object('vat-validation',
li -> 'vat-validation'))
FROM jsonb_array_elements(structured_data -> 'line-items') li
WHERE li ? 'vat-validation'),
'reconciliation', CASE WHEN reconciliation_status IS NOT NULL
THEN jsonb_build_object('status', reconciliation_status) END
))
WHERE structured_data IS NOT NULL;
-- 8) Strip those keys from structured_data (including missing-fields per §8)
UPDATE document SET
structured_data = structured_data
- 'validation-results'
- 'fraud-flags'
- 'tax-issues'
- 'missing-fields'
WHERE structured_data IS NOT NULL;
-- 8b) Strip vat-validation from each line item
UPDATE document SET structured_data = jsonb_set(
structured_data, '{line-items}',
COALESCE(
(SELECT jsonb_agg(li - 'vat-validation' ORDER BY (li ->> 'order')::int)
FROM jsonb_array_elements(structured_data -> 'line-items') li),
'[]'::jsonb))
WHERE structured_data ? 'line-items'
AND jsonb_typeof(structured_data -> 'line-items') = 'array';
-- 9) Replace matching pg_notify trigger with a processor-run trigger
DROP TRIGGER IF EXISTS trigger_document_matching_event ON document;
DROP FUNCTION IF EXISTS notify_matching_event();
CREATE OR REPLACE FUNCTION notify_processor_run_event() ...;
CREATE TRIGGER trigger_processor_run_event ... ; -- full body per §6.3
Down migration recreates the dropped columns/tables with NULL values (data is not recoverable bit-for-bit; this is a one-way migration in practice).
PENDING-CLEANUPS.md additionsAppended to the file created by the edit-history plan:
## `ap_ingestion_post_process_stat` (entire table)
- **Replaced by:** `document_processor_run` (unified per-processor
run history, any trigger kind).
- **Stopped being written:** <DATE>, when the diagnostics
recompute pipeline shipped and post-process handlers started
writing to `document_processor_run`.
- **Gate to drop:** backfill verified, no open queries against the
old table.
## `document.matching_status`, `matching_error`, `matching_attempts`, `matching_failed_at`
- **Replaced by:** `document_processor_run` rows where `processor_id
= 'matching'`. Latest status via `DISTINCT ON (processor_id) …
ORDER BY document_version DESC`; attempts via `COUNT(*)`.
- **Stopped being written:** <DATE>.
- **Gate to drop:** all readers migrated to the new table; the
pg_notify trigger replaced by `trigger_processor_run_event`.
## `matching_status` ENUM type
- **Replaced by:** `processor_run_status` ENUM.
- **Gate to drop:** after the columns above are dropped.
## `document.reconciliation_status`
- **Replaced by:** `document.diagnostics.reconciliation.status`
(materialized) and `document_processor_run` rows where
`processor_id = 'reconciliation'`.
- **Stopped being written:** <DATE>.
- **Gate to drop:** reconciliation UI reads from `document.diagnostics`.
## `structured_data.{validation-results, fraud-flags, tax-issues}`
## `structured_data.line-items[*].vat-validation`
- **Replaced by:** `document.diagnostics` + `document_processor_run`.
- **Stopped being written:** <DATE>.
- **Gate to drop:** migration verified, all readers moved to
`document.diagnostics`.
Not schema, but must ship in the same release:
infra/stacks/foundation_stack.py — add
diagnostics_recompute_queue + diagnostics_recompute_dlq,
modeled on ingest_queue (visibility 600 s, DLQ max-receive 3).scripts/init_aws.clj — add an sqs-diagnostics-recompute-queue
config var and a (create-queue-with-dlq! sqs-diagnostics-recompute-queue) call in the local init flow
(around line 236, alongside sqs-ingestion-queue /
sqs-acquisition-queue / sqs-matching-queue).test/com/getorcha/test/fixtures.clj — ensure the new queue is
created at test-system startup (tests that exercise the recompute
pipeline will enqueue and consume from it).resources/com/getorcha/config.edn — add the queue URL entry
(or equivalent config surface) so the workers service and the edit
handler share the reference.src/com/getorcha/system.clj and
related integrant configs) — register the diagnostics-recompute
consumer component alongside the ingestion worker. Share the DB
pool, LLM client, and processor namespaces.DelaySeconds = 60.document.diagnostics as
part of the completion transaction; update
document.needs_human_review from the new location.src/com/getorcha/workers/ap/matching/worker.clj)
— write a processor_id='matching' run row (two-phase); write
document.diagnostics.matching on success; stop writing
document.matching_status et al. (they still exist in the DB
until their cleanup gate, but readers move off).src/com/getorcha/workers/ap/matching/reconciliation.clj)
— same pattern; writes document.diagnostics.reconciliation;
stops writing document.reconciliation_status.src/com/getorcha/app/http/documents/view/*.clj
and src/com/getorcha/app/ui/components.clj) — source all
diagnostic reads from document.diagnostics instead of
document.structured_data; add the per-section state classifier
and stale/recomputing rendering; extend the SSE exec-fn in
shared.clj:1018 with the two new event types.extraction.clj — four templates) — remove
missing-fields from the required output JSON.schema/invoice/structured_data.clj,
schema/purchase_order/structured_data.clj,
schema/goods_received_note/structured_data.clj,
schema/contract/structured_data.clj) — remove
:validation-results, :fraud-flags, :tax-issues, :missing-fields
from the root shape, and :vat-validation from LineItem.src/com/getorcha/schema/diagnostics.clj
defining the document.diagnostics JSONB shape (Validations,
FraudFlag, TaxIssue, VatValidation, MatchingSummary,
ReconciliationSummary), reusing the existing schema definitions
moved from structured_data.clj.structured_data.validation-results or
ap_ingestion_post_process_stat (e.g. debug-doc,
ingestion-regression-test) must point at document.diagnostics
and document_processor_run. The edit-history plan already calls
out similar changes for its own moves; these extend that list.scripts/debug_fetch_document.clj must copy
document_processor_run rows when pulling a prod document into
local; the jsonb-keys registry in scripts/debug_common.clj must
include document.diagnostics and document_processor_run.result
so they round-trip.ap_ingestion_post_process_stat backfills
1:1. Matching rows with matching_attempts > 1 lose per-attempt
timing (only the latest state is represented). The synthetic
validations row is a single completed snapshot per document;
pre-release there was no per-run audit for validations at all, so
this is strictly additive. Editable-enrichment processors (accounts,
cost-center, accruals, supplier, bu-code, service-category) get
backfilled run rows with result = NULL because their outputs were
folded into structured_data and cannot be cleanly recovered
per-processor. This is documented in the migration notes.trigger_document_matching_event and installs
trigger_processor_run_event in the same file. The window between
the two is a single transaction — no SSE events are missed at
migration time.event/type values, so old clients remain functional during a
rolling deploy. New diagnostics events only reach clients that have
the updated exec-fn and client-side HTMX extensions.document.version, and
the queued message's idempotency check will skip (a newer edit /
ingestion event exists). The re-ingestion itself writes a full set
of run rows for the new version.