Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Unified Processors — Design

1. Scope & goals

Problem

Three mechanisms in the codebase today run "derive something from document state, record the run, write outputs":

Post-processor pipeline (IProcessor protocol + post-process/run) — the 9 in-ingestion processors (accounts, cost-center, accruals, supplier-matcher, supplier-verifier, tax-compliance-analyzer, financial-validation-resolver, fraud-detector, uncertain-validations-resolver).
Matching SQS worker (matching.worker/process-document!) — bespoke orchestration that runs matching, writes diagnostics, and drives reconciliation per cluster.
Diagnostics-recompute SQS worker (just landed in document-diagnostics) — placeholder stubs for edit-triggered recomputes.

Each one inserts document_processor_run rows, writes diagnostic slices, and conditionally mutates structured_data. They diverge on protocol shape, side-effect discipline, and trigger handling. Adding edit-triggered recompute to the mix (the reason this work started) would introduce a fourth implementation and compound the drift.

Separately, the diagnostics work shipped without a story for conditional recomputation: today an edit-triggered recompute would blindly re-run every processor, wasting LLM spend and risking the engine overwriting user-entered values.

Approach

Unify under one Processor protocol and one run-processors! engine. Every current processor (including matching and reconciliation) becomes an implementation. Two callers — ingestion and edit-recompute — parameterise the same engine with different phase lists and modes.

Introduce declarative :reads / :writes metadata on each processor so the engine can (a) skip processors whose declared reads weren't touched by the triggering edit, and (b) refuse to overwrite any field a user has manually edited.

Reorganise namespaces so matching lives alongside the other processors (workers.ap.processors.*), and rename -slice → -diagnostic to match the terminology in the rest of the domain.

In scope

One Processor protocol (v2) with ops-based apply; kills IProcessor.
One engine: run-processors! handles run rows, apply-filtering, diagnostic writes, phase ordering.
Conditional recomputation based on leaf-only read declarations + malli seq-regex matching.
Write-protection against historical user edits via an document-provenance lookup shared with the UI.
Non-leaf edit-op expansion (add/remove line-item → virtual leaf ops).
Namespace reorg: workers.ap.processors.*, matching + reconciliation move under it.
publish-document-ready! → processors.matching.queue/enqueue!.
Migration of all current post-processors + matching + reconciliation to the unified protocol.
Consumer ctx expansion: diagnostics-recompute/consumer gets llm-config, search-config, notifications refs.
Notifications on edit-triggered failures (same path as ingestion).

Out of scope

Any change to document_processor_run schema.
Any change to document.diagnostics schema (still JSONB, same slice keys).
Any change to the edit handler / SQS enqueue path.
New processors. Migration only.
Changing how matching computes matches or how reconciliation compares line items.

2. Concepts

2.1 Processor

A unit of work that derives something from document state. Every processor:

Has a stable :id (e.g. :accounts, :matching, :fraud-detector).
Declares the set of input leaves it reads (:reads).
Declares the set of output leaves it may mutate in structured-data (:writes) — empty if the processor only writes a diagnostic slice.
Declares its diagnostic slice (:diagnostic, was :slice) — the top-level key in document.diagnostics it owns, or nil.
Declares which triggers it runs under (:modes ⊆ #{:ingestion :edit}).
Implements -compute: (fn [ctx state]) → {:result _ :stats _}.
Implements -apply-ops: (fn [state result]) → [patch-ops] — may be empty. Patch ops are JSON Patch maps with the project's [id=X] line-item extension.

2.2 State

Replaces the ad-hoc ingestion map threaded through IProcessors. Single shape passed through the engine:

{:document        <document-row>              ;; includes :structured-data, :version, :legal-entity-id
 :legal-entity    <legal-entity-row>
 :file            {:contents <bytes-or-nil> :mime-type <string>}
 :structured-data <map>                       ;; mirror of :document/structured-data for convenience
 :commit-sha      <string-or-nil>
 :ingestion-id    <uuid-or-nil>               ;; set in :ingestion mode only
 :history-id      <uuid-or-nil>}              ;; set in :edit mode only

:file is populated on demand by the engine only for processors that need PDF bytes (currently tax-compliance's vision fallback). Lazy to avoid gratuitous S3 fetches.

2.3 Leaves

Every :reads entry and every :writes entry is a leaf path — a path whose tail addresses an atomic field (scalar, or a primitive inside a [:vector <primitive>]), per the structured-data malli schema. Subtree declarations ([:cat [:= :issuer]]) are forbidden.

Leaves are expressed as malli seq-regex patterns over clj-path segments:

[:cat [:= :issuer] [:= :name]]
[:cat [:= :line-items] :any [:= :description]]
[:cat [:= :line-items] :any [:= :debit-account]]

:any matches one segment of any shape — array index, {:id X} map, or keyword. No other wildcards are needed.

An authoring helper lowers a terse vector form to seq-regex:

(read-leaf :line-items :* :description)
;; => [:cat [:= :line-items] :any [:= :description]]

2.4 Diagnostic (renamed)

-slice → -diagnostic everywhere (protocol method, engine param, helper fn names). Semantics unchanged: one top-level key in document.diagnostics owned by one processor. Processors whose output lives only in structured-data return nil.

3. IProcessor protocol (v2)

Keep the existing name. Modernise the contract:

(defprotocol IProcessor
  (-id         [this])                   ;; :keyword
  (-reads      [this state])             ;; [seq-regex-pattern...] — state-aware for per-type dispatch
  (-writes     [this state])             ;; [seq-regex-pattern...] — may be empty
  (-diagnostic [this state])             ;; DiagnosticSpec or vector thereof, or nil
  (-modes      [this])                   ;; #{:ingestion :edit}
  (-always?    [this])                   ;; boolean — bypasses conditional filter
  (-compute    [this ctx state])         ;; -> {:result _ :stats _}
  (-apply-ops  [this state result]))     ;; -> [json-patch-op ...]

-reads, -writes, and -diagnostic take state so processors can dispatch on document type. matching's reads differ for invoice/purchase-order/contract/GRN; validations similarly dispatches its sub-path ownership per document type (see §13). State is guaranteed to contain :document with :document/structured-data including :document-type before these methods are called.

-compute replaces today's -compute [this] (which closed over context and ingestion at construction). The ctx/state switch lets the engine reuse a single processor instance per run and supply the canonical state map.

-apply-ops replaces today's -apply [this sd result] → new-sd. Ops form is required so the engine can filter against user-edited paths before applying. Returned ops use string keys ({"op" "replace" "path" "..." "value" ...}) matching the JSON-Patch convention used by json-patch/apply-patch and the edit handlers.

-always? is an escape hatch for processors that read too broadly to enumerate (validations — cheap deterministic checks over the whole document). When true, the conditional filter never skips this processor in :edit mode; reads declaration is ignored. Defaults to false.

3.1 DiagnosticSpec

-diagnostic returns one of:

nil                                        ;; no diagnostic write
{:slice :kw}                               ;; replace whole slice
{:slice :kw :sub-path [k1 k2 ...]}         ;; replace at sub-path
{:slice :kw :sub-paths [[k1] [k2] ...]}    ;; multiple sub-paths, each written
                                           ;; separately with the result map's
                                           ;; top-level keys
[spec1 spec2 ...]                          ;; multi-slice: write to several slices
                                           ;; from one processor's result

Multi-slice (the vector form) is used by tax-compliance-analyzer, which writes both :tax-issues and :line-items from a single LLM call. When the vector form is used, -compute's result is a map with top-level keys matching each slice:

{:tax-issues [...]
 :line-items {"li-abc" {:vat-validation {...}} ...}}

Engine routes each top-level key to its slice.

4. Engine: `run-processors!`

(run-processors! ctx state phases)

ctx — db-pool, llm-config, search-config, aws, notifications, etc.
state — see §2.2. Carries :mode, :trigger-kind, :ingestion-id / :history-id, :edited-by, the document/legal-entity/ structured-data/file/diagnostics, and (in :edit mode) a precomputed :changed-leaves set.
phases — [[Processor ...] [Processor ...]]. Each sub-vector is a concurrency group; groups run sequentially.

No separate processors-filter argument. Filtering happens inside the engine per phase.

Flow

For each phase, in order:

Refresh: refetch document.diagnostics and populate state.diagnostics. Phase-1 processors see empty/stale diagnostics; phase-2 sees phase-1's writes (e.g. tax-compliance-analyzer reads state.diagnostics.validations.tax-id-format that validations wrote in phase 1).
Schedule: in :edit mode, drop processors whose -always? is false AND whose -reads (evaluated against current state) don't intersect :changed-leaves AND which have a :completed run at the current document version. In :ingestion mode, no filtering.
Execute in parallel (virtual-thread-per-processor):
- Insert document_processor_run row (:running, with trigger-kind from state).
- Call -compute with ctx + state.
- If -diagnostic (evaluated against state) is non-nil, write the slice(s) via db.diagnostics/update-diagnostic! (§9). Multi-slice specs route per-slice values from the result map.
- If -apply-ops returns non-empty ops: filter against the user-edit set (§7) in :edit mode (no filter in :ingestion mode — see §5.4), then merge remaining ops into state's in-memory :structured-data. The engine does NOT persist structured-data mutations to the document row here — persistence is batched at end-of-engine-run (§4.1).
- Mark run :completed with :result + :stats.
- On exception: mark :failed, fire notification hook (§8).
Fold: after each phase, the engine returns an updated state reflecting applied ops and refreshed diagnostics. Subsequent phases see prior phases' mutations.

4.1 Persistence

The engine mutates state.structured-data in memory across phases. When the engine run completes, if state.structured-data differs from state.initial-structured-data:

:ingestion mode: the engine does NOT persist. The caller (ingestion worker) persists the final state via complete-ingestion! — single UPDATE + document_history row with change_type = 'ingestion' + version++.
:edit mode: the engine computes the aggregate JSON-Patch (diff from initial to final structured-data), writes a single UPDATE on document + one document_history row with change_type = 'derivation' + version++. The aggregate patch records the net effect of all processor ops across all phases.
Matching worker: matching's -apply-ops returns [] (matching side-effects land in document_match + cluster_id, not structured-data). No aggregate patch, no version bump.

A new change_type = 'derivation' enum value is added to document_history_change_type. Derivation rows have edited_by = NULL and ingestion_id = NULL; only patch is populated. The constraint on document_history is relaxed to allow this combination.

Return value

The updated state. Callers wire this back into their end-of-pipeline logic (e.g., ingestion's complete-ingestion! writes the final document_history row + document update in the same transaction).

5. Modes

5.1 `:ingestion`

-apply-ops results are applied with no filter — re-ingestion establishes truth from scratch, overwriting any prior user edits.
All declared processors run regardless of reads (no conditional filter).
state.trigger-kind = :ingestion; state.ingestion-id required.

5.2 `:edit`

-apply-ops results are filtered through the user-edit set (§7) before being applied. User-edited paths block processor writes at the same or descendant path.
Conditional filter: processors whose -always? is false AND reads don't intersect :changed-leaves AND which already have a completed run at the current version are skipped. Cold-start exemption: never-run processors always run.
state.trigger-kind = :edit; state.history-id and state.edited-by required.

5.3 `:manual` (future)

Not part of this work but protocol-level reserved. For ops scripts that want to force a recompute without an ingestion or edit anchor.

6. Conditional recomputation

6.1 Changed leaves

Given the current edit's patch (from document_history.patch), the engine produces the set of changed leaves:

Parse each op's path string to a clj-path via json-patch.path/pointer->clj-path.
If the op's path resolves to a leaf (per structured-data malli schema), keep it.
Otherwise (non-leaf ops: add/remove/replace subtree), expand by walking the op's value (or the pre-patch value for remove) and emitting one virtual leaf path per atomic descendant.

The expansion uses the StructuredData schema to decide "is this atomic?" — primitive schemas and [:vector <primitive>] are atomic; maps and vectors of maps are not.

6.2 Scheduling filter

For each processor in the phase:

If processor's -always? is true → run.
Else if processor has never had a :completed run at the document's current version → run (cold start).
Else if any of the processor's -reads patterns m/validates any leaf in the changed-leaves set → run.
Else → skip (no run row written; no diagnostic update).

The filter runs per phase, immediately before execution. Later phases see earlier phases' mutations, which may themselves touch leaves relevant to later-phase processors; the filter recomputes between phases.

6.3 Edit-context feed

The state receives a computed :changed-leaves set on entry to :edit mode so processors can consume it too (e.g., an efficient processor might compute only the deltas). Optional — the filter handles most cases; this is for processors that want finer control.

7. Write-protection

7.1 User-edited path set

Reuses the shared logic from view/provenance.clj, extracted to a new namespace com.getorcha.document.provenance:

(provenance/user-edited-paths db-pool document-id)
;; => #{"/issuer/name" "/line-items[id=abc]/debit-account" ...}

Internally walks document_history newest → oldest, collecting op paths up to (but not including) the most recent :ingestion row. Same behaviour that the UI relies on.

The extracted ns exports both:

document-provenance (path → {:edited-by :edited-at}, used by UI)
user-edited-paths (just the key set, used by the engine)

7.2 Op filter

The filter runs in :edit mode only. Ingestion mode applies every op unconditionally (re-ingestion wipes state).

For each op the engine is about to apply:

Parse op.path to a clj-path.
Check whether any prefix of that clj-path (including itself) is in the user-edited set (converted to clj-paths for the comparison).
If yes → drop the op, log at :info level with processor id + op path + blocking user-edited path.
Else → apply.

Prefix matching is important: user edited /line-items[id=abc] (wholesale replace) must block processor writes to /line-items[id=abc]/debit-account.

7.3 Test coverage

Unit-test the filter with representative scenarios:

Scalar user edit blocks a scalar processor write at the same path.
Subtree user edit (e.g. /line-items[id=abc]) blocks all descendant processor writes.
Processor write on a DIFFERENT line-item id is unaffected.
Ingestion mode does NOT filter (re-ingestion overwrites user edits).

8. Notifications

Failed processor runs fire admin notifications in both modes. Notification payload includes:

{:kind       :processor/failure
 :processor  :matching
 :trigger    {:kind :edit :history-id #uuid ... :edited-by #uuid ...}
 :document   {:id ... :legal-entity-id ... :file-original-name ...}
 :error      <message>}

For :ingestion triggers :edited-by is nil. For :edit triggers the admin payload includes which user performed the edit (useful when an edit pattern is destabilising a processor). Renames today's :matching/permanent-failure, :reconciliation/failure into the single :processor/failure kind with the processor id in the payload.

9. Renames

Old	New
`IProcessor` (protocol)	`IProcessor` (kept; contract extended)
`db.diagnostics/update-slice!`	`db.diagnostics/update-diagnostic!` (no alias — all callers updated)
`publish-document-ready!`	`processors.matching.queue/enqueue!`
`:matching/permanent-failure`	`:processor/failure` with `:processor :matching`
`:reconciliation/failure`	`:processor/failure` with `:processor :reconciliation`
`compute!` (post-process.clj)	absorbed into engine
`with-run-row!` (post-process.clj)	absorbed into engine
`run-processor-phases`	absorbed into engine
`run-phase`	absorbed into engine
`tax-compliance/run-vat-validation`	deleted (TCA writes `:line-items` diagnostic slice directly; §13.2)
`validation/validate` (multimethod)	deleted (callsites moved to `validations`, FVR, UVR processors)
`with-validations` (ingestion.clj)	deleted (validations is phase 1 of the engine)

New database:

document_history_change_type enum value 'derivation' (§4.1).
document_history CHECK constraint relaxed to allow change_type='derivation' AND ingestion_id IS NULL AND edited_by IS NULL.

10. Namespace reorganisation

src/com/getorcha/workers/ap/
  ingestion.clj                       ;; shrinks; runs extraction + calls engine
  ingestion/
    classification.clj
    extraction.clj
    transcription.clj
    vat_rules.clj
    validation.clj                    ;; pure rules (check-* functions)
    post_process/                     ;; [DELETED — see processors/]
  processors/
    engine.clj                        ;; IProcessor protocol + run-processors!
    reads.clj                         ;; seq-regex helpers, leaf expansion
    accounts.clj                      ;; (moved from post_process/)
    accruals.clj
    cost_center.clj
    financial_validation.clj
    fraud.clj
    supplier.clj
    tax_compliance.clj
    uncertain_validations.clj
    validations.clj                   ;; NEW — wraps ingestion/validation.clj
    matching.clj                      ;; NEW — wraps match-document! + reconcile-cluster!
    matching/
      queue.clj                       ;; NEW — enqueue! (was publish-document-ready!)
      core.clj                        ;; unchanged internals
      candidates.clj
      evidence.clj
      llm_decision.clj
      normalize.clj
      reconciliation.clj              ;; moved — still internal, no longer a separate processor
      searchable_text.clj

Note: the provenance logic moves to com.getorcha.document.provenance (new top-level ns, neutral between UI + workers). The UI's existing com.getorcha.app.http.documents.view.provenance ns becomes a thin shim that re-exports document-provenance (or is deleted with the UI callers updated to the new ns, whichever is cheaper at implementation time).

11. Phase lists

Ingestion pipeline change

Today the ingestion pipeline in workers.ap.ingestion runs these stages sequentially:

transcribe → classify → extract → validate (with-validations) → post-process → complete

validate mutates structured-data with :validation-results; post-process runs the 9 post-processors through the old IProcessor protocol and mutates structured-data again.

Under the unified model, validate and post-process collapse into a single engine call with THREE phases. The pipeline becomes:

transcribe → classify → extract → run-processors! [validations] [post-procs…] [fraud] → complete

validations (always-run, deterministic-only) runs alone in phase 1 because phase-2 processors read its output. The nine existing post-processors plus the new validations processor distribute across the three phases per §11 below. (There is no separate vat-validation processor — tax-compliance-analyzer writes the per-line vat-validation diagnostic directly; see §13.)

Ingestion (invoice)

The engine replaces both with-validations and post-process/run. Ingestion calls:

(engine/run-processors!
  ctx state-in-ingestion-mode
  [;; Phase 1 — deterministic validations (fast; produces the validation
   ;; statuses that downstream LLM processors consult)
   [validations]
   ;; Phase 2 — enrichment, analysis, resolvers
   [accounts cost-center accruals
    supplier-matcher supplier-verifier
    tax-compliance-analyzer
    financial-validation-resolver
    uncertain-validations-resolver]
   ;; Phase 3 — sees phase-2 mutations (e.g. tax-id correction)
   [fraud-detector]])

Ingestion then calls processors.matching.queue/enqueue! to hand off to the matching SQS worker. Matching stays async for latency isolation.

Phase rationale:

Phase 1 (validations) must run first: tax-compliance-analyzer reads :validations.tax-id-format to decide whether to enter its vision-correction branch; financial-validation-resolver and uncertain-validations-resolver consume uncertain statuses that validations produces. Today this is achieved by with-validations running before post-process/run; under the unified model it's just phase 1.
Phase 2 is the bulk of post-processors. financial-validation-resolver and uncertain-validations-resolver are here rather than a later phase because they read only validations (phase 1) output, not phase-2 peer output.
Phase 3 (fraud) sees phase-2 corrections (e.g. tax-id corrected by tax-compliance-analyzer). This preserves today's fraud-after- corrections sequencing.

Matching worker

The matching SQS worker handles both the post-ingestion continuation and (nothing else — edit-mode runs matching inline):

(engine/run-processors!
  ctx state-in-ingestion-mode
  [[matching]])

matching's -compute runs match-document! then invokes reconcile-cluster! for each affected cluster. Reconciliation is NOT a separate processor — it's an internal step of matching. matching writes both :matching and :reconciliation diagnostic slices (see §13).

Edit recompute

(engine/run-processors!
  ctx state-in-edit-mode
  [;; Phase 1 — validations (always runs; produces statuses downstream reads)
   [validations]
   ;; Phase 2 — everything else, conditionally
   [tax-compliance-analyzer
    fraud-detector matching
    accounts cost-center accruals
    supplier-matcher supplier-verifier
    financial-validation-resolver
    uncertain-validations-resolver]])

Phase 1 is the same as ingestion's phase 1: validations runs first because phase-2 processors (tax-compliance-analyzer, FVR, UVR) read its output.

Phase 2 collapses ingestion's phases 2 and 3 because in :edit mode -apply-ops mutations are filtered by the user-edit set, so phase-2 corrections typically don't propagate to phase-3 readers in a meaningful way — and when they would, the reader will recompute on a subsequent edit anyway. Fraud running alongside tax-compliance in phase 2 is a small latency win acceptable because fraud-detector's output is a diagnostic (not a correction) and a slightly-stale tax-id-type at fraud-time only produces a slightly-stale fraud-flag.

Every phase-2 processor is {:ingestion :edit}; the conditional filter (§6) decides which actually run based on :changed-leaves. validations runs unconditionally (-always? true).

Matching's reconciliation sub-step is sequenced inside matching's -compute, not at the engine's phase level. reconcile-cluster! still inserts its own document_processor_run rows for cluster-peer documents and writes their :reconciliation slices — these are side effects of the matching processor on OTHER documents, outside the engine's current-document scope. The engine itself only tracks runs/slices for the document that triggered the run.

12. Migration table

Every current processor gets a v2 profile. Values below are illustrative for the spec; exact reads/writes are locked in during implementation from source inspection.

Processor	Reads (leaves, terse)	Writes (structured-data)	Diagnostic	Modes	Always?
accounts	`issuer.name`, `issuer.vat-id`, `issuer.country`, `line-items..description`, `line-items..amount`	`line-items..debit-account`, `line-items..credit-account`	—	`{:ingestion :edit}`	no
cost-center	`issuer.name`, `line-items..description`, `line-items..amount`	`line-items.*.cost-center`	—	`{:ingestion :edit}`	no
accruals	`invoice-date`, `line-items.*.description`	`line-items.*.accrual`	—	`{:ingestion :edit}`	no
supplier-matcher	`issuer.name`, `issuer.vat-id`, `issuer.iban`	`supplier-match`	—	`{:ingestion :edit}`	no
supplier-verifier	`issuer.name`, `issuer.vat-id`, `issuer.country`, `issuer.address`	`supplier-verification-id`	—	`{:ingestion :edit}`	no
tax-compliance-analyzer	`issuer.country`, `issuer.tax-id`, `issuer.tax-id-type`, `recipient.country`, `recipient.tax-id-type`, `shipping-country`, `line-items..tax-rate`, `line-items..description`, `delivery-terms-raw`, `incoterm-code`, `compliance-statements.*.text`	`service-category`, `line-items.*.bu-code`; tax-id-correction branch (vision PDF) is `:ingestion`-only — see §14	multi-slice: `:tax-issues` + `:line-items` (see §13)	`{:ingestion :edit}`	no
financial-validation-resolver	`subtotal`, `total`, `tax-amount`, `line-items..amount`, `line-items..quantity`, `line-items.*.unit-price`	—	`:validations.financial-math` (sub-path — see §13)	`{:ingestion :edit}`	no
fraud-detector	`issuer.name`, `issuer.country`, `issuer.vat-id`, `issuer.tax-id`, `issuer.iban`, `issuer.account-number`, `issuer.sort-code`, `issuer.routing-number`, `issuer.bsb`, `recipient.country`, `invoice-date`, `line-items.*.description`	—	`:fraud-flags`	`{:ingestion :edit}`	no
uncertain-validations-resolver	`issuer.name`, `issuer.address`, `recipient.name`, `recipient.address`, `invoice-date`, `invoice-number`	—	`:validations.{required-fields,date-reasonableness,recipient-identity}` (sub-paths — see §13)	`{:ingestion :edit}`	no
validations	(see §12.1 — per-doc-type dispatch)	—	`:validations` (per-doc-type sub-paths — see §13)	`{:ingestion :edit}`	yes
matching	(per-doc-type — see §12.2)	matches rows in `document_match`, cluster-id on `document`, cluster reconciliation state on `ap_document_cluster`; also triggers `reconcile-cluster!` which writes `:reconciliation` slice for each cluster peer (§13.3)	`:matching`	`{:ingestion :edit}`	no

12.1 `validations` — per-doc-type dispatch, always-run

validations is the only -always? true processor. It runs cheap deterministic checks over the whole document. Enumerating every leaf it touches would produce a brittle declaration, and the cost profile (a few hundred microseconds, no LLM calls) doesn't justify the filtering overhead.

Its reads/writes/diagnostic vary by document type (via -reads [this state] dispatch on state.document.structured-data.document-type):

Doc type	Sub-paths owned by `validations`
invoice	`:tax-id-format`, `:iban-format`, `:issuer-country`, `:recipient-country`, `:large-document-summary-only` (invoice-specific checks; `:financial-math` owned by FVR, `:required-fields`/`:date-reasonableness`/`:recipient-identity` owned by UVR)
purchase-order	`:required-fields` (the whole validations slice for POs)
contract	`:signature-presence`, `:required-fields`, `:date-validity`, `:party-identification`, `:financial-consistency`, `:termination-clause`
goods-received-note	`:required-fields`

Contract/PO/GRN have no LLM validation resolvers; validations owns their entire :validations slice. Invoice has FVR and UVR (§13) owning specific sub-paths.

12.2 Matching — per-doc-type reads

Matching's internal code (normalize.clj, searchable-text.clj, evidence.clj) dispatches on :document/type when extracting counterparty names, references, and scoring fields. The processor's -reads mirrors this dispatch:

Doc type	Read leaves
invoice	`issuer.name`, `issuer.vat-id`, `issuer.iban`, `invoice-number`, `total`, `currency`, `line-items..description`, `line-items..quantity`, `line-items..unit`, `po-references.`, `gr-references.*`, `service-period.start`, `service-period.end`
purchase-order	`supplier.name`, `supplier.vat-id`, `po-number`, `total-value`, `currency`, `line-items..description`, `line-items..quantity`, `line-items..unit`, `contract-references.`, `requisition-numbers.*`
contract	`counterparty.name`, `counterparty.tax-id`, `contract-number`, `total-value`, `currency`, `deliverables.*`
goods-received-note	`supplier.name`, `supplier.vat-id`, `grn-number`, `line-items..description`, `line-items..quantity`, `line-items..unit`, `po-references.`, `delivery-note-numbers.*`

12.3 Reconciliation — internal to matching

Reconciliation is not a separate processor. It's a sub-step of matching's -compute: after match-document! writes matches and assigns/merges clusters, matching calls reconcile-cluster! for each affected cluster with ≥ 2 documents, and writes the :reconciliation diagnostic slice for the edited document. For cluster peers, the existing peer-cluster run-row + slice-writing in reconcile-cluster! remains unchanged (the engine only tracks the current document's runs).

13. Diagnostic slice co-ownership

:validations is written by THREE processors: validations (base deterministic checks), financial-validation-resolver (resolves the financial-math sub-check), uncertain-validations-resolver (resolves required-fields, date-reasonableness, recipient-identity sub-checks).

Under the old pipeline these flowed through a shared structured-data.validation-results map that processors merged into. Under the new model the slice is a single JSONB object; co-ownership requires merge-not-replace semantics with non-overlapping ownership.

Resolution: -diagnostic returns either a single spec (slice + optional sub-path) or a vector of specs (multi-slice). See §3.1 for the full shape definition. The engine routes per-processor slice writes via jsonb_set for atomicity per sub-path.

13.1 Ownership of `:validations` slice — invoice

Sub-path	Owner
`[:financial-math]`	`financial-validation-resolver`
`[:required-fields]`	`uncertain-validations-resolver`
`[:date-reasonableness]`	`uncertain-validations-resolver`
`[:recipient-identity]`	`uncertain-validations-resolver`
`[:tax-id-format]`	`validations`
`[:iban-format]`	`validations`
`[:issuer-country]`	`validations`
`[:recipient-country]`	`validations`
`[:large-document-summary-only]`	`validations`

For contract, PO, GRN: validations owns the whole :validations slice (no resolvers exist for those doc types). See §12.1.

Each sub-path has exactly ONE owner per document type. No two processors write to the same sub-path. Small refactor of existing code: today's check-financial-math, check-required-fields, check-date-reasonableness, check-recipient-identity stay in ingestion/validation.clj as pure functions but the composition moves out of validation/validate into the respective resolver processors (which do the deterministic part + the LLM refinement in one -compute). For invoice, validations' -compute stops emitting those four sub-paths. For contract/PO/GRN, validations still runs all deterministic checks for that type.

13.2 Ownership of `:tax-issues` and `:line-items` slices

tax-compliance-analyzer owns BOTH diagnostic slices (:tax-issues invoice-level, :line-items per-line :vat-validation) via multi-slice -diagnostic return (§3.1).

TCA's -diagnostic returns:

[{:slice :tax-issues} {:slice :line-items}]

TCA's -compute result has corresponding top-level keys (plus whatever the processor wants for its own -apply-ops):

{:tax-issues [{:type :missing-vat-id :severity "warning" ...} ...]
 :line-items {"li-abc" {:vat-validation {...}}
              "li-def" {:vat-validation {...}}}
 ;; processor-internal: used by -apply-ops to build structured-data ops
 :service-category   {...}
 :bu-codes           {"li-abc" {...} "li-def" {...}}
 :tax-id-correction  {:status "corrected" :tax-id "..." :tax-id-type "..."}}

Engine reads top-level keys matching declared slice names (:tax-issues, :line-items) and writes them via update-diagnostic!. There is no separate vat-validation processor — the previous transitional function tax-compliance/run-vat-validation (which only extracted data TCA's LLM had stuffed onto structured-data) is deleted, along with the "extract then strip" dance at ingestion-completion.

TCA's -apply-ops emits structured-data mutations from the processor-internal keys in result:

:service-category → replace /service-category
:bu-codes → per-line replaces at /line-items[id=X]/bu-code (first-class structured-data field, displayed + editable in UI, exported to DATEV — see schema/invoice/structured_data.clj:95)
In :ingestion mode only: :tax-id-correction → replace /issuer/tax-id and /issuer/tax-id-type (vision-mode tax-id correction, §14)

13.3 `:matching` and `:reconciliation` slice ownership

matching processor's -diagnostic returns {:slice :matching}. The :reconciliation slice is written by reconcile-cluster! internally during matching's -compute — it iterates every cluster-peer document and writes each doc's per-doc :reconciliation slice (summaries are filtered per document). This is outside the engine's current-document scope (§12.3). No co-ownership concerns: matching's -diagnostic writes only :matching; reconcile-cluster! writes :reconciliation for all cluster docs (including the current one).

13.4 UI impact — none

The UI reads the final merged slices regardless of which processor wrote which sub-path. No rendering changes. Existing per-section states (not-yet-run, in-progress, completed) from the diagnostics feature already handle the case where individual sub-paths have differing run statuses.

14. Tax-compliance vision mode

The existing tax-compliance analyser has a vision fallback for tax-id correction — when the prior :tax-id-format validation failed, it fetches the PDF and asks a vision LLM to read the correct tax-id off the invoice image.

Policy: vision mode is :ingestion-only. In :edit mode, if the user edited an invalid tax-id and it's still invalid, that's user intent to flag (the validations processor will emit the format warning). We don't second-guess the user with a vision LLM. This simplifies edit-mode plumbing too — no S3 fetch needed.

Implementation: tax-compliance-analyzer's -compute inspects state.mode. When :edit, the no-vat tax-id correction branch and the tax-id-warn vision extension are both skipped.

15. Test strategy

Unit tests for the engine — phase ordering, run-row writes, apply filtering, conditional scheduling, op expansion.
Unit tests per migrated processor — verify -reads, -writes, -compute, -apply-ops produce the expected values from fixture state.
Regression tests for the post-process pipeline — existing tests for accounts/cost-center/accruals/etc. pass without modification (processor outputs unchanged).
Integration test: edit → recompute → diagnostic update — seed a doc, edit a scalar, invoke run-processors! in :edit mode, assert the relevant slice updates and irrelevant processors skipped.
Integration test: write-protection — seed a doc, edit a field, run a processor whose writes would touch that field, assert the op was blocked.
Integration test: non-leaf edit expansion — seed a doc, add a line item via the handler, invoke the engine, assert accounts / cost-center ran for the new item.

Existing tests under test/com/getorcha/workers/ap/ingestion/post_process/ move under test/com/getorcha/workers/ap/processors/ with their ns updates.

16. Rollout

Since D3 settled on "do everything in one branch," the migration ships atomically:

Migration: add 'derivation' to document_history_change_type enum; relax the CHECK constraint. Down-migration drops the value (if unused) or keeps it (if any rows exist).
Introduce IProcessor protocol (v2) + engine (no callers yet).
Extract provenance to shared ns (both UI and engine consume it).
Implement reads helpers (leaf expansion, pattern matching).
Extend db.document-processor-run/count-runs to accept :document-version kwarg (needed by the engine's conditional filter in §6.2).
Extend db.diagnostics/update-diagnostic! for sub-path and multi-slice writes via jsonb_set.
Migrate each post-processor to the v2 protocol (one commit per processor). Old record arities coexist briefly via a deprecated shim; shim removed once all callers switch.
Migrate matching internals to processors/matching/* and introduce the processors/matching.clj wrapper (handles reconciliation inside -compute).
Rewrite post-process/run + with-validations to call run-processors!.
Rewrite matching.worker/process-document! to call run-processors!.
Rewrite diagnostics-recompute/orchestrator stubs to call run-processors! with the edit-mode phase list + filter.
Delete old IProcessor shim, validate multimethod, with-validations, tax-compliance/run-vat-validation, with-run-row!, compute!, run-processor-phases, run-phase, build-diagnostics, publish-document-ready!.
Add notification payload unification.

Tests gate each commit. If a commit breaks a regression test, it gets fixed or reverted before the next step.

17. Risks & mitigations

Reads declarations rot. Authors modify a processor's internals without updating its :reads. Mitigation: each processor's -reads is tested via a small audit that asserts the processor doesn't read structured-data paths outside its declared set (honour-system for now; can be strengthened later by instrumenting reads in test mode).
Op expansion schema drift. If the structured-data schema grows a new nested map type, the expander needs to recognise it. Mitigation: the expander uses m.util/get against the live schema; adding new fields is a no-op, adding new MAP-of-map types is a schema change that would also need an expander update — test coverage for that case.
Diagnostic sub-path co-writes race. Two phase-1 processors both writing to :validations with different sub-paths race at the DB level. Mitigation: the engine writes diagnostics serially per phase (takes a per-document advisory lock during slice merge), OR uses jsonb_set at the sub-path level (atomic, no race). Prefer the latter.
Edit-recompute overwhelms LLM quota. User rapidly edits 50 fields; engine schedules 50 * N processors. Mitigation: the 60s SQS debounce already collapses bursts to one recompute. Conditional filter further trims. No new mitigation needed.
Edit-set query cost. The user-edit set is queried once per engine invocation via provenance/user-edited-paths and cached in ctx. Bounded by edits-per-document.
Derivation history rows flood the table. Every engine run in edit mode with any applied op writes a derivation history row. If processors churn on transient fields (e.g. cost-center suggestions changing between edits), the table grows. Mitigation: engine only writes the history row if the aggregate patch is non-empty (no ops applied ⇒ no row). If this becomes an issue, add a squash job later that collapses adjacent derivation rows for the same document.
OCC conflicts on rapid edit after recompute. A user submits an edit with expected-version=N; between their page load and submit, the engine applied a derivation and bumped to N+1. The submit gets rejected as a conflict. Same behavior as two users editing concurrently — the UI already handles conflicts (409 with retarget fragment). Mitigation: none; this is correct behavior.
Split ownership of :validations sub-paths requires careful refactor. Moving the COMPOSITION of check-financial-math out of validation/validate into financial-validation-resolver means contract/PO/GRN (with no LLM resolvers) still need to run these checks for their own validation sub-paths. Mitigation: check-* functions stay pure in ingestion/validation.clj. The validations processor runs whichever checks the document type needs (per §12.1 table); resolver processors (FVR, UVR) run their own subset and own their sub-paths on invoices only.

18. Deferred decisions

Exact name of the seq-regex authoring helper (read-leaf, rd, …) — cosmetic, picked during implementation.
Whether the UI's view.provenance becomes a shim or is deleted with call-site updates (§10) — picked at implementation time based on caller count.