Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
Three mechanisms in the codebase today run "derive something from document state, record the run, write outputs":
IProcessor protocol + post-process/run)
— the 9 in-ingestion processors (accounts, cost-center, accruals,
supplier-matcher, supplier-verifier, tax-compliance-analyzer,
financial-validation-resolver, fraud-detector, uncertain-validations-resolver).matching.worker/process-document!) —
bespoke orchestration that runs matching, writes diagnostics, and
drives reconciliation per cluster.Each one inserts document_processor_run rows, writes diagnostic
slices, and conditionally mutates structured_data. They diverge on
protocol shape, side-effect discipline, and trigger handling. Adding
edit-triggered recompute to the mix (the reason this work started) would
introduce a fourth implementation and compound the drift.
Separately, the diagnostics work shipped without a story for conditional recomputation: today an edit-triggered recompute would blindly re-run every processor, wasting LLM spend and risking the engine overwriting user-entered values.
Unify under one Processor protocol and one run-processors! engine.
Every current processor (including matching and reconciliation) becomes
an implementation. Two callers — ingestion and edit-recompute —
parameterise the same engine with different phase lists and modes.
Introduce declarative :reads / :writes metadata on each processor
so the engine can (a) skip processors whose declared reads weren't
touched by the triggering edit, and (b) refuse to overwrite any field a
user has manually edited.
Reorganise namespaces so matching lives alongside the other processors
(workers.ap.processors.*), and rename -slice →
-diagnostic to match the terminology in the rest of the domain.
Processor protocol (v2) with ops-based apply; kills IProcessor.run-processors! handles run rows, apply-filtering,
diagnostic writes, phase ordering.document-provenance lookup shared with the UI.workers.ap.processors.*, matching + reconciliation
move under it.publish-document-ready! → processors.matching.queue/enqueue!.diagnostics-recompute/consumer gets
llm-config, search-config, notifications refs.document_processor_run schema.document.diagnostics schema (still JSONB, same slice
keys).A unit of work that derives something from document state. Every processor:
:id (e.g. :accounts, :matching, :fraud-detector).:reads).:writes) — empty if the processor only writes a
diagnostic slice.:diagnostic, was :slice) —
the top-level key in document.diagnostics it owns, or nil.:modes ⊆
#{:ingestion :edit}).-compute: (fn [ctx state]) → {:result _ :stats _}.-apply-ops: (fn [state result]) → [patch-ops] — may
be empty. Patch ops are JSON Patch maps with the project's
[id=X] line-item extension.Replaces the ad-hoc ingestion map threaded through IProcessors.
Single shape passed through the engine:
{:document <document-row> ;; includes :structured-data, :version, :legal-entity-id
:legal-entity <legal-entity-row>
:file {:contents <bytes-or-nil> :mime-type <string>}
:structured-data <map> ;; mirror of :document/structured-data for convenience
:commit-sha <string-or-nil>
:ingestion-id <uuid-or-nil> ;; set in :ingestion mode only
:history-id <uuid-or-nil>} ;; set in :edit mode only
:file is populated on demand by the engine only for processors that
need PDF bytes (currently tax-compliance's vision fallback). Lazy to
avoid gratuitous S3 fetches.
Every :reads entry and every :writes entry is a leaf path —
a path whose tail addresses an atomic field (scalar, or a primitive
inside a [:vector <primitive>]), per the structured-data malli
schema. Subtree declarations ([:cat [:= :issuer]]) are forbidden.
Leaves are expressed as malli seq-regex patterns over clj-path segments:
[:cat [:= :issuer] [:= :name]]
[:cat [:= :line-items] :any [:= :description]]
[:cat [:= :line-items] :any [:= :debit-account]]
:any matches one segment of any shape — array index, {:id X} map,
or keyword. No other wildcards are needed.
An authoring helper lowers a terse vector form to seq-regex:
(read-leaf :line-items :* :description)
;; => [:cat [:= :line-items] :any [:= :description]]
-slice → -diagnostic everywhere (protocol method, engine param,
helper fn names). Semantics unchanged: one top-level key in
document.diagnostics owned by one processor. Processors whose
output lives only in structured-data return nil.
Keep the existing name. Modernise the contract:
(defprotocol IProcessor
(-id [this]) ;; :keyword
(-reads [this state]) ;; [seq-regex-pattern...] — state-aware for per-type dispatch
(-writes [this state]) ;; [seq-regex-pattern...] — may be empty
(-diagnostic [this state]) ;; DiagnosticSpec or vector thereof, or nil
(-modes [this]) ;; #{:ingestion :edit}
(-always? [this]) ;; boolean — bypasses conditional filter
(-compute [this ctx state]) ;; -> {:result _ :stats _}
(-apply-ops [this state result])) ;; -> [json-patch-op ...]
-reads, -writes, and -diagnostic take state so processors
can dispatch on document type. matching's reads differ for
invoice/purchase-order/contract/GRN; validations similarly
dispatches its sub-path ownership per document type (see §13). State
is guaranteed to contain :document with :document/structured-data
including :document-type before these methods are called.
-compute replaces today's -compute [this] (which closed over
context and ingestion at construction). The ctx/state switch lets the
engine reuse a single processor instance per run and supply the
canonical state map.
-apply-ops replaces today's -apply [this sd result] → new-sd. Ops
form is required so the engine can filter against user-edited paths
before applying. Returned ops use string keys ({"op" "replace" "path" "..." "value" ...}) matching the JSON-Patch convention used by
json-patch/apply-patch and the edit handlers.
-always? is an escape hatch for processors that read too broadly to
enumerate (validations — cheap deterministic checks over the whole
document). When true, the conditional filter never skips this
processor in :edit mode; reads declaration is ignored. Defaults to
false.
-diagnostic returns one of:
nil ;; no diagnostic write
{:slice :kw} ;; replace whole slice
{:slice :kw :sub-path [k1 k2 ...]} ;; replace at sub-path
{:slice :kw :sub-paths [[k1] [k2] ...]} ;; multiple sub-paths, each written
;; separately with the result map's
;; top-level keys
[spec1 spec2 ...] ;; multi-slice: write to several slices
;; from one processor's result
Multi-slice (the vector form) is used by tax-compliance-analyzer,
which writes both :tax-issues and :line-items from a single LLM
call. When the vector form is used, -compute's result is a map with
top-level keys matching each slice:
{:tax-issues [...]
:line-items {"li-abc" {:vat-validation {...}} ...}}
Engine routes each top-level key to its slice.
run-processors!(run-processors! ctx state phases)
ctx — db-pool, llm-config, search-config, aws, notifications, etc.state — see §2.2. Carries :mode, :trigger-kind,
:ingestion-id / :history-id, :edited-by, the document/legal-entity/
structured-data/file/diagnostics, and (in :edit mode) a precomputed
:changed-leaves set.phases — [[Processor ...] [Processor ...]]. Each sub-vector is
a concurrency group; groups run sequentially.No separate processors-filter argument. Filtering happens inside
the engine per phase.
For each phase, in order:
document.diagnostics and populate
state.diagnostics. Phase-1 processors see empty/stale
diagnostics; phase-2 sees phase-1's writes (e.g.
tax-compliance-analyzer reads state.diagnostics.validations.tax-id-format
that validations wrote in phase 1).:edit mode, drop processors whose -always? is
false AND whose -reads (evaluated against current state) don't
intersect :changed-leaves AND which have a :completed run at
the current document version. In :ingestion mode, no filtering.document_processor_run row (:running, with
trigger-kind from state).-compute with ctx + state.-diagnostic (evaluated against state) is non-nil, write the
slice(s) via db.diagnostics/update-diagnostic! (§9). Multi-slice
specs route per-slice values from the result map.-apply-ops returns non-empty ops: filter against the
user-edit set (§7) in :edit mode (no filter in :ingestion
mode — see §5.4), then merge remaining ops into state's
in-memory :structured-data. The engine does NOT persist
structured-data mutations to the document row here — persistence
is batched at end-of-engine-run (§4.1).:completed with :result + :stats.:failed, fire notification hook (§8).state
reflecting applied ops and refreshed diagnostics. Subsequent phases
see prior phases' mutations.The engine mutates state.structured-data in memory across phases.
When the engine run completes, if state.structured-data differs
from state.initial-structured-data:
:ingestion mode: the engine does NOT persist. The caller
(ingestion worker) persists the final state via
complete-ingestion! — single UPDATE + document_history row
with change_type = 'ingestion' + version++.:edit mode: the engine computes the aggregate JSON-Patch
(diff from initial to final structured-data), writes a single
UPDATE on document + one document_history row with
change_type = 'derivation' + version++. The aggregate patch
records the net effect of all processor ops across all phases.-apply-ops returns [] (matching
side-effects land in document_match + cluster_id, not
structured-data). No aggregate patch, no version bump.A new change_type = 'derivation' enum value is added to
document_history_change_type. Derivation rows have edited_by = NULL and ingestion_id = NULL; only patch is populated. The
constraint on document_history is relaxed to allow this combination.
The updated state. Callers wire this back into their end-of-pipeline
logic (e.g., ingestion's complete-ingestion! writes the final
document_history row + document update in the same transaction).
:ingestion-apply-ops results are applied with no filter — re-ingestion
establishes truth from scratch, overwriting any prior user edits.state.trigger-kind = :ingestion; state.ingestion-id required.:edit-apply-ops results are filtered through the user-edit set (§7)
before being applied. User-edited paths block processor writes at
the same or descendant path.-always? is false AND reads
don't intersect :changed-leaves AND which already have a
completed run at the current version are skipped. Cold-start
exemption: never-run processors always run.state.trigger-kind = :edit; state.history-id and
state.edited-by required.:manual (future)Not part of this work but protocol-level reserved. For ops scripts that want to force a recompute without an ingestion or edit anchor.
Given the current edit's patch (from
document_history.patch), the engine produces the set of changed
leaves:
path string to a clj-path via
json-patch.path/pointer->clj-path.remove)
and emitting one virtual leaf path per atomic descendant.The expansion uses the StructuredData schema to decide "is this
atomic?" — primitive schemas and [:vector <primitive>] are
atomic; maps and vectors of maps are not.
For each processor in the phase:
-always? is true → run.:completed run at the document's
current version → run (cold start).-reads patterns m/validates any
leaf in the changed-leaves set → run.The filter runs per phase, immediately before execution. Later phases see earlier phases' mutations, which may themselves touch leaves relevant to later-phase processors; the filter recomputes between phases.
The state receives a computed :changed-leaves set on entry to
:edit mode so processors can consume it too (e.g., an efficient
processor might compute only the deltas). Optional — the filter
handles most cases; this is for processors that want finer control.
Reuses the shared logic from view/provenance.clj, extracted to a
new namespace com.getorcha.document.provenance:
(provenance/user-edited-paths db-pool document-id)
;; => #{"/issuer/name" "/line-items[id=abc]/debit-account" ...}
Internally walks document_history newest → oldest, collecting op
paths up to (but not including) the most recent :ingestion row.
Same behaviour that the UI relies on.
The extracted ns exports both:
document-provenance (path → {:edited-by :edited-at}, used by UI)user-edited-paths (just the key set, used by the engine)The filter runs in :edit mode only. Ingestion mode applies every
op unconditionally (re-ingestion wipes state).
For each op the engine is about to apply:
op.path to a clj-path.:info level with processor id +
op path + blocking user-edited path.Prefix matching is important: user edited /line-items[id=abc]
(wholesale replace) must block processor writes to
/line-items[id=abc]/debit-account.
Unit-test the filter with representative scenarios:
/line-items[id=abc]) blocks all descendant
processor writes.Failed processor runs fire admin notifications in both modes. Notification payload includes:
{:kind :processor/failure
:processor :matching
:trigger {:kind :edit :history-id #uuid ... :edited-by #uuid ...}
:document {:id ... :legal-entity-id ... :file-original-name ...}
:error <message>}
For :ingestion triggers :edited-by is nil. For :edit
triggers the admin payload includes which user performed the edit
(useful when an edit pattern is destabilising a processor). Renames
today's :matching/permanent-failure, :reconciliation/failure
into the single :processor/failure kind with the processor id in
the payload.
| Old | New |
|---|---|
IProcessor (protocol) |
IProcessor (kept; contract extended) |
db.diagnostics/update-slice! |
db.diagnostics/update-diagnostic! (no alias — all callers updated) |
publish-document-ready! |
processors.matching.queue/enqueue! |
:matching/permanent-failure |
:processor/failure with :processor :matching |
:reconciliation/failure |
:processor/failure with :processor :reconciliation |
compute! (post-process.clj) |
absorbed into engine |
with-run-row! (post-process.clj) |
absorbed into engine |
run-processor-phases |
absorbed into engine |
run-phase |
absorbed into engine |
tax-compliance/run-vat-validation |
deleted (TCA writes :line-items diagnostic slice directly; §13.2) |
validation/validate (multimethod) |
deleted (callsites moved to validations, FVR, UVR processors) |
with-validations (ingestion.clj) |
deleted (validations is phase 1 of the engine) |
New database:
document_history_change_type enum value 'derivation' (§4.1).document_history CHECK constraint relaxed to allow
change_type='derivation' AND ingestion_id IS NULL AND edited_by IS NULL.src/com/getorcha/workers/ap/
ingestion.clj ;; shrinks; runs extraction + calls engine
ingestion/
classification.clj
extraction.clj
transcription.clj
vat_rules.clj
validation.clj ;; pure rules (check-* functions)
post_process/ ;; [DELETED — see processors/]
processors/
engine.clj ;; IProcessor protocol + run-processors!
reads.clj ;; seq-regex helpers, leaf expansion
accounts.clj ;; (moved from post_process/)
accruals.clj
cost_center.clj
financial_validation.clj
fraud.clj
supplier.clj
tax_compliance.clj
uncertain_validations.clj
validations.clj ;; NEW — wraps ingestion/validation.clj
matching.clj ;; NEW — wraps match-document! + reconcile-cluster!
matching/
queue.clj ;; NEW — enqueue! (was publish-document-ready!)
core.clj ;; unchanged internals
candidates.clj
evidence.clj
llm_decision.clj
normalize.clj
reconciliation.clj ;; moved — still internal, no longer a separate processor
searchable_text.clj
Note: the provenance logic moves to com.getorcha.document.provenance
(new top-level ns, neutral between UI + workers). The UI's existing
com.getorcha.app.http.documents.view.provenance ns becomes a thin
shim that re-exports document-provenance (or is deleted with the
UI callers updated to the new ns, whichever is cheaper at
implementation time).
Today the ingestion pipeline in workers.ap.ingestion runs these
stages sequentially:
transcribe → classify → extract → validate (with-validations) → post-process → complete
validate mutates structured-data with :validation-results;
post-process runs the 9 post-processors through the old
IProcessor protocol and mutates structured-data again.
Under the unified model, validate and post-process collapse into
a single engine call with THREE phases. The pipeline becomes:
transcribe → classify → extract → run-processors! [validations] [post-procs…] [fraud] → complete
validations (always-run, deterministic-only) runs alone in phase 1
because phase-2 processors read its output. The nine existing
post-processors plus the new validations processor distribute
across the three phases per §11 below. (There is no separate
vat-validation processor — tax-compliance-analyzer writes the
per-line vat-validation diagnostic directly; see §13.)
The engine replaces both with-validations and post-process/run.
Ingestion calls:
(engine/run-processors!
ctx state-in-ingestion-mode
[;; Phase 1 — deterministic validations (fast; produces the validation
;; statuses that downstream LLM processors consult)
[validations]
;; Phase 2 — enrichment, analysis, resolvers
[accounts cost-center accruals
supplier-matcher supplier-verifier
tax-compliance-analyzer
financial-validation-resolver
uncertain-validations-resolver]
;; Phase 3 — sees phase-2 mutations (e.g. tax-id correction)
[fraud-detector]])
Ingestion then calls processors.matching.queue/enqueue! to hand
off to the matching SQS worker. Matching stays async for latency
isolation.
Phase rationale:
tax-compliance-analyzer
reads :validations.tax-id-format to decide whether to enter its
vision-correction branch; financial-validation-resolver and
uncertain-validations-resolver consume uncertain statuses that
validations produces. Today this is achieved by with-validations
running before post-process/run; under the unified model it's
just phase 1.financial-validation-resolver
and uncertain-validations-resolver are here rather than a later
phase because they read only validations (phase 1) output, not
phase-2 peer output.tax-compliance-analyzer). This preserves today's fraud-after-
corrections sequencing.The matching SQS worker handles both the post-ingestion continuation and (nothing else — edit-mode runs matching inline):
(engine/run-processors!
ctx state-in-ingestion-mode
[[matching]])
matching's -compute runs match-document! then invokes
reconcile-cluster! for each affected cluster. Reconciliation is
NOT a separate processor — it's an internal step of matching.
matching writes both :matching and :reconciliation diagnostic
slices (see §13).
(engine/run-processors!
ctx state-in-edit-mode
[;; Phase 1 — validations (always runs; produces statuses downstream reads)
[validations]
;; Phase 2 — everything else, conditionally
[tax-compliance-analyzer
fraud-detector matching
accounts cost-center accruals
supplier-matcher supplier-verifier
financial-validation-resolver
uncertain-validations-resolver]])
Phase 1 is the same as ingestion's phase 1: validations runs first because phase-2 processors (tax-compliance-analyzer, FVR, UVR) read its output.
Phase 2 collapses ingestion's phases 2 and 3 because in :edit mode
-apply-ops mutations are filtered by the user-edit set, so
phase-2 corrections typically don't propagate to phase-3 readers in
a meaningful way — and when they would, the reader will recompute
on a subsequent edit anyway. Fraud running alongside tax-compliance
in phase 2 is a small latency win acceptable because fraud-detector's
output is a diagnostic (not a correction) and a slightly-stale
tax-id-type at fraud-time only produces a slightly-stale fraud-flag.
Every phase-2 processor is {:ingestion :edit}; the conditional
filter (§6) decides which actually run based on :changed-leaves.
validations runs unconditionally (-always? true).
Matching's reconciliation sub-step is sequenced inside matching's
-compute, not at the engine's phase level. reconcile-cluster!
still inserts its own document_processor_run rows for cluster-peer
documents and writes their :reconciliation slices — these are
side effects of the matching processor on OTHER documents, outside
the engine's current-document scope. The engine itself only tracks
runs/slices for the document that triggered the run.
Every current processor gets a v2 profile. Values below are illustrative for the spec; exact reads/writes are locked in during implementation from source inspection.
| Processor | Reads (leaves, terse) | Writes (structured-data) | Diagnostic | Modes | Always? |
|---|---|---|---|---|---|
| accounts | issuer.name, issuer.vat-id, issuer.country, line-items.*.description, line-items.*.amount |
line-items.*.debit-account, line-items.*.credit-account |
— | {:ingestion :edit} |
no |
| cost-center | issuer.name, line-items.*.description, line-items.*.amount |
line-items.*.cost-center |
— | {:ingestion :edit} |
no |
| accruals | invoice-date, line-items.*.description |
line-items.*.accrual |
— | {:ingestion :edit} |
no |
| supplier-matcher | issuer.name, issuer.vat-id, issuer.iban |
supplier-match |
— | {:ingestion :edit} |
no |
| supplier-verifier | issuer.name, issuer.vat-id, issuer.country, issuer.address |
supplier-verification-id |
— | {:ingestion :edit} |
no |
| tax-compliance-analyzer | issuer.country, issuer.tax-id, issuer.tax-id-type, recipient.country, recipient.tax-id-type, shipping-country, line-items.*.tax-rate, line-items.*.description, delivery-terms-raw, incoterm-code, compliance-statements.*.text |
service-category, line-items.*.bu-code; tax-id-correction branch (vision PDF) is :ingestion-only — see §14 |
multi-slice: :tax-issues + :line-items (see §13) |
{:ingestion :edit} |
no |
| financial-validation-resolver | subtotal, total, tax-amount, line-items.*.amount, line-items.*.quantity, line-items.*.unit-price |
— | :validations.financial-math (sub-path — see §13) |
{:ingestion :edit} |
no |
| fraud-detector | issuer.name, issuer.country, issuer.vat-id, issuer.tax-id, issuer.iban, issuer.account-number, issuer.sort-code, issuer.routing-number, issuer.bsb, recipient.country, invoice-date, line-items.*.description |
— | :fraud-flags |
{:ingestion :edit} |
no |
| uncertain-validations-resolver | issuer.name, issuer.address, recipient.name, recipient.address, invoice-date, invoice-number |
— | :validations.{required-fields,date-reasonableness,recipient-identity} (sub-paths — see §13) |
{:ingestion :edit} |
no |
| validations | (see §12.1 — per-doc-type dispatch) | — | :validations (per-doc-type sub-paths — see §13) |
{:ingestion :edit} |
yes |
| matching | (per-doc-type — see §12.2) | matches rows in document_match, cluster-id on document, cluster reconciliation state on ap_document_cluster; also triggers reconcile-cluster! which writes :reconciliation slice for each cluster peer (§13.3) |
:matching |
{:ingestion :edit} |
no |
validations — per-doc-type dispatch, always-runvalidations is the only -always? true processor. It runs cheap
deterministic checks over the whole document. Enumerating every leaf
it touches would produce a brittle declaration, and the cost profile
(a few hundred microseconds, no LLM calls) doesn't justify the
filtering overhead.
Its reads/writes/diagnostic vary by document type (via
-reads [this state] dispatch on state.document.structured-data.document-type):
| Doc type | Sub-paths owned by validations |
|---|---|
| invoice | :tax-id-format, :iban-format, :issuer-country, :recipient-country, :large-document-summary-only (invoice-specific checks; :financial-math owned by FVR, :required-fields/:date-reasonableness/:recipient-identity owned by UVR) |
| purchase-order | :required-fields (the whole validations slice for POs) |
| contract | :signature-presence, :required-fields, :date-validity, :party-identification, :financial-consistency, :termination-clause |
| goods-received-note | :required-fields |
Contract/PO/GRN have no LLM validation resolvers; validations owns
their entire :validations slice. Invoice has FVR and UVR (§13)
owning specific sub-paths.
Matching's internal code (normalize.clj, searchable-text.clj,
evidence.clj) dispatches on :document/type when extracting
counterparty names, references, and scoring fields. The processor's
-reads mirrors this dispatch:
| Doc type | Read leaves |
|---|---|
| invoice | issuer.name, issuer.vat-id, issuer.iban, invoice-number, total, currency, line-items.*.description, line-items.*.quantity, line-items.*.unit, po-references.*, gr-references.*, service-period.start, service-period.end |
| purchase-order | supplier.name, supplier.vat-id, po-number, total-value, currency, line-items.*.description, line-items.*.quantity, line-items.*.unit, contract-references.*, requisition-numbers.* |
| contract | counterparty.name, counterparty.tax-id, contract-number, total-value, currency, deliverables.* |
| goods-received-note | supplier.name, supplier.vat-id, grn-number, line-items.*.description, line-items.*.quantity, line-items.*.unit, po-references.*, delivery-note-numbers.* |
Reconciliation is not a separate processor. It's a sub-step of
matching's -compute: after match-document! writes matches and
assigns/merges clusters, matching calls reconcile-cluster! for
each affected cluster with ≥ 2 documents, and writes the
:reconciliation diagnostic slice for the edited document. For
cluster peers, the existing peer-cluster run-row + slice-writing in
reconcile-cluster! remains unchanged (the engine only tracks the
current document's runs).
:validations is written by THREE processors: validations (base
deterministic checks), financial-validation-resolver (resolves the
financial-math sub-check), uncertain-validations-resolver (resolves
required-fields, date-reasonableness, recipient-identity sub-checks).
Under the old pipeline these flowed through a shared
structured-data.validation-results map that processors merged into.
Under the new model the slice is a single JSONB object; co-ownership
requires merge-not-replace semantics with non-overlapping ownership.
Resolution: -diagnostic returns either a single spec (slice +
optional sub-path) or a vector of specs (multi-slice). See §3.1 for
the full shape definition. The engine routes per-processor slice
writes via jsonb_set for atomicity per sub-path.
:validations slice — invoice| Sub-path | Owner |
|---|---|
[:financial-math] |
financial-validation-resolver |
[:required-fields] |
uncertain-validations-resolver |
[:date-reasonableness] |
uncertain-validations-resolver |
[:recipient-identity] |
uncertain-validations-resolver |
[:tax-id-format] |
validations |
[:iban-format] |
validations |
[:issuer-country] |
validations |
[:recipient-country] |
validations |
[:large-document-summary-only] |
validations |
For contract, PO, GRN: validations owns the whole :validations
slice (no resolvers exist for those doc types). See §12.1.
Each sub-path has exactly ONE owner per document type. No two
processors write to the same sub-path. Small refactor of existing
code: today's check-financial-math, check-required-fields,
check-date-reasonableness, check-recipient-identity stay in
ingestion/validation.clj as pure functions but the composition
moves out of validation/validate into the respective resolver
processors (which do the deterministic part + the LLM refinement in
one -compute). For invoice, validations' -compute stops
emitting those four sub-paths. For contract/PO/GRN, validations
still runs all deterministic checks for that type.
:tax-issues and :line-items slicestax-compliance-analyzer owns BOTH diagnostic slices
(:tax-issues invoice-level, :line-items per-line
:vat-validation) via multi-slice -diagnostic return (§3.1).
TCA's -diagnostic returns:
[{:slice :tax-issues} {:slice :line-items}]
TCA's -compute result has corresponding top-level keys (plus
whatever the processor wants for its own -apply-ops):
{:tax-issues [{:type :missing-vat-id :severity "warning" ...} ...]
:line-items {"li-abc" {:vat-validation {...}}
"li-def" {:vat-validation {...}}}
;; processor-internal: used by -apply-ops to build structured-data ops
:service-category {...}
:bu-codes {"li-abc" {...} "li-def" {...}}
:tax-id-correction {:status "corrected" :tax-id "..." :tax-id-type "..."}}
Engine reads top-level keys matching declared slice names (:tax-issues,
:line-items) and writes them via update-diagnostic!. There is no
separate vat-validation processor — the previous transitional
function tax-compliance/run-vat-validation (which only extracted
data TCA's LLM had stuffed onto structured-data) is deleted, along
with the "extract then strip" dance at ingestion-completion.
TCA's -apply-ops emits structured-data mutations from the
processor-internal keys in result:
:service-category → replace /service-category:bu-codes → per-line replaces at /line-items[id=X]/bu-code
(first-class structured-data field, displayed + editable in UI,
exported to DATEV — see schema/invoice/structured_data.clj:95):ingestion mode only: :tax-id-correction → replace
/issuer/tax-id and /issuer/tax-id-type (vision-mode tax-id
correction, §14):matching and :reconciliation slice ownershipmatching processor's -diagnostic returns {:slice :matching}.
The :reconciliation slice is written by reconcile-cluster!
internally during matching's -compute — it iterates every
cluster-peer document and writes each doc's per-doc :reconciliation
slice (summaries are filtered per document). This is outside the
engine's current-document scope (§12.3). No co-ownership concerns:
matching's -diagnostic writes only :matching; reconcile-cluster!
writes :reconciliation for all cluster docs (including the current
one).
The UI reads the final merged slices regardless of which processor wrote which sub-path. No rendering changes. Existing per-section states (not-yet-run, in-progress, completed) from the diagnostics feature already handle the case where individual sub-paths have differing run statuses.
The existing tax-compliance analyser has a vision fallback for tax-id
correction — when the prior :tax-id-format validation failed, it
fetches the PDF and asks a vision LLM to read the correct tax-id off
the invoice image.
Policy: vision mode is :ingestion-only. In :edit mode, if the
user edited an invalid tax-id and it's still invalid, that's user
intent to flag (the validations processor will emit the format
warning). We don't second-guess the user with a vision LLM. This
simplifies edit-mode plumbing too — no S3 fetch needed.
Implementation: tax-compliance-analyzer's -compute inspects
state.mode. When :edit, the no-vat tax-id correction branch and
the tax-id-warn vision extension are both skipped.
-reads, -writes,
-compute, -apply-ops produce the expected values from fixture
state.run-processors! in :edit mode,
assert the relevant slice updates and irrelevant processors
skipped.Existing tests under test/com/getorcha/workers/ap/ingestion/post_process/
move under test/com/getorcha/workers/ap/processors/ with their
ns updates.
Since D3 settled on "do everything in one branch," the migration ships atomically:
'derivation' to document_history_change_type
enum; relax the CHECK constraint. Down-migration drops the value
(if unused) or keeps it (if any rows exist).IProcessor protocol (v2) + engine (no callers yet).provenance to shared ns (both UI and engine consume it).reads helpers (leaf expansion, pattern matching).db.document-processor-run/count-runs to accept
:document-version kwarg (needed by the engine's
conditional filter in §6.2).db.diagnostics/update-diagnostic! for sub-path and
multi-slice writes via jsonb_set.processors/matching/* and
introduce the processors/matching.clj wrapper (handles
reconciliation inside -compute).post-process/run + with-validations to call
run-processors!.matching.worker/process-document! to call
run-processors!.diagnostics-recompute/orchestrator stubs to call
run-processors! with the edit-mode phase list + filter.IProcessor shim, validate multimethod,
with-validations, tax-compliance/run-vat-validation,
with-run-row!, compute!, run-processor-phases, run-phase,
build-diagnostics, publish-document-ready!.Tests gate each commit. If a commit breaks a regression test, it gets fixed or reverted before the next step.
:reads. Mitigation: each processor's -reads
is tested via a small audit that asserts the processor doesn't read
structured-data paths outside its declared set (honour-system for
now; can be strengthened later by instrumenting reads in test mode).m.util/get against the live schema; adding new
fields is a no-op, adding new MAP-of-map types is a schema change
that would also need an expander update — test coverage for that
case.:validations with different sub-paths race at the DB
level. Mitigation: the engine writes diagnostics serially per
phase (takes a per-document advisory lock during slice merge), OR
uses jsonb_set at the sub-path level (atomic, no race). Prefer the
latter.provenance/user-edited-paths and cached in
ctx. Bounded by edits-per-document.derivation history row.
If processors churn on transient fields (e.g. cost-center suggestions
changing between edits), the table grows. Mitigation: engine only
writes the history row if the aggregate patch is non-empty (no ops
applied ⇒ no row). If this becomes an issue, add a squash job later
that collapses adjacent derivation rows for the same document.expected-version=N; between their page load and submit,
the engine applied a derivation and bumped to N+1. The submit
gets rejected as a conflict. Same behavior as two users editing
concurrently — the UI already handles conflicts (409 with retarget
fragment). Mitigation: none; this is correct behavior.:validations sub-paths requires careful
refactor. Moving the COMPOSITION of check-financial-math out
of validation/validate into financial-validation-resolver means
contract/PO/GRN (with no LLM resolvers) still need to run these
checks for their own validation sub-paths. Mitigation: check-*
functions stay pure in ingestion/validation.clj. The
validations processor runs whichever checks the document type
needs (per §12.1 table); resolver processors (FVR, UVR) run their
own subset and own their sub-paths on invoices only.read-leaf, rd, …)
— cosmetic, picked during implementation.view.provenance becomes a shim or is deleted
with call-site updates (§10) — picked at implementation time based
on caller count.