Author: Product Team Status: Draft Last Updated: 2026-02-04
When an invoice is processed through Orcha's ingestion pipeline, the initial LLM extraction is followed by deterministic validation checks that may flag errors or uncertainties. Post-processors like the UncertainValidationsResolver then re-examine the original PDF with an LLM to correct these issues — fixing math errors, removing misextracted line items, correcting tax IDs.
Currently, these corrections are applied silently to the structured data. The LLM's reasoning and confidence are stored in validation-results, but there is no record of what specific fields were modified, what the original values were, or which processor made the change.
This creates a blind spot:
Add a lightweight correction log that records every modification made by corrective post-processors, enabling developers to:
| Metric | Target |
|---|---|
| Corrective modifications logged | 100% — every field change by in-scope processors is recorded |
| Pre-correction state reconstructable | Any corrected field's original value can be retrieved via a single query |
| Correction frequency queryable | Simple SQL can answer "top corrected fields in last 30 days" |
| Pipeline latency impact | No measurable increase in ingestion processing time |
Not all post-processors modify existing data. Some add entirely new fields (enrichments). Only processors that change values originally set by extraction are in scope.
| Processor | Type | What it does |
|---|---|---|
| UncertainValidationsResolver | Corrective | Modifies field values, updates line-item fields, removes misextracted line items (see full field inventory below) |
| TaxComplianceAnalyzer | Corrective | Corrects tax ID value (issuer.tax-id) and type classification (issuer.tax-id-type) when analysis disagrees with extraction |
| AccountsMatcher | Enrichment | Adds debit-account / credit-account to line items |
| CostCenterMatcher | Enrichment | Adds cost-center to line items |
| AccrualsMatcher | Enrichment | Adds accrual to line items |
| SupplierMatcher | Enrichment | Adds matched-account-number / match-confidence / match-reasoning to issuer |
Verified: No other pipeline stage mutates extracted fields. with-validations only adds :validation-results, with-fraud-detection only adds :fraud-flags, and no transformations occur between extraction and post-processing or between post-processing and persistence.
Every field that can be corrected by a post-processor, verified against the codebase:
UncertainValidationsResolver — field corrections (via allowlist):
| Field path | Triggered by check |
|---|---|
subtotal |
financial-math |
tax-amount |
financial-math |
tax-rate |
financial-math |
total |
financial-math |
discount |
financial-math |
shipping |
financial-math |
amount-due |
financial-math |
line-items-include-tax |
financial-math |
issuer.country |
issuer-country |
issuer.iban |
iban-format |
issuer.tax-id |
tax-id-format |
issuer.tax-id-type |
tax-id-format |
issuer.vat-id |
tax-id-format (legacy field) |
recipient.country |
recipient-country |
UncertainValidationsResolver — line-item corrections:
| Action | Triggered by check | Details |
|---|---|---|
line-item-update |
financial-math |
Any line-item field (amount, quantity, unit-price, tax-rate, description, etc.) updated by index |
line-item-removal |
financial-math |
Line items removed by index (e.g., section subtotals misextracted as line items) |
TaxComplianceAnalyzer — tax ID corrections:
| Field path | Condition |
|---|---|
issuer.tax-id |
Tax compliance analysis returns status: "corrected" |
issuer.tax-id-type |
Tax compliance analysis returns status: "corrected" |
Each logged correction is typed by the kind of change it represents:
| Action Type | Description | Example |
|---|---|---|
field-update |
A top-level or nested field value was changed | Subtotal corrected from 100.0 to 105.0 |
line-item-update |
A field on a specific line item was changed | Line item 0's amount changed from 150.0 to 125.0 |
line-item-removal |
A line item was removed entirely | Section subtotal misextracted as line item, removed |
Every correction logs the old value (what extraction produced) and the new value (what the post-processor set), along with the LLM's reasoning and confidence score. This enables full reconstruction of the pre-correction state without needing pipeline snapshots.
Correction logs are stored in a dedicated table, separate from the structured data itself. This keeps business data clean while enabling cross-ingestion analytics (e.g., "which fields are corrected most often across all tenants?").
This feature is developer-only. No UI changes are introduced.
Scenario: Invoice #12345 shows subtotal = 105.0, but the PDF clearly states 100.0.
Developer queries: SELECT * FROM ingestion_correction_log
WHERE ingestion_id = '<id>' AND field_path = 'subtotal'
Result: Shows that UncertainValidationsResolver changed subtotal from 100.0 to 105.0
with reasoning "Found shipping fee included in subtotal on PDF"
Action: Developer investigates whether the correction was appropriate or a model error
Scenario: Developer wants to know if extraction prompts should be improved.
Developer queries: SELECT field_path, action, count(*) as frequency
FROM ingestion_correction_log
WHERE tenant_id = '<id>' AND created_at > now() - interval '30 days'
GROUP BY field_path, action
ORDER BY frequency DESC
Result: line-item removals account for 40% of all corrections
Action: Improve extraction prompt to avoid extracting section subtotals as line items
Scenario: Developer wants to measure correction rate over time.
Developer queries: SELECT
count(DISTINCT icl.ingestion_id)::float / count(DISTINCT i.id) as correction_rate
FROM ingestion i
LEFT JOIN ingestion_correction_log icl ON icl.ingestion_id = i.id
WHERE i.status = 'completed'
AND i.completed_at > now() - interval '30 days'
Result: 12% of ingestions required at least one correction
Action: Track this metric monthly to measure extraction quality improvements
Scenario: A supplier's tax ID type was reclassified from "vat" to "ein".
Developer queries: SELECT * FROM ingestion_correction_log
WHERE ingestion_id = '<id>' AND field_path LIKE 'issuer.tax-id%'
Result: Two corrections logged:
- field_path: 'issuer.tax-id-type', old: "vat", new: "ein"
- field_path: 'issuer.tax-id', old: "DE12345678", new: "12-3456789"
Processor: tax-compliance-analyzer
Action: Developer verifies the reclassification was correct
Table: ingestion_correction_log
| Column | Type | Description |
|---|---|---|
| id | UUID | Primary key |
| ingestion_id | UUID (FK) | Which ingestion run produced this correction |
| tenant_id | UUID (FK) | Tenant for RLS and partitioning |
| processor_id | TEXT | Which processor made the correction (e.g., uncertain-validations-resolver, tax-compliance-analyzer) |
| action | TEXT | One of: field-update, line-item-update, line-item-removal |
| field_path | TEXT | Dotted path to the corrected field (e.g., subtotal, issuer.country, line-items.0.amount) |
| check_name | TEXT (nullable) | Which validation check triggered this correction (e.g., financial-math). Null for non-validation corrections (e.g., TaxComplianceAnalyzer). |
| old_value | JSONB | The value before correction |
| new_value | JSONB (nullable) | The value after correction. Null for line-item-removal. |
| reasoning | TEXT | LLM's explanation for why the correction was made |
| confidence | NUMERIC | LLM's confidence score (0.0 - 1.0) |
| created_at | TIMESTAMPTZ | When the correction was logged |
| Index | Purpose |
|---|---|
(ingestion_id) |
Look up all corrections for a specific ingestion |
(tenant_id, processor_id) |
Analytics: which processors produce the most corrections per tenant |
(tenant_id, field_path) |
Analytics: which fields are corrected most often per tenant |
Field correction (subtotal):
| Column | Value |
|---|---|
| processor_id | uncertain-validations-resolver |
| action | field-update |
| field_path | subtotal |
| check_name | financial-math |
| old_value | 100.0 |
| new_value | 105.0 |
| reasoning | "Found shipping fee of 5.0 included in subtotal on PDF page 1" |
| confidence | 0.85 |
Line item removal:
| Column | Value |
|---|---|
| processor_id | uncertain-validations-resolver |
| action | line-item-removal |
| field_path | line-items.2 |
| check_name | financial-math |
| old_value | {"description": "Subtotal Section A", "amount": 500.0, ...} |
| new_value | null |
| reasoning | "This is a section subtotal row, not a billable line item" |
| confidence | 0.90 |
Tax ID type reclassification:
| Column | Value |
|---|---|
| processor_id | tax-compliance-analyzer |
| action | field-update |
| field_path | issuer.tax-id-type |
| check_name | null |
| old_value | "vat" |
| new_value | "ein" |
| reasoning | "US-based supplier, EIN format matches, not a VAT ID" |
| confidence | 0.92 |
| Edge Case | Behavior |
|---|---|
| Ingestion retried after failure (attempt_count > 1) | Delete existing correction logs for the ingestion before re-processing. The log always reflects the final successful run. |
| Post-processor runs but makes no corrections | No rows inserted. Absence of rows = no corrections applied. |
| Multiple corrections in a single processor run | Each correction is a separate row. The ingestion_id ties them together. |
| Correction sets a field to null | new_value is stored as JSON null. Distinct from line-item-removal where new_value column itself is SQL NULL. |
| Line item removal changes subsequent item indexes | Log the index at the time of removal (before any reindexing). Multiple removals are logged with their original indexes. |
| Processor crashes mid-correction | If the ingestion fails and retries, correction logs are deleted at the start of the next attempt (see retry behavior above). |
| Enrichment processor overwrites a field that already had a value | Out of scope for v1. Only corrective processors are tracked. If enrichment processors start overwriting extracted values, they should be added to scope. |
issuer.country) or vector notation ([:issuer :country])? Dot notation is more SQL-friendly; vector notation is more Clojure-idiomatic. Recommendation: dot notation for queryability.