Pipeline Correction Tracking - Product Specification

Author: Product Team Status: Draft Last Updated: 2026-02-04


1. Problem Statement

When an invoice is processed through Orcha's ingestion pipeline, the initial LLM extraction is followed by deterministic validation checks that may flag errors or uncertainties. Post-processors like the UncertainValidationsResolver then re-examine the original PDF with an LLM to correct these issues — fixing math errors, removing misextracted line items, correcting tax IDs.

Currently, these corrections are applied silently to the structured data. The LLM's reasoning and confidence are stored in validation-results, but there is no record of what specific fields were modified, what the original values were, or which processor made the change.

This creates a blind spot:

2. Goal

Add a lightweight correction log that records every modification made by corrective post-processors, enabling developers to:

  1. Trace data lineage — for any field, know whether it came from extraction or was corrected, and by whom
  2. Identify correction patterns — query which fields are corrected most often, by which processors, to guide extraction prompt improvements
  3. Monitor pipeline health — measure what percentage of ingestions require corrections over time

3. Success Metrics

Metric Target
Corrective modifications logged 100% — every field change by in-scope processors is recorded
Pre-correction state reconstructable Any corrected field's original value can be retrieved via a single query
Correction frequency queryable Simple SQL can answer "top corrected fields in last 30 days"
Pipeline latency impact No measurable increase in ingestion processing time

4. Core Concepts

4.1 Corrective vs. Enrichment Post-Processors

Not all post-processors modify existing data. Some add entirely new fields (enrichments). Only processors that change values originally set by extraction are in scope.

Processor Type What it does
UncertainValidationsResolver Corrective Modifies field values, updates line-item fields, removes misextracted line items (see full field inventory below)
TaxComplianceAnalyzer Corrective Corrects tax ID value (issuer.tax-id) and type classification (issuer.tax-id-type) when analysis disagrees with extraction
AccountsMatcher Enrichment Adds debit-account / credit-account to line items
CostCenterMatcher Enrichment Adds cost-center to line items
AccrualsMatcher Enrichment Adds accrual to line items
SupplierMatcher Enrichment Adds matched-account-number / match-confidence / match-reasoning to issuer

Verified: No other pipeline stage mutates extracted fields. with-validations only adds :validation-results, with-fraud-detection only adds :fraud-flags, and no transformations occur between extraction and post-processing or between post-processing and persistence.

4.2 Complete Field Inventory

Every field that can be corrected by a post-processor, verified against the codebase:

UncertainValidationsResolver — field corrections (via allowlist):

Field path Triggered by check
subtotal financial-math
tax-amount financial-math
tax-rate financial-math
total financial-math
discount financial-math
shipping financial-math
amount-due financial-math
line-items-include-tax financial-math
issuer.country issuer-country
issuer.iban iban-format
issuer.tax-id tax-id-format
issuer.tax-id-type tax-id-format
issuer.vat-id tax-id-format (legacy field)
recipient.country recipient-country

UncertainValidationsResolver — line-item corrections:

Action Triggered by check Details
line-item-update financial-math Any line-item field (amount, quantity, unit-price, tax-rate, description, etc.) updated by index
line-item-removal financial-math Line items removed by index (e.g., section subtotals misextracted as line items)

TaxComplianceAnalyzer — tax ID corrections:

Field path Condition
issuer.tax-id Tax compliance analysis returns status: "corrected"
issuer.tax-id-type Tax compliance analysis returns status: "corrected"

4.3 Correction Action Types

Each logged correction is typed by the kind of change it represents:

Action Type Description Example
field-update A top-level or nested field value was changed Subtotal corrected from 100.0 to 105.0
line-item-update A field on a specific line item was changed Line item 0's amount changed from 150.0 to 125.0
line-item-removal A line item was removed entirely Section subtotal misextracted as line item, removed

4.3 Before/After Capture

Every correction logs the old value (what extraction produced) and the new value (what the post-processor set), along with the LLM's reasoning and confidence score. This enables full reconstruction of the pre-correction state without needing pipeline snapshots.

4.4 Separate Storage

Correction logs are stored in a dedicated table, separate from the structured data itself. This keeps business data clean while enabling cross-ingestion analytics (e.g., "which fields are corrected most often across all tenants?").


5. Audience

This feature is developer-only. No UI changes are introduced.


6. Use Cases

6.1 Debugging an Unexpected Field Value

Scenario: Invoice #12345 shows subtotal = 105.0, but the PDF clearly states 100.0.
Developer queries: SELECT * FROM ingestion_correction_log
                   WHERE ingestion_id = '<id>' AND field_path = 'subtotal'
Result: Shows that UncertainValidationsResolver changed subtotal from 100.0 to 105.0
        with reasoning "Found shipping fee included in subtotal on PDF"
Action: Developer investigates whether the correction was appropriate or a model error

6.2 Identifying Extraction Improvement Opportunities

Scenario: Developer wants to know if extraction prompts should be improved.
Developer queries: SELECT field_path, action, count(*) as frequency
                   FROM ingestion_correction_log
                   WHERE tenant_id = '<id>' AND created_at > now() - interval '30 days'
                   GROUP BY field_path, action
                   ORDER BY frequency DESC
Result: line-item removals account for 40% of all corrections
Action: Improve extraction prompt to avoid extracting section subtotals as line items

6.3 Pipeline Health Monitoring

Scenario: Developer wants to measure correction rate over time.
Developer queries: SELECT
                     count(DISTINCT icl.ingestion_id)::float / count(DISTINCT i.id) as correction_rate
                   FROM ingestion i
                   LEFT JOIN ingestion_correction_log icl ON icl.ingestion_id = i.id
                   WHERE i.status = 'completed'
                     AND i.completed_at > now() - interval '30 days'
Result: 12% of ingestions required at least one correction
Action: Track this metric monthly to measure extraction quality improvements

6.4 Tracing a Tax ID Correction

Scenario: A supplier's tax ID type was reclassified from "vat" to "ein".
Developer queries: SELECT * FROM ingestion_correction_log
                   WHERE ingestion_id = '<id>' AND field_path LIKE 'issuer.tax-id%'
Result: Two corrections logged:
        - field_path: 'issuer.tax-id-type', old: "vat", new: "ein"
        - field_path: 'issuer.tax-id', old: "DE12345678", new: "12-3456789"
        Processor: tax-compliance-analyzer
Action: Developer verifies the reclassification was correct

7. Data Model

7.1 Correction Log Table

Table: ingestion_correction_log

Column Type Description
id UUID Primary key
ingestion_id UUID (FK) Which ingestion run produced this correction
tenant_id UUID (FK) Tenant for RLS and partitioning
processor_id TEXT Which processor made the correction (e.g., uncertain-validations-resolver, tax-compliance-analyzer)
action TEXT One of: field-update, line-item-update, line-item-removal
field_path TEXT Dotted path to the corrected field (e.g., subtotal, issuer.country, line-items.0.amount)
check_name TEXT (nullable) Which validation check triggered this correction (e.g., financial-math). Null for non-validation corrections (e.g., TaxComplianceAnalyzer).
old_value JSONB The value before correction
new_value JSONB (nullable) The value after correction. Null for line-item-removal.
reasoning TEXT LLM's explanation for why the correction was made
confidence NUMERIC LLM's confidence score (0.0 - 1.0)
created_at TIMESTAMPTZ When the correction was logged

7.2 Indexes

Index Purpose
(ingestion_id) Look up all corrections for a specific ingestion
(tenant_id, processor_id) Analytics: which processors produce the most corrections per tenant
(tenant_id, field_path) Analytics: which fields are corrected most often per tenant

7.3 Example Rows

Field correction (subtotal):

Column Value
processor_id uncertain-validations-resolver
action field-update
field_path subtotal
check_name financial-math
old_value 100.0
new_value 105.0
reasoning "Found shipping fee of 5.0 included in subtotal on PDF page 1"
confidence 0.85

Line item removal:

Column Value
processor_id uncertain-validations-resolver
action line-item-removal
field_path line-items.2
check_name financial-math
old_value {"description": "Subtotal Section A", "amount": 500.0, ...}
new_value null
reasoning "This is a section subtotal row, not a billable line item"
confidence 0.90

Tax ID type reclassification:

Column Value
processor_id tax-compliance-analyzer
action field-update
field_path issuer.tax-id-type
check_name null
old_value "vat"
new_value "ein"
reasoning "US-based supplier, EIN format matches, not a VAT ID"
confidence 0.92

8. Edge Cases & Business Rules

Edge Case Behavior
Ingestion retried after failure (attempt_count > 1) Delete existing correction logs for the ingestion before re-processing. The log always reflects the final successful run.
Post-processor runs but makes no corrections No rows inserted. Absence of rows = no corrections applied.
Multiple corrections in a single processor run Each correction is a separate row. The ingestion_id ties them together.
Correction sets a field to null new_value is stored as JSON null. Distinct from line-item-removal where new_value column itself is SQL NULL.
Line item removal changes subsequent item indexes Log the index at the time of removal (before any reindexing). Multiple removals are logged with their original indexes.
Processor crashes mid-correction If the ingestion fails and retries, correction logs are deleted at the start of the next attempt (see retry behavior above).
Enrichment processor overwrites a field that already had a value Out of scope for v1. Only corrective processors are tracked. If enrichment processors start overwriting extracted values, they should be added to scope.

9. Data Retention


10. Out of Scope (v1)


11. Open Questions

  1. Field path format: Should field paths use dot notation (issuer.country) or vector notation ([:issuer :country])? Dot notation is more SQL-friendly; vector notation is more Clojure-idiomatic. Recommendation: dot notation for queryability.
  2. Confidence threshold logging: Should we also log cases where the LLM analyzed a field but decided NOT to correct it (confidence too low or value confirmed correct)? This data could be useful for understanding false-positive rates. Recommendation: defer to v2.
  3. Cross-processor conflicts: If both UncertainValidationsResolver and TaxComplianceAnalyzer try to correct the same field (e.g., tax-id), who wins? Currently they run in parallel and results are applied sequentially. The last writer wins, but both corrections would be logged. Recommendation: document this behavior and accept it for v1.