Ingestion Regression Testing — Design

Problem

When making changes to the ingestion pipeline (transcription, classification, extraction, post-processing), we need a way to verify that existing documents don't regress in quality. Currently there's no automated way to compare ingestion results before and after a code change.

Solution

A Claude Code skill (/regression-test) that takes a list of document IDs (or natural language description), dispatches parallel inspector agents — one per document — that each:

Ensure the document exists locally (fetching from prod if needed)
Capture the baseline structured data from the existing (prod) ingestion
Trigger a reingestion with the current local code
Wait for completion
Compare the full structured data (old vs new)
If the comparison is ambiguous, visually inspect the PDF to determine ground truth
Write an individual report to disk

Architecture

User: "/regression-test <document IDs or description>"
         │
         ▼
┌──────────────────────────┐
│   Orchestrator (skill)   │
│                          │
│  1. Parse document IDs   │
│  2. Dispatch inspectors  │──── one Agent per document, parallel
│  3. Wait for completion  │
│  4. Print summary table  │──── terminal only, not persisted
└──────────────────────────┘
         │
         ▼  (parallel, one per document)
┌──────────────────────────┐
│   Inspector Agent        │
│                          │
│  1. Check local DB       │──── psql
│  2. If missing, fetch    │──── bb debug:fetch-document --force <id>
│  3. Capture prod baseline│──── psql (structured_data + commit_sha
│     from latest completed│     from latest completed ingestion)
│     ingestion            │
│  4. Trigger reingestion  │──── clojure-eval REPL
│  5. Poll until complete  │──── clojure-eval (SQL on ingestion table)
│  6. Fetch new structured │──── clojure-eval (SQL)
│     data                 │
│  7. Compare old vs new   │
│  8. If ambiguous, read   │──── Read tool on PDF from local S3
│     the PDF visually     │
│  9. Write report to disk │──── docs/regression-reports/<doc-id>-<commit>.md
└──────────────────────────┘

Inspector Details

Document Fetching

Follows the /debug-doc pattern:

Check local DB: SELECT id FROM document WHERE id = '<doc-id>'
If missing: bb debug:fetch-document --force <doc-id>
- Connects to prod via SSM port forwarding
- Downloads document + all ingestion history + S3 files
- Inserts into local DB under dev-seed tenant

Baseline Capture

Before reingesting, capture from the latest completed ingestion:

SELECT structured_data, commit_sha
FROM ingestion
WHERE document_id = '<doc-id>' AND status = 'completed'
ORDER BY created_at DESC
LIMIT 1

Reingestion

Trigger via clojure-eval REPL:

(require '[com.getorcha.erp.ingestion :as erp.ingestion])
(erp.ingestion/queue-for-ingestion! (repl/db-pool) (repl/aws) #uuid "<doc-id>" nil)

Polling for Completion

Poll the ingestion status via clojure-eval until it leaves in-progress:

(db.sql/execute-one! (repl/db-pool)
  {:select [:status]
   :from   [:ingestion]
   :where  [:= :id #uuid "<ingestion-id>"]})

Poll interval: a few seconds. Timeout: reasonable upper bound (e.g., 2-3 minutes).

Comparison Logic

Compare the entire structured_data JSON object — old vs new. This includes all extraction outputs and all post-processing outputs (account matching, fraud detection, validation, tax issues, etc.). This is a regression test on ingestion quality, not just extraction quality.

Excluded from comparison: document matching / clusters. The inspector prompt explicitly calls this out to avoid false regression reports.

Deep-diff the full objects. Categorize differences by severity. Only report fields that differ — identical fields are omitted from the report.

Verdicts

Identical — no meaningful differences
Improved — net positive changes (fixed errors, added missing data, better accuracy)
Regressed — net negative changes (lost data, introduced errors, worse accuracy)
Mixed — some improvements, some regressions
Unclear — significant discrepancies, can't determine winner → triggers PDF visual inspection

PDF Visual Inspection

Only triggered when the verdict is "Unclear" — significant discrepancies between old and new structured data with no clear winner.

The inspector downloads the PDF from local S3 and reads it visually (Claude vision capabilities) to determine ground truth, then revises the verdict.

bb dev:aws-cli s3 cp s3://v1-orcha-global-storage-local-stack/documents/<doc-id>.pdf /tmp/<doc-id>.pdf

Then uses the Read tool on the PDF file.

Report Format

File naming

docs/regression-reports/<doc-id>-<new-commit-sha-short>.md

Examples:

docs/regression-reports/019c0fdc-1ea9-709f-a196-ceac943ae631-abc1234.md

Glob by document: docs/regression-reports/019c0fdc-* Glob by commit: docs/regression-reports/*-abc1234.md

Identical result (minimal)

# Regression: <doc-id> — <new-commit-short>
- **Verdict**: Identical
- **Prod commit**: `<sha>`
- **New commit**: `<sha>`

No meaningful differences in structured data.

Non-identical result (full narrative)

# Regression: <doc-id> — <new-commit-short>
- **Filename**: <original filename>
- **Verdict**: Improved / Regressed / Mixed / Unclear
- **Prod commit**: `<sha>`
- **New commit**: `<sha>`

## Changes
<narrative description of differences only, organized by significance>

## PDF Inspection
<only present if verdict required visual inspection>
<what the agent saw and how it resolved the ambiguity>

Orchestrator terminal summary

Not persisted. Printed after all inspectors complete:

| Document | Filename | Verdict   |
|----------|----------|-----------|
| 019c...  | inv.pdf  | Regressed |
| 019d...  | bill.pdf | Improved  |

X documents tested. Y improved, Z regressed, W identical, ...

Scope

Invoice documents only — the only document type in scope for regression testing
Local reingestion — tests run against the locally running system (REPL + workers)
Comparison excludes matching/clusters — explicitly noted in inspector instructions

Implementation Artifacts

Skill: .claude/skills/regression-test/SKILL.md — orchestrator instructions
Inspector prompt: .claude/skills/regression-test/inspector-prompt.md — template for each subagent
Report directory: docs/regression-reports/ (gitignored)