When making changes to the ingestion pipeline (transcription, classification, extraction, post-processing), we need a way to verify that existing documents don't regress in quality. Currently there's no automated way to compare ingestion results before and after a code change.
A Claude Code skill (/regression-test) that takes a list of document IDs (or natural language description), dispatches parallel inspector agents — one per document — that each:
User: "/regression-test <document IDs or description>"
│
▼
┌──────────────────────────┐
│ Orchestrator (skill) │
│ │
│ 1. Parse document IDs │
│ 2. Dispatch inspectors │──── one Agent per document, parallel
│ 3. Wait for completion │
│ 4. Print summary table │──── terminal only, not persisted
└──────────────────────────┘
│
▼ (parallel, one per document)
┌──────────────────────────┐
│ Inspector Agent │
│ │
│ 1. Check local DB │──── psql
│ 2. If missing, fetch │──── bb debug:fetch-document --force <id>
│ 3. Capture prod baseline│──── psql (structured_data + commit_sha
│ from latest completed│ from latest completed ingestion)
│ ingestion │
│ 4. Trigger reingestion │──── clojure-eval REPL
│ 5. Poll until complete │──── clojure-eval (SQL on ingestion table)
│ 6. Fetch new structured │──── clojure-eval (SQL)
│ data │
│ 7. Compare old vs new │
│ 8. If ambiguous, read │──── Read tool on PDF from local S3
│ the PDF visually │
│ 9. Write report to disk │──── docs/regression-reports/<doc-id>-<commit>.md
└──────────────────────────┘
Follows the /debug-doc pattern:
SELECT id FROM document WHERE id = '<doc-id>'bb debug:fetch-document --force <doc-id>
Before reingesting, capture from the latest completed ingestion:
SELECT structured_data, commit_sha
FROM ingestion
WHERE document_id = '<doc-id>' AND status = 'completed'
ORDER BY created_at DESC
LIMIT 1
Trigger via clojure-eval REPL:
(require '[com.getorcha.erp.ingestion :as erp.ingestion])
(erp.ingestion/queue-for-ingestion! (repl/db-pool) (repl/aws) #uuid "<doc-id>" nil)
Poll the ingestion status via clojure-eval until it leaves in-progress:
(db.sql/execute-one! (repl/db-pool)
{:select [:status]
:from [:ingestion]
:where [:= :id #uuid "<ingestion-id>"]})
Poll interval: a few seconds. Timeout: reasonable upper bound (e.g., 2-3 minutes).
Compare the entire structured_data JSON object — old vs new. This includes all extraction outputs and all post-processing outputs (account matching, fraud detection, validation, tax issues, etc.). This is a regression test on ingestion quality, not just extraction quality.
Excluded from comparison: document matching / clusters. The inspector prompt explicitly calls this out to avoid false regression reports.
Deep-diff the full objects. Categorize differences by severity. Only report fields that differ — identical fields are omitted from the report.
Only triggered when the verdict is "Unclear" — significant discrepancies between old and new structured data with no clear winner.
The inspector downloads the PDF from local S3 and reads it visually (Claude vision capabilities) to determine ground truth, then revises the verdict.
bb dev:aws-cli s3 cp s3://v1-orcha-global-storage-local-stack/documents/<doc-id>.pdf /tmp/<doc-id>.pdf
Then uses the Read tool on the PDF file.
docs/regression-reports/<doc-id>-<new-commit-sha-short>.md
Examples:
docs/regression-reports/019c0fdc-1ea9-709f-a196-ceac943ae631-abc1234.mdGlob by document: docs/regression-reports/019c0fdc-*
Glob by commit: docs/regression-reports/*-abc1234.md
# Regression: <doc-id> — <new-commit-short>
- **Verdict**: Identical
- **Prod commit**: `<sha>`
- **New commit**: `<sha>`
No meaningful differences in structured data.
# Regression: <doc-id> — <new-commit-short>
- **Filename**: <original filename>
- **Verdict**: Improved / Regressed / Mixed / Unclear
- **Prod commit**: `<sha>`
- **New commit**: `<sha>`
## Changes
<narrative description of differences only, organized by significance>
## PDF Inspection
<only present if verdict required visual inspection>
<what the agent saw and how it resolved the ambiguity>
Not persisted. Printed after all inspectors complete:
| Document | Filename | Verdict |
|----------|----------|-----------|
| 019c... | inv.pdf | Regressed |
| 019d... | bill.pdf | Improved |
X documents tested. Y improved, Z regressed, W identical, ...
.claude/skills/regression-test/SKILL.md — orchestrator instructions.claude/skills/regression-test/inspector-prompt.md — template for each subagentdocs/regression-reports/ (gitignored)