Proof of concept: match a set of invoice PDFs against a credit card statement PDF.
Demonstrate that LLM-based transcription + extraction + matching works well enough
for CC reconciliation to be worth productizing. Tune for correct results on the
test dataset (~/code/orcha/drive/Kroeger Tax/Testdata CC).
../spikes/cc-reconciliation/
├── .env # API keys (gitignored)
├── .env.example
├── .gitignore
├── requirements.txt
├── docs/plans/
├── cache/ # Cached transcription/extraction results (gitignored)
├── output/ # Generated HTML reports
└── src/
├── main.py # Entry point — full pipeline, CLI flags
├── preprocess.py # OpenCV: denoise, CLAHE, deskew, border cleanup
├── transcribe.py # Google Doc AI primary, Gemini vision fallback
├── extract.py # Claude: CC statement + invoice extraction
├── match.py # Claude: single-call matching with reasoning
└── report.py # Jinja2 HTML report, print-friendly CSS
1. Discover PDFs
Classify by filename: "Kreditkartenabrechnung" → statement, rest → invoices
2. For each PDF (cached per-file):
a. Render pages to images (pymupdf)
b. Preprocess each page (denoise + CLAHE + deskew + border crop)
c. Transcribe (Google Doc AI, Gemini vision fallback if low quality)
d. Extract structured data (Claude Sonnet 4.5)
e. Write cache/{filename}.json
3. Match (single Claude call)
All CC lines + all invoice summaries → matches with reasoning
4. Generate HTML report → output/report.html
--no-cache — force full re-run, ignore cached results--match-only — skip transcription/extraction, re-run matching from cache--no-preprocess — skip image preprocessing (transcribe raw page images)Applied to every page image by default. Togglable via --no-preprocess.
Order:
Order matters: borders removed before skew detection, CLAHE before denoise (CLAHE can amplify noise).
Quality metrics always computed (even when preprocessing is off):
The dataset has three quality tiers:
Main issues are grain/noise and low contrast, not blur. Blur detection (Laplacian) is kept as a quality metric for debugging, not as a preprocessing gate.
Mirrors the Orcha pattern:
Models:
2ce14f950a811b13 (project: getorcha-dev, location: eu)gemini-3-pro-previewTwo prompts, same model (claude-sonnet-4-5-20250929):
{
"statement_period": {"from": "2025-12-12", "to": "2026-01-14"},
"cardholder": "Henning Olinski",
"card_number": "5310 00XX XXXX 1593",
"total": 5461.56,
"currency": "EUR",
"lines": [
{
"index": 1,
"purchase_date": "2025-12-12",
"booking_date": "2025-12-15",
"merchant": "UZR*Coffee Unlimited, Hamburg",
"amount": 42.40,
"currency": "EUR",
"miles": 42,
"redacted": false
},
{
"index": 4,
"merchant": null,
"amount": 46.75,
"redacted": true
}
]
}
{
"vendor": "Zirkonzahn Deutschland GmbH",
"invoice_number": "2025DE23264",
"invoice_date": "2025-11-27",
"total": 309.40,
"currency": "EUR",
"payment_method": "Kreditkarte",
"brief_description": "Dental CAD/CAM components (scananalog, titanit)"
}
Minimal fields — only what's needed for matching.
Single Claude call (claude-sonnet-4-5-20250929). All extracted data in one
context window.
Input: formatted list of CC lines + invoice summaries with filenames. Output:
{
"matches": [
{
"cc_line_index": 1,
"invoice_file": "2025-12-04_073706.pdf",
"confidence": "high",
"reasoning": "Amount 42.40€ exact match, merchant 'Coffee Unlimited' matches, dates align"
}
],
"unmatched_cc_lines": [
{
"cc_line_index": 2,
"amount": 27.01,
"reasoning": "Redacted entry, no invoice with matching amount found"
}
],
"unmatched_invoices": [
{
"file": "2026-03-05_125059.pdf",
"vendor": "...",
"amount": 123.45,
"reasoning": "Date falls outside statement period"
}
]
}
Single file, inline CSS, no external dependencies. Jinja2 template as a string
constant in report.py.
Sections:
Print CSS:
@media print with proper page breakspymupdf — PDF page renderingopencv-python — image preprocessingnumpy — array ops for OpenCVanthropic — Claude APIgoogle-cloud-documentai — OCRgoogle-generativeai — Gemini vision fallbackjinja2 — HTML templatingpython-dotenv — .env loadingFrom Orcha's LocalStack SSM parameters, stored in .env:
ANTHROPIC_API_KEYGOOGLE_GENAI_API_KEYGOOGLE_CLOUD_PROJECTGOOGLE_DOCAI_PROCESSOR_IDGOOGLE_APPLICATION_CREDENTIALS (path to credentials file)