CC Reconciliation Spike — Design

Goal

Proof of concept: match a set of invoice PDFs against a credit card statement PDF. Demonstrate that LLM-based transcription + extraction + matching works well enough for CC reconciliation to be worth productizing. Tune for correct results on the test dataset (~/code/orcha/drive/Kroeger Tax/Testdata CC).

Dataset

1 DKB Miles & More credit card statement (5 pages, period 12.12.2025–14.01.2026)
23 invoice PDFs (scans: thermal receipts, professional invoices, Amazon digital prints)
Some CC entries are physically redacted (blacked out, handwritten "Privat")
Some invoices may not match any CC entry (outside statement period, different payment method)
Redacted entries should still be matched if possible (by amount/date)

Project Structure

../spikes/cc-reconciliation/
├── .env                     # API keys (gitignored)
├── .env.example
├── .gitignore
├── requirements.txt
├── docs/plans/
├── cache/                   # Cached transcription/extraction results (gitignored)
├── output/                  # Generated HTML reports
└── src/
    ├── main.py              # Entry point — full pipeline, CLI flags
    ├── preprocess.py         # OpenCV: denoise, CLAHE, deskew, border cleanup
    ├── transcribe.py         # Google Doc AI primary, Gemini vision fallback
    ├── extract.py            # Claude: CC statement + invoice extraction
    ├── match.py              # Claude: single-call matching with reasoning
    └── report.py             # Jinja2 HTML report, print-friendly CSS

Pipeline

1. Discover PDFs
   Classify by filename: "Kreditkartenabrechnung" → statement, rest → invoices

2. For each PDF (cached per-file):
   a. Render pages to images (pymupdf)
   b. Preprocess each page (denoise + CLAHE + deskew + border crop)
   c. Transcribe (Google Doc AI, Gemini vision fallback if low quality)
   d. Extract structured data (Claude Sonnet 4.5)
   e. Write cache/{filename}.json

3. Match (single Claude call)
   All CC lines + all invoice summaries → matches with reasoning

4. Generate HTML report → output/report.html

CLI Flags

--no-cache — force full re-run, ignore cached results
--match-only — skip transcription/extraction, re-run matching from cache
--no-preprocess — skip image preprocessing (transcribe raw page images)

Preprocessing

Applied to every page image by default. Togglable via --no-preprocess.

Order:

Border cleanup — detect dark scanner edges, crop to content
Deskew — Hough transform, correct if |angle| > 0.5°
CLAHE — clipLimit=2.0, tileGridSize=(8,8)
Denoise — cv2.fastNlMeansDenoising(h=10)

Order matters: borders removed before skew detection, CLAHE before denoise (CLAHE can amplify noise).

Quality metrics always computed (even when preprocessing is off):

Laplacian variance (edge sharpness)
Detected skew angle
Whether borders were cropped
Original dimensions

Document Quality Assessment

The dataset has three quality tiers:

Clean digital prints (~60%): Amazon, Zirkonzahn, Porsche, Condent. No preprocessing needed but it won't hurt.
Moderate scans (~25%): Some Zirkonzahn, EasyPark (dark header gradient). CLAHE + denoise helps.
Receipts/annotated (~15%): Thermal receipts, handwritten annotations. Narrow format, lower resolution.

Main issues are grain/noise and low contrast, not blur. Blur detection (Laplacian) is kept as a quality metric for debugging, not as a preprocessing gate.

Transcription

Mirrors the Orcha pattern:

Primary: Google Document AI — preprocessed page images, returns text + token confidence
Quality check — if >5% tokens have confidence <0.8, flag as low quality
Fallback: Gemini vision (gemini-3-pro-preview) — page images with transcription prompt
Output — concatenated text with page markers, per-page quality metrics

Models:

Google Document AI processor: 2ce14f950a811b13 (project: getorcha-dev, location: eu)
Gemini vision fallback: gemini-3-pro-preview

Extraction

Two prompts, same model (claude-sonnet-4-5-20250929):

CC Statement → structured data

{
  "statement_period": {"from": "2025-12-12", "to": "2026-01-14"},
  "cardholder": "Henning Olinski",
  "card_number": "5310 00XX XXXX 1593",
  "total": 5461.56,
  "currency": "EUR",
  "lines": [
    {
      "index": 1,
      "purchase_date": "2025-12-12",
      "booking_date": "2025-12-15",
      "merchant": "UZR*Coffee Unlimited, Hamburg",
      "amount": 42.40,
      "currency": "EUR",
      "miles": 42,
      "redacted": false
    },
    {
      "index": 4,
      "merchant": null,
      "amount": 46.75,
      "redacted": true
    }
  ]
}

Invoice → structured data

{
  "vendor": "Zirkonzahn Deutschland GmbH",
  "invoice_number": "2025DE23264",
  "invoice_date": "2025-11-27",
  "total": 309.40,
  "currency": "EUR",
  "payment_method": "Kreditkarte",
  "brief_description": "Dental CAD/CAM components (scananalog, titanit)"
}

Minimal fields — only what's needed for matching.

Matching

Single Claude call (claude-sonnet-4-5-20250929). All extracted data in one context window.

Input: formatted list of CC lines + invoice summaries with filenames. Output:

{
  "matches": [
    {
      "cc_line_index": 1,
      "invoice_file": "2025-12-04_073706.pdf",
      "confidence": "high",
      "reasoning": "Amount 42.40€ exact match, merchant 'Coffee Unlimited' matches, dates align"
    }
  ],
  "unmatched_cc_lines": [
    {
      "cc_line_index": 2,
      "amount": 27.01,
      "reasoning": "Redacted entry, no invoice with matching amount found"
    }
  ],
  "unmatched_invoices": [
    {
      "file": "2026-03-05_125059.pdf",
      "vendor": "...",
      "amount": 123.45,
      "reasoning": "Date falls outside statement period"
    }
  ]
}

HTML Report

Single file, inline CSS, no external dependencies. Jinja2 template as a string constant in report.py.

Sections:

Header — statement period, cardholder, card number, statement total
Summary bar — matched / unmatched CC lines / unmatched invoices / total matched amount
Matched entries table — CC date | Merchant | Amount | Invoice file | Confidence | Reasoning
Unmatched CC lines table — Date | Merchant or "[REDACTED]" | Amount | Reasoning
Unmatched invoices table — File | Vendor | Date | Amount | Reasoning

Print CSS:

@media print with proper page breaks
A4 landscape for match table
~10pt font, table borders in print mode
No background colors

Tech Stack

Python 3.11+
pymupdf — PDF page rendering
opencv-python — image preprocessing
numpy — array ops for OpenCV
anthropic — Claude API
google-cloud-documentai — OCR
google-generativeai — Gemini vision fallback
jinja2 — HTML templating
python-dotenv — .env loading

API Keys

From Orcha's LocalStack SSM parameters, stored in .env:

ANTHROPIC_API_KEY
GOOGLE_GENAI_API_KEY
GOOGLE_CLOUD_PROJECT
GOOGLE_DOCAI_PROCESSOR_ID
GOOGLE_APPLICATION_CREDENTIALS (path to credentials file)