CC Reconciliation Spike — Design

Goal

Proof of concept: match a set of invoice PDFs against a credit card statement PDF. Demonstrate that LLM-based transcription + extraction + matching works well enough for CC reconciliation to be worth productizing. Tune for correct results on the test dataset (~/code/orcha/drive/Kroeger Tax/Testdata CC).

Dataset

Project Structure

../spikes/cc-reconciliation/
├── .env                     # API keys (gitignored)
├── .env.example
├── .gitignore
├── requirements.txt
├── docs/plans/
├── cache/                   # Cached transcription/extraction results (gitignored)
├── output/                  # Generated HTML reports
└── src/
    ├── main.py              # Entry point — full pipeline, CLI flags
    ├── preprocess.py         # OpenCV: denoise, CLAHE, deskew, border cleanup
    ├── transcribe.py         # Google Doc AI primary, Gemini vision fallback
    ├── extract.py            # Claude: CC statement + invoice extraction
    ├── match.py              # Claude: single-call matching with reasoning
    └── report.py             # Jinja2 HTML report, print-friendly CSS

Pipeline

1. Discover PDFs
   Classify by filename: "Kreditkartenabrechnung" → statement, rest → invoices

2. For each PDF (cached per-file):
   a. Render pages to images (pymupdf)
   b. Preprocess each page (denoise + CLAHE + deskew + border crop)
   c. Transcribe (Google Doc AI, Gemini vision fallback if low quality)
   d. Extract structured data (Claude Sonnet 4.5)
   e. Write cache/{filename}.json

3. Match (single Claude call)
   All CC lines + all invoice summaries → matches with reasoning

4. Generate HTML report → output/report.html

CLI Flags

Preprocessing

Applied to every page image by default. Togglable via --no-preprocess.

Order:

  1. Border cleanup — detect dark scanner edges, crop to content
  2. Deskew — Hough transform, correct if |angle| > 0.5°
  3. CLAHE — clipLimit=2.0, tileGridSize=(8,8)
  4. Denoise — cv2.fastNlMeansDenoising(h=10)

Order matters: borders removed before skew detection, CLAHE before denoise (CLAHE can amplify noise).

Quality metrics always computed (even when preprocessing is off):

Document Quality Assessment

The dataset has three quality tiers:

Main issues are grain/noise and low contrast, not blur. Blur detection (Laplacian) is kept as a quality metric for debugging, not as a preprocessing gate.

Transcription

Mirrors the Orcha pattern:

  1. Primary: Google Document AI — preprocessed page images, returns text + token confidence
  2. Quality check — if >5% tokens have confidence <0.8, flag as low quality
  3. Fallback: Gemini vision (gemini-3-pro-preview) — page images with transcription prompt
  4. Output — concatenated text with page markers, per-page quality metrics

Models:

Extraction

Two prompts, same model (claude-sonnet-4-5-20250929):

CC Statement → structured data

{
  "statement_period": {"from": "2025-12-12", "to": "2026-01-14"},
  "cardholder": "Henning Olinski",
  "card_number": "5310 00XX XXXX 1593",
  "total": 5461.56,
  "currency": "EUR",
  "lines": [
    {
      "index": 1,
      "purchase_date": "2025-12-12",
      "booking_date": "2025-12-15",
      "merchant": "UZR*Coffee Unlimited, Hamburg",
      "amount": 42.40,
      "currency": "EUR",
      "miles": 42,
      "redacted": false
    },
    {
      "index": 4,
      "merchant": null,
      "amount": 46.75,
      "redacted": true
    }
  ]
}

Invoice → structured data

{
  "vendor": "Zirkonzahn Deutschland GmbH",
  "invoice_number": "2025DE23264",
  "invoice_date": "2025-11-27",
  "total": 309.40,
  "currency": "EUR",
  "payment_method": "Kreditkarte",
  "brief_description": "Dental CAD/CAM components (scananalog, titanit)"
}

Minimal fields — only what's needed for matching.

Matching

Single Claude call (claude-sonnet-4-5-20250929). All extracted data in one context window.

Input: formatted list of CC lines + invoice summaries with filenames. Output:

{
  "matches": [
    {
      "cc_line_index": 1,
      "invoice_file": "2025-12-04_073706.pdf",
      "confidence": "high",
      "reasoning": "Amount 42.40€ exact match, merchant 'Coffee Unlimited' matches, dates align"
    }
  ],
  "unmatched_cc_lines": [
    {
      "cc_line_index": 2,
      "amount": 27.01,
      "reasoning": "Redacted entry, no invoice with matching amount found"
    }
  ],
  "unmatched_invoices": [
    {
      "file": "2026-03-05_125059.pdf",
      "vendor": "...",
      "amount": 123.45,
      "reasoning": "Date falls outside statement period"
    }
  ]
}

HTML Report

Single file, inline CSS, no external dependencies. Jinja2 template as a string constant in report.py.

Sections:

  1. Header — statement period, cardholder, card number, statement total
  2. Summary bar — matched / unmatched CC lines / unmatched invoices / total matched amount
  3. Matched entries table — CC date | Merchant | Amount | Invoice file | Confidence | Reasoning
  4. Unmatched CC lines table — Date | Merchant or "[REDACTED]" | Amount | Reasoning
  5. Unmatched invoices table — File | Vendor | Date | Amount | Reasoning

Print CSS:

Tech Stack

API Keys

From Orcha's LocalStack SSM parameters, stored in .env: