CC Reconciliation Spike — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Build a standalone Python spike that matches invoice PDFs against a credit card statement using OCR + LLM extraction + LLM matching, producing a print-friendly HTML report.

Architecture: Pipeline of 5 stages (preprocess → transcribe → extract → match → report), each in its own module. Per-file caching between stages so you can iterate on prompts without re-running OCR. Single entry point with CLI flags.

Tech Stack: Python 3.11+, pymupdf, opencv-python, google-cloud-documentai, google-generativeai, anthropic, jinja2, python-dotenv

Task 1: Project Scaffolding

Files:

Create: ../spikes/cc-reconciliation/.gitignore
Create: ../spikes/cc-reconciliation/.env.example
Create: ../spikes/cc-reconciliation/.env
Create: ../spikes/cc-reconciliation/requirements.txt
Create: ../spikes/cc-reconciliation/src/__init__.py

Step 1: Create .gitignore

.env
cache/
output/
__pycache__/
*.pyc
.venv/

Step 2: Create .env.example

ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENAI_API_KEY=AIza...
GOOGLE_CLOUD_PROJECT=getorcha-dev
GOOGLE_DOCAI_PROCESSOR_ID=2ce14f950a811b13
GOOGLE_DOCAI_LOCATION=eu
GOOGLE_APPLICATION_CREDENTIALS=/home/volrath/code/orcha/orcha/credentials/google-docai-dev.json

Step 3: Create .env with actual values

Copy from .env.example and fill in values from Orcha's scripts/init_aws.clj:

ANTHROPIC_API_KEY = the value from /v1-orcha/anthropic-api-key
GOOGLE_GENAI_API_KEY = the value from /v1-orcha/google-genai-api-key
GOOGLE_CLOUD_PROJECT = getorcha-dev
GOOGLE_DOCAI_PROCESSOR_ID = 2ce14f950a811b13
GOOGLE_DOCAI_LOCATION = eu
GOOGLE_APPLICATION_CREDENTIALS = /home/volrath/code/orcha/orcha/credentials/google-docai-dev.json

Step 4: Create requirements.txt

pymupdf>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0
anthropic>=0.40.0
google-cloud-documentai>=2.20.0
google-generativeai>=0.3.0
jinja2>=3.1.0
python-dotenv>=1.0.0

Step 5: Create empty src/__init__.py

Empty file.

Step 6: Create venv and install dependencies

Run:

python3 -m venv ../spikes/cc-reconciliation/.venv
../spikes/cc-reconciliation/.venv/bin/pip install -r ../spikes/cc-reconciliation/requirements.txt

Expected: All packages install successfully.

Step 7: Create cache/ and output/ directories

mkdir -p ../spikes/cc-reconciliation/cache ../spikes/cc-reconciliation/output

Step 8: Commit

git add ../spikes/cc-reconciliation/.gitignore ../spikes/cc-reconciliation/.env.example ../spikes/cc-reconciliation/requirements.txt ../spikes/cc-reconciliation/src/__init__.py
git commit -m "spike(cc-recon): project scaffolding with deps and env config"

Task 2: Image Preprocessing (`preprocess.py`)

Files:

Create: ../spikes/cc-reconciliation/src/preprocess.py

Reference code: /home/volrath/code/orcha/spikes/guebelin/split.py lines 66-107

Step 1: Implement preprocess.py

This module needs 5 functions:

compute_quality_metrics(image) -> dict — Laplacian variance, original dimensions. Always runs.
crop_borders(image) -> (image, bool) — Detect dark edges via row/column mean intensity. Crop if edge strips are significantly darker than center. Return cropped image and whether cropping occurred.
deskew(image) -> (image, float) — Convert to binary (Otsu threshold), find contours or use cv2.minAreaRect on non-zero pixels, rotate if |angle| > 0.5°. Return corrected image and detected angle.
enhance(image) -> image — CLAHE (clipLimit=2.0, tileGridSize=(8,8)) then fastNlMeansDenoising(h=10, templateWindowSize=7, searchWindowSize=21). Ported from Gübelin enhance_image().
preprocess_page(image, enable=True) -> (image, dict) — Orchestrates the pipeline. If enable=False, only computes quality metrics and returns the original image unchanged. If enable=True, runs: crop_borders → deskew → enhance.

import cv2
import numpy as np


def compute_quality_metrics(image: np.ndarray) -> dict:
    """Compute quality metrics for a page image. Always runs regardless of preprocessing flag."""
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) if len(image.shape) == 3 else image
    laplacian_var = cv2.Laplacian(gray, cv2.CV_64F).var()
    h, w = image.shape[:2]
    return {
        "laplacian_variance": round(float(laplacian_var), 2),
        "original_size": [h, w],
    }


def crop_borders(image: np.ndarray, threshold: int = 40, margin_pct: float = 0.02) -> tuple[np.ndarray, bool]:
    """Remove dark scanner borders. Returns (cropped_image, was_cropped)."""
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) if len(image.shape) == 3 else image
    h, w = gray.shape

    margin_h = int(h * margin_pct)
    margin_w = int(w * margin_pct)

    # Find rows/cols where mean intensity is above threshold (not dark border)
    row_means = np.mean(gray, axis=1)
    col_means = np.mean(gray, axis=0)

    content_rows = np.where(row_means > threshold)[0]
    content_cols = np.where(col_means > threshold)[0]

    if len(content_rows) == 0 or len(content_cols) == 0:
        return image, False

    top, bottom = content_rows[0], content_rows[-1]
    left, right = content_cols[0], content_cols[-1]

    # Only crop if we'd remove meaningful border (> margin_pct of dimension)
    if top < margin_h and bottom > h - margin_h and left < margin_w and right > w - margin_w:
        return image, False

    cropped = image[top:bottom + 1, left:right + 1]
    return cropped, True


def deskew(image: np.ndarray, angle_threshold: float = 0.5) -> tuple[np.ndarray, float]:
    """Detect and correct skew. Returns (corrected_image, detected_angle_degrees)."""
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) if len(image.shape) == 3 else image

    # Binarize
    _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

    # Find coordinates of all non-zero pixels
    coords = np.column_stack(np.where(binary > 0))
    if len(coords) < 100:
        return image, 0.0

    # Get minimum area rectangle
    angle = cv2.minAreaRect(coords)[-1]

    # Normalize angle
    if angle < -45:
        angle = 90 + angle
    elif angle > 45:
        angle = angle - 90

    if abs(angle) < angle_threshold:
        return image, round(float(angle), 2)

    # Rotate
    h, w = image.shape[:2]
    center = (w // 2, h // 2)
    matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
    corrected = cv2.warpAffine(image, matrix, (w, h), flags=cv2.INTER_CUBIC,
                               borderMode=cv2.BORDER_REPLICATE)
    return corrected, round(float(angle), 2)


def enhance(image: np.ndarray) -> np.ndarray:
    """Apply CLAHE contrast enhancement and denoising."""
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) if len(image.shape) == 3 else image

    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    enhanced = clahe.apply(gray)

    denoised = cv2.fastNlMeansDenoising(enhanced, None, h=10,
                                        templateWindowSize=7, searchWindowSize=21)
    return denoised


def preprocess_page(image: np.ndarray, enable: bool = True) -> tuple[np.ndarray, dict]:
    """
    Full preprocessing pipeline for a single page image.

    Args:
        image: Page image as numpy array (RGB or grayscale)
        enable: If False, only computes quality metrics without modifying the image

    Returns:
        (processed_image, metrics_dict)
    """
    metrics = compute_quality_metrics(image)
    metrics["preprocessed"] = False

    if not enable:
        return image, metrics

    result, border_cropped = crop_borders(image)
    metrics["border_cropped"] = border_cropped

    result, skew_angle = deskew(result)
    metrics["skew_angle"] = skew_angle

    result = enhance(result)
    metrics["preprocessed"] = True

    return result, metrics

Step 2: Quick smoke test

Run from the spike directory:

cd ../spikes/cc-reconciliation
.venv/bin/python -c "
import fitz, numpy as np
from src.preprocess import preprocess_page
doc = fitz.open('/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC/2025-12-04_073706.pdf')
page = doc[0]
zoom = 200 / 72
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
processed, metrics = preprocess_page(img)
print('Metrics:', metrics)
print('Output shape:', processed.shape)
"

Expected: Prints metrics dict with laplacian_variance, skew_angle, etc. and output shape.

Step 3: Commit

git add ../spikes/cc-reconciliation/src/preprocess.py
git commit -m "spike(cc-recon): image preprocessing with CLAHE, denoise, deskew, border crop"

Task 3: Transcription (`transcribe.py`)

Files:

Create: ../spikes/cc-reconciliation/src/transcribe.py

Reference code:

Orcha transcription: /home/volrath/code/orcha/orcha/src/com/getorcha/workers/ap/ingestion/transcription.clj
Google Doc AI endpoint: https://{location}-documentai.googleapis.com/v1/projects/{project}/locations/{location}/processors/{processor}:process
Gemini endpoint: https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key={api_key}

Step 1: Implement transcribe.py

This module needs:

_get_access_token() -> str — Load Google credentials from GOOGLE_APPLICATION_CREDENTIALS, get OAuth2 access token.
transcribe_docai(image_bytes, mime_type) -> dict — Call Document AI, return {"text": ..., "token_quality": {...}, "quality_score": ...}.
_calculate_token_quality(response) -> dict — Extract token confidence stats from Doc AI response. Low-confidence threshold: 0.8.
_needs_vision_fallback(token_quality) -> bool — True if low_confidence_ratio > 0.05.
transcribe_gemini_vision(page_images) -> str — Call Gemini vision API with page images, return text.
transcribe_pdf(pdf_path, preprocess_enabled) -> dict — Main entry point. Renders pages, preprocesses, runs Doc AI, falls back to Gemini if needed. Returns {"text": ..., "pages": [...], "quality_metrics": [...], "method": "docai"|"gemini_vision"}.

import base64
import io
import json
import os

import fitz
import numpy as np
import requests
from google.oauth2 import service_account

from src.preprocess import preprocess_page


DPI = 200
GEMINI_DPI = 150
GEMINI_JPEG_QUALITY = 95
LOW_CONFIDENCE_THRESHOLD = 0.8
LOW_CONFIDENCE_RATIO = 0.05
GEMINI_MODEL = "gemini-3-pro-preview"


def _get_access_token() -> str:
    """Get OAuth2 access token from service account credentials."""
    creds_path = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
    credentials = service_account.Credentials.from_service_account_file(
        creds_path,
        scopes=["https://www.googleapis.com/auth/cloud-platform"],
    )
    credentials.refresh(requests.Request())
    return credentials.token


def _render_page(page: fitz.Page, dpi: int = DPI) -> np.ndarray:
    """Render a PDF page to numpy array at given DPI."""
    zoom = dpi / 72
    mat = fitz.Matrix(zoom, zoom)
    pix = page.get_pixmap(matrix=mat)
    img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
    if pix.n == 4:
        import cv2
        img = cv2.cvtColor(img, cv2.COLOR_RGBA2RGB)
    return img


def _image_to_png_bytes(image: np.ndarray) -> bytes:
    """Encode numpy array as PNG bytes."""
    import cv2
    success, buf = cv2.imencode(".png", image)
    if not success:
        raise RuntimeError("Failed to encode image to PNG")
    return buf.tobytes()


def _image_to_jpeg_bytes(image: np.ndarray, quality: int = GEMINI_JPEG_QUALITY) -> bytes:
    """Encode numpy array as JPEG bytes."""
    import cv2
    success, buf = cv2.imencode(".jpg", image, [cv2.IMWRITE_JPEG_QUALITY, quality])
    if not success:
        raise RuntimeError("Failed to encode image to JPEG")
    return buf.tobytes()


def _calculate_token_quality(response: dict) -> dict:
    """Extract token confidence stats from Document AI response."""
    pages = response.get("document", {}).get("pages", [])
    confidences = []
    for page in pages:
        for token in page.get("tokens", []):
            conf = token.get("layout", {}).get("confidence")
            if conf is not None:
                confidences.append(conf)

    if not confidences:
        return {"token_count": 0, "low_confidence_count": 0,
                "low_confidence_ratio": 0.0, "mean_confidence": 0.0,
                "min_confidence": 0.0}

    low_count = sum(1 for c in confidences if c < LOW_CONFIDENCE_THRESHOLD)
    return {
        "token_count": len(confidences),
        "low_confidence_count": low_count,
        "low_confidence_ratio": round(low_count / len(confidences), 4),
        "mean_confidence": round(sum(confidences) / len(confidences), 4),
        "min_confidence": round(min(confidences), 4),
    }


def _needs_vision_fallback(token_quality: dict) -> bool:
    """Check if OCR quality is too low and vision fallback is needed."""
    return token_quality["low_confidence_ratio"] > LOW_CONFIDENCE_RATIO


def transcribe_docai(image_bytes: bytes, mime_type: str = "image/png") -> dict:
    """Transcribe a single page image using Google Document AI."""
    project = os.environ["GOOGLE_CLOUD_PROJECT"]
    processor = os.environ["GOOGLE_DOCAI_PROCESSOR_ID"]
    location = os.environ.get("GOOGLE_DOCAI_LOCATION", "eu")
    token = _get_access_token()

    url = (f"https://{location}-documentai.googleapis.com/v1/"
           f"projects/{project}/locations/{location}/processors/{processor}:process")

    body = {
        "rawDocument": {
            "content": base64.b64encode(image_bytes).decode(),
            "mimeType": mime_type,
        },
        "fieldMask": "text,pages.tokens,pages.imageQualityScores",
        "processOptions": {
            "ocrConfig": {
                "enableImageQualityScores": True,
            }
        }
    }

    resp = requests.post(url, json=body,
                         headers={"Authorization": f"Bearer {token}"},
                         timeout=120)
    resp.raise_for_status()
    result = resp.json()

    text = result.get("document", {}).get("text", "")
    token_quality = _calculate_token_quality(result)

    pages = result.get("document", {}).get("pages", [])
    quality_score = 0.0
    if pages:
        scores = [p.get("imageQualityScores", {}).get("qualityScore", 0.0) for p in pages]
        quality_score = sum(scores) / len(scores)

    return {
        "text": text,
        "token_quality": token_quality,
        "quality_score": round(quality_score, 4),
    }


def transcribe_gemini_vision(page_images: list[np.ndarray]) -> str:
    """Transcribe page images using Gemini vision as fallback."""
    api_key = os.environ["GOOGLE_GENAI_API_KEY"]
    url = (f"https://generativelanguage.googleapis.com/v1beta/"
           f"models/{GEMINI_MODEL}:generateContent?key={api_key}")

    parts = [{"text": "Transcribe this document exactly as written. "
                       "Preserve all text, numbers, dates, and formatting. "
                       "Output only the transcribed text, nothing else."}]

    for img in page_images:
        jpeg_bytes = _image_to_jpeg_bytes(img)
        parts.append({
            "inline_data": {
                "mime_type": "image/jpeg",
                "data": base64.b64encode(jpeg_bytes).decode(),
            }
        })

    body = {"contents": [{"parts": parts}]}
    resp = requests.post(url, json=body, timeout=180)
    resp.raise_for_status()
    result = resp.json()

    return result["candidates"][0]["content"]["parts"][0]["text"]


def transcribe_pdf(pdf_path: str, preprocess_enabled: bool = True) -> dict:
    """
    Transcribe a PDF file. Renders pages, preprocesses, runs Doc AI with
    Gemini vision fallback.

    Returns:
        {
            "text": str,               # full concatenated text
            "method": str,             # "docai" or "gemini_vision"
            "page_count": int,
            "quality_metrics": [dict],  # per-page preprocessing metrics
            "token_quality": dict,      # Doc AI token confidence stats
            "quality_score": float,     # Doc AI image quality score
        }
    """
    doc = fitz.open(pdf_path)
    page_texts = []
    quality_metrics = []
    page_images = []  # keep for potential Gemini fallback

    # Aggregate token quality across pages
    all_token_quality = {"token_count": 0, "low_confidence_count": 0,
                         "confidences_sum": 0.0, "min_confidence": 1.0}
    all_quality_scores = []

    for page_num in range(len(doc)):
        page = doc[page_num]
        img = _render_page(page, DPI)
        processed, metrics = preprocess_page(img, enable=preprocess_enabled)
        quality_metrics.append({"page": page_num + 1, **metrics})

        # Keep lower-res version for potential Gemini fallback
        gemini_img = _render_page(page, GEMINI_DPI)
        page_images.append(gemini_img)

        # OCR with Document AI
        png_bytes = _image_to_png_bytes(processed)
        result = transcribe_docai(png_bytes)

        page_texts.append(result["text"])
        all_quality_scores.append(result["quality_score"])

        tq = result["token_quality"]
        all_token_quality["token_count"] += tq["token_count"]
        all_token_quality["low_confidence_count"] += tq["low_confidence_count"]
        all_token_quality["confidences_sum"] += tq["mean_confidence"] * tq["token_count"]
        all_token_quality["min_confidence"] = min(all_token_quality["min_confidence"],
                                                  tq["min_confidence"])

    doc.close()

    # Aggregate token quality
    tc = all_token_quality["token_count"]
    aggregated_tq = {
        "token_count": tc,
        "low_confidence_count": all_token_quality["low_confidence_count"],
        "low_confidence_ratio": round(all_token_quality["low_confidence_count"] / tc, 4) if tc else 0.0,
        "mean_confidence": round(all_token_quality["confidences_sum"] / tc, 4) if tc else 0.0,
        "min_confidence": round(all_token_quality["min_confidence"], 4),
    }

    avg_quality = round(sum(all_quality_scores) / len(all_quality_scores), 4) if all_quality_scores else 0.0

    # Check if we need Gemini vision fallback
    method = "docai"
    if _needs_vision_fallback(aggregated_tq):
        print(f"  Low OCR quality (ratio={aggregated_tq['low_confidence_ratio']}), "
              f"falling back to Gemini vision...")
        method = "gemini_vision"
        gemini_text = transcribe_gemini_vision(page_images)
        full_text = gemini_text
    else:
        # Concatenate with page markers
        parts = []
        for i, text in enumerate(page_texts):
            if len(page_texts) > 1:
                parts.append(f"--- Page {i + 1} ---")
            parts.append(text)
        full_text = "\n".join(parts)

    return {
        "text": full_text,
        "method": method,
        "page_count": len(page_texts),
        "quality_metrics": quality_metrics,
        "token_quality": aggregated_tq,
        "quality_score": avg_quality,
    }

Step 2: Smoke test with one invoice

cd ../spikes/cc-reconciliation
.venv/bin/python -c "
import os
from dotenv import load_dotenv
load_dotenv()
from src.transcribe import transcribe_pdf
result = transcribe_pdf('/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC/2025-12-04_073706.pdf')
print('Method:', result['method'])
print('Pages:', result['page_count'])
print('Token quality:', result['token_quality'])
print('Text (first 200 chars):', result['text'][:200])
"

Expected: Prints transcription method, page count, quality stats, and first 200 chars of text.

Step 3: Commit

git add ../spikes/cc-reconciliation/src/transcribe.py
git commit -m "spike(cc-recon): transcription with Doc AI primary and Gemini vision fallback"

Task 4: Extraction (`extract.py`)

Files:

Create: ../spikes/cc-reconciliation/src/extract.py

Step 1: Implement extract.py

Two extraction functions, both using Claude Sonnet 4.5. Uses the anthropic SDK directly.

import json
import os

import anthropic


MODEL = "claude-sonnet-4-5-20250929"


def _get_client() -> anthropic.Anthropic:
    return anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])


def extract_cc_statement(transcribed_text: str) -> dict:
    """Extract structured data from a credit card statement transcription."""
    client = _get_client()

    prompt = """Extract all line items from this credit card statement.

RULES:
1. Extract every transaction line, including redacted/blacked-out ones.
2. For redacted entries: set "redacted" to true. The merchant name may be blacked out
   but the amount and dates should still be visible. Extract what you can.
3. Amounts: use numbers with dot decimal (42.40, not "42,40"). The statement uses
   German format (comma = decimal), convert to dot format.
4. Dates: use ISO format YYYY-MM-DD.
5. The "S" suffix on amounts means "Soll" (debit). "H" means "Haben" (credit).
   Report amounts as positive numbers. Mark credits with "is_credit": true.
6. Extract Prämienmeilen (bonus miles) as the "miles" field for each entry.
7. Index lines sequentially starting from 1.

Return JSON with this exact structure:
{
  "statement_period": {"from": "YYYY-MM-DD", "to": "YYYY-MM-DD"},
  "cardholder": "...",
  "card_number": "...",
  "total": 0.00,
  "currency": "EUR",
  "lines": [
    {
      "index": 1,
      "purchase_date": "YYYY-MM-DD",
      "booking_date": "YYYY-MM-DD",
      "merchant": "merchant name or null if redacted",
      "amount": 0.00,
      "currency": "EUR",
      "miles": 0,
      "redacted": false,
      "is_credit": false
    }
  ]
}

STATEMENT TEXT:
""" + transcribed_text

    response = client.messages.create(
        model=MODEL,
        max_tokens=16000,
        temperature=0,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": "{"},
        ],
    )

    raw = "{" + response.content[0].text
    # Strip markdown code fences if present
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]

    return json.loads(raw)


def extract_invoice(transcribed_text: str) -> dict:
    """Extract structured data from an invoice transcription."""
    client = _get_client()

    prompt = """Extract key fields from this invoice/receipt.

RULES:
1. Amounts: use numbers with dot decimal (309.40, not "309,40").
2. Dates: use ISO format YYYY-MM-DD.
3. Extract the total amount actually charged (including tax).
4. For payment_method: look for "Kreditkarte", "EC-Cash", "PayPal", "Überweisung", etc.
   Use null if not stated.
5. Keep brief_description under 100 characters — just enough to identify what was purchased.
6. Copy vendor name exactly as written. Do not correct spelling.

Return JSON with this exact structure:
{
  "vendor": "Company Name",
  "invoice_number": "...",
  "invoice_date": "YYYY-MM-DD",
  "total": 0.00,
  "currency": "EUR",
  "payment_method": "Kreditkarte",
  "brief_description": "short description of items"
}

INVOICE TEXT:
""" + transcribed_text

    response = client.messages.create(
        model=MODEL,
        max_tokens=4000,
        temperature=0,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": "{"},
        ],
    )

    raw = "{" + response.content[0].text
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]

    return json.loads(raw)

Step 2: Smoke test with transcription output

cd ../spikes/cc-reconciliation
.venv/bin/python -c "
import os, json
from dotenv import load_dotenv
load_dotenv()
from src.transcribe import transcribe_pdf
from src.extract import extract_invoice
result = transcribe_pdf('/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC/2025-12-04_073706.pdf')
extracted = extract_invoice(result['text'])
print(json.dumps(extracted, indent=2))
"

Expected: Prints extracted invoice data (Coffee Unlimited, 42.40 EUR, etc.)

Step 3: Commit

git add ../spikes/cc-reconciliation/src/extract.py
git commit -m "spike(cc-recon): LLM extraction for CC statements and invoices"

Task 5: Matching (`match.py`)

Files:

Create: ../spikes/cc-reconciliation/src/match.py

Step 1: Implement match.py

Single Claude call with all data.

import json
import os

import anthropic


MODEL = "claude-sonnet-4-5-20250929"


def match_all(statement_data: dict, invoices: dict[str, dict]) -> dict:
    """
    Match CC statement lines against extracted invoice data.

    Args:
        statement_data: Extracted CC statement (from extract_cc_statement)
        invoices: Dict mapping filename -> extracted invoice data

    Returns:
        Match results with reasoning
    """
    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

    # Format CC lines
    cc_lines_text = []
    for line in statement_data["lines"]:
        merchant = line.get("merchant") or "[REDACTED]"
        redacted_tag = " [REDACTED]" if line.get("redacted") else ""
        credit_tag = " [CREDIT]" if line.get("is_credit") else ""
        cc_lines_text.append(
            f"  [{line['index']}] {line.get('purchase_date', '?')} | "
            f"{merchant}{redacted_tag}{credit_tag} | "
            f"{line['amount']} {line.get('currency', 'EUR')}"
        )

    # Format invoices
    invoice_lines_text = []
    for filename, inv in sorted(invoices.items()):
        invoice_lines_text.append(
            f"  [{filename}] {inv.get('vendor', '?')} | "
            f"{inv.get('total', '?')} {inv.get('currency', 'EUR')} | "
            f"Date: {inv.get('invoice_date', '?')} | "
            f"Payment: {inv.get('payment_method', '?')} | "
            f"{inv.get('brief_description', '')}"
        )

    prompt = f"""You are matching credit card statement line items against invoices.

CREDIT CARD STATEMENT:
  Period: {statement_data['statement_period']['from']} to {statement_data['statement_period']['to']}
  Cardholder: {statement_data['cardholder']}
  Total: {statement_data['total']} {statement_data['currency']}

CC LINE ITEMS:
{chr(10).join(cc_lines_text)}

INVOICES (filename | vendor | total | date | payment method | description):
{chr(10).join(invoice_lines_text)}

MATCHING RULES:
1. Match by amount (exact or very close — within a few cents for rounding).
2. Use date proximity: the CC purchase date should be close to the invoice date
   (same day or within a few days, considering processing delays).
3. Use merchant/vendor name similarity: CC merchant names are often abbreviated
   or formatted differently than invoice vendor names (e.g., "HAMBURG MARRIOTT HO"
   matches "Hamburg Marriott"). Be flexible with name matching.
4. Redacted CC entries: match by amount and date only, since merchant is hidden.
5. Credit entries (marked [CREDIT]) are refunds — they can match invoices too.
6. An invoice can match at most one CC line, and vice versa.
7. Some CC lines may have no matching invoice (private expenses, small purchases
   without receipt). Some invoices may not match any CC line (different payment
   method, outside statement period).
8. When multiple invoices could match a CC line (e.g., same amount), use date
   and merchant to disambiguate. If still ambiguous, note it in reasoning.

Return JSON:
{{
  "matches": [
    {{
      "cc_line_index": 1,
      "invoice_file": "filename.pdf",
      "confidence": "high|medium|low",
      "reasoning": "Brief explanation of why these match"
    }}
  ],
  "unmatched_cc_lines": [
    {{
      "cc_line_index": 2,
      "amount": 27.01,
      "merchant": "merchant or null",
      "reasoning": "Why no match was found"
    }}
  ],
  "unmatched_invoices": [
    {{
      "file": "filename.pdf",
      "vendor": "...",
      "amount": 0.00,
      "reasoning": "Why no match was found"
    }}
  ]
}}"""

    response = client.messages.create(
        model=MODEL,
        max_tokens=16000,
        temperature=0,
        messages=[
            {"role": "user", "content": prompt},
            {"role": "assistant", "content": "{"},
        ],
    )

    raw = "{" + response.content[0].text
    if raw.startswith("```"):
        raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]

    return json.loads(raw)

Step 2: Commit (no smoke test — needs full pipeline data, will test via main.py)

git add ../spikes/cc-reconciliation/src/match.py
git commit -m "spike(cc-recon): LLM-based matching of CC lines to invoices"

Task 6: HTML Report (`report.py`)

Files:

Create: ../spikes/cc-reconciliation/src/report.py

Step 1: Implement report.py

Jinja2 template as string constant. Print-friendly CSS.

import os

from jinja2 import Template


REPORT_TEMPLATE = """<!DOCTYPE html>
<html lang="de">
<head>
<meta charset="utf-8">
<title>CC Reconciliation Report</title>
<style>
  * { margin: 0; padding: 0; box-sizing: border-box; }
  body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
         font-size: 11pt; color: #333; padding: 20px; max-width: 1400px; margin: 0 auto; }
  h1 { font-size: 18pt; margin-bottom: 8px; }
  h2 { font-size: 14pt; margin: 24px 0 8px 0; border-bottom: 2px solid #333; padding-bottom: 4px; }
  .header-info { font-size: 10pt; color: #666; margin-bottom: 16px; }
  .header-info span { margin-right: 24px; }
  .summary { display: flex; gap: 16px; margin: 16px 0; flex-wrap: wrap; }
  .summary-card { background: #f5f5f5; border: 1px solid #ddd; border-radius: 4px;
                   padding: 12px 16px; min-width: 150px; }
  .summary-card .label { font-size: 9pt; color: #666; text-transform: uppercase; }
  .summary-card .value { font-size: 16pt; font-weight: bold; margin-top: 4px; }
  .match-high { color: #2e7d32; }
  .match-medium { color: #f57f17; }
  .match-low { color: #c62828; }
  table { width: 100%; border-collapse: collapse; margin: 8px 0 24px 0; font-size: 10pt; }
  th { background: #f5f5f5; text-align: left; padding: 6px 8px; border: 1px solid #ddd;
       font-weight: 600; white-space: nowrap; }
  td { padding: 6px 8px; border: 1px solid #ddd; vertical-align: top; }
  tr:nth-child(even) { background: #fafafa; }
  .amount { text-align: right; font-variant-numeric: tabular-nums; white-space: nowrap; }
  .redacted { color: #999; font-style: italic; }
  .reasoning { font-size: 9pt; color: #555; max-width: 400px; }
  .confidence { font-weight: 600; text-transform: uppercase; font-size: 9pt; }

  @media print {
    body { padding: 0; font-size: 9pt; }
    .summary-card { background: none; border: 1px solid #999; }
    tr:nth-child(even) { background: none; }
    h2 { page-break-before: auto; }
    table { page-break-inside: auto; }
    tr { page-break-inside: avoid; }
    @page { size: A4 landscape; margin: 10mm; }
  }
</style>
</head>
<body>

<h1>Kreditkartenabrechnung — Belegzuordnung</h1>
<div class="header-info">
  <span><b>Zeitraum:</b> {{ statement.statement_period.from }} bis {{ statement.statement_period.to }}</span>
  <span><b>Karteninhaber:</b> {{ statement.cardholder }}</span>
  <span><b>Karte:</b> {{ statement.card_number }}</span>
  <span><b>Gesamtbetrag:</b> {{ "%.2f"|format(statement.total) }} {{ statement.currency }}</span>
</div>

<div class="summary">
  <div class="summary-card">
    <div class="label">Zugeordnet</div>
    <div class="value match-high">{{ matches|length }}</div>
  </div>
  <div class="summary-card">
    <div class="label">CC ohne Beleg</div>
    <div class="value">{{ unmatched_cc|length }}</div>
  </div>
  <div class="summary-card">
    <div class="label">Belege ohne CC</div>
    <div class="value">{{ unmatched_inv|length }}</div>
  </div>
  <div class="summary-card">
    <div class="label">Zugeordneter Betrag</div>
    <div class="value">{{ "%.2f"|format(matched_total) }} €</div>
  </div>
</div>

<h2>Zugeordnete Buchungen ({{ matches|length }})</h2>
<table>
<tr>
  <th>#</th><th>Kaufdatum</th><th>Händler (CC)</th><th class="amount">Betrag</th>
  <th>Beleg (Datei)</th><th>Lieferant</th><th class="amount">Rechnungsbetrag</th>
  <th>Konfidenz</th><th>Begründung</th>
</tr>
{% for m in matches %}
<tr>
  <td>{{ m.cc_line_index }}</td>
  <td>{{ m.cc_line.purchase_date }}</td>
  <td>{% if m.cc_line.redacted %}<span class="redacted">[GESCHWÄRZT]</span>{% else %}{{ m.cc_line.merchant }}{% endif %}</td>
  <td class="amount">{{ "%.2f"|format(m.cc_line.amount) }} €</td>
  <td>{{ m.invoice_file }}</td>
  <td>{{ m.invoice.vendor }}</td>
  <td class="amount">{{ "%.2f"|format(m.invoice.total) }} €</td>
  <td><span class="confidence match-{{ m.confidence }}">{{ m.confidence }}</span></td>
  <td class="reasoning">{{ m.reasoning }}</td>
</tr>
{% endfor %}
</table>

{% if unmatched_cc %}
<h2>CC-Buchungen ohne Beleg ({{ unmatched_cc|length }})</h2>
<table>
<tr><th>#</th><th>Kaufdatum</th><th>Händler</th><th class="amount">Betrag</th><th>Begründung</th></tr>
{% for u in unmatched_cc %}
<tr>
  <td>{{ u.cc_line_index }}</td>
  <td>{{ u.cc_line.purchase_date }}</td>
  <td>{% if u.cc_line.redacted %}<span class="redacted">[GESCHWÄRZT]</span>{% else %}{{ u.cc_line.merchant or '-' }}{% endif %}</td>
  <td class="amount">{{ "%.2f"|format(u.amount) }} €</td>
  <td class="reasoning">{{ u.reasoning }}</td>
</tr>
{% endfor %}
</table>
{% endif %}

{% if unmatched_inv %}
<h2>Belege ohne CC-Buchung ({{ unmatched_inv|length }})</h2>
<table>
<tr><th>Datei</th><th>Lieferant</th><th>Datum</th><th class="amount">Betrag</th><th>Begründung</th></tr>
{% for u in unmatched_inv %}
<tr>
  <td>{{ u.file }}</td>
  <td>{{ u.vendor }}</td>
  <td>{{ u.invoice.invoice_date or '-' }}</td>
  <td class="amount">{{ "%.2f"|format(u.amount) }} €</td>
  <td class="reasoning">{{ u.reasoning }}</td>
</tr>
{% endfor %}
</table>
{% endif %}

</body>
</html>"""


def generate_report(statement_data: dict, invoices: dict[str, dict],
                    match_results: dict, output_path: str) -> str:
    """
    Generate HTML report from match results.

    Args:
        statement_data: Extracted CC statement data
        invoices: Dict mapping filename -> extracted invoice data
        match_results: Output from match.match_all()
        output_path: Path to write HTML file

    Returns:
        Path to generated report
    """
    # Build CC line lookup
    cc_lines_by_index = {line["index"]: line for line in statement_data["lines"]}

    # Enrich matches with source data
    enriched_matches = []
    matched_total = 0.0
    for m in match_results.get("matches", []):
        cc_line = cc_lines_by_index.get(m["cc_line_index"], {})
        invoice = invoices.get(m["invoice_file"], {})
        enriched_matches.append({
            **m,
            "cc_line": cc_line,
            "invoice": invoice,
        })
        matched_total += cc_line.get("amount", 0.0)

    # Enrich unmatched CC lines
    enriched_unmatched_cc = []
    for u in match_results.get("unmatched_cc_lines", []):
        cc_line = cc_lines_by_index.get(u["cc_line_index"], {})
        enriched_unmatched_cc.append({**u, "cc_line": cc_line})

    # Enrich unmatched invoices
    enriched_unmatched_inv = []
    for u in match_results.get("unmatched_invoices", []):
        invoice = invoices.get(u["file"], {})
        enriched_unmatched_inv.append({**u, "invoice": invoice})

    template = Template(REPORT_TEMPLATE)
    html = template.render(
        statement=statement_data,
        matches=enriched_matches,
        unmatched_cc=enriched_unmatched_cc,
        unmatched_inv=enriched_unmatched_inv,
        matched_total=matched_total,
    )

    os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
    with open(output_path, "w") as f:
        f.write(html)

    return output_path

Step 2: Commit

git add ../spikes/cc-reconciliation/src/report.py
git commit -m "spike(cc-recon): HTML report generation with print-friendly CSS"

Task 7: Main Entry Point (`main.py`)

Files:

Create: ../spikes/cc-reconciliation/src/main.py

Step 1: Implement main.py

Orchestrates the full pipeline with caching and CLI flags.

import argparse
import json
import os
import sys
from pathlib import Path

from dotenv import load_dotenv


STATEMENT_KEYWORD = "Kreditkartenabrechnung"


def discover_pdfs(input_folder: str) -> tuple[str | None, list[str]]:
    """Classify PDFs in folder. Returns (statement_path, [invoice_paths])."""
    folder = Path(input_folder)
    statement = None
    invoices = []

    for pdf in sorted(folder.glob("*.pdf")):
        if STATEMENT_KEYWORD in pdf.name:
            statement = str(pdf)
        else:
            invoices.append(str(pdf))

    return statement, invoices


def load_cache(cache_dir: str, filename: str) -> dict | None:
    """Load cached result for a file, or return None."""
    cache_path = Path(cache_dir) / f"{filename}.json"
    if cache_path.exists():
        with open(cache_path) as f:
            return json.load(f)
    return None


def save_cache(cache_dir: str, filename: str, data: dict):
    """Save result to cache."""
    os.makedirs(cache_dir, exist_ok=True)
    cache_path = Path(cache_dir) / f"{filename}.json"
    with open(cache_path, "w") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)


def process_pdf(pdf_path: str, cache_dir: str, is_statement: bool,
                no_cache: bool, preprocess: bool) -> dict:
    """Transcribe and extract a single PDF. Uses cache if available."""
    from src.transcribe import transcribe_pdf
    from src.extract import extract_cc_statement, extract_invoice

    filename = Path(pdf_path).name
    doc_type = "statement" if is_statement else "invoice"

    # Check cache
    if not no_cache:
        cached = load_cache(cache_dir, filename)
        if cached:
            print(f"  [cached] {filename}")
            return cached

    # Transcribe
    print(f"  [transcribe] {filename}...")
    transcription = transcribe_pdf(pdf_path, preprocess_enabled=preprocess)

    # Extract
    print(f"  [extract] {filename} ({doc_type})...")
    if is_statement:
        extracted = extract_cc_statement(transcription["text"])
    else:
        extracted = extract_invoice(transcription["text"])

    result = {
        "filename": filename,
        "type": doc_type,
        "transcription": {
            "method": transcription["method"],
            "page_count": transcription["page_count"],
            "quality_score": transcription["quality_score"],
            "token_quality": transcription["token_quality"],
            "quality_metrics": transcription["quality_metrics"],
            "text": transcription["text"],
        },
        "extracted": extracted,
    }

    save_cache(cache_dir, filename, result)
    return result


def run(input_folder: str, output_dir: str = "output", cache_dir: str = "cache",
        no_cache: bool = False, match_only: bool = False, preprocess: bool = True):
    """Run the full CC reconciliation pipeline."""
    from src.match import match_all
    from src.report import generate_report

    print(f"Input folder: {input_folder}")
    statement_path, invoice_paths = discover_pdfs(input_folder)

    if not statement_path:
        print(f"ERROR: No file containing '{STATEMENT_KEYWORD}' found in {input_folder}")
        sys.exit(1)

    print(f"Found statement: {Path(statement_path).name}")
    print(f"Found {len(invoice_paths)} invoices")
    print()

    # Step 1+2: Transcribe and extract
    if not match_only:
        print("=== Transcription & Extraction ===")
        statement_result = process_pdf(statement_path, cache_dir, is_statement=True,
                                       no_cache=no_cache, preprocess=preprocess)
        invoice_results = {}
        for inv_path in invoice_paths:
            result = process_pdf(inv_path, cache_dir, is_statement=False,
                                 no_cache=no_cache, preprocess=preprocess)
            invoice_results[Path(inv_path).name] = result
    else:
        # Load from cache
        print("=== Loading from cache (--match-only) ===")
        statement_cache = load_cache(cache_dir, Path(statement_path).name)
        if not statement_cache:
            print("ERROR: No cached statement data. Run without --match-only first.")
            sys.exit(1)
        statement_result = statement_cache
        invoice_results = {}
        for inv_path in invoice_paths:
            cached = load_cache(cache_dir, Path(inv_path).name)
            if cached:
                invoice_results[Path(inv_path).name] = cached
            else:
                print(f"  WARNING: No cache for {Path(inv_path).name}, skipping")

    statement_data = statement_result["extracted"]
    invoices_data = {name: r["extracted"] for name, r in invoice_results.items()}

    # Step 3: Match
    print()
    print("=== Matching ===")
    print(f"  Matching {len(statement_data.get('lines', []))} CC lines against "
          f"{len(invoices_data)} invoices...")
    match_results = match_all(statement_data, invoices_data)

    print(f"  Matched: {len(match_results.get('matches', []))}")
    print(f"  Unmatched CC lines: {len(match_results.get('unmatched_cc_lines', []))}")
    print(f"  Unmatched invoices: {len(match_results.get('unmatched_invoices', []))}")

    # Save match results to cache
    save_cache(cache_dir, "_match_results", match_results)

    # Step 4: Report
    print()
    print("=== Report ===")
    report_path = os.path.join(output_dir, "report.html")
    generate_report(statement_data, invoices_data, match_results, report_path)
    print(f"  Report written to: {report_path}")


def main():
    parser = argparse.ArgumentParser(description="CC Reconciliation Spike")
    parser.add_argument("input_folder",
                        help="Folder containing CC statement PDF and invoice PDFs")
    parser.add_argument("--output-dir", default="output",
                        help="Directory for HTML report (default: output)")
    parser.add_argument("--cache-dir", default="cache",
                        help="Directory for cached results (default: cache)")
    parser.add_argument("--no-cache", action="store_true",
                        help="Force re-run, ignore cached results")
    parser.add_argument("--match-only", action="store_true",
                        help="Skip transcription/extraction, re-run matching from cache")
    parser.add_argument("--no-preprocess", action="store_true",
                        help="Skip image preprocessing")
    args = parser.parse_args()

    # Load environment
    load_dotenv()

    # Change to spike directory so relative paths work
    spike_dir = Path(__file__).parent.parent
    os.chdir(spike_dir)

    run(
        input_folder=args.input_folder,
        output_dir=args.output_dir,
        cache_dir=args.cache_dir,
        no_cache=args.no_cache,
        match_only=args.match_only,
        preprocess=not args.no_preprocess,
    )


if __name__ == "__main__":
    main()

Step 2: Full pipeline test

cd ../spikes/cc-reconciliation
.venv/bin/python -m src.main "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"

Expected: Runs full pipeline, prints progress, generates output/report.html.

Step 3: Open report and verify

xdg-open ../spikes/cc-reconciliation/output/report.html

Review the report for:

Correct statement header info
Reasonable matches (Coffee Unlimited ↔ 42.40€ line, Zirkonzahn invoices, etc.)
Redacted entries handled
Unmatched sections populated

Step 4: Commit

git add ../spikes/cc-reconciliation/src/main.py
git commit -m "spike(cc-recon): main entry point with full pipeline orchestration"

Task 8: End-to-End Tuning

This task is iterative. Run the pipeline, review results, adjust prompts.

Step 1: Run full pipeline and review report

cd ../spikes/cc-reconciliation
.venv/bin/python -m src.main "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"

Step 2: Check extraction accuracy

Inspect cache/*.json files. For each invoice, verify:

Vendor name correct
Total amount correct
Date correct
Payment method detected

For the CC statement, verify:

All line items extracted (including redacted ones)
Amounts correct (German comma → dot conversion)
Redacted flag set correctly

Step 3: Check matching accuracy

Cross-reference cache/_match_results.json against manual inspection:

Coffee Unlimited receipt (42.40€) should match CC line for UZR*Coffee Unlimited
Zirkonzahn invoices should match WWW.ZIRKONZAHN.COM entries
Amazon invoices should match AMZN Mktp entries
EasyPark should match PAYPAL *EASYPARK entries
Restaurant receipt should match Mama Restaurant entry
Marriott invoices should match HAMBURG MARRIOTT entries

Step 4: Iterate on prompts if needed

Use --match-only to re-run matching without re-transcribing:

.venv/bin/python -m src.main --match-only "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"

Use --no-cache to force re-extraction if extraction prompts change:

.venv/bin/python -m src.main --no-cache "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"

Step 5: Compare with/without preprocessing

# Clear cache and run without preprocessing
rm -rf cache/
.venv/bin/python -m src.main --no-preprocess "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"
# Save report
cp output/report.html output/report-no-preprocess.html

# Clear cache and run with preprocessing
rm -rf cache/
.venv/bin/python -m src.main "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"
cp output/report.html output/report-with-preprocess.html

Compare the two reports for differences in match quality.

Step 6: Commit final tuned version

git add ../spikes/cc-reconciliation/src/
git commit -m "spike(cc-recon): tuned prompts for correct matching on test dataset"

CC Reconciliation Spike — Implementation Plan

Task 1: Project Scaffolding

Task 2: Image Preprocessing (preprocess.py)

Task 3: Transcription (transcribe.py)

Task 4: Extraction (extract.py)

Task 5: Matching (match.py)

Task 6: HTML Report (report.py)

Task 7: Main Entry Point (main.py)

Task 8: End-to-End Tuning

Task 2: Image Preprocessing (`preprocess.py`)

Task 3: Transcription (`transcribe.py`)

Task 4: Extraction (`extract.py`)

Task 5: Matching (`match.py`)

Task 6: HTML Report (`report.py`)

Task 7: Main Entry Point (`main.py`)