For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Build a standalone Python spike that matches invoice PDFs against a credit card statement using OCR + LLM extraction + LLM matching, producing a print-friendly HTML report.
Architecture: Pipeline of 5 stages (preprocess → transcribe → extract → match → report), each in its own module. Per-file caching between stages so you can iterate on prompts without re-running OCR. Single entry point with CLI flags.
Tech Stack: Python 3.11+, pymupdf, opencv-python, google-cloud-documentai, google-generativeai, anthropic, jinja2, python-dotenv
Files:
../spikes/cc-reconciliation/.gitignore../spikes/cc-reconciliation/.env.example../spikes/cc-reconciliation/.env../spikes/cc-reconciliation/requirements.txt../spikes/cc-reconciliation/src/__init__.pyStep 1: Create .gitignore
.env
cache/
output/
__pycache__/
*.pyc
.venv/
Step 2: Create .env.example
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENAI_API_KEY=AIza...
GOOGLE_CLOUD_PROJECT=getorcha-dev
GOOGLE_DOCAI_PROCESSOR_ID=2ce14f950a811b13
GOOGLE_DOCAI_LOCATION=eu
GOOGLE_APPLICATION_CREDENTIALS=/home/volrath/code/orcha/orcha/credentials/google-docai-dev.json
Step 3: Create .env with actual values
Copy from .env.example and fill in values from Orcha's scripts/init_aws.clj:
ANTHROPIC_API_KEY = the value from /v1-orcha/anthropic-api-keyGOOGLE_GENAI_API_KEY = the value from /v1-orcha/google-genai-api-keyGOOGLE_CLOUD_PROJECT = getorcha-devGOOGLE_DOCAI_PROCESSOR_ID = 2ce14f950a811b13GOOGLE_DOCAI_LOCATION = euGOOGLE_APPLICATION_CREDENTIALS = /home/volrath/code/orcha/orcha/credentials/google-docai-dev.jsonStep 4: Create requirements.txt
pymupdf>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0
anthropic>=0.40.0
google-cloud-documentai>=2.20.0
google-generativeai>=0.3.0
jinja2>=3.1.0
python-dotenv>=1.0.0
Step 5: Create empty src/__init__.py
Empty file.
Step 6: Create venv and install dependencies
Run:
python3 -m venv ../spikes/cc-reconciliation/.venv
../spikes/cc-reconciliation/.venv/bin/pip install -r ../spikes/cc-reconciliation/requirements.txt
Expected: All packages install successfully.
Step 7: Create cache/ and output/ directories
mkdir -p ../spikes/cc-reconciliation/cache ../spikes/cc-reconciliation/output
Step 8: Commit
git add ../spikes/cc-reconciliation/.gitignore ../spikes/cc-reconciliation/.env.example ../spikes/cc-reconciliation/requirements.txt ../spikes/cc-reconciliation/src/__init__.py
git commit -m "spike(cc-recon): project scaffolding with deps and env config"
preprocess.py)Files:
../spikes/cc-reconciliation/src/preprocess.pyReference code: /home/volrath/code/orcha/spikes/guebelin/split.py lines 66-107
Step 1: Implement preprocess.py
This module needs 5 functions:
compute_quality_metrics(image) -> dict — Laplacian variance, original dimensions. Always runs.crop_borders(image) -> (image, bool) — Detect dark edges via row/column mean intensity. Crop if edge strips are significantly darker than center. Return cropped image and whether cropping occurred.deskew(image) -> (image, float) — Convert to binary (Otsu threshold), find contours or use cv2.minAreaRect on non-zero pixels, rotate if |angle| > 0.5°. Return corrected image and detected angle.enhance(image) -> image — CLAHE (clipLimit=2.0, tileGridSize=(8,8)) then fastNlMeansDenoising(h=10, templateWindowSize=7, searchWindowSize=21). Ported from Gübelin enhance_image().preprocess_page(image, enable=True) -> (image, dict) — Orchestrates the pipeline. If enable=False, only computes quality metrics and returns the original image unchanged. If enable=True, runs: crop_borders → deskew → enhance.import cv2
import numpy as np
def compute_quality_metrics(image: np.ndarray) -> dict:
"""Compute quality metrics for a page image. Always runs regardless of preprocessing flag."""
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) if len(image.shape) == 3 else image
laplacian_var = cv2.Laplacian(gray, cv2.CV_64F).var()
h, w = image.shape[:2]
return {
"laplacian_variance": round(float(laplacian_var), 2),
"original_size": [h, w],
}
def crop_borders(image: np.ndarray, threshold: int = 40, margin_pct: float = 0.02) -> tuple[np.ndarray, bool]:
"""Remove dark scanner borders. Returns (cropped_image, was_cropped)."""
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) if len(image.shape) == 3 else image
h, w = gray.shape
margin_h = int(h * margin_pct)
margin_w = int(w * margin_pct)
# Find rows/cols where mean intensity is above threshold (not dark border)
row_means = np.mean(gray, axis=1)
col_means = np.mean(gray, axis=0)
content_rows = np.where(row_means > threshold)[0]
content_cols = np.where(col_means > threshold)[0]
if len(content_rows) == 0 or len(content_cols) == 0:
return image, False
top, bottom = content_rows[0], content_rows[-1]
left, right = content_cols[0], content_cols[-1]
# Only crop if we'd remove meaningful border (> margin_pct of dimension)
if top < margin_h and bottom > h - margin_h and left < margin_w and right > w - margin_w:
return image, False
cropped = image[top:bottom + 1, left:right + 1]
return cropped, True
def deskew(image: np.ndarray, angle_threshold: float = 0.5) -> tuple[np.ndarray, float]:
"""Detect and correct skew. Returns (corrected_image, detected_angle_degrees)."""
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) if len(image.shape) == 3 else image
# Binarize
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
# Find coordinates of all non-zero pixels
coords = np.column_stack(np.where(binary > 0))
if len(coords) < 100:
return image, 0.0
# Get minimum area rectangle
angle = cv2.minAreaRect(coords)[-1]
# Normalize angle
if angle < -45:
angle = 90 + angle
elif angle > 45:
angle = angle - 90
if abs(angle) < angle_threshold:
return image, round(float(angle), 2)
# Rotate
h, w = image.shape[:2]
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
corrected = cv2.warpAffine(image, matrix, (w, h), flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
return corrected, round(float(angle), 2)
def enhance(image: np.ndarray) -> np.ndarray:
"""Apply CLAHE contrast enhancement and denoising."""
gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY) if len(image.shape) == 3 else image
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
enhanced = clahe.apply(gray)
denoised = cv2.fastNlMeansDenoising(enhanced, None, h=10,
templateWindowSize=7, searchWindowSize=21)
return denoised
def preprocess_page(image: np.ndarray, enable: bool = True) -> tuple[np.ndarray, dict]:
"""
Full preprocessing pipeline for a single page image.
Args:
image: Page image as numpy array (RGB or grayscale)
enable: If False, only computes quality metrics without modifying the image
Returns:
(processed_image, metrics_dict)
"""
metrics = compute_quality_metrics(image)
metrics["preprocessed"] = False
if not enable:
return image, metrics
result, border_cropped = crop_borders(image)
metrics["border_cropped"] = border_cropped
result, skew_angle = deskew(result)
metrics["skew_angle"] = skew_angle
result = enhance(result)
metrics["preprocessed"] = True
return result, metrics
Step 2: Quick smoke test
Run from the spike directory:
cd ../spikes/cc-reconciliation
.venv/bin/python -c "
import fitz, numpy as np
from src.preprocess import preprocess_page
doc = fitz.open('/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC/2025-12-04_073706.pdf')
page = doc[0]
zoom = 200 / 72
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
processed, metrics = preprocess_page(img)
print('Metrics:', metrics)
print('Output shape:', processed.shape)
"
Expected: Prints metrics dict with laplacian_variance, skew_angle, etc. and output shape.
Step 3: Commit
git add ../spikes/cc-reconciliation/src/preprocess.py
git commit -m "spike(cc-recon): image preprocessing with CLAHE, denoise, deskew, border crop"
transcribe.py)Files:
../spikes/cc-reconciliation/src/transcribe.pyReference code:
/home/volrath/code/orcha/orcha/src/com/getorcha/workers/ap/ingestion/transcription.cljhttps://{location}-documentai.googleapis.com/v1/projects/{project}/locations/{location}/processors/{processor}:processhttps://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent?key={api_key}Step 1: Implement transcribe.py
This module needs:
_get_access_token() -> str — Load Google credentials from GOOGLE_APPLICATION_CREDENTIALS, get OAuth2 access token.transcribe_docai(image_bytes, mime_type) -> dict — Call Document AI, return {"text": ..., "token_quality": {...}, "quality_score": ...}._calculate_token_quality(response) -> dict — Extract token confidence stats from Doc AI response. Low-confidence threshold: 0.8._needs_vision_fallback(token_quality) -> bool — True if low_confidence_ratio > 0.05.transcribe_gemini_vision(page_images) -> str — Call Gemini vision API with page images, return text.transcribe_pdf(pdf_path, preprocess_enabled) -> dict — Main entry point. Renders pages, preprocesses, runs Doc AI, falls back to Gemini if needed. Returns {"text": ..., "pages": [...], "quality_metrics": [...], "method": "docai"|"gemini_vision"}.import base64
import io
import json
import os
import fitz
import numpy as np
import requests
from google.oauth2 import service_account
from src.preprocess import preprocess_page
DPI = 200
GEMINI_DPI = 150
GEMINI_JPEG_QUALITY = 95
LOW_CONFIDENCE_THRESHOLD = 0.8
LOW_CONFIDENCE_RATIO = 0.05
GEMINI_MODEL = "gemini-3-pro-preview"
def _get_access_token() -> str:
"""Get OAuth2 access token from service account credentials."""
creds_path = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
credentials = service_account.Credentials.from_service_account_file(
creds_path,
scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
credentials.refresh(requests.Request())
return credentials.token
def _render_page(page: fitz.Page, dpi: int = DPI) -> np.ndarray:
"""Render a PDF page to numpy array at given DPI."""
zoom = dpi / 72
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
if pix.n == 4:
import cv2
img = cv2.cvtColor(img, cv2.COLOR_RGBA2RGB)
return img
def _image_to_png_bytes(image: np.ndarray) -> bytes:
"""Encode numpy array as PNG bytes."""
import cv2
success, buf = cv2.imencode(".png", image)
if not success:
raise RuntimeError("Failed to encode image to PNG")
return buf.tobytes()
def _image_to_jpeg_bytes(image: np.ndarray, quality: int = GEMINI_JPEG_QUALITY) -> bytes:
"""Encode numpy array as JPEG bytes."""
import cv2
success, buf = cv2.imencode(".jpg", image, [cv2.IMWRITE_JPEG_QUALITY, quality])
if not success:
raise RuntimeError("Failed to encode image to JPEG")
return buf.tobytes()
def _calculate_token_quality(response: dict) -> dict:
"""Extract token confidence stats from Document AI response."""
pages = response.get("document", {}).get("pages", [])
confidences = []
for page in pages:
for token in page.get("tokens", []):
conf = token.get("layout", {}).get("confidence")
if conf is not None:
confidences.append(conf)
if not confidences:
return {"token_count": 0, "low_confidence_count": 0,
"low_confidence_ratio": 0.0, "mean_confidence": 0.0,
"min_confidence": 0.0}
low_count = sum(1 for c in confidences if c < LOW_CONFIDENCE_THRESHOLD)
return {
"token_count": len(confidences),
"low_confidence_count": low_count,
"low_confidence_ratio": round(low_count / len(confidences), 4),
"mean_confidence": round(sum(confidences) / len(confidences), 4),
"min_confidence": round(min(confidences), 4),
}
def _needs_vision_fallback(token_quality: dict) -> bool:
"""Check if OCR quality is too low and vision fallback is needed."""
return token_quality["low_confidence_ratio"] > LOW_CONFIDENCE_RATIO
def transcribe_docai(image_bytes: bytes, mime_type: str = "image/png") -> dict:
"""Transcribe a single page image using Google Document AI."""
project = os.environ["GOOGLE_CLOUD_PROJECT"]
processor = os.environ["GOOGLE_DOCAI_PROCESSOR_ID"]
location = os.environ.get("GOOGLE_DOCAI_LOCATION", "eu")
token = _get_access_token()
url = (f"https://{location}-documentai.googleapis.com/v1/"
f"projects/{project}/locations/{location}/processors/{processor}:process")
body = {
"rawDocument": {
"content": base64.b64encode(image_bytes).decode(),
"mimeType": mime_type,
},
"fieldMask": "text,pages.tokens,pages.imageQualityScores",
"processOptions": {
"ocrConfig": {
"enableImageQualityScores": True,
}
}
}
resp = requests.post(url, json=body,
headers={"Authorization": f"Bearer {token}"},
timeout=120)
resp.raise_for_status()
result = resp.json()
text = result.get("document", {}).get("text", "")
token_quality = _calculate_token_quality(result)
pages = result.get("document", {}).get("pages", [])
quality_score = 0.0
if pages:
scores = [p.get("imageQualityScores", {}).get("qualityScore", 0.0) for p in pages]
quality_score = sum(scores) / len(scores)
return {
"text": text,
"token_quality": token_quality,
"quality_score": round(quality_score, 4),
}
def transcribe_gemini_vision(page_images: list[np.ndarray]) -> str:
"""Transcribe page images using Gemini vision as fallback."""
api_key = os.environ["GOOGLE_GENAI_API_KEY"]
url = (f"https://generativelanguage.googleapis.com/v1beta/"
f"models/{GEMINI_MODEL}:generateContent?key={api_key}")
parts = [{"text": "Transcribe this document exactly as written. "
"Preserve all text, numbers, dates, and formatting. "
"Output only the transcribed text, nothing else."}]
for img in page_images:
jpeg_bytes = _image_to_jpeg_bytes(img)
parts.append({
"inline_data": {
"mime_type": "image/jpeg",
"data": base64.b64encode(jpeg_bytes).decode(),
}
})
body = {"contents": [{"parts": parts}]}
resp = requests.post(url, json=body, timeout=180)
resp.raise_for_status()
result = resp.json()
return result["candidates"][0]["content"]["parts"][0]["text"]
def transcribe_pdf(pdf_path: str, preprocess_enabled: bool = True) -> dict:
"""
Transcribe a PDF file. Renders pages, preprocesses, runs Doc AI with
Gemini vision fallback.
Returns:
{
"text": str, # full concatenated text
"method": str, # "docai" or "gemini_vision"
"page_count": int,
"quality_metrics": [dict], # per-page preprocessing metrics
"token_quality": dict, # Doc AI token confidence stats
"quality_score": float, # Doc AI image quality score
}
"""
doc = fitz.open(pdf_path)
page_texts = []
quality_metrics = []
page_images = [] # keep for potential Gemini fallback
# Aggregate token quality across pages
all_token_quality = {"token_count": 0, "low_confidence_count": 0,
"confidences_sum": 0.0, "min_confidence": 1.0}
all_quality_scores = []
for page_num in range(len(doc)):
page = doc[page_num]
img = _render_page(page, DPI)
processed, metrics = preprocess_page(img, enable=preprocess_enabled)
quality_metrics.append({"page": page_num + 1, **metrics})
# Keep lower-res version for potential Gemini fallback
gemini_img = _render_page(page, GEMINI_DPI)
page_images.append(gemini_img)
# OCR with Document AI
png_bytes = _image_to_png_bytes(processed)
result = transcribe_docai(png_bytes)
page_texts.append(result["text"])
all_quality_scores.append(result["quality_score"])
tq = result["token_quality"]
all_token_quality["token_count"] += tq["token_count"]
all_token_quality["low_confidence_count"] += tq["low_confidence_count"]
all_token_quality["confidences_sum"] += tq["mean_confidence"] * tq["token_count"]
all_token_quality["min_confidence"] = min(all_token_quality["min_confidence"],
tq["min_confidence"])
doc.close()
# Aggregate token quality
tc = all_token_quality["token_count"]
aggregated_tq = {
"token_count": tc,
"low_confidence_count": all_token_quality["low_confidence_count"],
"low_confidence_ratio": round(all_token_quality["low_confidence_count"] / tc, 4) if tc else 0.0,
"mean_confidence": round(all_token_quality["confidences_sum"] / tc, 4) if tc else 0.0,
"min_confidence": round(all_token_quality["min_confidence"], 4),
}
avg_quality = round(sum(all_quality_scores) / len(all_quality_scores), 4) if all_quality_scores else 0.0
# Check if we need Gemini vision fallback
method = "docai"
if _needs_vision_fallback(aggregated_tq):
print(f" Low OCR quality (ratio={aggregated_tq['low_confidence_ratio']}), "
f"falling back to Gemini vision...")
method = "gemini_vision"
gemini_text = transcribe_gemini_vision(page_images)
full_text = gemini_text
else:
# Concatenate with page markers
parts = []
for i, text in enumerate(page_texts):
if len(page_texts) > 1:
parts.append(f"--- Page {i + 1} ---")
parts.append(text)
full_text = "\n".join(parts)
return {
"text": full_text,
"method": method,
"page_count": len(page_texts),
"quality_metrics": quality_metrics,
"token_quality": aggregated_tq,
"quality_score": avg_quality,
}
Step 2: Smoke test with one invoice
cd ../spikes/cc-reconciliation
.venv/bin/python -c "
import os
from dotenv import load_dotenv
load_dotenv()
from src.transcribe import transcribe_pdf
result = transcribe_pdf('/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC/2025-12-04_073706.pdf')
print('Method:', result['method'])
print('Pages:', result['page_count'])
print('Token quality:', result['token_quality'])
print('Text (first 200 chars):', result['text'][:200])
"
Expected: Prints transcription method, page count, quality stats, and first 200 chars of text.
Step 3: Commit
git add ../spikes/cc-reconciliation/src/transcribe.py
git commit -m "spike(cc-recon): transcription with Doc AI primary and Gemini vision fallback"
extract.py)Files:
../spikes/cc-reconciliation/src/extract.pyStep 1: Implement extract.py
Two extraction functions, both using Claude Sonnet 4.5. Uses the anthropic SDK directly.
import json
import os
import anthropic
MODEL = "claude-sonnet-4-5-20250929"
def _get_client() -> anthropic.Anthropic:
return anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
def extract_cc_statement(transcribed_text: str) -> dict:
"""Extract structured data from a credit card statement transcription."""
client = _get_client()
prompt = """Extract all line items from this credit card statement.
RULES:
1. Extract every transaction line, including redacted/blacked-out ones.
2. For redacted entries: set "redacted" to true. The merchant name may be blacked out
but the amount and dates should still be visible. Extract what you can.
3. Amounts: use numbers with dot decimal (42.40, not "42,40"). The statement uses
German format (comma = decimal), convert to dot format.
4. Dates: use ISO format YYYY-MM-DD.
5. The "S" suffix on amounts means "Soll" (debit). "H" means "Haben" (credit).
Report amounts as positive numbers. Mark credits with "is_credit": true.
6. Extract Prämienmeilen (bonus miles) as the "miles" field for each entry.
7. Index lines sequentially starting from 1.
Return JSON with this exact structure:
{
"statement_period": {"from": "YYYY-MM-DD", "to": "YYYY-MM-DD"},
"cardholder": "...",
"card_number": "...",
"total": 0.00,
"currency": "EUR",
"lines": [
{
"index": 1,
"purchase_date": "YYYY-MM-DD",
"booking_date": "YYYY-MM-DD",
"merchant": "merchant name or null if redacted",
"amount": 0.00,
"currency": "EUR",
"miles": 0,
"redacted": false,
"is_credit": false
}
]
}
STATEMENT TEXT:
""" + transcribed_text
response = client.messages.create(
model=MODEL,
max_tokens=16000,
temperature=0,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": "{"},
],
)
raw = "{" + response.content[0].text
# Strip markdown code fences if present
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
return json.loads(raw)
def extract_invoice(transcribed_text: str) -> dict:
"""Extract structured data from an invoice transcription."""
client = _get_client()
prompt = """Extract key fields from this invoice/receipt.
RULES:
1. Amounts: use numbers with dot decimal (309.40, not "309,40").
2. Dates: use ISO format YYYY-MM-DD.
3. Extract the total amount actually charged (including tax).
4. For payment_method: look for "Kreditkarte", "EC-Cash", "PayPal", "Überweisung", etc.
Use null if not stated.
5. Keep brief_description under 100 characters — just enough to identify what was purchased.
6. Copy vendor name exactly as written. Do not correct spelling.
Return JSON with this exact structure:
{
"vendor": "Company Name",
"invoice_number": "...",
"invoice_date": "YYYY-MM-DD",
"total": 0.00,
"currency": "EUR",
"payment_method": "Kreditkarte",
"brief_description": "short description of items"
}
INVOICE TEXT:
""" + transcribed_text
response = client.messages.create(
model=MODEL,
max_tokens=4000,
temperature=0,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": "{"},
],
)
raw = "{" + response.content[0].text
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
return json.loads(raw)
Step 2: Smoke test with transcription output
cd ../spikes/cc-reconciliation
.venv/bin/python -c "
import os, json
from dotenv import load_dotenv
load_dotenv()
from src.transcribe import transcribe_pdf
from src.extract import extract_invoice
result = transcribe_pdf('/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC/2025-12-04_073706.pdf')
extracted = extract_invoice(result['text'])
print(json.dumps(extracted, indent=2))
"
Expected: Prints extracted invoice data (Coffee Unlimited, 42.40 EUR, etc.)
Step 3: Commit
git add ../spikes/cc-reconciliation/src/extract.py
git commit -m "spike(cc-recon): LLM extraction for CC statements and invoices"
match.py)Files:
../spikes/cc-reconciliation/src/match.pyStep 1: Implement match.py
Single Claude call with all data.
import json
import os
import anthropic
MODEL = "claude-sonnet-4-5-20250929"
def match_all(statement_data: dict, invoices: dict[str, dict]) -> dict:
"""
Match CC statement lines against extracted invoice data.
Args:
statement_data: Extracted CC statement (from extract_cc_statement)
invoices: Dict mapping filename -> extracted invoice data
Returns:
Match results with reasoning
"""
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
# Format CC lines
cc_lines_text = []
for line in statement_data["lines"]:
merchant = line.get("merchant") or "[REDACTED]"
redacted_tag = " [REDACTED]" if line.get("redacted") else ""
credit_tag = " [CREDIT]" if line.get("is_credit") else ""
cc_lines_text.append(
f" [{line['index']}] {line.get('purchase_date', '?')} | "
f"{merchant}{redacted_tag}{credit_tag} | "
f"{line['amount']} {line.get('currency', 'EUR')}"
)
# Format invoices
invoice_lines_text = []
for filename, inv in sorted(invoices.items()):
invoice_lines_text.append(
f" [{filename}] {inv.get('vendor', '?')} | "
f"{inv.get('total', '?')} {inv.get('currency', 'EUR')} | "
f"Date: {inv.get('invoice_date', '?')} | "
f"Payment: {inv.get('payment_method', '?')} | "
f"{inv.get('brief_description', '')}"
)
prompt = f"""You are matching credit card statement line items against invoices.
CREDIT CARD STATEMENT:
Period: {statement_data['statement_period']['from']} to {statement_data['statement_period']['to']}
Cardholder: {statement_data['cardholder']}
Total: {statement_data['total']} {statement_data['currency']}
CC LINE ITEMS:
{chr(10).join(cc_lines_text)}
INVOICES (filename | vendor | total | date | payment method | description):
{chr(10).join(invoice_lines_text)}
MATCHING RULES:
1. Match by amount (exact or very close — within a few cents for rounding).
2. Use date proximity: the CC purchase date should be close to the invoice date
(same day or within a few days, considering processing delays).
3. Use merchant/vendor name similarity: CC merchant names are often abbreviated
or formatted differently than invoice vendor names (e.g., "HAMBURG MARRIOTT HO"
matches "Hamburg Marriott"). Be flexible with name matching.
4. Redacted CC entries: match by amount and date only, since merchant is hidden.
5. Credit entries (marked [CREDIT]) are refunds — they can match invoices too.
6. An invoice can match at most one CC line, and vice versa.
7. Some CC lines may have no matching invoice (private expenses, small purchases
without receipt). Some invoices may not match any CC line (different payment
method, outside statement period).
8. When multiple invoices could match a CC line (e.g., same amount), use date
and merchant to disambiguate. If still ambiguous, note it in reasoning.
Return JSON:
{{
"matches": [
{{
"cc_line_index": 1,
"invoice_file": "filename.pdf",
"confidence": "high|medium|low",
"reasoning": "Brief explanation of why these match"
}}
],
"unmatched_cc_lines": [
{{
"cc_line_index": 2,
"amount": 27.01,
"merchant": "merchant or null",
"reasoning": "Why no match was found"
}}
],
"unmatched_invoices": [
{{
"file": "filename.pdf",
"vendor": "...",
"amount": 0.00,
"reasoning": "Why no match was found"
}}
]
}}"""
response = client.messages.create(
model=MODEL,
max_tokens=16000,
temperature=0,
messages=[
{"role": "user", "content": prompt},
{"role": "assistant", "content": "{"},
],
)
raw = "{" + response.content[0].text
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0]
return json.loads(raw)
Step 2: Commit (no smoke test — needs full pipeline data, will test via main.py)
git add ../spikes/cc-reconciliation/src/match.py
git commit -m "spike(cc-recon): LLM-based matching of CC lines to invoices"
report.py)Files:
../spikes/cc-reconciliation/src/report.pyStep 1: Implement report.py
Jinja2 template as string constant. Print-friendly CSS.
import os
from jinja2 import Template
REPORT_TEMPLATE = """<!DOCTYPE html>
<html lang="de">
<head>
<meta charset="utf-8">
<title>CC Reconciliation Report</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
font-size: 11pt; color: #333; padding: 20px; max-width: 1400px; margin: 0 auto; }
h1 { font-size: 18pt; margin-bottom: 8px; }
h2 { font-size: 14pt; margin: 24px 0 8px 0; border-bottom: 2px solid #333; padding-bottom: 4px; }
.header-info { font-size: 10pt; color: #666; margin-bottom: 16px; }
.header-info span { margin-right: 24px; }
.summary { display: flex; gap: 16px; margin: 16px 0; flex-wrap: wrap; }
.summary-card { background: #f5f5f5; border: 1px solid #ddd; border-radius: 4px;
padding: 12px 16px; min-width: 150px; }
.summary-card .label { font-size: 9pt; color: #666; text-transform: uppercase; }
.summary-card .value { font-size: 16pt; font-weight: bold; margin-top: 4px; }
.match-high { color: #2e7d32; }
.match-medium { color: #f57f17; }
.match-low { color: #c62828; }
table { width: 100%; border-collapse: collapse; margin: 8px 0 24px 0; font-size: 10pt; }
th { background: #f5f5f5; text-align: left; padding: 6px 8px; border: 1px solid #ddd;
font-weight: 600; white-space: nowrap; }
td { padding: 6px 8px; border: 1px solid #ddd; vertical-align: top; }
tr:nth-child(even) { background: #fafafa; }
.amount { text-align: right; font-variant-numeric: tabular-nums; white-space: nowrap; }
.redacted { color: #999; font-style: italic; }
.reasoning { font-size: 9pt; color: #555; max-width: 400px; }
.confidence { font-weight: 600; text-transform: uppercase; font-size: 9pt; }
@media print {
body { padding: 0; font-size: 9pt; }
.summary-card { background: none; border: 1px solid #999; }
tr:nth-child(even) { background: none; }
h2 { page-break-before: auto; }
table { page-break-inside: auto; }
tr { page-break-inside: avoid; }
@page { size: A4 landscape; margin: 10mm; }
}
</style>
</head>
<body>
<h1>Kreditkartenabrechnung — Belegzuordnung</h1>
<div class="header-info">
<span><b>Zeitraum:</b> {{ statement.statement_period.from }} bis {{ statement.statement_period.to }}</span>
<span><b>Karteninhaber:</b> {{ statement.cardholder }}</span>
<span><b>Karte:</b> {{ statement.card_number }}</span>
<span><b>Gesamtbetrag:</b> {{ "%.2f"|format(statement.total) }} {{ statement.currency }}</span>
</div>
<div class="summary">
<div class="summary-card">
<div class="label">Zugeordnet</div>
<div class="value match-high">{{ matches|length }}</div>
</div>
<div class="summary-card">
<div class="label">CC ohne Beleg</div>
<div class="value">{{ unmatched_cc|length }}</div>
</div>
<div class="summary-card">
<div class="label">Belege ohne CC</div>
<div class="value">{{ unmatched_inv|length }}</div>
</div>
<div class="summary-card">
<div class="label">Zugeordneter Betrag</div>
<div class="value">{{ "%.2f"|format(matched_total) }} €</div>
</div>
</div>
<h2>Zugeordnete Buchungen ({{ matches|length }})</h2>
<table>
<tr>
<th>#</th><th>Kaufdatum</th><th>Händler (CC)</th><th class="amount">Betrag</th>
<th>Beleg (Datei)</th><th>Lieferant</th><th class="amount">Rechnungsbetrag</th>
<th>Konfidenz</th><th>Begründung</th>
</tr>
{% for m in matches %}
<tr>
<td>{{ m.cc_line_index }}</td>
<td>{{ m.cc_line.purchase_date }}</td>
<td>{% if m.cc_line.redacted %}<span class="redacted">[GESCHWÄRZT]</span>{% else %}{{ m.cc_line.merchant }}{% endif %}</td>
<td class="amount">{{ "%.2f"|format(m.cc_line.amount) }} €</td>
<td>{{ m.invoice_file }}</td>
<td>{{ m.invoice.vendor }}</td>
<td class="amount">{{ "%.2f"|format(m.invoice.total) }} €</td>
<td><span class="confidence match-{{ m.confidence }}">{{ m.confidence }}</span></td>
<td class="reasoning">{{ m.reasoning }}</td>
</tr>
{% endfor %}
</table>
{% if unmatched_cc %}
<h2>CC-Buchungen ohne Beleg ({{ unmatched_cc|length }})</h2>
<table>
<tr><th>#</th><th>Kaufdatum</th><th>Händler</th><th class="amount">Betrag</th><th>Begründung</th></tr>
{% for u in unmatched_cc %}
<tr>
<td>{{ u.cc_line_index }}</td>
<td>{{ u.cc_line.purchase_date }}</td>
<td>{% if u.cc_line.redacted %}<span class="redacted">[GESCHWÄRZT]</span>{% else %}{{ u.cc_line.merchant or '-' }}{% endif %}</td>
<td class="amount">{{ "%.2f"|format(u.amount) }} €</td>
<td class="reasoning">{{ u.reasoning }}</td>
</tr>
{% endfor %}
</table>
{% endif %}
{% if unmatched_inv %}
<h2>Belege ohne CC-Buchung ({{ unmatched_inv|length }})</h2>
<table>
<tr><th>Datei</th><th>Lieferant</th><th>Datum</th><th class="amount">Betrag</th><th>Begründung</th></tr>
{% for u in unmatched_inv %}
<tr>
<td>{{ u.file }}</td>
<td>{{ u.vendor }}</td>
<td>{{ u.invoice.invoice_date or '-' }}</td>
<td class="amount">{{ "%.2f"|format(u.amount) }} €</td>
<td class="reasoning">{{ u.reasoning }}</td>
</tr>
{% endfor %}
</table>
{% endif %}
</body>
</html>"""
def generate_report(statement_data: dict, invoices: dict[str, dict],
match_results: dict, output_path: str) -> str:
"""
Generate HTML report from match results.
Args:
statement_data: Extracted CC statement data
invoices: Dict mapping filename -> extracted invoice data
match_results: Output from match.match_all()
output_path: Path to write HTML file
Returns:
Path to generated report
"""
# Build CC line lookup
cc_lines_by_index = {line["index"]: line for line in statement_data["lines"]}
# Enrich matches with source data
enriched_matches = []
matched_total = 0.0
for m in match_results.get("matches", []):
cc_line = cc_lines_by_index.get(m["cc_line_index"], {})
invoice = invoices.get(m["invoice_file"], {})
enriched_matches.append({
**m,
"cc_line": cc_line,
"invoice": invoice,
})
matched_total += cc_line.get("amount", 0.0)
# Enrich unmatched CC lines
enriched_unmatched_cc = []
for u in match_results.get("unmatched_cc_lines", []):
cc_line = cc_lines_by_index.get(u["cc_line_index"], {})
enriched_unmatched_cc.append({**u, "cc_line": cc_line})
# Enrich unmatched invoices
enriched_unmatched_inv = []
for u in match_results.get("unmatched_invoices", []):
invoice = invoices.get(u["file"], {})
enriched_unmatched_inv.append({**u, "invoice": invoice})
template = Template(REPORT_TEMPLATE)
html = template.render(
statement=statement_data,
matches=enriched_matches,
unmatched_cc=enriched_unmatched_cc,
unmatched_inv=enriched_unmatched_inv,
matched_total=matched_total,
)
os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
with open(output_path, "w") as f:
f.write(html)
return output_path
Step 2: Commit
git add ../spikes/cc-reconciliation/src/report.py
git commit -m "spike(cc-recon): HTML report generation with print-friendly CSS"
main.py)Files:
../spikes/cc-reconciliation/src/main.pyStep 1: Implement main.py
Orchestrates the full pipeline with caching and CLI flags.
import argparse
import json
import os
import sys
from pathlib import Path
from dotenv import load_dotenv
STATEMENT_KEYWORD = "Kreditkartenabrechnung"
def discover_pdfs(input_folder: str) -> tuple[str | None, list[str]]:
"""Classify PDFs in folder. Returns (statement_path, [invoice_paths])."""
folder = Path(input_folder)
statement = None
invoices = []
for pdf in sorted(folder.glob("*.pdf")):
if STATEMENT_KEYWORD in pdf.name:
statement = str(pdf)
else:
invoices.append(str(pdf))
return statement, invoices
def load_cache(cache_dir: str, filename: str) -> dict | None:
"""Load cached result for a file, or return None."""
cache_path = Path(cache_dir) / f"{filename}.json"
if cache_path.exists():
with open(cache_path) as f:
return json.load(f)
return None
def save_cache(cache_dir: str, filename: str, data: dict):
"""Save result to cache."""
os.makedirs(cache_dir, exist_ok=True)
cache_path = Path(cache_dir) / f"{filename}.json"
with open(cache_path, "w") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
def process_pdf(pdf_path: str, cache_dir: str, is_statement: bool,
no_cache: bool, preprocess: bool) -> dict:
"""Transcribe and extract a single PDF. Uses cache if available."""
from src.transcribe import transcribe_pdf
from src.extract import extract_cc_statement, extract_invoice
filename = Path(pdf_path).name
doc_type = "statement" if is_statement else "invoice"
# Check cache
if not no_cache:
cached = load_cache(cache_dir, filename)
if cached:
print(f" [cached] {filename}")
return cached
# Transcribe
print(f" [transcribe] {filename}...")
transcription = transcribe_pdf(pdf_path, preprocess_enabled=preprocess)
# Extract
print(f" [extract] {filename} ({doc_type})...")
if is_statement:
extracted = extract_cc_statement(transcription["text"])
else:
extracted = extract_invoice(transcription["text"])
result = {
"filename": filename,
"type": doc_type,
"transcription": {
"method": transcription["method"],
"page_count": transcription["page_count"],
"quality_score": transcription["quality_score"],
"token_quality": transcription["token_quality"],
"quality_metrics": transcription["quality_metrics"],
"text": transcription["text"],
},
"extracted": extracted,
}
save_cache(cache_dir, filename, result)
return result
def run(input_folder: str, output_dir: str = "output", cache_dir: str = "cache",
no_cache: bool = False, match_only: bool = False, preprocess: bool = True):
"""Run the full CC reconciliation pipeline."""
from src.match import match_all
from src.report import generate_report
print(f"Input folder: {input_folder}")
statement_path, invoice_paths = discover_pdfs(input_folder)
if not statement_path:
print(f"ERROR: No file containing '{STATEMENT_KEYWORD}' found in {input_folder}")
sys.exit(1)
print(f"Found statement: {Path(statement_path).name}")
print(f"Found {len(invoice_paths)} invoices")
print()
# Step 1+2: Transcribe and extract
if not match_only:
print("=== Transcription & Extraction ===")
statement_result = process_pdf(statement_path, cache_dir, is_statement=True,
no_cache=no_cache, preprocess=preprocess)
invoice_results = {}
for inv_path in invoice_paths:
result = process_pdf(inv_path, cache_dir, is_statement=False,
no_cache=no_cache, preprocess=preprocess)
invoice_results[Path(inv_path).name] = result
else:
# Load from cache
print("=== Loading from cache (--match-only) ===")
statement_cache = load_cache(cache_dir, Path(statement_path).name)
if not statement_cache:
print("ERROR: No cached statement data. Run without --match-only first.")
sys.exit(1)
statement_result = statement_cache
invoice_results = {}
for inv_path in invoice_paths:
cached = load_cache(cache_dir, Path(inv_path).name)
if cached:
invoice_results[Path(inv_path).name] = cached
else:
print(f" WARNING: No cache for {Path(inv_path).name}, skipping")
statement_data = statement_result["extracted"]
invoices_data = {name: r["extracted"] for name, r in invoice_results.items()}
# Step 3: Match
print()
print("=== Matching ===")
print(f" Matching {len(statement_data.get('lines', []))} CC lines against "
f"{len(invoices_data)} invoices...")
match_results = match_all(statement_data, invoices_data)
print(f" Matched: {len(match_results.get('matches', []))}")
print(f" Unmatched CC lines: {len(match_results.get('unmatched_cc_lines', []))}")
print(f" Unmatched invoices: {len(match_results.get('unmatched_invoices', []))}")
# Save match results to cache
save_cache(cache_dir, "_match_results", match_results)
# Step 4: Report
print()
print("=== Report ===")
report_path = os.path.join(output_dir, "report.html")
generate_report(statement_data, invoices_data, match_results, report_path)
print(f" Report written to: {report_path}")
def main():
parser = argparse.ArgumentParser(description="CC Reconciliation Spike")
parser.add_argument("input_folder",
help="Folder containing CC statement PDF and invoice PDFs")
parser.add_argument("--output-dir", default="output",
help="Directory for HTML report (default: output)")
parser.add_argument("--cache-dir", default="cache",
help="Directory for cached results (default: cache)")
parser.add_argument("--no-cache", action="store_true",
help="Force re-run, ignore cached results")
parser.add_argument("--match-only", action="store_true",
help="Skip transcription/extraction, re-run matching from cache")
parser.add_argument("--no-preprocess", action="store_true",
help="Skip image preprocessing")
args = parser.parse_args()
# Load environment
load_dotenv()
# Change to spike directory so relative paths work
spike_dir = Path(__file__).parent.parent
os.chdir(spike_dir)
run(
input_folder=args.input_folder,
output_dir=args.output_dir,
cache_dir=args.cache_dir,
no_cache=args.no_cache,
match_only=args.match_only,
preprocess=not args.no_preprocess,
)
if __name__ == "__main__":
main()
Step 2: Full pipeline test
cd ../spikes/cc-reconciliation
.venv/bin/python -m src.main "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"
Expected: Runs full pipeline, prints progress, generates output/report.html.
Step 3: Open report and verify
xdg-open ../spikes/cc-reconciliation/output/report.html
Review the report for:
Step 4: Commit
git add ../spikes/cc-reconciliation/src/main.py
git commit -m "spike(cc-recon): main entry point with full pipeline orchestration"
This task is iterative. Run the pipeline, review results, adjust prompts.
Step 1: Run full pipeline and review report
cd ../spikes/cc-reconciliation
.venv/bin/python -m src.main "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"
Step 2: Check extraction accuracy
Inspect cache/*.json files. For each invoice, verify:
For the CC statement, verify:
Step 3: Check matching accuracy
Cross-reference cache/_match_results.json against manual inspection:
Step 4: Iterate on prompts if needed
Use --match-only to re-run matching without re-transcribing:
.venv/bin/python -m src.main --match-only "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"
Use --no-cache to force re-extraction if extraction prompts change:
.venv/bin/python -m src.main --no-cache "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"
Step 5: Compare with/without preprocessing
# Clear cache and run without preprocessing
rm -rf cache/
.venv/bin/python -m src.main --no-preprocess "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"
# Save report
cp output/report.html output/report-no-preprocess.html
# Clear cache and run with preprocessing
rm -rf cache/
.venv/bin/python -m src.main "/home/volrath/code/orcha/drive/Kroeger Tax/Testdata CC"
cp output/report.html output/report-with-preprocess.html
Compare the two reports for differences in match quality.
Step 6: Commit final tuned version
git add ../spikes/cc-reconciliation/src/
git commit -m "spike(cc-recon): tuned prompts for correct matching on test dataset"