Pipeline Comparison Tool

Compares old (Python/DSPy) and new (Clojure) invoice extraction pipelines, generating detailed HTML reports.

Directory Structure

spikes/old-impl-comparison/
├── compare.clj          # Main comparison script
├── README.md            # This file
└── invoices/
    ├── index.html            # Index of all reports
    ├── 01/
    │   ├── invoice.pdf       # Source invoice
    │   ├── report.html       # Generated comparison report
    │   ├── new-document.json # New pipeline output
    │   ├── new-ext-text.txt  # Text sent to LLM
    │   ├── old-document.json # Old pipeline output
    │   └── old-ext-text.txt  # Old pipeline extracted text
    ├── 02/
    │   └── ...
    └── ...

Usage

# Run comparison for invoice 01
bb spikes/old-impl-comparison/compare.clj 01

# Run comparison for all invoices with complete data
bb spikes/old-impl-comparison/compare.clj --all

# Force fresh run (delete cached results)
bb spikes/old-impl-comparison/compare.clj --clean 01

Prerequisites

New pipeline server running: clj -M:dev -m user on port 8888
Old pipeline available: Python venv at /home/volrath/code/old-orcha/backend
Invoice file: Place invoice.pdf (or .jpg/.png) in invoices/<num>/

How It Works

Checks if cached results exist for each pipeline
Runs missing pipelines in parallel (10-minute timeout each)
Compares field-by-field using canonical mappings
Generates markdown report with:
- Field comparison table (with priority levels)
- Line items comparison
- Performance statistics
- Cost estimates

Reading the Report

Field Comparison Table

| Field | Pri | Old Key | Old Value | New Key | New Value | Match |
|-------|-----|---------|-----------|---------|-----------|-------|
| Invoice Number | P1 | invoice_number | INV-001 | invoice-number | INV-001 | ✅ |
| Issuer Name | P1 | issuer.name | Müller GmbH | issuer.name | Mueller GmbH | ❌ |

Pri: Priority level (P1 = critical, P4 = nice-to-have)
Old/New Key: JSON path in each pipeline's output
Match: ✅ values match, ❌ values differ

Mismatches are sorted to the top for easy identification.

Line Items Comparison

| # | Old Description | New Description | Qty | Price | Total |
|---|-----------------|-----------------|-----|-------|-------|
| 1 | Widget A | Widget A | 10.0 ✅ | 50.00 ✅ | 500.00 ✅ |
| 2 | Service B | Service B | 1.0 ✅ | 200.00→199.99 ❌ | 200.00→199.99 ❌ |

Format: old_value→new_value ❌ for mismatches
Format: value ✅ for matches

Statistics Table

| Metric | Old Pipeline | New Pipeline |
|--------|--------------|---------------|
| Total Time | 127.48s | 40.0s |
| Input Tokens | - | 4181 |
| Output Tokens | - | 2561 |
| Est. LLM Cost | - | $0.0510 |
| Est. Total Cost | - | $0.0525 |
| OCR Quality | - | 0.986 |

Total Time: Wall-clock time for complete pipeline
Est. LLM Cost: Claude API cost (input + output tokens)
Est. Total Cost: LLM + Document AI OCR ($0.0015/page)
OCR Quality: Document AI confidence score (0-1)

Cost Calculation

;; Claude Sonnet pricing
claude-input-price-per-m   = $3.00   ; per million input tokens
claude-output-price-per-m  = $15.00  ; per million output tokens

;; Document AI pricing (only when OCR is used)
docai-price-per-page = $0.0015  ; ~$1.50 per 1000 pages

Extraction Methods

The new pipeline uses a 2-tier extraction strategy:

PDFBox: Free, fast text extraction for PDFs with embedded text (≥100 chars/page)
Document AI: OCR for scanned PDFs or images ($0.0015/page)

The report shows which method was used and only charges OCR costs when Document AI was used. Page count is determined directly from the PDF file using pdfinfo.

Priority Levels

Priority	Meaning
P1	Critical for DATEV/ERP export
P2	Important for validation/matching
P3	Useful for business logic
P4	Optional but nice to have