Pipeline Comparison Tool

Compares old (Python/DSPy) and new (Clojure) invoice extraction pipelines, generating detailed HTML reports.

Directory Structure

spikes/old-impl-comparison/
├── compare.clj          # Main comparison script
├── README.md            # This file
└── invoices/
    ├── index.html            # Index of all reports
    ├── 01/
    │   ├── invoice.pdf       # Source invoice
    │   ├── report.html       # Generated comparison report
    │   ├── new-document.json # New pipeline output
    │   ├── new-ext-text.txt  # Text sent to LLM
    │   ├── old-document.json # Old pipeline output
    │   └── old-ext-text.txt  # Old pipeline extracted text
    ├── 02/
    │   └── ...
    └── ...

Usage

# Run comparison for invoice 01
bb spikes/old-impl-comparison/compare.clj 01

# Run comparison for all invoices with complete data
bb spikes/old-impl-comparison/compare.clj --all

# Force fresh run (delete cached results)
bb spikes/old-impl-comparison/compare.clj --clean 01

Prerequisites

  1. New pipeline server running: clj -M:dev -m user on port 8888
  2. Old pipeline available: Python venv at /home/volrath/code/old-orcha/backend
  3. Invoice file: Place invoice.pdf (or .jpg/.png) in invoices/<num>/

How It Works

  1. Checks if cached results exist for each pipeline
  2. Runs missing pipelines in parallel (10-minute timeout each)
  3. Compares field-by-field using canonical mappings
  4. Generates markdown report with:

Reading the Report

Field Comparison Table

| Field | Pri | Old Key | Old Value | New Key | New Value | Match |
|-------|-----|---------|-----------|---------|-----------|-------|
| Invoice Number | P1 | invoice_number | INV-001 | invoice-number | INV-001 | ✅ |
| Issuer Name | P1 | issuer.name | Müller GmbH | issuer.name | Mueller GmbH | ❌ |

Mismatches are sorted to the top for easy identification.

Line Items Comparison

| # | Old Description | New Description | Qty | Price | Total |
|---|-----------------|-----------------|-----|-------|-------|
| 1 | Widget A | Widget A | 10.0 ✅ | 50.00 ✅ | 500.00 ✅ |
| 2 | Service B | Service B | 1.0 ✅ | 200.00→199.99 ❌ | 200.00→199.99 ❌ |

Statistics Table

| Metric | Old Pipeline | New Pipeline |
|--------|--------------|---------------|
| Total Time | 127.48s | 40.0s |
| Input Tokens | - | 4181 |
| Output Tokens | - | 2561 |
| Est. LLM Cost | - | $0.0510 |
| Est. Total Cost | - | $0.0525 |
| OCR Quality | - | 0.986 |

Cost Calculation

;; Claude Sonnet pricing
claude-input-price-per-m   = $3.00   ; per million input tokens
claude-output-price-per-m  = $15.00  ; per million output tokens

;; Document AI pricing (only when OCR is used)
docai-price-per-page = $0.0015  ; ~$1.50 per 1000 pages

Extraction Methods

The new pipeline uses a 2-tier extraction strategy:

The report shows which method was used and only charges OCR costs when Document AI was used. Page count is determined directly from the PDF file using pdfinfo.

Priority Levels

Priority Meaning
P1 Critical for DATEV/ERP export
P2 Important for validation/matching
P3 Useful for business logic
P4 Optional but nice to have