Compares old (Python/DSPy) and new (Clojure) invoice extraction pipelines, generating detailed HTML reports.
spikes/old-impl-comparison/
├── compare.clj # Main comparison script
├── README.md # This file
└── invoices/
├── index.html # Index of all reports
├── 01/
│ ├── invoice.pdf # Source invoice
│ ├── report.html # Generated comparison report
│ ├── new-document.json # New pipeline output
│ ├── new-ext-text.txt # Text sent to LLM
│ ├── old-document.json # Old pipeline output
│ └── old-ext-text.txt # Old pipeline extracted text
├── 02/
│ └── ...
└── ...
# Run comparison for invoice 01
bb spikes/old-impl-comparison/compare.clj 01
# Run comparison for all invoices with complete data
bb spikes/old-impl-comparison/compare.clj --all
# Force fresh run (delete cached results)
bb spikes/old-impl-comparison/compare.clj --clean 01
clj -M:dev -m user on port 8888/home/volrath/code/old-orcha/backendinvoice.pdf (or .jpg/.png) in invoices/<num>/| Field | Pri | Old Key | Old Value | New Key | New Value | Match |
|-------|-----|---------|-----------|---------|-----------|-------|
| Invoice Number | P1 | invoice_number | INV-001 | invoice-number | INV-001 | ✅ |
| Issuer Name | P1 | issuer.name | Müller GmbH | issuer.name | Mueller GmbH | ❌ |
Mismatches are sorted to the top for easy identification.
| # | Old Description | New Description | Qty | Price | Total |
|---|-----------------|-----------------|-----|-------|-------|
| 1 | Widget A | Widget A | 10.0 ✅ | 50.00 ✅ | 500.00 ✅ |
| 2 | Service B | Service B | 1.0 ✅ | 200.00→199.99 ❌ | 200.00→199.99 ❌ |
old_value→new_value ❌ for mismatchesvalue ✅ for matches| Metric | Old Pipeline | New Pipeline |
|--------|--------------|---------------|
| Total Time | 127.48s | 40.0s |
| Input Tokens | - | 4181 |
| Output Tokens | - | 2561 |
| Est. LLM Cost | - | $0.0510 |
| Est. Total Cost | - | $0.0525 |
| OCR Quality | - | 0.986 |
;; Claude Sonnet pricing
claude-input-price-per-m = $3.00 ; per million input tokens
claude-output-price-per-m = $15.00 ; per million output tokens
;; Document AI pricing (only when OCR is used)
docai-price-per-page = $0.0015 ; ~$1.50 per 1000 pages
The new pipeline uses a 2-tier extraction strategy:
The report shows which method was used and only charges OCR costs when Document AI was used.
Page count is determined directly from the PDF file using pdfinfo.
| Priority | Meaning |
|---|---|
| P1 | Critical for DATEV/ERP export |
| P2 | Important for validation/matching |
| P3 | Useful for business logic |
| P4 | Optional but nice to have |