Date: 2026-03-17 Problem: Invoices with 400+ pages and 2000+ line items overwhelm post-processing — tax compliance and uncertain validations exceed context windows, accounts/cost-center matching becomes prohibitively expensive, and financial validation resolver triggers on math mismatches between summary and detail items. The extraction itself was solved by chunking (2026-03-16), but post-processing remains broken for these documents.
Trigger case: Doka Österreich monthly rental invoice (466 pages, ~2000 line items) for STRABAG BMTI GmbH. Structure: page 1 is a complete financial summary with one aggregated line item ("Miete = 124,306.63"), pages 2-147 are per-project rental summaries, pages 148-466 are equipment-level appendix.
Extract everything from all pages (we need the full data for CSV download), but only post-process line items from the summary page(s). Classification becomes an agent loop that can probe the document structure for large invoices.
Classification switches from a single llm/generate call to an ai.agent/run-with-model loop. This applies to all documents, not just large ones.
For documents with ≤50 pages:
For documents with >50 pages:
summary-page-rangeSummary detection prompt addition (>50 pages only):
This document has N pages. Determine whether it has a summary page (or pages) that contain the complete financial picture — invoice-level totals (subtotal, tax, total) and aggregated line items. Per-project breakdowns, transaction detail, and equipment/article lists are NOT summary pages, even if they contain the word "summary" in their header. Use the provided tools to examine the document structure. Return
summary-page-rangeas[start, end](1-indexed, inclusive) or null if no clear summary exists.
Tools (only registered for >50 page documents):
| Tool | Input | Output | Purpose |
|---|---|---|---|
read_pages |
start, end (max 10 pages) | Full transcription text for those pages | Detailed examination of specific pages |
search_text |
query string | List of page numbers containing the query + total count | Find section boundaries and recurring headers |
page_headers |
start, end (max 50 pages) | First 2-3 lines of each page in range | Scan document structure without loading full content |
Tools are closures over the transcription text — built per-invocation, not from the global tool registry. Uses ai.agent/run-with-model with a hand-built tool map.
Expected agent behavior for the Doka invoice:
search_text("summary") or search_text("invoice summary") → finds recurring section headers on pages 2-147page_headers(1, 5) → pages 2+ are per-project breakdowns, not invoice-level summarysummary-page-range: [1, 1]Model: For ≤50 pages, uses :classification config (Gemini Flash) — same as today. For >50 pages, upgrades to :classification-large config (Sonnet) — summary detection is a harder task and the cost is negligible compared to the post-processing savings. The classify! function selects the config based on page count. The two configs vary independently.
Max iterations: 5 (if the agent can't determine the summary in 5 tool calls, return summary-page-range: null — treat as a normal invoice).
Classification output: Same as today (document-type, invoice-subtype, confidence, etc.) plus the new optional summary-page-range field.
Extraction processes all pages using the existing chunked approach. Every line item retains its page-location: [start-page, end-page].
After extraction, in extraction.clj: If summary-page-range is present in the classification result, extraction splits items before returning:
page-location overlaps the summary range → line-itemsbreakdown-itemsIf summary-page-range is null: All items remain in line-items, breakdown-items is absent. Current behavior, no change.
Overlap logic: An item overlaps if its page-location range intersects the summary range. For summary-page-range: [1, 1] and item page-location: [1, 1] → overlap. Item page-location: [2, 3] → no overlap.
The orchestrator and post-processing see the same structured-data shape as today — line-items just contains fewer items when a summary range is active.
The chunking parameters from the 2026-03-16 implementation remain:
chars-per-token = 2 (conservative for number-heavy content)prompt-overhead-tokens = 15000max-chunk-chars = 150000 (~120 pages per chunk)Post-processors operate on structured-data.line-items as they do today. With the item split, they receive only summary items:
No batching issues, no context window overflow, no cost explosion.
Add a large-document-summary-only check in validation.clj. When summary-page-range is present:
"warning" (not error — doesn't block processing)This uses the existing validation results mechanism that already renders in the UI.
Add :classification-large to the LLM config in config.edn, pointing to Sonnet. This is referenced by the orchestrator's llm-config map alongside the existing keys.
;; In :com.getorcha/llm:
:classification-large {:provider :anthropic
:api-key #orcha/param "/v1-orcha/anthropic-api-key"
:model "claude-sonnet-4-5-20250929"}
;; In the orchestrator's :llm-config:
:classification-large #ref [:com.getorcha/llm :classification-large]
structured-data additions:
;; In the StructuredData Malli schema:
[:summary-page-range {:optional true} [:maybe [:tuple :int :int]]]
[:breakdown-items {:optional true} [:maybe [:vector LineItem]]]
breakdown-items has the same shape as line-items — same LineItem schema — but without the post-processing enrichment fields (no debit-account, credit-account, cost-center, etc.).
classification additions:
;; In the classification output:
[:summary-page-range {:optional true} [:maybe [:tuple :int :int]]]
Document arrives
│
▼
Transcription (unchanged)
│
▼
Classification (agent loop)
├─ ≤50 pages: classify only (1 iteration, 0 tool calls)
└─ >50 pages: classify + summary detection (2-4 iterations with tools)
│
▼
Extraction (all pages, chunked if needed)
├─ If summary-page-range set: split into line-items + breakdown-items
└─ If null: all items in line-items (current behavior)
│
▼
Post-processing (on line-items only)
│
▼
Validation (adds summary-only warning if applicable)
│
▼
Store structured-data with line-items, breakdown-items, summary-page-range