Large Invoice Summary Extraction Design

Date: 2026-03-17 Problem: Invoices with 400+ pages and 2000+ line items overwhelm post-processing — tax compliance and uncertain validations exceed context windows, accounts/cost-center matching becomes prohibitively expensive, and financial validation resolver triggers on math mismatches between summary and detail items. The extraction itself was solved by chunking (2026-03-16), but post-processing remains broken for these documents.

Trigger case: Doka Österreich monthly rental invoice (466 pages, ~2000 line items) for STRABAG BMTI GmbH. Structure: page 1 is a complete financial summary with one aggregated line item ("Miete = 124,306.63"), pages 2-147 are per-project rental summaries, pages 148-466 are equipment-level appendix.

Approach

Extract everything from all pages (we need the full data for CSV download), but only post-process line items from the summary page(s). Classification becomes an agent loop that can probe the document structure for large invoices.

Design

1. Classification Becomes an Agent Loop

Classification switches from a single llm/generate call to an ai.agent/run-with-model loop. This applies to all documents, not just large ones.

For documents with ≤50 pages:

Agent receives the same prompt as today (first-page text, classify the document type)
No summary detection instructions, no tools registered
Agent returns classification in 1 iteration, 0 tool calls — functionally identical to today

For documents with >50 pages:

Prompt is augmented with: the page count and summary detection instructions
Tools are registered, giving the agent access to the transcription text
Agent classifies the document AND determines summary-page-range

Summary detection prompt addition (>50 pages only):

This document has N pages. Determine whether it has a summary page (or pages) that contain the complete financial picture — invoice-level totals (subtotal, tax, total) and aggregated line items. Per-project breakdowns, transaction detail, and equipment/article lists are NOT summary pages, even if they contain the word "summary" in their header. Use the provided tools to examine the document structure. Return summary-page-range as [start, end] (1-indexed, inclusive) or null if no clear summary exists.

Tools (only registered for >50 page documents):

Tool	Input	Output	Purpose
`read_pages`	start, end (max 10 pages)	Full transcription text for those pages	Detailed examination of specific pages
`search_text`	query string	List of page numbers containing the query + total count	Find section boundaries and recurring headers
`page_headers`	start, end (max 50 pages)	First 2-3 lines of each page in range	Scan document structure without loading full content

Tools are closures over the transcription text — built per-invocation, not from the global tool registry. Uses ai.agent/run-with-model with a hand-built tool map.

Expected agent behavior for the Doka invoice:

Sees page 1: financial totals, one aggregated rental line item, 466 pages
search_text("summary") or search_text("invoice summary") → finds recurring section headers on pages 2-147
page_headers(1, 5) → pages 2+ are per-project breakdowns, not invoice-level summary
Conclusion: summary is page 1 only → summary-page-range: [1, 1]

Model: For ≤50 pages, uses :classification config (Gemini Flash) — same as today. For >50 pages, upgrades to :classification-large config (Sonnet) — summary detection is a harder task and the cost is negligible compared to the post-processing savings. The classify! function selects the config based on page count. The two configs vary independently.

Max iterations: 5 (if the agent can't determine the summary in 5 tool calls, return summary-page-range: null — treat as a normal invoice).

Classification output: Same as today (document-type, invoice-subtype, confidence, etc.) plus the new optional summary-page-range field.

2. Extraction — Item Split

Extraction processes all pages using the existing chunked approach. Every line item retains its page-location: [start-page, end-page].

After extraction, in extraction.clj: If summary-page-range is present in the classification result, extraction splits items before returning:

Items whose page-location overlaps the summary range → line-items
All other items → breakdown-items

If summary-page-range is null: All items remain in line-items, breakdown-items is absent. Current behavior, no change.

Overlap logic: An item overlaps if its page-location range intersects the summary range. For summary-page-range: [1, 1] and item page-location: [1, 1] → overlap. Item page-location: [2, 3] → no overlap.

The orchestrator and post-processing see the same structured-data shape as today — line-items just contains fewer items when a summary range is active.

The chunking parameters from the 2026-03-16 implementation remain:

chars-per-token = 2 (conservative for number-heavy content)
prompt-overhead-tokens = 15000
max-chunk-chars = 150000 (~120 pages per chunk)
Continuation prompt for non-first chunks

3. Post-Processing — No Code Changes

Post-processors operate on structured-data.line-items as they do today. With the item split, they receive only summary items:

Accounts matcher: 1 item instead of 2000 → 1 LLM call
Cost center matcher: 1 item → 1 LLM call
Tax compliance: small JSON → fits in context
Uncertain validations: small JSON → fits in context
Financial validation: math checks against summary totals → consistent

No batching issues, no context window overflow, no cost explosion.

4. Deterministic Validation Warning

Add a large-document-summary-only check in validation.clj. When summary-page-range is present:

Status: "warning" (not error — doesn't block processing)
Message: "Invoice has N pages — only the summary page was processed for account and cost center matching. Detailed breakdown available for download. Manual review recommended."

This uses the existing validation results mechanism that already renders in the UI.

5. Config Changes

Add :classification-large to the LLM config in config.edn, pointing to Sonnet. This is referenced by the orchestrator's llm-config map alongside the existing keys.

;; In :com.getorcha/llm:
:classification-large {:provider :anthropic
                       :api-key  #orcha/param "/v1-orcha/anthropic-api-key"
                       :model    "claude-sonnet-4-5-20250929"}

;; In the orchestrator's :llm-config:
:classification-large #ref [:com.getorcha/llm :classification-large]

6. Schema Changes

structured-data additions:

;; In the StructuredData Malli schema:
[:summary-page-range {:optional true} [:maybe [:tuple :int :int]]]
[:breakdown-items    {:optional true} [:maybe [:vector LineItem]]]

breakdown-items has the same shape as line-items — same LineItem schema — but without the post-processing enrichment fields (no debit-account, credit-account, cost-center, etc.).

classification additions:

;; In the classification output:
[:summary-page-range {:optional true} [:maybe [:tuple :int :int]]]

7. UI Changes

Summary items displayed in the normal line items table (with account/cost-center matches)
Breakdown items available via CSV download button
Warning banner from the validation check: "Large invoice — needs manual review"

8. Pipeline Flow

Document arrives
    │
    ▼
Transcription (unchanged)
    │
    ▼
Classification (agent loop)
    ├─ ≤50 pages: classify only (1 iteration, 0 tool calls)
    └─ >50 pages: classify + summary detection (2-4 iterations with tools)
    │
    ▼
Extraction (all pages, chunked if needed)
    ├─ If summary-page-range set: split into line-items + breakdown-items
    └─ If null: all items in line-items (current behavior)
    │
    ▼
Post-processing (on line-items only)
    │
    ▼
Validation (adds summary-only warning if applicable)
    │
    ▼
Store structured-data with line-items, breakdown-items, summary-page-range

What This Does NOT Solve

Extraction cost: Chunked extraction still makes 5 API calls for a 466-page document. This is acceptable — extraction happens once, and we need the full data.
Extraction quality: LLM extraction of 120 pages per chunk with hundreds of line items per chunk may miss items or produce errors. Acceptable for breakdown items — they're reference data, not enriched.
Breakdown enrichment: Breakdown items get no account/cost-center/tax matching. Users must review manually. This is the intended behavior.
Dynamic summary detection for ≤50 page documents: These go through the normal pipeline. If a 30-page invoice has post-processing issues, that's a separate problem to solve (likely by batching post-processors better).