Large Invoice Summary Extraction Design

Date: 2026-03-17 Problem: Invoices with 400+ pages and 2000+ line items overwhelm post-processing — tax compliance and uncertain validations exceed context windows, accounts/cost-center matching becomes prohibitively expensive, and financial validation resolver triggers on math mismatches between summary and detail items. The extraction itself was solved by chunking (2026-03-16), but post-processing remains broken for these documents.

Trigger case: Doka Österreich monthly rental invoice (466 pages, ~2000 line items) for STRABAG BMTI GmbH. Structure: page 1 is a complete financial summary with one aggregated line item ("Miete = 124,306.63"), pages 2-147 are per-project rental summaries, pages 148-466 are equipment-level appendix.

Approach

Extract everything from all pages (we need the full data for CSV download), but only post-process line items from the summary page(s). Classification becomes an agent loop that can probe the document structure for large invoices.

Design

1. Classification Becomes an Agent Loop

Classification switches from a single llm/generate call to an ai.agent/run-with-model loop. This applies to all documents, not just large ones.

For documents with ≤50 pages:

For documents with >50 pages:

Summary detection prompt addition (>50 pages only):

This document has N pages. Determine whether it has a summary page (or pages) that contain the complete financial picture — invoice-level totals (subtotal, tax, total) and aggregated line items. Per-project breakdowns, transaction detail, and equipment/article lists are NOT summary pages, even if they contain the word "summary" in their header. Use the provided tools to examine the document structure. Return summary-page-range as [start, end] (1-indexed, inclusive) or null if no clear summary exists.

Tools (only registered for >50 page documents):

Tool Input Output Purpose
read_pages start, end (max 10 pages) Full transcription text for those pages Detailed examination of specific pages
search_text query string List of page numbers containing the query + total count Find section boundaries and recurring headers
page_headers start, end (max 50 pages) First 2-3 lines of each page in range Scan document structure without loading full content

Tools are closures over the transcription text — built per-invocation, not from the global tool registry. Uses ai.agent/run-with-model with a hand-built tool map.

Expected agent behavior for the Doka invoice:

  1. Sees page 1: financial totals, one aggregated rental line item, 466 pages
  2. search_text("summary") or search_text("invoice summary") → finds recurring section headers on pages 2-147
  3. page_headers(1, 5) → pages 2+ are per-project breakdowns, not invoice-level summary
  4. Conclusion: summary is page 1 only → summary-page-range: [1, 1]

Model: For ≤50 pages, uses :classification config (Gemini Flash) — same as today. For >50 pages, upgrades to :classification-large config (Sonnet) — summary detection is a harder task and the cost is negligible compared to the post-processing savings. The classify! function selects the config based on page count. The two configs vary independently.

Max iterations: 5 (if the agent can't determine the summary in 5 tool calls, return summary-page-range: null — treat as a normal invoice).

Classification output: Same as today (document-type, invoice-subtype, confidence, etc.) plus the new optional summary-page-range field.

2. Extraction — Item Split

Extraction processes all pages using the existing chunked approach. Every line item retains its page-location: [start-page, end-page].

After extraction, in extraction.clj: If summary-page-range is present in the classification result, extraction splits items before returning:

If summary-page-range is null: All items remain in line-items, breakdown-items is absent. Current behavior, no change.

Overlap logic: An item overlaps if its page-location range intersects the summary range. For summary-page-range: [1, 1] and item page-location: [1, 1] → overlap. Item page-location: [2, 3] → no overlap.

The orchestrator and post-processing see the same structured-data shape as today — line-items just contains fewer items when a summary range is active.

The chunking parameters from the 2026-03-16 implementation remain:

3. Post-Processing — No Code Changes

Post-processors operate on structured-data.line-items as they do today. With the item split, they receive only summary items:

No batching issues, no context window overflow, no cost explosion.

4. Deterministic Validation Warning

Add a large-document-summary-only check in validation.clj. When summary-page-range is present:

This uses the existing validation results mechanism that already renders in the UI.

5. Config Changes

Add :classification-large to the LLM config in config.edn, pointing to Sonnet. This is referenced by the orchestrator's llm-config map alongside the existing keys.

;; In :com.getorcha/llm:
:classification-large {:provider :anthropic
                       :api-key  #orcha/param "/v1-orcha/anthropic-api-key"
                       :model    "claude-sonnet-4-5-20250929"}
;; In the orchestrator's :llm-config:
:classification-large #ref [:com.getorcha/llm :classification-large]

6. Schema Changes

structured-data additions:

;; In the StructuredData Malli schema:
[:summary-page-range {:optional true} [:maybe [:tuple :int :int]]]
[:breakdown-items    {:optional true} [:maybe [:vector LineItem]]]

breakdown-items has the same shape as line-items — same LineItem schema — but without the post-processing enrichment fields (no debit-account, credit-account, cost-center, etc.).

classification additions:

;; In the classification output:
[:summary-page-range {:optional true} [:maybe [:tuple :int :int]]]

7. UI Changes

8. Pipeline Flow

Document arrives
    │
    ▼
Transcription (unchanged)
    │
    ▼
Classification (agent loop)
    ├─ ≤50 pages: classify only (1 iteration, 0 tool calls)
    └─ >50 pages: classify + summary detection (2-4 iterations with tools)
    │
    ▼
Extraction (all pages, chunked if needed)
    ├─ If summary-page-range set: split into line-items + breakdown-items
    └─ If null: all items in line-items (current behavior)
    │
    ▼
Post-processing (on line-items only)
    │
    ▼
Validation (adds summary-only warning if applicable)
    │
    ▼
Store structured-data with line-items, breakdown-items, summary-page-range

What This Does NOT Solve