Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Large Invoice Handling Design

Date: 2026-03-16 Problem: Invoices with 400+ pages / 2000+ line items fail extraction because the transcribed text exceeds the LLM's 200K token context limit. The UI also can't efficiently render thousands of line items.

Trigger case: Strabag invoice ATP1LDPE.pdf — 466 pages, 621K chars of transcribed text, 223K tokens total prompt (vs 200K limit).

Problem 1: Chunked Extraction

Current State

Extraction sends the entire transcribed text in a single LLM call. The prompt has ~68K tokens of overhead (instructions, schema, legal-entity context), leaving ~130K tokens for text. Documents exceeding this budget fail with an API error.

Design

Before calling the LLM, estimate whether the prompt will exceed the model's context limit. If it does, split the transcribed text by page ranges and run multiple extraction calls, then merge results.

Chunk sizing:

Prompt overhead: ~68K tokens (instructions + schema + legal-entity context).
Usable budget per chunk: (model_limit - overhead) × 0.80 safety margin.
At ~4 chars/token, this gives ~400K chars of text per chunk.
Split on page boundaries using the page markers already present in pdf-lib transcription output.

Overlap pages:

Include the last 3 pages of chunk K at the start of chunk K+1.
This handles line items that span a page boundary between chunks.

Merge logic:

Collect line items from all chunks.
For items from overlap pages, deduplicate by (page-location, description, amount). Keep the version from the chunk where the item is not at a boundary.
Header-level fields (invoice-number, issuer, total, dates, etc.) come from the first chunk only — they're on page 1.
Sort final line items by page-location.

Where this fits:

Chunking decision happens inside the structured-data "invoice" method in extraction.clj.
If estimated tokens < limit: single call (current behavior, unchanged).
If estimated tokens > limit: chunk → extract each → merge.
Token estimation uses chars/4 + measured prompt overhead. No tokenizer library needed.

Post-processing already handles batching: The accounts and cost-center matchers already chunk at 25 items per batch, so 2000 line items will work through existing infrastructure.

Problem 2: UI Rendering Performance

Current State

Both line-items-table and enhanced-line-items-table render all items as HTML in a single server response. With 2000+ items (especially enhanced cards with toggles/badges), the DOM becomes heavy.

Design

Apply content-visibility: auto CSS to line item containers.

Each line item row/card gets content-visibility: auto and an appropriate contain-intrinsic-size.
The browser skips layout and paint for off-screen items.
No JavaScript changes, no new endpoints, no HTMX modifications.
All HTML is still in the DOM — initial payload is ~1-2MB for 2000 items, acceptable.
Works in Chrome, Edge, Firefox 125+, Safari 18+.

If this turns out to be insufficient, the upgrade path is HTMX lazy-loaded batches with hx-trigger="revealed".