Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
Date: 2026-03-16 Problem: Invoices with 400+ pages / 2000+ line items fail extraction because the transcribed text exceeds the LLM's 200K token context limit. The UI also can't efficiently render thousands of line items.
Trigger case: Strabag invoice ATP1LDPE.pdf — 466 pages, 621K chars of transcribed text, 223K tokens total prompt (vs 200K limit).
Extraction sends the entire transcribed text in a single LLM call. The prompt has ~68K tokens of overhead (instructions, schema, legal-entity context), leaving ~130K tokens for text. Documents exceeding this budget fail with an API error.
Before calling the LLM, estimate whether the prompt will exceed the model's context limit. If it does, split the transcribed text by page ranges and run multiple extraction calls, then merge results.
Chunk sizing:
Overlap pages:
Merge logic:
Where this fits:
structured-data "invoice" method in extraction.clj.Post-processing already handles batching: The accounts and cost-center matchers already chunk at 25 items per batch, so 2000 line items will work through existing infrastructure.
Both line-items-table and enhanced-line-items-table render all items as HTML in a single server response. With 2000+ items (especially enhanced cards with toggles/badges), the DOM becomes heavy.
Apply content-visibility: auto CSS to line item containers.
content-visibility: auto and an appropriate contain-intrinsic-size.If this turns out to be insufficient, the upgrade path is HTMX lazy-loaded batches with hx-trigger="revealed".