Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Large Invoice Handling Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Handle invoices that exceed the LLM context window by chunking extraction, and improve UI rendering performance for thousands of line items.

Architecture: Split oversized transcriptions by page boundaries with overlap, extract each chunk independently, merge results with deduplication. CSS content-visibility: auto for UI performance.

Tech Stack: Clojure, Anthropic Claude API, HTMX/Hiccup, CSS

Design doc: docs/plans/2026-03-16-large-invoice-handling-design.md

Task 1: Text splitting by page boundaries

Files:

Modify: src/com/getorcha/workers/ap/ingestion/extraction.clj
Test: test/com/getorcha/workers/ap/ingestion/extraction_test.clj

This task adds a pure function that splits transcription text on === PAGE N === markers into page-indexed segments, then groups them into chunks that fit within a character budget, with configurable overlap.

Step 1: Write the failing test

Add to extraction_test.clj:

(deftest test-split-text-into-page-chunks
  (let [text (str "=== PAGE 1 ===\nPage 1 content\n"
                   "=== PAGE 2 ===\nPage 2 content\n"
                   "=== PAGE 3 ===\nPage 3 content\n"
                   "=== PAGE 4 ===\nPage 4 content\n"
                   "=== PAGE 5 ===\nPage 5 content\n")]

    (testing "Returns single chunk when text fits budget"
      (let [chunks (workers.extraction/split-text-into-page-chunks text 10000 3)]
        (is (= 1 (count chunks)))
        (is (= 1 (:start-page (first chunks))))
        (is (= 5 (:end-page (first chunks))))
        (is (= text (:text (first chunks))))))

    (testing "Splits into multiple chunks with overlap when text exceeds budget"
      ;; Budget of 40 chars forces splitting (each page ~16 chars)
      (let [chunks (workers.extraction/split-text-into-page-chunks text 40 1)]
        (is (> (count chunks) 1))
        ;; First chunk starts at page 1
        (is (= 1 (:start-page (first chunks))))
        ;; Last chunk ends at page 5
        (is (= 5 (:end-page (last chunks))))
        ;; Overlap: second chunk's start-page <= first chunk's end-page
        (when (> (count chunks) 1)
          (is (<= (:start-page (second chunks))
                  (:end-page (first chunks)))))))

    (testing "Handles single page"
      (let [chunks (workers.extraction/split-text-into-page-chunks "=== PAGE 1 ===\nContent\n" 10000 1)]
        (is (= 1 (count chunks)))
        (is (= 1 (:start-page (first chunks))))
        (is (= 1 (:end-page (first chunks))))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran)" Expected: FAIL — split-text-into-page-chunks does not exist yet.

Step 3: Write implementation

Add to extraction.clj, before the structured-data multimethod:

(defn split-text-into-page-chunks
  "Splits transcription text into chunks that fit within a character budget.

   Text is split on `=== PAGE N ===` markers. Pages are grouped into chunks
   such that each chunk's text length stays within `char-budget`. Adjacent
   chunks share `overlap-pages` pages to handle line items spanning boundaries.

   Returns a vector of maps:
     {:text       \"...\"    ;; The text for this chunk
      :start-page N         ;; First page number in this chunk
      :end-page   M}        ;; Last page number in this chunk"
  [text char-budget overlap-pages]
  (let [;; Split on page markers, keeping the marker with its content
        pages     (rest (string/split text #"(?==== PAGE \d+ ===)"))
        page-nums (mapv (fn [page-text]
                          (let [[_ n] (re-find #"=== PAGE (\d+) ===" page-text)]
                            (parse-long n)))
                        pages)
        n-pages   (count pages)]
    (if (<= (count text) char-budget)
      ;; Fits in one chunk
      [{:text       text
        :start-page (first page-nums)
        :end-page   (last page-nums)}]
      ;; Build chunks greedily
      (loop [i      0
             chunks []]
        (if (>= i n-pages)
          chunks
          (let [;; Find how many pages fit in this chunk
                end-i (loop [j i]
                        (if (>= j n-pages)
                          n-pages
                          (let [chunk-text (apply str (subvec (vec pages) i (inc j)))]
                            (if (> (count chunk-text) char-budget)
                              (max (inc i) j) ;; At least one page per chunk
                              (recur (inc j))))))
                chunk-text  (apply str (subvec (vec pages) i end-i))
                start-page  (nth page-nums i)
                end-page    (nth page-nums (dec end-i))
                ;; Next chunk starts overlap-pages back from end
                next-i      (max (inc i) (- end-i overlap-pages))]
            (recur next-i
                   (conj chunks {:text       chunk-text
                                 :start-page start-page
                                 :end-page   end-page}))))))))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran)" Expected: All tests pass.

Step 5: Commit

git add src/com/getorcha/workers/ap/ingestion/extraction.clj test/com/getorcha/workers/ap/ingestion/extraction_test.clj
git commit -m "feat: add split-text-into-page-chunks for large invoice extraction"

Task 2: Merge chunked extraction results

Files:

Modify: src/com/getorcha/workers/ap/ingestion/extraction.clj
Test: test/com/getorcha/workers/ap/ingestion/extraction_test.clj

Add a function that merges extraction results from multiple chunks: takes header fields from chunk 1, collects all line items, deduplicates items from overlap pages.

Step 1: Write the failing test

(deftest test-merge-chunked-extractions
  (let [chunk-1 {:data  {:invoice-number "INV-001"
                          :issuer         {:name "Acme"}
                          :total          1000.0
                          :line-items     [{:description "Item A" :amount 100.0 :page-location [1 1]}
                                           {:description "Item B" :amount 200.0 :page-location [2 3]}]}
                  :stats {:input-tokens 5000 :output-tokens 1000 :model "claude"
                          :started-at (java.time.Instant/parse "2026-01-01T00:00:00Z")
                          :ended-at (java.time.Instant/parse "2026-01-01T00:00:10Z")}}
        chunk-2 {:data  {:invoice-number "INV-001"
                          :issuer         {:name "Acme"}
                          :total          1000.0
                          ;; Overlap: Item B appears again from overlap pages
                          :line-items     [{:description "Item B" :amount 200.0 :page-location [2 3]}
                                           {:description "Item C" :amount 300.0 :page-location [4 4]}]}
                  :stats {:input-tokens 4000 :output-tokens 800 :model "claude"
                          :started-at (java.time.Instant/parse "2026-01-01T00:00:11Z")
                          :ended-at (java.time.Instant/parse "2026-01-01T00:00:20Z")}}]

    (testing "Merges line items from multiple chunks"
      (let [merged (workers.extraction/merge-chunked-extractions [chunk-1 chunk-2])]
        ;; Header fields come from first chunk
        (is (= "INV-001" (get-in merged [:data :invoice-number])))
        (is (= {:name "Acme"} (get-in merged [:data :issuer])))
        (is (= 1000.0 (get-in merged [:data :total])))
        ;; Deduplication: Item B appears only once
        (is (= 3 (count (get-in merged [:data :line-items]))))
        (is (= ["Item A" "Item B" "Item C"]
               (mapv :description (get-in merged [:data :line-items]))))
        ;; Line items sorted by page-location
        (is (= [[1 1] [2 3] [4 4]]
               (mapv :page-location (get-in merged [:data :line-items]))))
        ;; Stats aggregated
        (is (= 9000 (get-in merged [:stats :input-tokens])))
        (is (= 1800 (get-in merged [:stats :output-tokens])))
        (is (= 2 (get-in merged [:stats :chunks])))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran)"

Step 3: Write implementation

Add to extraction.clj:

(defn merge-chunked-extractions
  "Merges extraction results from multiple chunks into a single result.

   - Header fields (invoice-number, issuer, total, etc.) come from the first chunk.
   - Line items are collected from all chunks and deduplicated by
     (page-location, description, amount) tuple.
   - Final line items are sorted by page-location.
   - Stats are aggregated (summed tokens, earliest start, latest end)."
  [chunk-results]
  (if (= 1 (count chunk-results))
    (first chunk-results)
    (let [first-data    (:data (first chunk-results))
          all-items     (mapcat #(get-in % [:data :line-items]) chunk-results)
          ;; Deduplicate by (page-location, description, amount)
          deduped-items (->> all-items
                             (reduce (fn [seen item]
                                       (let [k [(:page-location item)
                                                (:description item)
                                                (:amount item)]]
                                         (if (contains? seen k)
                                           seen
                                           (assoc seen k item))))
                                     (linked/map))
                             vals
                             (sort-by :page-location)
                             vec)
          all-stats     (mapv :stats chunk-results)]
      {:data  (assoc first-data :line-items deduped-items)
       :stats {:input-tokens  (reduce + (keep :input-tokens all-stats))
               :output-tokens (reduce + (keep :output-tokens all-stats))
               :model         (:model (first all-stats))
               :started-at    (:started-at (first all-stats))
               :ended-at      (:ended-at (last all-stats))
               :chunks        (count chunk-results)}})))

Note: This uses linked/map (from linked.core) for insertion-ordered map to preserve ordering during dedup. Check if this dependency exists; if not, use a vector-based dedup approach instead:

;; Alternative without linked dependency:
deduped-items (->> all-items
                   (reduce (fn [[seen items] item]
                             (let [k [(:page-location item)
                                      (:description item)
                                      (:amount item)]]
                               (if (contains? seen k)
                                 [seen items]
                                 [(conj seen k) (conj items item)])))
                           [#{} []])
                   second
                   (sort-by :page-location)
                   vec)

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran)"

Step 5: Commit

git add src/com/getorcha/workers/ap/ingestion/extraction.clj test/com/getorcha/workers/ap/ingestion/extraction_test.clj
git commit -m "feat: add merge-chunked-extractions for combining chunk results"

Task 3: Wire chunked extraction into the invoice extraction method

Files:

Modify: src/com/getorcha/workers/ap/ingestion/extraction.clj
Test: test/com/getorcha/workers/ap/ingestion/extraction_test.clj

Modify (defmethod structured-data "invoice" ...) to estimate token count and use chunked extraction when the text exceeds the model's context limit.

Step 1: Write the failing test

(deftest test-structured-data-chunked-extraction
  (testing "Chunks extraction when text exceeds token budget"
    ;; Build a large text that would exceed a small token limit
    (let [;; 10 pages of content
          large-text   (apply str (for [i (range 1 11)]
                                    (str "=== PAGE " i " ===\n"
                                         (apply str (repeat 200 "x")) "\n")))
          call-count   (atom 0)
          llm-spy      (fn [_config _prompt]
                         (swap! call-count inc)
                         {:text          (json/generate-string
                                          (assoc sample-extraction-json
                                                 :line-items [{:description (str "Item from call " @call-count)
                                                               :amount 100.0
                                                               :article-code nil
                                                               :quantity nil
                                                               :unit nil
                                                               :unit-price nil
                                                               :price-per nil
                                                               :discount nil
                                                               :discount-type nil
                                                               :tax-rate nil
                                                               :page-location [1 1]}]))
                          :input-tokens  500
                          :output-tokens 150
                          :model         "test-model"
                          :raw-response  {}})
          context      {:db-pool    fixtures/*db*
                        :llm-config (test-llm-config)}
          ingestion    {:transcription-result {:text large-text}
                        :classification       {:document-type "invoice"}}]
      ;; Override the context-limit to force chunking (very small budget)
      (with-redefs [llm/generate                          llm-spy
                    workers.extraction/model-context-limit 2000
                    workers.extraction/chars-per-token     4]
        (let [{:keys [data stats]} (workers.extraction/structured-data context ingestion)]
          ;; Multiple LLM calls should have been made
          (is (> @call-count 1))
          ;; Result should have merged data
          (is (some? (:line-items data)))
          ;; Stats should show chunk count
          (is (= @call-count (:chunks stats))))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran)"

Step 3: Write implementation

Modify extraction.clj. Add config constants and update the structured-data "invoice" method:

;; Near top of file, after requires
(def model-context-limit
  "Maximum input tokens for the extraction model (Anthropic Claude)."
  200000)

(def chars-per-token
  "Approximate characters per token for budget estimation."
  4)

(def ^:private prompt-overhead-tokens
  "Estimated token overhead for the extraction prompt (instructions + schema + legal entity).
   Measured from real prompts: ~68K tokens."
  68000)

(def ^:private overlap-pages
  "Number of overlap pages between chunks to handle line items at boundaries."
  3)

(def ^:private safety-margin
  "Safety margin applied to token budget to avoid hitting limits."
  0.80)

Replace the structured-data "invoice" method:

(defn ^:private extract-chunk
  "Run extraction on a single text chunk. Returns {:data ... :stats ...}."
  [db-pool extraction-cfg legal-entity-id email-context text]
  (let [started-at     (java.time.Instant/now)
        prompt         (ai.prompts/legal-entity-prompt db-pool legal-entity-id :extraction
                                                       {:text text :email-context email-context})
        llm-generation (llm/generate extraction-cfg prompt)
        ended-at       (java.time.Instant/now)]
    {:data  (-> (:text llm-generation)
                llm/parse-json-response
                cleanup-spurious-discount
                normalize-issuer-iban)
     :stats (-> llm-generation
                (dissoc :text)
                (assoc :started-at started-at
                       :ended-at ended-at))}))


(defmethod structured-data "invoice"
  [{:keys [db-pool llm-config] :as _context}
   {{:keys [text]} :transcription-result :keys [document] :as _ingestion}]
  (let [extraction-cfg  (:extraction llm-config)
        legal-entity-id (:document/legal-entity-id document)
        email-context   (if-let [t (not-empty (get-in document [:document/source-metadata :email-body-text]))]
                          (str "EMAIL CONTEXT (supplementary — the invoice/document is the source of truth for all extracted values):\n" t)
                          "")
        ;; Estimate whether text fits in a single call
        text-tokens     (/ (count text) chars-per-token)
        budget-tokens   (* (- model-context-limit prompt-overhead-tokens) safety-margin)
        char-budget     (long (* budget-tokens chars-per-token))]
    (if (<= text-tokens budget-tokens)
      ;; Single extraction call (current behavior)
      (extract-chunk db-pool extraction-cfg legal-entity-id email-context text)
      ;; Chunked extraction
      (let [chunks       (split-text-into-page-chunks text char-budget overlap-pages)
            _            (log/info "Chunked extraction" {:chunks (count chunks)
                                                         :text-length (count text)
                                                         :char-budget char-budget})
            results      (mapv (fn [{:keys [text] :as _chunk}]
                                 (extract-chunk db-pool extraction-cfg legal-entity-id email-context text))
                               chunks)]
        (merge-chunked-extractions results)))))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran)"

Step 5: Commit

git add src/com/getorcha/workers/ap/ingestion/extraction.clj test/com/getorcha/workers/ap/ingestion/extraction_test.clj
git commit -m "feat: wire chunked extraction into invoice extraction method"

Task 4: Chunked extraction prompt adjustment

Files:

Modify: src/com/getorcha/workers/ap/ingestion/extraction.clj

For non-first chunks, the extraction prompt needs a small addition telling the LLM that this is a continuation — it should extract line items only (not header fields which were already extracted from chunk 1). This avoids conflicting header extractions from later chunks that may not contain the header page.

Step 1: Write the failing test

(deftest test-chunked-extraction-continuation-prompt
  (testing "Non-first chunks get continuation instructions appended to text"
    (let [prompts (atom [])
          llm-spy (fn [_config prompt]
                    (swap! prompts conj prompt)
                    {:text          (json/generate-string sample-extraction-json)
                     :input-tokens  500
                     :output-tokens 150
                     :model         "test-model"
                     :raw-response  {}})
          text    (apply str (for [i (range 1 11)]
                               (str "=== PAGE " i " ===\n"
                                    (apply str (repeat 200 "x")) "\n")))
          context {:db-pool    fixtures/*db*
                   :llm-config (test-llm-config)}]
      (with-redefs [llm/generate                          llm-spy
                    workers.extraction/model-context-limit 2000
                    workers.extraction/chars-per-token     4]
        (workers.extraction/structured-data context {:transcription-result {:text text}
                                                     :classification       {:document-type "invoice"}})
        ;; First prompt should NOT contain continuation marker
        (is (not (clojure.string/includes? (first @prompts) "CONTINUATION CHUNK")))
        ;; Subsequent prompts SHOULD contain continuation marker
        (doseq [p (rest @prompts)]
          (is (clojure.string/includes? p "CONTINUATION CHUNK")))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran)"

Step 3: Write implementation

Modify extract-chunk to accept a continuation? flag, and update the caller in structured-data "invoice":

(def ^:private continuation-instruction
  "\n\nCONTINUATION CHUNK: This is a continuation of a large invoice that was split into parts.
You are seeing pages ${start-page} through ${end-page}.
- Extract ONLY line items from this text.
- For header fields (invoice-number, issuer, recipient, totals, dates, etc.), return null values.
- Extract ALL line items you see, including any that overlap with previous chunks.")

(defn ^:private extract-chunk
  "Run extraction on a single text chunk. Returns {:data ... :stats ...}.
   When `chunk-info` is provided (for non-first chunks), appends continuation
   instructions to the text."
  [db-pool extraction-cfg legal-entity-id email-context text chunk-info]
  (let [started-at     (java.time.Instant/now)
        text'          (if chunk-info
                         (str text
                              (-> continuation-instruction
                                  (string/replace "${start-page}" (str (:start-page chunk-info)))
                                  (string/replace "${end-page}" (str (:end-page chunk-info)))))
                         text)
        prompt         (ai.prompts/legal-entity-prompt db-pool legal-entity-id :extraction
                                                       {:text text' :email-context email-context})
        llm-generation (llm/generate extraction-cfg prompt)
        ended-at       (java.time.Instant/now)]
    {:data  (-> (:text llm-generation)
                llm/parse-json-response
                cleanup-spurious-discount
                normalize-issuer-iban)
     :stats (-> llm-generation
                (dissoc :text)
                (assoc :started-at started-at
                       :ended-at ended-at))}))

Update the single-call path and chunked path in structured-data "invoice":

;; Single call:
(extract-chunk db-pool extraction-cfg legal-entity-id email-context text nil)

;; Chunked:
(let [results (map-indexed
                (fn [idx {:keys [text start-page end-page]}]
                  (extract-chunk db-pool extraction-cfg legal-entity-id email-context text
                                 (when (pos? idx)
                                   {:start-page start-page :end-page end-page})))
                chunks)]
  (merge-chunked-extractions (vec results)))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran)"

Step 5: Commit

git add src/com/getorcha/workers/ap/ingestion/extraction.clj test/com/getorcha/workers/ap/ingestion/extraction_test.clj
git commit -m "feat: add continuation prompt for non-first extraction chunks"

Task 5: CSS content-visibility for line items

Files:

Modify: resources/app/public/css/style.css

Add content-visibility: auto to line item rows and cards so the browser skips layout/paint for off-screen items.

Step 1: Add CSS rules

In resources/app/public/css/style.css, add after the existing .line-items-table tbody tr:hover rule:

.line-items-table tbody tr {
  content-visibility: auto;
  contain-intrinsic-size: auto 45px;
}

And after the existing .line-item-card rule:

.line-item-card {
  content-visibility: auto;
  contain-intrinsic-size: auto 120px;
}

Step 2: Verify visually

Start the dev server, navigate to an invoice with line items, and verify:

Line items still render correctly
Scrolling is smooth
No layout jumps (adjust contain-intrinsic-size values if needed)

Step 3: Commit

git add resources/app/public/css/style.css
git commit -m "feat: add content-visibility for line items rendering performance"

Task 6: Integration test with real Strabag invoice

Files:

No new files; uses existing test document in local DB/S3

Re-ingest the Strabag invoice (019cf832-1890-703f-9ab0-8a6b814cdebb) locally to verify end-to-end:

Step 1: Re-queue the document for ingestion

In the REPL:

(require '[com.getorcha.erp.ingestion :as erp.ingestion])
(erp.ingestion/requeue-document! (repl/db-pool) (repl/aws) #uuid "019cf832-1890-703f-9ab0-8a6b814cdebb")

Step 2: Monitor logs for chunked extraction

Watch for log lines like:

Chunked extraction {:chunks N :text-length 621418 :char-budget ...}
Multiple Calling Anthropic Claude log entries
No prompt is too long errors

Step 3: Verify results

psql -h localhost -U postgres -d orcha -c "
  SELECT status, error_type, error_message,
         extraction_input_tokens, extraction_output_tokens
  FROM ap_ingestion
  WHERE document_id = '019cf832-1890-703f-9ab0-8a6b814cdebb'
  ORDER BY created_at DESC LIMIT 1" -x

Check that:

status = completed (not failed)
Line items count is reasonable (closer to expected ~2000):

psql -h localhost -U postgres -d orcha -c "
  SELECT jsonb_array_length(structured_data->'line-items') as line_item_count
  FROM document
  WHERE id = '019cf832-1890-703f-9ab0-8a6b814cdebb'" -x

Step 4: View in UI

Open the invoice in the browser and verify:

Line items render without browser hang
Scrolling is smooth (content-visibility working)
All line items appear correct

Step 5: Commit any adjustments

If any tuning was needed (overlap pages, budget, CSS sizes), commit those changes.

Task 7: Lint and final cleanup

Step 1: Run linter

clj-kondo --lint src test dev

Fix any issues.

Step 2: Run full test suite

clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"

Step 3: Final commit if needed

git add <any-fixed-files>
git commit -m "fix: lint and cleanup for large invoice handling"