Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Large Invoice Summary Extraction Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Large invoices (>50 pages) get their summary page range detected by an agentic classifier, and extraction splits items into line-items (summary, post-processed) and breakdown-items (detail, stored as-is for CSV download).

Architecture: Classification becomes an agent loop (LangChain4j) for all documents. For >50 pages, the agent gets tools to probe the transcription and a prompt augmentation to detect summary page ranges. Extraction splits items based on the detected range. Post-processing runs on summary items only. A validation warning flags these documents for manual review.

Tech Stack: Clojure, LangChain4j (agent loop), Malli (schemas), Integrant (config), clojure.test

Design doc: docs/plans/2026-03-17-large-invoice-summary-extraction-design.md


Task 1: Config — Add :classification-large

Files:

Step 1: Add the new LLM config key

In :com.getorcha/llm (line ~62), add after :main:

:classification-large {:provider :anthropic
                       :api-key  #orcha/param "/v1-orcha/anthropic-api-key"
                       :model    "claude-sonnet-4-5-20250929"}

In the orchestrator's :llm-config (line ~233), add:

:classification-large #ref [:com.getorcha/llm :classification-large]

Step 2: Verify config loads

Start the REPL, (reset), then:

(get-in (:com.getorcha.workers.ap.ingestion/orchestrator integrant.repl.state/config)
        [:llm-config :classification-large :model])
;; => "claude-sonnet-4-5-20250929"

Step 3: Commit

git add resources/com/getorcha/config.edn
git commit -m "feat: add :classification-large LLM config for summary detection"

Task 2: Schema — Add summary-page-range and breakdown-items

Files:

Step 1: Write a failing test

In the test file (create if needed), add a test that validates an InvoiceData map with the new fields:

(deftest test-invoice-data-with-summary-fields
  (let [base-invoice (valid-invoice-fixture) ;; existing helper or inline a minimal valid map
        with-summary (assoc base-invoice
                            :summary-page-range [1 1]
                            :breakdown-items [{:description "Detail item"
                                               :amount 100.0
                                               :page-location [5 5]}])]
    (is (m/validate InvoiceData with-summary))
    (testing "summary-page-range is optional"
      (is (m/validate InvoiceData (dissoc with-summary :summary-page-range))))
    (testing "breakdown-items is optional"
      (is (m/validate InvoiceData (dissoc with-summary :breakdown-items))))))

Step 2: Run test — should fail (fields not in schema yet)

clj -X:test:silent :nses '[com.getorcha.schema.invoice.structured-data-test]'

Step 3: Add fields to InvoiceData schema

In src/com/getorcha/schema/invoice/structured_data.clj, in the InvoiceData [:map ...] form (around line 291, before the closing ]), add:

;; Summary extraction (for large documents with >50 pages)
[:summary-page-range {:optional true} [:maybe [:tuple :int :int]]]
[:breakdown-items    {:optional true} [:maybe [:vector LineItem]]]

Place these after :validation-results and before the closing ]].

Step 4: Run test — should pass

clj -X:test:silent :nses '[com.getorcha.schema.invoice.structured-data-test]'

Step 5: Commit

git add src/com/getorcha/schema/invoice/structured_data.clj test/com/getorcha/schema/invoice/structured_data_test.clj
git commit -m "feat: add summary-page-range and breakdown-items to InvoiceData schema"

Task 3: Classification Tools — read_pages, search_text, page_headers

These are functions that take the transcription text and return a LangChain4j tool map for the agent. They live in classification.clj since they're only used there.

Files:

Step 1: Write failing tests for tool helper functions

The tools need pure helper functions that operate on text. Test those independently before wiring into LC4j.

(deftest test-read-pages
  (let [text "=== PAGE 1 ===\nPage one content\n=== PAGE 2 ===\nPage two content\n=== PAGE 3 ===\nPage three\n"]
    (testing "reads a single page"
      (is (= "=== PAGE 2 ===\nPage two content\n"
             (classification/read-pages-from-text text 2 2))))
    (testing "reads a range"
      (let [result (classification/read-pages-from-text text 1 2)]
        (is (string/includes? result "Page one"))
        (is (string/includes? result "Page two"))
        (is (not (string/includes? result "Page three")))))
    (testing "clamps to available pages"
      (is (some? (classification/read-pages-from-text text 1 999))))))

(deftest test-search-text-in-pages
  (let [text "=== PAGE 1 ===\nInvoice Summary\n=== PAGE 2 ===\nProject Summary\n=== PAGE 3 ===\nAppendix detail\n"]
    (testing "finds pages containing query"
      (let [result (classification/search-text-in-pages text "Summary")]
        (is (= 2 (:total-matches result)))
        (is (= [1 2] (:pages result)))))
    (testing "returns zero for no match"
      (let [result (classification/search-text-in-pages text "nonexistent")]
        (is (= 0 (:total-matches result)))
        (is (empty? (:pages result)))))))

(deftest test-page-headers
  (let [text "=== PAGE 1 ===\nFirst header line\nSecond line\nThird line\nFourth\n=== PAGE 2 ===\nAnother header\nMore content\n"]
    (testing "returns first 2 lines per page"
      (let [result (classification/page-headers-from-text text 1 2)]
        (is (= 2 (count result)))
        (is (= 1 (:page (first result))))
        (is (string/includes? (:header (first result)) "First header line"))))))

Step 2: Run tests — should fail

clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]'

Step 3: Implement the helper functions

In classification.clj, add these private functions. They parse the === PAGE N === markers in the transcription text.

read-pages-from-text — splits text on page markers, returns joined text for the requested range.

search-text-in-pages — splits text into pages, checks each page for the query string (case-insensitive), returns {:pages [1 3 5] :total-matches 3}.

page-headers-from-text — splits text into pages, returns first 2-3 lines of each page in the range as [{:page 1 :header "..."} ...].

All three reuse the same page-splitting logic from split-text-into-page-chunks (the === PAGE N === regex). Extract a shared split-into-pages helper that returns [{:page-num N :text "..."}].

Step 4: Run tests — should pass

clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]'

Step 5: Commit

git add src/com/getorcha/workers/ap/ingestion/classification.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj
git commit -m "feat: add transcription text tool helpers for agentic classification"

Task 4: Classification — Convert to Agent Loop

This is the largest task. classify! switches from llm/generate to ai.agent/run-with-model.

Files:

Step 1: Write failing tests

Two test cases: small doc (no tools, same behavior as today) and large doc (tools registered, returns summary-page-range).

(deftest test-classify-small-document-agent
  (testing "≤50 pages: classifies normally, no summary detection"
    ;; Mock the agent to return a classification JSON
    (let [agent-calls (atom [])
          fake-agent  (fn [_model _tools _ctx config]
                        (swap! agent-calls conj config)
                        {:text (json/generate-string
                                 {:document-type "invoice"
                                  :invoice-subtype "standard-invoice"
                                  :description "Test invoice"
                                  :confidence "high"})
                         :iterations 0
                         :tool-calls []
                         :usage {:input-tokens 100 :output-tokens 50}})]
      (with-redefs [ai.agent/run-with-model fake-agent
                    ai.agent/make-chat-model (fn [_] :mock-model)]
        (let [result (classification/classify!
                       {:db-pool fixtures/*db*
                        :llm-config (test-llm-config)}
                       {:transcription-result {:text "=== PAGE 1 ===\nShort invoice\n"
                                               :page-count 3}
                        :document {:document/legal-entity-id test-le-id}})]
          ;; Classification works as before
          (is (= "invoice" (get-in result [:classification :document-type])))
          ;; No summary-page-range for small docs
          (is (nil? (get-in result [:classification :summary-page-range]))))))))

(deftest test-classify-large-document-summary-detection
  (testing ">50 pages: includes summary detection, returns summary-page-range"
    (let [fake-agent (fn [_model _tools _ctx config]
                       ;; Agent returns classification + summary-page-range
                       {:text (json/generate-string
                                {:document-type "invoice"
                                 :invoice-subtype "standard-invoice"
                                 :description "Large rental invoice"
                                 :confidence "high"
                                 :summary-page-range [1 1]})
                        :iterations 2
                        :tool-calls [{:name "search_text" :args "{}" :result "{}"}]
                        :usage {:input-tokens 500 :output-tokens 100}})]
      (with-redefs [ai.agent/run-with-model fake-agent
                    ai.agent/make-chat-model (fn [_] :mock-model)]
        (let [text (apply str (for [i (range 1 101)]
                                (str "=== PAGE " i " ===\nContent page " i "\n")))
              result (classification/classify!
                       {:db-pool fixtures/*db*
                        :llm-config (test-llm-config)}
                       {:transcription-result {:text text :page-count 100}
                        :document {:document/legal-entity-id test-le-id}})]
          ;; Classification works
          (is (= "invoice" (get-in result [:classification :document-type])))
          ;; Summary page range detected
          (is (= [1 1] (get-in result [:classification :summary-page-range]))))))))

Step 2: Run tests — should fail

Step 3: Implement the refactored classify!

Key changes to classify!:

  1. Read :page-count from the transcription result (already available from the transcription step).
  2. Determine if this is a large document (page-count > large-document-threshold, default 50).
  3. Build the prompt:
  4. Build tools:
  5. Select model config: :classification for small docs, :classification-large for large docs.
  6. Build ChatModel via ai.agent/make-chat-model (note: this is currently private — either make it package-accessible or inline the model construction).
  7. Call ai.agent/run-with-model with the model, tool map, and {:prompt rendered-prompt :max-iterations max-iter}.
  8. Parse the response text with parse-classification-result (existing function).
  9. For large documents: also extract :summary-page-range from the parsed JSON and assoc it into the classification result.

Important: make-chat-model in ai/agent.clj is ^:private. Either change it to public or duplicate the construction logic in classification. Prefer making it public — it's a pure factory function with no side effects.

Regarding the prompt: The existing :classification prompt template uses ${text} substitution. For the agent, pass the fully-rendered prompt (with text substituted) as the :prompt parameter. No prompt template changes needed — just append the summary detection instructions after rendering for large docs.

Summary detection prompt addition (appended only for >50 pages):

SUMMARY DETECTION (document has N pages):
This is a large document. Determine whether it has a summary page (or pages) containing
the complete financial picture — invoice-level totals (subtotal, tax, total) and aggregated
line items that add up to those totals.

Per-project breakdowns, transaction detail, and equipment/article lists are NOT summary pages,
even if they contain the word "summary" in their header. A summary page is the one you would
use to book this invoice in an accounting system — it has the final numbers.

Use the provided tools to examine the document structure. Add to your JSON response:
  "summary-page-range": [start, end]  (1-indexed, inclusive)
  or "summary-page-range": null  if no clear summary exists.

Step 4: Run tests — should pass

clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]'

Step 5: Run all existing classification tests to check for regressions

clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]'

Step 6: Commit

git add src/com/getorcha/workers/ap/ingestion/classification.clj src/com/getorcha/ai/agent.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj
git commit -m "feat: convert classification to agent loop with summary detection for large documents"

Task 5: Extraction — Item Split Based on summary-page-range

After extraction merges all chunks, split items into line-items and breakdown-items based on the classification's summary-page-range.

Files:

Step 1: Write failing test

(deftest test-item-split-by-summary-page-range
  (let [items [{:description "Summary item" :amount 1000.0 :page-location [1 1]}
               {:description "Detail A"     :amount 100.0  :page-location [2 3]}
               {:description "Detail B"     :amount 200.0  :page-location [5 5]}
               {:description "Boundary"     :amount 50.0   :page-location [1 2]}]]

    (testing "with summary-page-range, splits items"
      (let [result (workers.extraction/split-items-by-summary items [1 1])]
        (is (= 1 (count (:line-items result))))
        (is (= "Summary item" (:description (first (:line-items result)))))
        (is (= 3 (count (:breakdown-items result))))))

    (testing "boundary item: page-location [1 2] overlaps [1 1]"
      (let [result (workers.extraction/split-items-by-summary items [1 1])]
        ;; Item spanning summary and non-summary pages goes to line-items
        (is (some #(= "Boundary" (:description %)) (:line-items result)))))

    (testing "without summary-page-range, returns all in line-items"
      (let [result (workers.extraction/split-items-by-summary items nil)]
        (is (= 4 (count (:line-items result))))
        (is (nil? (:breakdown-items result)))))))

Step 2: Run test — should fail

Step 3: Implement split-items-by-summary

In extraction.clj, add a private function:

(defn ^:private split-items-by-summary
  "Splits line items into summary items and breakdown items based on page range.
   Items whose page-location overlaps the summary range go to :line-items.
   Returns {:line-items [...] :breakdown-items [...] or nil}."
  [items summary-page-range]
  (if (nil? summary-page-range)
    {:line-items items}
    (let [[s-start s-end] summary-page-range
          overlaps? (fn [{:keys [page-location]}]
                      (let [[p-start p-end] page-location]
                        (and (<= p-start s-end) (>= p-end s-start))))
          {summary true breakdown false} (group-by overlaps? items)]
      {:line-items      (vec (or summary []))
       :breakdown-items (when (seq breakdown) (vec breakdown))})))

Then modify structured-data "invoice" (line ~761) to apply the split. The ingestion map has :classification with :summary-page-range. After the extraction result is computed (either single call or chunked), apply the split:

;; At the end of the method, after getting the extraction result:
(let [result          (if (<= text-tokens budget-tokens)
                        (extract-chunk ...)
                        (let [chunks ...] (merge-chunked-extractions results)))
      summary-range   (get-in _ingestion [:classification :summary-page-range])
      {:keys [line-items breakdown-items]} (split-items-by-summary
                                             (:line-items (:data result))
                                             summary-range)]
  (cond-> (assoc-in result [:data :line-items] line-items)
    breakdown-items (assoc-in [:data :breakdown-items] breakdown-items)))

Note: the _ingestion parameter is already available in the defmethod — it's currently named _ingestion but we need to use the classification from it. Rename to ingestion and destructure :classification.

Step 4: Run test — should pass

clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]'

Step 5: Run all extraction tests for regressions

Step 6: Commit

git add src/com/getorcha/workers/ap/ingestion/extraction.clj test/com/getorcha/workers/ap/ingestion/extraction_test.clj
git commit -m "feat: split extraction items into line-items and breakdown-items by summary page range"

Task 6: Validation — Add large-document-summary-only Warning

Files:

Step 1: Write failing test

(deftest test-large-document-summary-only-warning
  (testing "adds warning when summary-page-range is present"
    (let [data {:summary-page-range [1 1]
                :line-items [{:description "Miete" :amount 1000.0 :page-location [1 1]}]
                :breakdown-items [{:description "Detail" :amount 100.0 :page-location [5 5]}]}
          result (validation/check-large-document-summary-only data)]
      (is (= "warning" (:status result)))
      (is (string/includes? (:message result) "summary"))))

  (testing "passes when no summary-page-range"
    (let [result (validation/check-large-document-summary-only {:line-items []})]
      (is (= "pass" (:status result))))))

Step 2: Run test — should fail

Step 3: Implement the check

In validation.clj, add:

(defn ^:private check-large-document-summary-only
  "Warns when a large document was processed using summary-only extraction."
  [{:keys [summary-page-range breakdown-items]}]
  (if summary-page-range
    (let [[start end] summary-page-range
          breakdown-count (count breakdown-items)]
      {:status  "warning"
       :message (format "Large document — only summary pages %d-%d were processed for account and cost center matching. %d breakdown items available for download. Manual review recommended."
                        start end breakdown-count)
       :details {:summary-page-range summary-page-range
                 :breakdown-item-count breakdown-count}})
    {:status "pass"}))

Register it in the validate "invoice" defmethod (line ~831):

(defmethod validate "invoice"
  [structured-data]
  (assoc structured-data
         :validation-results
         {:financial-math                (check-financial-math structured-data)
          :required-fields               (check-required-fields structured-data)
          :tax-id-format                 (check-tax-id-format structured-data)
          :iban-format                   (check-iban structured-data)
          :date-reasonableness           (check-date-reasonableness structured-data)
          :issuer-country                (check-issuer-country structured-data)
          :recipient-country             (check-recipient-country structured-data)
          :large-document-summary-only   (check-large-document-summary-only structured-data)}))

Step 4: Run test — should pass

clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.validation-test]'

Step 5: Run all validation tests for regressions

Step 6: Commit

git add src/com/getorcha/workers/ap/ingestion/validation.clj test/com/getorcha/workers/ap/ingestion/validation_test.clj
git commit -m "feat: add large-document-summary-only validation warning"

Task 7: Integration Test — End-to-End Large Document

Files:

Write an integration test that exercises the full pipeline with a >50 page document. Mock the LLM calls but verify:

  1. Classification agent receives tools and summary detection instructions
  2. Extraction returns both line-items and breakdown-items
  3. Post-processing only processes line-items
  4. Validation includes large-document-summary-only warning
  5. Final structured-data has all expected fields

Use with-redefs on ai.agent/run-with-model and llm/generate to control responses.

git commit -m "test: add integration test for large document summary extraction pipeline"

Task 8: Manual Smoke Test with Doka Invoice

Re-ingest the Doka invoice (document-id 019cf832-1890-703f-9ab0-8a6b814cdebb) and verify:

  1. Classification logs show agent tool calls (search_text, page_headers)
  2. Classification result includes summary-page-range: [1, 1]
  3. Extraction completes with line-items (summary) and breakdown-items (detail)
  4. Post-processing completes without context window overflow
  5. Validation shows large-document-summary-only warning
  6. Document displays correctly in UI with warning banner

This is manual — no automated test. Use /reingest-doc skill if available.