Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Large invoices (>50 pages) get their summary page range detected by an agentic classifier, and extraction splits items into line-items (summary, post-processed) and breakdown-items (detail, stored as-is for CSV download).
Architecture: Classification becomes an agent loop (LangChain4j) for all documents. For >50 pages, the agent gets tools to probe the transcription and a prompt augmentation to detect summary page ranges. Extraction splits items based on the detected range. Post-processing runs on summary items only. A validation warning flags these documents for manual review.
Tech Stack: Clojure, LangChain4j (agent loop), Malli (schemas), Integrant (config), clojure.test
Design doc: docs/plans/2026-03-17-large-invoice-summary-extraction-design.md
:classification-largeFiles:
resources/com/getorcha/config.ednStep 1: Add the new LLM config key
In :com.getorcha/llm (line ~62), add after :main:
:classification-large {:provider :anthropic
:api-key #orcha/param "/v1-orcha/anthropic-api-key"
:model "claude-sonnet-4-5-20250929"}
In the orchestrator's :llm-config (line ~233), add:
:classification-large #ref [:com.getorcha/llm :classification-large]
Step 2: Verify config loads
Start the REPL, (reset), then:
(get-in (:com.getorcha.workers.ap.ingestion/orchestrator integrant.repl.state/config)
[:llm-config :classification-large :model])
;; => "claude-sonnet-4-5-20250929"
Step 3: Commit
git add resources/com/getorcha/config.edn
git commit -m "feat: add :classification-large LLM config for summary detection"
summary-page-range and breakdown-itemsFiles:
src/com/getorcha/schema/invoice/structured_data.cljtest/com/getorcha/schema/invoice/structured_data_test.clj (or inline REPL verification)Step 1: Write a failing test
In the test file (create if needed), add a test that validates an InvoiceData map with the new fields:
(deftest test-invoice-data-with-summary-fields
(let [base-invoice (valid-invoice-fixture) ;; existing helper or inline a minimal valid map
with-summary (assoc base-invoice
:summary-page-range [1 1]
:breakdown-items [{:description "Detail item"
:amount 100.0
:page-location [5 5]}])]
(is (m/validate InvoiceData with-summary))
(testing "summary-page-range is optional"
(is (m/validate InvoiceData (dissoc with-summary :summary-page-range))))
(testing "breakdown-items is optional"
(is (m/validate InvoiceData (dissoc with-summary :breakdown-items))))))
Step 2: Run test — should fail (fields not in schema yet)
clj -X:test:silent :nses '[com.getorcha.schema.invoice.structured-data-test]'
Step 3: Add fields to InvoiceData schema
In src/com/getorcha/schema/invoice/structured_data.clj, in the InvoiceData [:map ...] form (around line 291, before the closing ]), add:
;; Summary extraction (for large documents with >50 pages)
[:summary-page-range {:optional true} [:maybe [:tuple :int :int]]]
[:breakdown-items {:optional true} [:maybe [:vector LineItem]]]
Place these after :validation-results and before the closing ]].
Step 4: Run test — should pass
clj -X:test:silent :nses '[com.getorcha.schema.invoice.structured-data-test]'
Step 5: Commit
git add src/com/getorcha/schema/invoice/structured_data.clj test/com/getorcha/schema/invoice/structured_data_test.clj
git commit -m "feat: add summary-page-range and breakdown-items to InvoiceData schema"
read_pages, search_text, page_headersThese are functions that take the transcription text and return a LangChain4j tool map for the agent. They live in classification.clj since they're only used there.
Files:
src/com/getorcha/workers/ap/ingestion/classification.cljtest/com/getorcha/workers/ap/ingestion/classification_test.cljStep 1: Write failing tests for tool helper functions
The tools need pure helper functions that operate on text. Test those independently before wiring into LC4j.
(deftest test-read-pages
(let [text "=== PAGE 1 ===\nPage one content\n=== PAGE 2 ===\nPage two content\n=== PAGE 3 ===\nPage three\n"]
(testing "reads a single page"
(is (= "=== PAGE 2 ===\nPage two content\n"
(classification/read-pages-from-text text 2 2))))
(testing "reads a range"
(let [result (classification/read-pages-from-text text 1 2)]
(is (string/includes? result "Page one"))
(is (string/includes? result "Page two"))
(is (not (string/includes? result "Page three")))))
(testing "clamps to available pages"
(is (some? (classification/read-pages-from-text text 1 999))))))
(deftest test-search-text-in-pages
(let [text "=== PAGE 1 ===\nInvoice Summary\n=== PAGE 2 ===\nProject Summary\n=== PAGE 3 ===\nAppendix detail\n"]
(testing "finds pages containing query"
(let [result (classification/search-text-in-pages text "Summary")]
(is (= 2 (:total-matches result)))
(is (= [1 2] (:pages result)))))
(testing "returns zero for no match"
(let [result (classification/search-text-in-pages text "nonexistent")]
(is (= 0 (:total-matches result)))
(is (empty? (:pages result)))))))
(deftest test-page-headers
(let [text "=== PAGE 1 ===\nFirst header line\nSecond line\nThird line\nFourth\n=== PAGE 2 ===\nAnother header\nMore content\n"]
(testing "returns first 2 lines per page"
(let [result (classification/page-headers-from-text text 1 2)]
(is (= 2 (count result)))
(is (= 1 (:page (first result))))
(is (string/includes? (:header (first result)) "First header line"))))))
Step 2: Run tests — should fail
clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]'
Step 3: Implement the helper functions
In classification.clj, add these private functions. They parse the === PAGE N === markers in the transcription text.
read-pages-from-text — splits text on page markers, returns joined text for the requested range.
search-text-in-pages — splits text into pages, checks each page for the query string (case-insensitive), returns {:pages [1 3 5] :total-matches 3}.
page-headers-from-text — splits text into pages, returns first 2-3 lines of each page in the range as [{:page 1 :header "..."} ...].
All three reuse the same page-splitting logic from split-text-into-page-chunks (the === PAGE N === regex). Extract a shared split-into-pages helper that returns [{:page-num N :text "..."}].
Step 4: Run tests — should pass
clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]'
Step 5: Commit
git add src/com/getorcha/workers/ap/ingestion/classification.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj
git commit -m "feat: add transcription text tool helpers for agentic classification"
This is the largest task. classify! switches from llm/generate to ai.agent/run-with-model.
Files:
src/com/getorcha/workers/ap/ingestion/classification.cljtest/com/getorcha/workers/ap/ingestion/classification_test.cljStep 1: Write failing tests
Two test cases: small doc (no tools, same behavior as today) and large doc (tools registered, returns summary-page-range).
(deftest test-classify-small-document-agent
(testing "≤50 pages: classifies normally, no summary detection"
;; Mock the agent to return a classification JSON
(let [agent-calls (atom [])
fake-agent (fn [_model _tools _ctx config]
(swap! agent-calls conj config)
{:text (json/generate-string
{:document-type "invoice"
:invoice-subtype "standard-invoice"
:description "Test invoice"
:confidence "high"})
:iterations 0
:tool-calls []
:usage {:input-tokens 100 :output-tokens 50}})]
(with-redefs [ai.agent/run-with-model fake-agent
ai.agent/make-chat-model (fn [_] :mock-model)]
(let [result (classification/classify!
{:db-pool fixtures/*db*
:llm-config (test-llm-config)}
{:transcription-result {:text "=== PAGE 1 ===\nShort invoice\n"
:page-count 3}
:document {:document/legal-entity-id test-le-id}})]
;; Classification works as before
(is (= "invoice" (get-in result [:classification :document-type])))
;; No summary-page-range for small docs
(is (nil? (get-in result [:classification :summary-page-range]))))))))
(deftest test-classify-large-document-summary-detection
(testing ">50 pages: includes summary detection, returns summary-page-range"
(let [fake-agent (fn [_model _tools _ctx config]
;; Agent returns classification + summary-page-range
{:text (json/generate-string
{:document-type "invoice"
:invoice-subtype "standard-invoice"
:description "Large rental invoice"
:confidence "high"
:summary-page-range [1 1]})
:iterations 2
:tool-calls [{:name "search_text" :args "{}" :result "{}"}]
:usage {:input-tokens 500 :output-tokens 100}})]
(with-redefs [ai.agent/run-with-model fake-agent
ai.agent/make-chat-model (fn [_] :mock-model)]
(let [text (apply str (for [i (range 1 101)]
(str "=== PAGE " i " ===\nContent page " i "\n")))
result (classification/classify!
{:db-pool fixtures/*db*
:llm-config (test-llm-config)}
{:transcription-result {:text text :page-count 100}
:document {:document/legal-entity-id test-le-id}})]
;; Classification works
(is (= "invoice" (get-in result [:classification :document-type])))
;; Summary page range detected
(is (= [1 1] (get-in result [:classification :summary-page-range]))))))))
Step 2: Run tests — should fail
Step 3: Implement the refactored classify!
Key changes to classify!:
:page-count from the transcription result (already available from the transcription step).large-document-threshold, default 50).legal-entity-prompt.read_pages, search_text, page_headers — closures over the transcription text. Use LangChain4j ToolSpecification/builder and reify ToolExecutor per the pattern in ai/agent/interop.clj:82-99.:classification for small docs, :classification-large for large docs.ChatModel via ai.agent/make-chat-model (note: this is currently private — either make it package-accessible or inline the model construction).ai.agent/run-with-model with the model, tool map, and {:prompt rendered-prompt :max-iterations max-iter}.parse-classification-result (existing function).:summary-page-range from the parsed JSON and assoc it into the classification result.Important: make-chat-model in ai/agent.clj is ^:private. Either change it to public or duplicate the construction logic in classification. Prefer making it public — it's a pure factory function with no side effects.
Regarding the prompt: The existing :classification prompt template uses ${text} substitution. For the agent, pass the fully-rendered prompt (with text substituted) as the :prompt parameter. No prompt template changes needed — just append the summary detection instructions after rendering for large docs.
Summary detection prompt addition (appended only for >50 pages):
SUMMARY DETECTION (document has N pages):
This is a large document. Determine whether it has a summary page (or pages) containing
the complete financial picture — invoice-level totals (subtotal, tax, total) and aggregated
line items that add up to those totals.
Per-project breakdowns, transaction detail, and equipment/article lists are NOT summary pages,
even if they contain the word "summary" in their header. A summary page is the one you would
use to book this invoice in an accounting system — it has the final numbers.
Use the provided tools to examine the document structure. Add to your JSON response:
"summary-page-range": [start, end] (1-indexed, inclusive)
or "summary-page-range": null if no clear summary exists.
Step 4: Run tests — should pass
clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]'
Step 5: Run all existing classification tests to check for regressions
clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]'
Step 6: Commit
git add src/com/getorcha/workers/ap/ingestion/classification.clj src/com/getorcha/ai/agent.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj
git commit -m "feat: convert classification to agent loop with summary detection for large documents"
summary-page-rangeAfter extraction merges all chunks, split items into line-items and breakdown-items based on the classification's summary-page-range.
Files:
src/com/getorcha/workers/ap/ingestion/extraction.cljtest/com/getorcha/workers/ap/ingestion/extraction_test.cljStep 1: Write failing test
(deftest test-item-split-by-summary-page-range
(let [items [{:description "Summary item" :amount 1000.0 :page-location [1 1]}
{:description "Detail A" :amount 100.0 :page-location [2 3]}
{:description "Detail B" :amount 200.0 :page-location [5 5]}
{:description "Boundary" :amount 50.0 :page-location [1 2]}]]
(testing "with summary-page-range, splits items"
(let [result (workers.extraction/split-items-by-summary items [1 1])]
(is (= 1 (count (:line-items result))))
(is (= "Summary item" (:description (first (:line-items result)))))
(is (= 3 (count (:breakdown-items result))))))
(testing "boundary item: page-location [1 2] overlaps [1 1]"
(let [result (workers.extraction/split-items-by-summary items [1 1])]
;; Item spanning summary and non-summary pages goes to line-items
(is (some #(= "Boundary" (:description %)) (:line-items result)))))
(testing "without summary-page-range, returns all in line-items"
(let [result (workers.extraction/split-items-by-summary items nil)]
(is (= 4 (count (:line-items result))))
(is (nil? (:breakdown-items result)))))))
Step 2: Run test — should fail
Step 3: Implement split-items-by-summary
In extraction.clj, add a private function:
(defn ^:private split-items-by-summary
"Splits line items into summary items and breakdown items based on page range.
Items whose page-location overlaps the summary range go to :line-items.
Returns {:line-items [...] :breakdown-items [...] or nil}."
[items summary-page-range]
(if (nil? summary-page-range)
{:line-items items}
(let [[s-start s-end] summary-page-range
overlaps? (fn [{:keys [page-location]}]
(let [[p-start p-end] page-location]
(and (<= p-start s-end) (>= p-end s-start))))
{summary true breakdown false} (group-by overlaps? items)]
{:line-items (vec (or summary []))
:breakdown-items (when (seq breakdown) (vec breakdown))})))
Then modify structured-data "invoice" (line ~761) to apply the split. The ingestion map has :classification with :summary-page-range. After the extraction result is computed (either single call or chunked), apply the split:
;; At the end of the method, after getting the extraction result:
(let [result (if (<= text-tokens budget-tokens)
(extract-chunk ...)
(let [chunks ...] (merge-chunked-extractions results)))
summary-range (get-in _ingestion [:classification :summary-page-range])
{:keys [line-items breakdown-items]} (split-items-by-summary
(:line-items (:data result))
summary-range)]
(cond-> (assoc-in result [:data :line-items] line-items)
breakdown-items (assoc-in [:data :breakdown-items] breakdown-items)))
Note: the _ingestion parameter is already available in the defmethod — it's currently named _ingestion but we need to use the classification from it. Rename to ingestion and destructure :classification.
Step 4: Run test — should pass
clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.extraction-test]'
Step 5: Run all extraction tests for regressions
Step 6: Commit
git add src/com/getorcha/workers/ap/ingestion/extraction.clj test/com/getorcha/workers/ap/ingestion/extraction_test.clj
git commit -m "feat: split extraction items into line-items and breakdown-items by summary page range"
large-document-summary-only WarningFiles:
src/com/getorcha/workers/ap/ingestion/validation.cljtest/com/getorcha/workers/ap/ingestion/validation_test.cljStep 1: Write failing test
(deftest test-large-document-summary-only-warning
(testing "adds warning when summary-page-range is present"
(let [data {:summary-page-range [1 1]
:line-items [{:description "Miete" :amount 1000.0 :page-location [1 1]}]
:breakdown-items [{:description "Detail" :amount 100.0 :page-location [5 5]}]}
result (validation/check-large-document-summary-only data)]
(is (= "warning" (:status result)))
(is (string/includes? (:message result) "summary"))))
(testing "passes when no summary-page-range"
(let [result (validation/check-large-document-summary-only {:line-items []})]
(is (= "pass" (:status result))))))
Step 2: Run test — should fail
Step 3: Implement the check
In validation.clj, add:
(defn ^:private check-large-document-summary-only
"Warns when a large document was processed using summary-only extraction."
[{:keys [summary-page-range breakdown-items]}]
(if summary-page-range
(let [[start end] summary-page-range
breakdown-count (count breakdown-items)]
{:status "warning"
:message (format "Large document — only summary pages %d-%d were processed for account and cost center matching. %d breakdown items available for download. Manual review recommended."
start end breakdown-count)
:details {:summary-page-range summary-page-range
:breakdown-item-count breakdown-count}})
{:status "pass"}))
Register it in the validate "invoice" defmethod (line ~831):
(defmethod validate "invoice"
[structured-data]
(assoc structured-data
:validation-results
{:financial-math (check-financial-math structured-data)
:required-fields (check-required-fields structured-data)
:tax-id-format (check-tax-id-format structured-data)
:iban-format (check-iban structured-data)
:date-reasonableness (check-date-reasonableness structured-data)
:issuer-country (check-issuer-country structured-data)
:recipient-country (check-recipient-country structured-data)
:large-document-summary-only (check-large-document-summary-only structured-data)}))
Step 4: Run test — should pass
clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.validation-test]'
Step 5: Run all validation tests for regressions
Step 6: Commit
git add src/com/getorcha/workers/ap/ingestion/validation.clj test/com/getorcha/workers/ap/ingestion/validation_test.clj
git commit -m "feat: add large-document-summary-only validation warning"
Files:
test/com/getorcha/workers/ap/ingestion_test.cljWrite an integration test that exercises the full pipeline with a >50 page document. Mock the LLM calls but verify:
line-items and breakdown-itemsline-itemslarge-document-summary-only warningUse with-redefs on ai.agent/run-with-model and llm/generate to control responses.
git commit -m "test: add integration test for large document summary extraction pipeline"
Re-ingest the Doka invoice (document-id 019cf832-1890-703f-9ab0-8a6b814cdebb) and verify:
summary-page-range: [1, 1]line-items (summary) and breakdown-items (detail)large-document-summary-only warningThis is manual — no automated test. Use /reingest-doc skill if available.