Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Prevent the LLM from misclassifying invoice appendix pages (delivery note summaries) as separate documents, and fix the random content hash for segmented children.
Architecture: Two fixes: (1) strengthen the classification prompt so the LLM recognizes delivery note listings/summaries as invoice attachments rather than standalone GRNs, (2) replace random UUID content hash for segmented children with SHA-256 of actual content.
Tech Stack: Clojure, clojure.test
When a 2-page invoice PDF (ATP1E4CT_invoice.pdf) is re-ingested, the LLM classification non-deterministically returns segments: page 1 as "invoice", page 2 as "goods-received-note". Page 2 is actually a "Lieferschein - Aufstellung" (delivery note listing) — an attachment summarizing the delivery notes referenced by the invoice. It's part of the invoice, not a standalone GRN.
The classification prompt already has rule 6: "Delivery notes/Lieferscheine are goods-received-note, NOT invoices" — which triggers the LLM to see "Lieferschein" on page 2 and split it out. The "same document" signals section mentions "Appendix pages" but doesn't specifically call out delivery note summaries as invoice attachments.
The child document created from page 2 then matches with the parent invoice (score 0.9055) because they share counterparty, VAT ID, quantities, etc.
| File | Action | Responsibility |
|---|---|---|
src/com/getorcha/workers/ap/ingestion/classification.clj |
Modify (lines 267-298) | Strengthen multi-doc detection prompt |
src/com/getorcha/workers/ap/ingestion.clj |
Modify (line 728) | Use SHA-256 content hash |
test/com/getorcha/workers/ap/ingestion_test.clj |
Modify | Add content hash test |
test/com/getorcha/workers/ap/ingestion/classification_test.clj |
Modify | Add classification test for invoice w/ delivery summary |
Add explicit guidance that delivery note summaries/listings appearing as appendix pages of an invoice are part of the invoice, not separate GRN documents.
Files:
Modify: src/com/getorcha/workers/ap/ingestion/classification.clj:267-298
Test: test/com/getorcha/workers/ap/ingestion/classification_test.clj
Step 1: Write a classification test for invoice with delivery summary
Check the existing classification test structure first. Then add a test that classifies the actual ATP1E4CT_invoice.pdf text (or a representative sample) and asserts it returns a single classification (not segments).
The test should use the actual transcribed text from the invoice to ensure the LLM sees the "Lieferschein - Aufstellung" and still classifies it as a single invoice.
Read the existing classification tests to match their pattern, then add:
(deftest classify-invoice-with-delivery-summary-appendix
(testing "invoice with 'Lieferschein - Aufstellung' page is classified as single invoice, not segmented"
;; Page 2 of some invoices contains a delivery note listing (Lieferschein-Aufstellung)
;; that summarizes deliveries referenced by the invoice. This is an invoice attachment,
;; not a standalone goods-received-note.
(let [text (str "=== PAGE 1 ===\n"
"SÜD BETON\nLIEFERBETON GmbH & Co KG\n"
"STRABAG AG\nRechnung\n"
"Rechnungsnummer: 2501395\nDatum: 24.06.2025\n"
"C25/30 XC2 GK16 F59 CEM II/B 42,5N 5,00 m³ 95,00 475,00\n"
"C12/15 X0 GK22 F52 CEM II 42,5N 6,50 m³ 82,50 536,25\n"
"Nettobetrag: 2.117,05\nMwSt 20%: 423,41\nBrutto: 2.540,46\n"
"UID-Nr: ATU 29643605\n"
"=== PAGE 2 ===\n"
"Lieferschein - Aufstellung\n"
"Datum LS-Nr. Artikel Menge\n"
"12.06.2025 2057422 C25/30 XC2 GK16 F59 CEM II/B 42,5N 5,00\n"
"17.06.2025 2057543 C12/15 X0 GK22 F52 CEM II 42,5N 6,50\n"
"18.06.2025 50653 C25/30 B2 GK22 F52 CEM II 42,5N 9,50\n"
"23.06.2025 50761 C25/30 B2 GK22 F52 CEM II 42,5N 4,50\n"
"UID-Nr: ATU 29643605\n")]
(let [result (classify-text text)]
(is (nil? (:segments result))
"Should NOT segment — delivery note listing is an invoice attachment")
(is (= "invoice" (:document-type result)))
(is (= "standard-invoice" (:invoice-subtype result)))))))
Note: classify-text is a test helper — check how existing classification tests invoke the LLM and adapt accordingly.
Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"
The test may pass sometimes due to non-determinism. Run it 2-3 times to check.
In src/com/getorcha/workers/ap/ingestion/classification.clj, modify the multi-document detection section. Add to the "Signals of the SAME document continuing" list (after line 283):
- Summary or detail pages that reference, itemize, or break down content from the main
document (e.g., a page listing delivery references for an invoice, a schedule of
payments for a contract). These are attachments/appendices, not standalone documents.
The key principle: a supporting page that exists to detail or substantiate items in the primary document is part of that document, even if it resembles another document type when viewed in isolation.
Run 3 times:
clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]' :vars '[com.getorcha.workers.ap.ingestion.classification-test/classify-invoice-with-delivery-summary-appendix]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"
Expected: PASS on all 3 runs.
Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"
Expected: All pass.
Run: clj-kondo --lint src/com/getorcha/workers/ap/ingestion/classification.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj
git add src/com/getorcha/workers/ap/ingestion/classification.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj
git commit -m "fix: prevent LLM from splitting invoice delivery-summary pages as separate GRNs"
Replace the random UUID content hash with an actual SHA-256 of the child PDF bytes. This enables proper dedup on retry and prevents the unique constraint (legal_entity_id, content_hash) from being meaningless for segmented children.
Files:
Modify: src/com/getorcha/workers/ap/ingestion.clj:728
Test: test/com/getorcha/workers/ap/ingestion_test.clj
Step 1: Write failing test
Add to ingestion_test.clj:
(deftest segmentation-gate-uses-content-hash
(testing "child documents get a deterministic content hash based on PDF bytes"
(let [s3-client ^S3Client (get-in fixtures/*system* [::aws/state :clients :s3])
s3-bucket (get-in (fixtures/config) [::aws/state :s3-buckets :storage])
{:keys [ingestion-id document-id]}
(create-test-document+ingestion!
{:s3-client s3-client
:s3-bucket s3-bucket
:db-pool fixtures/*db*}
{:upload-to-s3? true})
orchestrator (::workers.ingestion/orchestrator fixtures/*system*)
aws-state (::aws/state fixtures/*system*)
context (merge orchestrator {:db-pool fixtures/*db* :aws aws-state})
parent-document (db.sql/execute-one!
fixtures/*db*
{:select [:*] :from [:document] :where [:= :id document-id]})
segments [{:document-type "invoice"
:invoice-subtype "standard-invoice"
:pages [1 1]
:confidence "high"}
{:document-type "invoice"
:invoice-subtype "standard-invoice"
:pages [2 2]
:confidence "high"}]
ingestion {:id ingestion-id
:document {:document/id document-id
:document/legal-entity-id (:document/legal-entity-id parent-document)
:document/file-path (:document/file-path parent-document)
:document/file-original-name (:document/file-original-name parent-document)
:document/source-metadata (or (:document/source-metadata parent-document) {})}
:transcription-result {:text (str "=== PAGE 1 ===\nInvoice 1 content\n"
"=== PAGE 2 ===\nInvoice 2 content")
:page-count 2}
:file {:contents (.getBytes "fake pdf content")
:mime-type "application/pdf"}}]
(let [sent-messages (atom [])]
(with-redefs [workers.transcription/extract-pages (fn [_pdf-bytes _page-indices]
(.getBytes "deterministic child content"))
aws/send-message! (fn [_client _queue-url body]
(swap! sent-messages conj body))]
(#'workers.ingestion/segmentation-gate! context ingestion segments "test-sha")))
;; Verify child's content hash is deterministic (SHA-256 of the child PDF bytes)
(let [children (db.sql/execute!
fixtures/*db*
{:select [:content-hash]
:from [:document]
:where [:= :source-document-id document-id]})]
(is (= 1 (count children)))
;; Hash should NOT be a UUID format (36 chars with dashes)
(is (not (re-matches #"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
(:document/content-hash (first children))))
"Content hash should be a SHA-256 hex string, not a random UUID")
;; Should be a 64-char hex string (SHA-256)
(is (re-matches #"[0-9a-f]{64}" (:document/content-hash (first children))))))))
Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' :vars '[com.getorcha.workers.ap.ingestion-test/segmentation-gate-uses-content-hash]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"
Expected: FAIL — content hash matches UUID pattern.
In src/com/getorcha/workers/ap/ingestion.clj, replace line 728:
content-hash (str (random-uuid))]
with:
content-hash (let [digest (java.security.MessageDigest/getInstance "SHA-256")
hash-bytes (.digest digest child-pdf-bytes)]
(apply str (map #(format "%02x" %) hash-bytes)))]
No additional imports needed — java.security.MessageDigest is auto-imported.
Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' :vars '[com.getorcha.workers.ap.ingestion-test/segmentation-gate-uses-content-hash]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"
Expected: PASS.
Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"
Expected: All pass.
Run: clj-kondo --lint src/com/getorcha/workers/ap/ingestion.clj
git add src/com/getorcha/workers/ap/ingestion.clj test/com/getorcha/workers/ap/ingestion_test.clj
git commit -m "fix: use SHA-256 content hash for segmented child documents"