Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Fix Segmentation Self-Match Bug

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Prevent the LLM from misclassifying invoice appendix pages (delivery note summaries) as separate documents, and fix the random content hash for segmented children.

Architecture: Two fixes: (1) strengthen the classification prompt so the LLM recognizes delivery note listings/summaries as invoice attachments rather than standalone GRNs, (2) replace random UUID content hash for segmented children with SHA-256 of actual content.

Tech Stack: Clojure, clojure.test


Context

When a 2-page invoice PDF (ATP1E4CT_invoice.pdf) is re-ingested, the LLM classification non-deterministically returns segments: page 1 as "invoice", page 2 as "goods-received-note". Page 2 is actually a "Lieferschein - Aufstellung" (delivery note listing) — an attachment summarizing the delivery notes referenced by the invoice. It's part of the invoice, not a standalone GRN.

The classification prompt already has rule 6: "Delivery notes/Lieferscheine are goods-received-note, NOT invoices" — which triggers the LLM to see "Lieferschein" on page 2 and split it out. The "same document" signals section mentions "Appendix pages" but doesn't specifically call out delivery note summaries as invoice attachments.

The child document created from page 2 then matches with the parent invoice (score 0.9055) because they share counterparty, VAT ID, quantities, etc.

File Structure

File Action Responsibility
src/com/getorcha/workers/ap/ingestion/classification.clj Modify (lines 267-298) Strengthen multi-doc detection prompt
src/com/getorcha/workers/ap/ingestion.clj Modify (line 728) Use SHA-256 content hash
test/com/getorcha/workers/ap/ingestion_test.clj Modify Add content hash test
test/com/getorcha/workers/ap/ingestion/classification_test.clj Modify Add classification test for invoice w/ delivery summary

Task 1: Strengthen classification prompt for invoice attachments

Add explicit guidance that delivery note summaries/listings appearing as appendix pages of an invoice are part of the invoice, not separate GRN documents.

Files:

Check the existing classification test structure first. Then add a test that classifies the actual ATP1E4CT_invoice.pdf text (or a representative sample) and asserts it returns a single classification (not segments).

The test should use the actual transcribed text from the invoice to ensure the LLM sees the "Lieferschein - Aufstellung" and still classifies it as a single invoice.

Read the existing classification tests to match their pattern, then add:

(deftest classify-invoice-with-delivery-summary-appendix
  (testing "invoice with 'Lieferschein - Aufstellung' page is classified as single invoice, not segmented"
    ;; Page 2 of some invoices contains a delivery note listing (Lieferschein-Aufstellung)
    ;; that summarizes deliveries referenced by the invoice. This is an invoice attachment,
    ;; not a standalone goods-received-note.
    (let [text (str "=== PAGE 1 ===\n"
                    "SÜD BETON\nLIEFERBETON GmbH & Co KG\n"
                    "STRABAG AG\nRechnung\n"
                    "Rechnungsnummer: 2501395\nDatum: 24.06.2025\n"
                    "C25/30 XC2 GK16 F59 CEM II/B 42,5N  5,00 m³  95,00  475,00\n"
                    "C12/15 X0 GK22 F52 CEM II 42,5N  6,50 m³  82,50  536,25\n"
                    "Nettobetrag: 2.117,05\nMwSt 20%: 423,41\nBrutto: 2.540,46\n"
                    "UID-Nr: ATU 29643605\n"
                    "=== PAGE 2 ===\n"
                    "Lieferschein - Aufstellung\n"
                    "Datum LS-Nr. Artikel Menge\n"
                    "12.06.2025 2057422 C25/30 XC2 GK16 F59 CEM II/B 42,5N 5,00\n"
                    "17.06.2025 2057543 C12/15 X0 GK22 F52 CEM II 42,5N 6,50\n"
                    "18.06.2025 50653 C25/30 B2 GK22 F52 CEM II 42,5N 9,50\n"
                    "23.06.2025 50761 C25/30 B2 GK22 F52 CEM II 42,5N 4,50\n"
                    "UID-Nr: ATU 29643605\n")]
      (let [result (classify-text text)]
        (is (nil? (:segments result))
            "Should NOT segment — delivery note listing is an invoice attachment")
        (is (= "invoice" (:document-type result)))
        (is (= "standard-invoice" (:invoice-subtype result)))))))

Note: classify-text is a test helper — check how existing classification tests invoke the LLM and adapt accordingly.

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"

The test may pass sometimes due to non-determinism. Run it 2-3 times to check.

In src/com/getorcha/workers/ap/ingestion/classification.clj, modify the multi-document detection section. Add to the "Signals of the SAME document continuing" list (after line 283):

- Summary or detail pages that reference, itemize, or break down content from the main
  document (e.g., a page listing delivery references for an invoice, a schedule of
  payments for a contract). These are attachments/appendices, not standalone documents.

The key principle: a supporting page that exists to detail or substantiate items in the primary document is part of that document, even if it resembles another document type when viewed in isolation.

Run 3 times:

clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]' :vars '[com.getorcha.workers.ap.ingestion.classification-test/classify-invoice-with-delivery-summary-appendix]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"

Expected: PASS on all 3 runs.

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"

Expected: All pass.

Run: clj-kondo --lint src/com/getorcha/workers/ap/ingestion/classification.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj

git add src/com/getorcha/workers/ap/ingestion/classification.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj
git commit -m "fix: prevent LLM from splitting invoice delivery-summary pages as separate GRNs"

Task 2: Use content-based hash for segmented children

Replace the random UUID content hash with an actual SHA-256 of the child PDF bytes. This enables proper dedup on retry and prevents the unique constraint (legal_entity_id, content_hash) from being meaningless for segmented children.

Files:

Add to ingestion_test.clj:

(deftest segmentation-gate-uses-content-hash
  (testing "child documents get a deterministic content hash based on PDF bytes"
    (let [s3-client  ^S3Client (get-in fixtures/*system* [::aws/state :clients :s3])
          s3-bucket  (get-in (fixtures/config) [::aws/state :s3-buckets :storage])
          {:keys [ingestion-id document-id]}
          (create-test-document+ingestion!
           {:s3-client s3-client
            :s3-bucket s3-bucket
            :db-pool   fixtures/*db*}
           {:upload-to-s3? true})
          orchestrator (::workers.ingestion/orchestrator fixtures/*system*)
          aws-state    (::aws/state fixtures/*system*)
          context      (merge orchestrator {:db-pool fixtures/*db* :aws aws-state})
          parent-document (db.sql/execute-one!
                           fixtures/*db*
                           {:select [:*] :from [:document] :where [:= :id document-id]})
          segments [{:document-type   "invoice"
                     :invoice-subtype "standard-invoice"
                     :pages           [1 1]
                     :confidence      "high"}
                    {:document-type   "invoice"
                     :invoice-subtype "standard-invoice"
                     :pages           [2 2]
                     :confidence      "high"}]
          ingestion {:id                    ingestion-id
                     :document              {:document/id               document-id
                                             :document/legal-entity-id  (:document/legal-entity-id parent-document)
                                             :document/file-path        (:document/file-path parent-document)
                                             :document/file-original-name (:document/file-original-name parent-document)
                                             :document/source-metadata  (or (:document/source-metadata parent-document) {})}
                     :transcription-result  {:text       (str "=== PAGE 1 ===\nInvoice 1 content\n"
                                                              "=== PAGE 2 ===\nInvoice 2 content")
                                             :page-count 2}
                     :file                  {:contents (.getBytes "fake pdf content")
                                             :mime-type "application/pdf"}}]
      (let [sent-messages (atom [])]
        (with-redefs [workers.transcription/extract-pages (fn [_pdf-bytes _page-indices]
                                                            (.getBytes "deterministic child content"))
                      aws/send-message!                   (fn [_client _queue-url body]
                                                            (swap! sent-messages conj body))]
          (#'workers.ingestion/segmentation-gate! context ingestion segments "test-sha")))
      ;; Verify child's content hash is deterministic (SHA-256 of the child PDF bytes)
      (let [children (db.sql/execute!
                      fixtures/*db*
                      {:select [:content-hash]
                       :from   [:document]
                       :where  [:= :source-document-id document-id]})]
        (is (= 1 (count children)))
        ;; Hash should NOT be a UUID format (36 chars with dashes)
        (is (not (re-matches #"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"
                             (:document/content-hash (first children))))
            "Content hash should be a SHA-256 hex string, not a random UUID")
        ;; Should be a 64-char hex string (SHA-256)
        (is (re-matches #"[0-9a-f]{64}" (:document/content-hash (first children))))))))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' :vars '[com.getorcha.workers.ap.ingestion-test/segmentation-gate-uses-content-hash]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"

Expected: FAIL — content hash matches UUID pattern.

In src/com/getorcha/workers/ap/ingestion.clj, replace line 728:

content-hash     (str (random-uuid))]

with:

content-hash     (let [digest (java.security.MessageDigest/getInstance "SHA-256")
                       hash-bytes (.digest digest child-pdf-bytes)]
                   (apply str (map #(format "%02x" %) hash-bytes)))]

No additional imports needed — java.security.MessageDigest is auto-imported.

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' :vars '[com.getorcha.workers.ap.ingestion-test/segmentation-gate-uses-content-hash]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"

Expected: PASS.

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)"

Expected: All pass.

Run: clj-kondo --lint src/com/getorcha/workers/ap/ingestion.clj

git add src/com/getorcha/workers/ap/ingestion.clj test/com/getorcha/workers/ap/ingestion_test.clj
git commit -m "fix: use SHA-256 content hash for segmented child documents"