Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Multi-Document PDF Splitting Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Detect and split multi-document PDFs during ingestion so each document is processed independently.

Architecture: Classification is enhanced to return a vector of segments with page ranges. A new segmentation gate in the orchestrator splits the PDF, creates child documents with cached transcription and pre-filled classification, and routes them for independent processing. A new classification_result JSONB column on ap_ingestion enables classification caching/skipping.

Tech Stack: Clojure, PostgreSQL, PDFBox, S3, SQS, Claude LLM

Spec: docs/superpowers/specs/2026-03-27-multi-document-pdf-splitting-design.md


Task 1: Database Migration — Add classification_result and source_document_id

Files:

ALTER TABLE ap_ingestion ADD COLUMN classification_result JSONB;
--;;
COMMENT ON COLUMN ap_ingestion.classification_result IS
    'Cached classification result. Set after successful LLM classification (retry idempotency)
     or pre-filled for split child documents (skip classification). NULL = run classification.';
--;;
ALTER TABLE document ADD COLUMN source_document_id UUID REFERENCES document(id) ON DELETE SET NULL;
--;;
COMMENT ON COLUMN document.source_document_id IS
    'Parent document ID when this document was created by splitting a multi-document PDF.
     Used for lineage/debugging only, not pipeline logic.';
--;;
CREATE INDEX idx_document_source_document_id ON document(source_document_id)
    WHERE source_document_id IS NOT NULL;
DROP INDEX IF EXISTS idx_document_source_document_id;
--;;
ALTER TABLE document DROP COLUMN IF EXISTS source_document_id;
--;;
ALTER TABLE ap_ingestion DROP COLUMN IF EXISTS classification_result;

Run: psql -h localhost -U postgres -d orcha -f resources/migrations/20260327120000-add-segmentation-columns.up.sql Expected: No errors.

Run: psql -h localhost -U postgres -d orcha -c "\d ap_ingestion" | grep classification_result Expected: classification_result | jsonb |

Run: psql -h localhost -U postgres -d orcha -c "\d document" | grep source_document_id Expected: source_document_id | uuid |

Modify: src/com/getorcha/schema/ingestion.clj — add [:classification-result {:optional true} [:maybe :map]] to the Ingestion schema.

Modify: src/com/getorcha/schema/document.clj — add [:source-document-id {:optional true} [:maybe :uuid]] to the Document schema.

git add resources/migrations/20260327120000-add-segmentation-columns.up.sql resources/migrations/20260327120000-add-segmentation-columns.down.sql src/com/getorcha/schema/ingestion.clj src/com/getorcha/schema/document.clj
git commit -m "feat: add classification_result and source_document_id columns for multi-doc splitting"

Task 2: Classification Caching in Orchestrator

Make classify! in the orchestrator check for a pre-filled classification_result before calling the LLM, and persist the result after a successful LLM call.

Files:

Add to test/com/getorcha/workers/ap/ingestion_test.clj:

(deftest classify-uses-cached-classification-result
  (testing "classify! skips LLM when classification_result is pre-filled"
    ;; Create a document+ingestion with classification_result pre-set
    (let [{:keys [ingestion-id]} (create-test-document+ingestion!
                                  fixtures/*db* fixtures/*system*
                                  {:upload-to-s3? true})
          classification {:document-type "invoice"
                          :invoice-subtype "standard-invoice"
                          :confidence "high"
                          :document-description "Test invoice"}]
      ;; Pre-fill classification_result on the ingestion
      (db.sql/execute-one!
       fixtures/*db*
       {:update :ap-ingestion
        :set    {:classification-result [:lift classification]}
        :where  [:= :id ingestion-id]})
      ;; Run classify! — should NOT call classification/classify!
      (let [classify-called? (atom false)]
        (with-redefs [classification/classify! (fn [& _]
                                                (reset! classify-called? true)
                                                (throw (ex-info "Should not be called" {})))]
          (let [context   (select-keys (::ingestion/orchestrator fixtures/*system*)
                                       [:db-pool :llm-config])
                ingestion {:id ingestion-id
                           :document {:document/id (random-uuid)
                                      :document/legal-entity-id (random-uuid)}
                           :transcription-result {:text "test" :page-count 1}}
                result    (#'ingestion/classify! context ingestion)]
            (is (false? @classify-called?))
            (is (= "invoice" (get-in result [:classification :document-type])))
            (is (= classification (:classification result)))))))))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in)" Expected: FAIL — classify! doesn't check classification_result yet.

Modify src/com/getorcha/workers/ap/ingestion.clj, replace the classify! function (lines 233-261):

(defn ^:private classify!
  "Classifies document type. Returns ingestion with :classification added.
   If classification_result is already set on the ingestion record (pre-filled
   for split documents or cached from a previous attempt), uses it directly.
   Otherwise runs LLM classification and persists the result for retry idempotency.
   Updates document.type immediately so the document appears on the correct page
   (AP for invoices, DM for contracts/POs/GRNs) while still processing."
  [{:keys [db-pool] :as context} {:keys [id] :as ingestion}]
  (let [cached-result (:ap-ingestion/classification-result
                        (db.sql/execute-one!
                         db-pool
                         {:select [:classification-result]
                          :from   [:ap-ingestion]
                          :where  [:= :id id]}))]
    (if cached-result
      ;; Use pre-filled or previously cached classification
      (do
        (log/info "Using cached classification result"
                  {:document-type (:document-type cached-result)})
        (db.sql/execute-one!
         db-pool
         {:update :document
          :set    {:type (db.sql/->cast (:document-type cached-result) :document-type)}
          :where  [:= :id (get-in ingestion [:document :document/id])]})
        (cond-> (assoc ingestion
                       :classification cached-result
                       :classification-stats {:input-tokens 0 :output-tokens 0
                                             :started-at (java.time.Instant/now)
                                             :ended-at (java.time.Instant/now)
                                             :model "cached"})
          (= "financial-notice" (:document-type cached-result))
          (assoc :short-circuit? true)))
      ;; Run LLM classification
      (do
        (log/info "Starting classification")
        (let [{:keys [stats classification short-circuit?]} (classification/classify! context ingestion)]
          (log/info "Classification completed"
                    {:document-type  (:document-type classification)
                     :subtype        (case (:document-type classification)
                                       "invoice"          (:invoice-subtype classification)
                                       "contract"         (:contract-type classification)
                                       "financial-notice" (:notice-type classification)
                                       nil)
                     :confidence     (:confidence classification)
                     :short-circuit? (boolean short-circuit?)
                     :duration-ms    (- (.toEpochMilli ^java.time.Instant (:ended-at stats))
                                        (.toEpochMilli ^java.time.Instant (:started-at stats)))})
          ;; Persist classification result for retry idempotency
          (db.sql/execute-one!
           db-pool
           {:update :ap-ingestion
            :set    {:classification-result [:lift classification]}
            :where  [:= :id id]})
          ;; Set document type early
          (db.sql/execute-one!
           db-pool
           {:update :document
            :set    {:type (db.sql/->cast (:document-type classification) :document-type)}
            :where  [:= :id (get-in ingestion [:document :document/id])]})
          (cond-> (assoc ingestion
                         :classification classification
                         :classification-stats stats)
            short-circuit? (assoc :short-circuit? true)))))))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)" Expected: All pass.

git add src/com/getorcha/workers/ap/ingestion.clj test/com/getorcha/workers/ap/ingestion_test.clj
git commit -m "feat: classify! checks classification_result before running LLM"

Task 3: Transcription Text Slicing Utilities

Add public functions for slicing transcription text by page ranges and re-indexing page numbers. These already have private counterparts in transcription.clj — we make slicing available to the orchestrator.

Files:

Add to test/com/getorcha/workers/ap/ingestion/transcription_test.clj (create if needed):

(ns com.getorcha.workers.ap.ingestion.transcription-test
  (:require [clojure.test :refer [deftest is testing]]
            [com.getorcha.workers.ap.ingestion.transcription :as transcription]))


(deftest slice-text-by-page-range-test
  (let [text (str "=== PAGE 1 ===\nInvoice for Franz Dorfer\nTotal: 439.20\n"
                  "=== PAGE 2 ===\nInvoice for Christian Müller\nTotal: 564.60\n"
                  "=== PAGE 3 ===\nInvoice for Matthias Rysavy\nTotal: 439.20\n"
                  "=== PAGE 4 ===\nInvoice for Slamanig\nTotal: 333.80")]
    (testing "extracts single page and re-indexes to page 1"
      (let [result (transcription/slice-text-by-page-range text [2 2])]
        (is (= "=== PAGE 1 ===\nInvoice for Christian Müller\nTotal: 564.60" result))))

    (testing "extracts multi-page range and re-indexes"
      (let [result (transcription/slice-text-by-page-range text [2 3])]
        (is (= (str "=== PAGE 1 ===\nInvoice for Christian Müller\nTotal: 564.60\n"
                    "=== PAGE 2 ===\nInvoice for Matthias Rysavy\nTotal: 439.20")
               result))))

    (testing "extracts first page"
      (let [result (transcription/slice-text-by-page-range text [1 1])]
        (is (= "=== PAGE 1 ===\nInvoice for Franz Dorfer\nTotal: 439.20" result))))

    (testing "extracts last page"
      (let [result (transcription/slice-text-by-page-range text [4 4])]
        (is (= "=== PAGE 1 ===\nInvoice for Slamanig\nTotal: 333.80" result))))))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.transcription-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in)" Expected: FAIL — slice-text-by-page-range doesn't exist.

Make the existing split-text-by-pages public (remove ^:private), and add a new public function after it in src/com/getorcha/workers/ap/ingestion/transcription.clj:

(defn split-text-by-pages
  "Split text containing === PAGE N === markers into a map of {page-num page-text}.
   Page text does not include the marker itself."
  [text]
  (let [markers (vec (re-seq #"=== PAGE (\d+) ===" text))
        parts   (str/split text #"=== PAGE \d+ ===\n?")]
    (into {} (map-indexed
              (fn [idx [_ num-str]]
                [(parse-long num-str) (str/trim (nth parts (inc idx) ""))])
              markers))))


(defn slice-text-by-page-range
  "Extract pages from transcription text and re-index to start at page 1.
   page-range is [start end] (1-indexed, inclusive).
   Returns text with === PAGE N === markers re-numbered from 1."
  [text [start-page end-page]]
  (let [pages-map (split-text-by-pages text)]
    (->> (range start-page (inc end-page))
         (keep (fn [page-num]
                 (when-let [page-text (get pages-map page-num)]
                   page-text)))
         (map-indexed (fn [idx page-text]
                        (str "=== PAGE " (inc idx) " ===\n" page-text)))
         (str/join "\n"))))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.transcription-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)" Expected: All pass.

git add src/com/getorcha/workers/ap/ingestion/transcription.clj test/com/getorcha/workers/ap/ingestion/transcription_test.clj
git commit -m "feat: add slice-text-by-page-range for multi-doc text splitting"

Task 4: PDF Splitting Utility

Make the existing extract-pages function in transcription.clj public so the orchestrator can use it for splitting PDFs.

Files:

Change ^:private to public in src/com/getorcha/workers/ap/ingestion/transcription.clj at line 84:

(defn extract-pages
  "Extract specific pages (0-indexed) from a PDF into a new PDF document.
   Returns byte array of the new PDF."
  ^bytes [^bytes pdf-bytes page-indices]
  (with-open [source-doc (Loader/loadPDF pdf-bytes)
              new-doc    (PDDocument.)]
    (doseq [i page-indices]
      (.importPage new-doc (.getPage source-doc i)))
    (let [baos (ByteArrayOutputStream.)]
      (.save new-doc baos)
      (.toByteArray baos))))
git add src/com/getorcha/workers/ap/ingestion/transcription.clj
git commit -m "feat: make extract-pages public for PDF splitting"

Task 5: Classification Prompt — Multi-Document Detection

Enhance the classification prompt to detect multiple documents in multi-page PDFs and return a vector of segments.

Files:

Add to test/com/getorcha/workers/ap/ingestion/classification_test.clj:

(deftest parse-multi-segment-classification-test
  (testing "single document returns one-element vector"
    (let [response {"document-type" "invoice"
                    "invoice-subtype" "standard-invoice"
                    "description" "Hotel invoice"
                    "confidence" "high"
                    "segments" nil}
          result (classification/parse-classification-response response)]
      (is (= 1 (count result)))
      (is (= "invoice" (:document-type (first result))))))

  (testing "multi-document returns vector of segments"
    (let [response {"segments" [{"document-type" "invoice"
                                 "invoice-subtype" "standard-invoice"
                                 "pages" [1 1]
                                 "description" "Invoice for Franz Dorfer"
                                 "confidence" "high"}
                                {"document-type" "invoice"
                                 "invoice-subtype" "standard-invoice"
                                 "pages" [2 2]
                                 "description" "Invoice for Christian Müller"
                                 "confidence" "high"}
                                {"document-type" "financial-notice"
                                 "notice-type" "payment-reminder"
                                 "notice-metadata" {"reference-number" "12345"
                                                    "amount" 100.0}
                                 "pages" [3 3]
                                 "description" "Payment reminder"
                                 "confidence" "high"}]}
          result (classification/parse-classification-response response)]
      (is (= 3 (count result)))
      (is (= "invoice" (:document-type (first result))))
      (is (= [1 1] (:pages (first result))))
      (is (= "financial-notice" (:document-type (nth result 2))))
      (is (= "payment-reminder" (:notice-type (nth result 2))))
      (is (some? (:notice-metadata (nth result 2)))))))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in)" Expected: FAIL — parse-classification-response doesn't exist yet.

Add to src/com/getorcha/workers/ap/ingestion/classification.clj, a new public function:

(defn parse-classification-response
  "Parse classification LLM response into a vector of segment maps.
   If the response contains a 'segments' array, each segment is parsed independently.
   If no segments array, wraps the single classification as a one-element vector.
   Each segment has :document-type, :pages, :confidence, and type-specific fields."
  [response]
  (if-let [segments (seq (:segments response))]
    ;; Multi-document: parse each segment
    (mapv (fn [segment]
            (let [parsed (parse-classification-result segment)]
              (assoc parsed :pages (:pages segment))))
          segments)
    ;; Single document: wrap existing classification
    [(parse-classification-result response)]))

Modify the classification prompt (the defmethod ai.prompts/-prompt :classification form). Add multi-document detection instructions. After the existing rules, add:

MULTI-DOCUMENT DETECTION (for multi-page documents only):
If this document contains MULTIPLE DISTINCT documents (e.g., separate invoices for different
guests, different invoice numbers, different issuers), return a "segments" array instead
of a single classification. Each segment must include a "pages" field with [start, end]
(1-indexed, inclusive).

Signals of SEPARATE documents:
- Different invoice numbers or document identifiers
- Different issuer/header blocks appearing on separate pages
- Separate totals sections for each document
- New document titles (e.g., "Rechnung", "Invoice") appearing on a new page

Signals of the SAME document continuing:
- Same invoice number across pages
- "Page 2 of 3" or similar continuation markers
- Line items continuing without a new header
- Appendix pages (terms & conditions, etc.) following the main document

For multi-document response, return:
{
  "segments": [
    {"document-type": "invoice", "invoice-subtype": "standard-invoice",
     "pages": [1, 1], "description": "...", "confidence": "high"},
    {"document-type": "invoice", "invoice-subtype": "standard-invoice",
     "pages": [2, 3], "description": "...", "confidence": "high"},
    ...
  ]
}

For financial-notice segments, include notice-type and notice-metadata in the segment.

If all pages belong to one document, do NOT use the segments format.
Return the normal single-classification JSON.

Also update the classify! function in classification.clj to send the full text (not just first page) for multi-page documents:

;; In classify!, change the text selection:
;; For single-page docs: use first-page-text (as today)
;; For multi-page docs: use full text so the LLM can detect boundaries
first-page-text (extract-first-page-text text)
prompt-text     (if (and (some? page-count) (> page-count 1))
                  text
                  first-page-text)

Modify the return value of classification/classify! to return segments:

;; At the end of classify!, replace parse-classification-result with parse-classification-response:
classification-segments (parse-classification-response raw-json)
;; ...
(cond-> {:segments       classification-segments
         :classification (first classification-segments)  ;; segment[0] for backward compat
         :stats          stats}
  (= "financial-notice" (:document-type (first classification-segments)))
  (assoc :short-circuit? true))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion.classification-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)" Expected: All pass.

git add src/com/getorcha/workers/ap/ingestion/classification.clj test/com/getorcha/workers/ap/ingestion/classification_test.clj
git commit -m "feat: classification detects multi-document PDFs and returns segments"

Task 6: Segmentation Gate

Implement the segmentation gate in the orchestrator. This is the core of the feature.

Files:

Add to test/com/getorcha/workers/ap/ingestion_test.clj:

(deftest segmentation-gate-splits-multi-document-pdf
  (testing "gate creates child documents for multi-segment classification"
    (let [{:keys [ingestion-id document-id]}
          (create-test-document+ingestion!
           fixtures/*db* fixtures/*system*
           {:upload-to-s3? true
            :content (test-multi-invoice-pdf-bytes)})  ;; helper producing 4-page PDF
          context (::ingestion/orchestrator fixtures/*system*)
          segments [{:document-type "invoice"
                     :invoice-subtype "standard-invoice"
                     :pages [1 1]
                     :confidence "high"
                     :document-description "Invoice 1"}
                    {:document-type "invoice"
                     :invoice-subtype "standard-invoice"
                     :pages [2 2]
                     :confidence "high"
                     :document-description "Invoice 2"}]
          ingestion {:id ingestion-id
                     :document {:document/id document-id
                                :document/legal-entity-id test-legal-entity-id
                                :document/file-path (str "documents/" document-id ".pdf")
                                :document/source-metadata {:from "test@example.com"}}
                     :transcription-result {:text (str "=== PAGE 1 ===\nInvoice 1\n"
                                                       "=== PAGE 2 ===\nInvoice 2")
                                            :page-count 2}
                     :file {:contents (test-multi-invoice-pdf-bytes)
                            :mime-type "application/pdf"}}]
      ;; Run the gate
      (#'ingestion/segmentation-gate! context ingestion segments)
      ;; Verify child document was created
      (let [children (db.sql/execute!
                      fixtures/*db*
                      {:select [:*]
                       :from   [:document]
                       :where  [:= :source-document-id document-id]})]
        (is (= 1 (count children)))
        (is (= "invoice" (name (:document/type (first children)))))
        ;; Verify child ingestion has cached transcription and classification
        (let [child-ingestion (db.sql/execute-one!
                               fixtures/*db*
                               {:select [:*]
                                :from   [:ap-ingestion]
                                :where  [:= :document-id (:document/id (first children))]})]
          (is (some? (:ap-ingestion/transcription-file-path child-ingestion)))
          (is (some? (:ap-ingestion/classification-result child-ingestion))))))))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in)" Expected: FAIL — segmentation-gate! doesn't exist.

Add to src/com/getorcha/workers/ap/ingestion.clj, before the job-handler function:

(defn ^:private segmentation-gate!
  "Splits a multi-document PDF into separate documents.
   For segments beyond segment[0]:
   - Creates child ap_document + ap_ingestion records
   - Splits PDF and uploads chunks to S3
   - Uploads sliced transcription EDN per child
   - Routes by type: notice/other completed inline, extractable types queued via SQS
   For segment[0]: archives original PDF, replaces with trimmed version, updates transcription.
   All sub-steps are idempotent for retry safety."
  [{:keys [db-pool ^S3Client s3-client ^SqsClient sqs-client]
    {:keys [s3-buckets queue-urls]} :aws
    :as context}
   {:keys [id document file transcription-result] :as ingestion}
   segments
   commit-sha]
  (let [storage-bucket   (:storage s3-buckets)
        ingestion-queue  (:ingestion queue-urls)
        document-id      (:document/id document)
        legal-entity-id  (:document/legal-entity-id document)
        source-metadata  (:document/source-metadata document)
        pdf-bytes        (:contents file)
        full-text        (:text transcription-result)
        ;; Check for existing children (idempotency)
        existing-children (db.sql/execute!
                           db-pool
                           {:select [:id]
                            :from   [:document]
                            :where  [:= :source-document-id document-id]})
        {:ap-ingestion/keys [doc-source-id uploaded-by]}
                         (db.sql/execute-one!
                          db-pool
                          {:select [:doc-source-id :uploaded-by]
                           :from   [:ap-ingestion]
                           :where  [:= :id id]})]
    (when (empty? existing-children)
      (log/info "Segmentation gate: splitting document"
                {:document-id document-id
                 :segment-count (count segments)})
      ;; Process segments 1..N (children)
      (doseq [segment (rest segments)]
        (let [child-doc-id    (random-uuid)
              child-ing-id    (random-uuid)
              [start end]     (:pages segment)
              ;; 0-indexed page indices for PDFBox
              page-indices    (range (dec start) end)
              child-pdf-bytes (transcription/extract-pages pdf-bytes page-indices)
              child-file-path (str "documents/" child-doc-id ".pdf")
              child-text      (transcription/slice-text-by-page-range full-text [start end])
              child-trans-edn (pr-str (assoc transcription-result
                                             :text child-text
                                             :page-count (- (inc end) start)))
              child-trans-path (str "ingestions/" child-ing-id "/transcription-output.edn")
              content-hash    (str (java.util.UUID/randomUUID))]  ;; unique hash for split chunk
          ;; 1. Create child document
          (db.sql/execute-one!
           db-pool
           {:insert-into :document
            :values      [{:id                 child-doc-id
                           :legal-entity-id    legal-entity-id
                           :type               (db.sql/->cast (:document-type segment) :document-type)
                           :content-hash       content-hash
                           :file-path          child-file-path
                           :file-original-name (:document/file-original-name document)
                           :source-metadata    [:lift (or source-metadata {})]
                           :source-document-id document-id}]})
          ;; 2. Create child ingestion (preserve XOR: doc-source-id OR uploaded-by)
          (db.sql/execute-one!
           db-pool
           {:insert-into :ap-ingestion
            :values      [(cond-> {:id                      child-ing-id
                                   :document-id              child-doc-id
                                   :transcription-file-path  child-trans-path
                                   :classification-result    [:lift (dissoc segment :pages)]}
                            doc-source-id (assoc :doc-source-id doc-source-id)
                            uploaded-by   (assoc :uploaded-by uploaded-by))]})
          ;; 3. Upload split PDF and transcription to S3
          (aws/put-object! s3-client storage-bucket child-file-path
                           child-pdf-bytes "application/pdf")
          (aws/put-object! s3-client storage-bucket child-trans-path
                           child-trans-edn "text/plain; charset=utf-8")
          ;; 4. Route by type
          (cond
            (= "financial-notice" (:document-type segment))
            (complete-notice! context commit-sha
                              {:id child-ing-id
                               :document {:document/id child-doc-id}
                               :classification (dissoc segment :pages)
                               :classification-stats {:input-tokens 0 :output-tokens 0
                                                      :started-at (java.time.Instant/now)
                                                      :ended-at (java.time.Instant/now)
                                                      :model "inherited"}})

            (= "other" (:document-type segment))
            (complete-other! context commit-sha
                             {:id child-ing-id
                              :document {:document/id child-doc-id
                                         :document/file-path child-file-path
                                         :document/source-metadata source-metadata}
                              :classification (dissoc segment :pages)
                              :classification-stats {:input-tokens 0 :output-tokens 0
                                                     :started-at (java.time.Instant/now)
                                                     :ended-at (java.time.Instant/now)
                                                     :model "inherited"}})

            :else
            (aws/send-message! sqs-client ingestion-queue (str child-ing-id))))))
    ;; Process segment[0]: trim document 1
    (let [[start end] (:pages (first segments))
          page-indices (range (dec start) end)]
      ;; Archive original PDF
      (aws/put-object! s3-client storage-bucket
                       (str "documents/" document-id "/original.pdf")
                       pdf-bytes "application/pdf")
      ;; Replace with trimmed PDF
      (let [trimmed-pdf (transcription/extract-pages pdf-bytes page-indices)]
        (aws/put-object! s3-client storage-bucket
                         (:document/file-path document)
                         trimmed-pdf "application/pdf"))
      ;; Update transcription with sliced text
      (let [sliced-text      (transcription/slice-text-by-page-range full-text [start end])
            sliced-result    (assoc transcription-result
                                   :text sliced-text
                                   :page-count (- (inc end) start))
            transcription-path (str "ingestions/" id "/transcription-output.edn")]
        (aws/put-object! s3-client storage-bucket transcription-path
                         (pr-str sliced-result) "text/plain; charset=utf-8")))))

Run: clj -X:test:silent :nses '[com.getorcha.workers.ap.ingestion-test]' 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)" Expected: All pass.

git add src/com/getorcha/workers/ap/ingestion.clj test/com/getorcha/workers/ap/ingestion_test.clj
git commit -m "feat: implement segmentation gate for multi-document PDF splitting"

Task 7: Wire Segmentation Gate into the Orchestrator

Connect the gate to the classify! return value and the job-handler pipeline.

Files:

The classify! function needs to also return the full segments vector from classification/classify!. Modify the non-cached branch to capture and return segments:

;; In the LLM classification branch, after calling classification/classify!:
(let [{:keys [stats classification segments short-circuit?]} (classification/classify! context ingestion)]
  ;; Persist segment[0] to classification_result
  (db.sql/execute-one!
   db-pool
   {:update :ap-ingestion
    :set    {:classification-result [:lift classification]}
    :where  [:= :id id]})
  ;; Set document type early
  (db.sql/execute-one!
   db-pool
   {:update :document
    :set    {:type (db.sql/->cast (:document-type classification) :document-type)}
    :where  [:= :id (get-in ingestion [:document :document/id])]})
  (cond-> (assoc ingestion
                 :classification classification
                 :classification-stats stats
                 :segments segments)  ;; <-- NEW: pass segments through
    short-circuit? (assoc :short-circuit? true)))

Modify the job-handler function (lines 683-713). After classification, check for multiple segments and run the gate before routing:

;; Replace lines 683-713 with:
(let [classified (->> pipeline-state
                      (fetch-files-from-s3! context)
                      (transcribe! context)
                      (classify! context))
      segments   (:segments classified)
      ;; Run segmentation gate if multiple documents detected
      _          (when (and segments (> (count segments) 1))
                   (segmentation-gate! context classified segments commit-sha))
      doc-type   (get-in classified [:classification :document-type])]
  (cond
    ;; Financial notices: short-circuit with notification
    (:short-circuit? classified)
    (let [document (complete-notice! context commit-sha classified)]
      (publish-document-ready! context (:document/id document)))

    ;; "Other" documents: skip extraction, no notification
    (= "other" doc-type)
    (complete-other! context commit-sha classified)

    ;; All other types: full extraction pipeline
    :else
    (let [document (->> classified
                        (extract! context)
                        (with-validations context)
                        (post-process! context)
                        (complete-ingestion! db-pool commit-sha :completed))]
      ;; ... rest unchanged

Run: clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)" Expected: All pass.

git add src/com/getorcha/workers/ap/ingestion.clj
git commit -m "feat: wire segmentation gate into ingestion pipeline"

Task 8: Lint and Fix

Run: clj-kondo --lint src test dev Expected: No new warnings.

Run: clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Ran .* tests)" Expected: All pass.

git add -A
git commit -m "fix: lint cleanup for multi-doc splitting"

Task 9: Manual Integration Test

Test with the real multi-invoice document from the problem statement.

Use the /reingest-doc skill to re-process document 019d2a63-49a2-70b5-a345-d70ce0b7bf15.

Run: psql -h localhost -U postgres -d orcha -c "SELECT id, type, source_document_id, file_path FROM document WHERE source_document_id = '019d2a63-49a2-70b5-a345-d70ce0b7bf15'" Expected: 3 child documents (pages 2, 3, 4).

Run: psql -h localhost -U postgres -d orcha -c "SELECT i.id, i.status, i.classification_result->>'document-type' as doc_type, d.source_document_id FROM ap_ingestion i JOIN document d ON d.id = i.document_id WHERE d.source_document_id = '019d2a63-49a2-70b5-a345-d70ce0b7bf15'" Expected: 3 ingestions, all completed, all type "invoice".

Run: bb dev:aws-cli s3 ls s3://v1-orcha-global-storage-local-stack/documents/019d2a63-49a2-70b5-a345-d70ce0b7bf15 Expected: Both 019d2a63-49a2-70b5-a345-d70ce0b7bf15.pdf (trimmed) and 019d2a63-49a2-70b5-a345-d70ce0b7bf15/original.pdf (archived).

Run: psql -h localhost -U postgres -d orcha -c "SELECT d.id, d.structured_data->>'invoice-number' as inv_num, d.structured_data->'issuer'->>'name' as issuer FROM document d WHERE d.id = '019d2a63-49a2-70b5-a345-d70ce0b7bf15' OR d.source_document_id = '019d2a63-49a2-70b5-a345-d70ce0b7bf15' ORDER BY d.created_at" Expected: 4 documents with different invoice numbers (315002605076, 315002605079, 315002605080, 315002605081).