Multi-Document PDF Splitting

Problem

The ingestion pipeline assumes one document per PDF. When a single PDF contains multiple documents (e.g., a hotel sending 4 separate invoices for 4 guests in one file), only the first document is extracted. The rest are silently ignored.

Example: document 019d2a63-49a2-70b5-a345-d70ce0b7bf15 — a 4-page PDF from Hotel Schillerpark with 4 separate invoices (different invoice numbers, guests, rooms, totals). The pipeline extracted only invoice #1 (Franz Dorfer, 439.20 EUR) and ignored invoices #2-#4.

This is not an invoice-specific problem. Any document type can appear in a multi-document PDF.

Design

Pipeline Overview

Current:

fetch-from-s3 → transcribe → classify → [route by type] → extract/notice/other → complete

New:

fetch-from-s3 → transcribe → classify → [segmentation gate] → [route by type] → extract/notice/other → complete

The segmentation gate is a new step in the orchestrator that runs after classification. If classification detects multiple documents, the gate splits the PDF and creates independent child documents.

Classification Changes

Classification is enhanced to detect document boundaries within multi-page PDFs.

Single-page documents: Classified as today. Returns a one-element vector.

Multi-page documents: The prompt reads all pages and detects boundaries between distinct documents. Signals of a new document: different invoice number/header, different issuer block, separate totals section, new document title. Signals of continuation: same invoice number, "Page 2 of 3", continuation of line items without new header.

Classification returns a vector of segment maps:

[{:document-type "invoice"
  :invoice-subtype "standard-invoice"
  :pages [1 1]
  :confidence "high"
  :document-description "Hotel invoice for Franz Dorfer"}
 {:document-type "invoice"
  :invoice-subtype "standard-invoice"
  :pages [2 2]
  :confidence "high"
  :document-description "Hotel invoice for Christian Müller"}
 {:document-type "financial-notice"
  :notice-type "payment-reminder"
  :notice-metadata {:reference-number "12345" :amount 100.00 :currency "EUR"}
  :pages [3 3]
  :confidence "high"
  :document-description "Payment reminder from supplier X"}]

Each segment includes full classification for its type — including notice metadata for financial notices, since financial notices short-circuit (classification IS the structured data, no extraction runs).

Segmentation Gate

Lives in the orchestrator, runs after classify!. Checks (> (count segments) 1).

If single segment: Continue as today.

If multiple segments, for each segment beyond segment[0]:

  1. Split PDF via PDFBox — extract segment's pages, upload to S3
  2. Create ap_document with:
  3. Create ap_doc_ingestion with:
  4. Route by segment type:

For document 1 (segment[0]):

  1. Archive original PDF to documents/{document-id}/original.pdf
  2. Replace main S3 file with trimmed PDF (segment[0]'s pages only)
  3. Update cached transcription — upload sliced EDN, update transcription_file_path
  4. Continue pipeline with segment[0]'s classification

Ordering and Idempotency

Gate operation order:

  1. Create all child documents and ingestions (DB)
  2. Upload split PDFs and sliced transcription EDNs (S3)
  3. Complete notice/other children inline
  4. Queue extractable children via SQS
  5. Archive original PDF
  6. Replace document 1's S3 file with trimmed pages
  7. Update document 1's cached transcription

Idempotency: Each sub-step is independently idempotent:

On retry, the gate runs all steps unconditionally. No partial-progress tracking needed.

Database Changes

New column on ap_ingestion:

New column on ap_document:

classify! Orchestrator Changes

  1. Query ap_ingestion.classification_result for this ingestion
  2. If not null → use as classification (single segment), skip LLM
  3. If null → run classification/classify!, which returns a vector of segments
  4. Persist segment[0] to ap_ingestion.classification_result for this ingestion (retry idempotency)
  5. Segment[0] is the classification for the current document; remaining segments feed the gate
  6. document.type update uses segment[0]'s document-type
  7. short-circuit? derived from segment[0]

Transcription Slicing

The full OCR text uses === PAGE N === markers. The gate slices by these markers and re-indexes page numbers so each child's transcription starts at page 1. Uploaded as EDN to ingestions/{ingestion-id}/transcription-output.edn following the existing convention.

PDF Splitting

PDFBox PDDocument loaded once from the original S3 file. Pages extracted per segment, each saved as a separate PDF. One load, N writes.

Original PDF Preservation

Before modifying document 1's S3 file, the original multi-document PDF is copied to documents/{document-id}/original.pdf. The original is discoverable by convention for debugging.

Email-Document Linkage

Child documents preserve the email source linkage:

What Doesn't Change

Edge Cases