Multi-Document PDF Splitting

Problem

The ingestion pipeline assumes one document per PDF. When a single PDF contains multiple documents (e.g., a hotel sending 4 separate invoices for 4 guests in one file), only the first document is extracted. The rest are silently ignored.

Example: document 019d2a63-49a2-70b5-a345-d70ce0b7bf15 — a 4-page PDF from Hotel Schillerpark with 4 separate invoices (different invoice numbers, guests, rooms, totals). The pipeline extracted only invoice #1 (Franz Dorfer, 439.20 EUR) and ignored invoices #2-#4.

This is not an invoice-specific problem. Any document type can appear in a multi-document PDF.

Design

Pipeline Overview

Current:

fetch-from-s3 → transcribe → classify → [route by type] → extract/notice/other → complete

New:

fetch-from-s3 → transcribe → classify → [segmentation gate] → [route by type] → extract/notice/other → complete

The segmentation gate is a new step in the orchestrator that runs after classification. If classification detects multiple documents, the gate splits the PDF and creates independent child documents.

Classification Changes

Classification is enhanced to detect document boundaries within multi-page PDFs.

Single-page documents: Classified as today. Returns a one-element vector.

Multi-page documents: The prompt reads all pages and detects boundaries between distinct documents. Signals of a new document: different invoice number/header, different issuer block, separate totals section, new document title. Signals of continuation: same invoice number, "Page 2 of 3", continuation of line items without new header.

Classification returns a vector of segment maps:

[{:document-type "invoice"
  :invoice-subtype "standard-invoice"
  :pages [1 1]
  :confidence "high"
  :document-description "Hotel invoice for Franz Dorfer"}
 {:document-type "invoice"
  :invoice-subtype "standard-invoice"
  :pages [2 2]
  :confidence "high"
  :document-description "Hotel invoice for Christian Müller"}
 {:document-type "financial-notice"
  :notice-type "payment-reminder"
  :notice-metadata {:reference-number "12345" :amount 100.00 :currency "EUR"}
  :pages [3 3]
  :confidence "high"
  :document-description "Payment reminder from supplier X"}]

Each segment includes full classification for its type — including notice metadata for financial notices, since financial notices short-circuit (classification IS the structured data, no extraction runs).

Segmentation Gate

Lives in the orchestrator, runs after classify!. Checks (> (count segments) 1).

If single segment: Continue as today.

If multiple segments, for each segment beyond segment[0]:

Split PDF via PDFBox — extract segment's pages, upload to S3
Create ap_document with:
- type pre-set from segment classification
- source_metadata copied from parent document
- source_document_id set to parent document's ID (lineage)
Create ap_doc_ingestion with:
- doc_source_id from parent ingestion (preserves email linkage)
- transcription_file_path → pre-uploaded sliced transcription EDN (pages re-indexed to start at 1)
- classification_result → pre-filled with segment's classification
Route by segment type:
- Financial notice → complete-notice! inline (no SQS queue needed)
- "Other" → complete-other! inline (no SQS queue needed)
- Invoice/contract/PO/GRN → queue SQS message for ingestion (worker picks up, skips transcription + classification via cached values, proceeds to extraction)

For document 1 (segment[0]):

Archive original PDF to documents/{document-id}/original.pdf
Replace main S3 file with trimmed PDF (segment[0]'s pages only)
Update cached transcription — upload sliced EDN, update transcription_file_path
Continue pipeline with segment[0]'s classification

Ordering and Idempotency

Gate operation order:

Create all child documents and ingestions (DB)
Upload split PDFs and sliced transcription EDNs (S3)
Complete notice/other children inline
Queue extractable children via SQS
Archive original PDF
Replace document 1's S3 file with trimmed pages
Update document 1's cached transcription

Idempotency: Each sub-step is independently idempotent:

DB inserts use INSERT ... ON CONFLICT DO NOTHING or existence checks
S3 uploads are idempotent (PUT overwrites)
SQS duplicate messages handled by existing claim-ingestion! dedup
complete-notice! / complete-other! check ingestion status before running, skip if already completed

On retry, the gate runs all steps unconditionally. No partial-progress tracking needed.

Database Changes

New column on ap_ingestion:

classification_result — JSONB, nullable, default null
classify! checks this before running the LLM: if set, uses it, skips LLM. If null, runs classification normally.
After successful LLM classification, classify! persists the result to this column. This makes classification idempotent on retries for all documents (general improvement, not just for splits).

New column on ap_document:

source_document_id — UUID, nullable, FK to document(id), default null
Set on child documents created by the segmentation gate
Lineage/traceability for debugging and audit. Not used for pipeline logic.

`classify!` Orchestrator Changes

Query ap_ingestion.classification_result for this ingestion
If not null → use as classification (single segment), skip LLM
If null → run classification/classify!, which returns a vector of segments
Persist segment[0] to ap_ingestion.classification_result for this ingestion (retry idempotency)
Segment[0] is the classification for the current document; remaining segments feed the gate
document.type update uses segment[0]'s document-type
short-circuit? derived from segment[0]

Transcription Slicing

The full OCR text uses === PAGE N === markers. The gate slices by these markers and re-indexes page numbers so each child's transcription starts at page 1. Uploaded as EDN to ingestions/{ingestion-id}/transcription-output.edn following the existing convention.

PDF Splitting

PDFBox PDDocument loaded once from the original S3 file. Pages extracted per segment, each saved as a separate PDF. One load, N writes.

Original PDF Preservation

Before modifying document 1's S3 file, the original multi-document PDF is copied to documents/{document-id}/original.pdf. The original is discoverable by convention for debugging.

Email-Document Linkage

Child documents preserve the email source linkage:

ap_doc_ingestion.doc_source_id set to the same doc source as the parent ingestion
ap_document.source_metadata copied from the parent document (from, subject, message-id, forwarding chain, etc.)

What Doesn't Change

Extraction — receives single-document classification and transcription, no changes
Validation — operates on single-document structured_data, no changes
Fraud detection — no changes
Post-processing (cost center, account matching) — no changes
complete-ingestion! — no changes
DB trigger (copies structured_data from ingestion to document on completion) — no changes
ERP UI — split documents are normal ap_document records with their own PDFs
Email acquisition / triage — no changes; multi-doc detection happens at ingestion time
SQS message format — still {:ingestion-id "uuid"}; classification travels via DB column

Edge Cases

Multi-page single document (e.g., 10-page invoice spanning pages) → classification returns one segment with pages [1 10], no split
Mixed document types (invoice + delivery note + financial notice in one PDF) → separate segments with different types, each routed appropriately
Non-contiguous pages → assumed contiguous for now; classification prompt designed for contiguous page ranges
Single-page documents → classification returns one-element vector, gate is a no-op
Document 1 failure after split → on retry, document 1 is a trimmed single/few-page document, classifies normally, no re-split
Gate failure mid-execution → all sub-steps idempotent, retry runs everything unconditionally