The ingestion pipeline assumes one document per PDF. When a single PDF contains multiple documents (e.g., a hotel sending 4 separate invoices for 4 guests in one file), only the first document is extracted. The rest are silently ignored.
Example: document 019d2a63-49a2-70b5-a345-d70ce0b7bf15 — a 4-page PDF from Hotel Schillerpark with 4 separate invoices (different invoice numbers, guests, rooms, totals). The pipeline extracted only invoice #1 (Franz Dorfer, 439.20 EUR) and ignored invoices #2-#4.
This is not an invoice-specific problem. Any document type can appear in a multi-document PDF.
Current:
fetch-from-s3 → transcribe → classify → [route by type] → extract/notice/other → complete
New:
fetch-from-s3 → transcribe → classify → [segmentation gate] → [route by type] → extract/notice/other → complete
The segmentation gate is a new step in the orchestrator that runs after classification. If classification detects multiple documents, the gate splits the PDF and creates independent child documents.
Classification is enhanced to detect document boundaries within multi-page PDFs.
Single-page documents: Classified as today. Returns a one-element vector.
Multi-page documents: The prompt reads all pages and detects boundaries between distinct documents. Signals of a new document: different invoice number/header, different issuer block, separate totals section, new document title. Signals of continuation: same invoice number, "Page 2 of 3", continuation of line items without new header.
Classification returns a vector of segment maps:
[{:document-type "invoice"
:invoice-subtype "standard-invoice"
:pages [1 1]
:confidence "high"
:document-description "Hotel invoice for Franz Dorfer"}
{:document-type "invoice"
:invoice-subtype "standard-invoice"
:pages [2 2]
:confidence "high"
:document-description "Hotel invoice for Christian Müller"}
{:document-type "financial-notice"
:notice-type "payment-reminder"
:notice-metadata {:reference-number "12345" :amount 100.00 :currency "EUR"}
:pages [3 3]
:confidence "high"
:document-description "Payment reminder from supplier X"}]
Each segment includes full classification for its type — including notice metadata for financial notices, since financial notices short-circuit (classification IS the structured data, no extraction runs).
Lives in the orchestrator, runs after classify!. Checks (> (count segments) 1).
If single segment: Continue as today.
If multiple segments, for each segment beyond segment[0]:
ap_document with:
type pre-set from segment classificationsource_metadata copied from parent documentsource_document_id set to parent document's ID (lineage)ap_doc_ingestion with:
doc_source_id from parent ingestion (preserves email linkage)transcription_file_path → pre-uploaded sliced transcription EDN (pages re-indexed to start at 1)classification_result → pre-filled with segment's classificationcomplete-notice! inline (no SQS queue needed)complete-other! inline (no SQS queue needed)For document 1 (segment[0]):
documents/{document-id}/original.pdftranscription_file_pathGate operation order:
Idempotency: Each sub-step is independently idempotent:
INSERT ... ON CONFLICT DO NOTHING or existence checksclaim-ingestion! dedupcomplete-notice! / complete-other! check ingestion status before running, skip if already completedOn retry, the gate runs all steps unconditionally. No partial-progress tracking needed.
New column on ap_ingestion:
classification_result — JSONB, nullable, default nullclassify! checks this before running the LLM: if set, uses it, skips LLM. If null, runs classification normally.classify! persists the result to this column. This makes classification idempotent on retries for all documents (general improvement, not just for splits).New column on ap_document:
source_document_id — UUID, nullable, FK to document(id), default nullclassify! Orchestrator Changesap_ingestion.classification_result for this ingestionclassification/classify!, which returns a vector of segmentsap_ingestion.classification_result for this ingestion (retry idempotency)document.type update uses segment[0]'s document-typeshort-circuit? derived from segment[0]The full OCR text uses === PAGE N === markers. The gate slices by these markers and re-indexes page numbers so each child's transcription starts at page 1. Uploaded as EDN to ingestions/{ingestion-id}/transcription-output.edn following the existing convention.
PDFBox PDDocument loaded once from the original S3 file. Pages extracted per segment, each saved as a separate PDF. One load, N writes.
Before modifying document 1's S3 file, the original multi-document PDF is copied to documents/{document-id}/original.pdf. The original is discoverable by convention for debugging.
Child documents preserve the email source linkage:
ap_doc_ingestion.doc_source_id set to the same doc source as the parent ingestionap_document.source_metadata copied from the parent document (from, subject, message-id, forwarding chain, etc.)complete-ingestion! — no changesap_document records with their own PDFs{:ingestion-id "uuid"}; classification travels via DB columnpages [1 10], no split