Document AI's native PDF parsing clips text at layout block boundaries. For programmatic PDFs with form-like layouts (multi-column headers, key-value fields), characters at the right edge of detected blocks are silently dropped. This produces truncated field values (e.g., "WP26" instead of "WP261041", "Armin Sa" instead of "Armin Saeidiani") with confidence=1.0 — making the errors undetectable by quality heuristics.
PDFBox reads the same embedded text objects without layout segmentation, so it returns complete text. Currently PDFBox is only attempted for PDFs >15 pages, meaning most invoices (1-3 pages) go straight to Document AI.
Replace the current all-or-nothing tier system with per-page evaluation:
PDF arrives
→ Load with PDFBox
→ For each page:
extract positioned text elements (TextPosition objects)
evaluate character density
→ Pages passing quality gate:
layout reconstruction from PDFBox coordinates → done
→ Pages failing quality gate:
split into sub-PDF
send to Document AI (enableNativePdfParsing: false)
→ Pages with acceptable token confidence:
layout reconstruction from Document AI coordinates → done
→ Pages with low confidence (>5% tokens below 0.8):
render to images, send to Vision → done
→ Assemble all pages in order with === PAGE N === markers
Subclass PDFTextStripper and override
writeString(String text, List<TextPosition> textPositions). This method is
called once per word group — PDFBox handles character-to-word grouping
internally. Capture each group as a positioned element:
{:text "WP261041"
:x 0.21 ;; normalized to 0.0-1.0 (x / page-width)
:y 0.42 ;; normalized (y / page-height)
:width 0.05 ;; normalized
:height 0.01 ;; normalized
:page 1}
Normalization to 0.0-1.0 makes the elements identical in shape to what
ocr_layout.clj gets from Document AI's normalizedVertices. This enables
reuse of the same row-grouping and column-separator algorithm.
Extract the core layout algorithm from ocr_layout.clj into a shared module:
layout.clj (new) — group-into-rows, row->text,
page->structured-text. Takes a seq of positioned elements (the
{:text :x :y :width :height} maps above), returns formatted text with |
column separators.pdfbox_layout.clj (new) — PDFBox TextPosition → positioned elements
→ calls layout.cljocr_layout.clj (modified) — Document AI response → positioned elements
→ calls layout.cljOutput format is unchanged: === PAGE N ===\n headers with |-separated
columns within rows.
Character density threshold: 50 chars/page (configurable via config.edn). Optimized for correctness — if a page looks slightly suspicious, fall through to Document AI.
Evaluation is per-page using PDFTextStripper.setStartPage/setEndPage (or
extracting per-page char counts from the full positioned elements). Pages are
independent — a 4-page document might have pages 1-3 handled by PDFBox and
page 4 by Document AI.
PDFBox only extracts text from text rendering operators (Tj, TJ, etc.).
Some PDFs contain text rendered as rasterized images — e.g., footers with legal
details (VAT ID, IBAN, registered address) baked as bitmaps by large issuers
like Vodafone. PDFBox sees the image XObject but can't read its content. The
page passes the character density gate (plenty of native text in the body), so
OCR is never triggered, and the image-rendered text is silently lost.
Observed impact: Vodafone invoice footer (USt-IdNr DE813113094, IBAN, BIC,
registered address) rendered as a 640x58px image at Y=14.92. PDFBox produced
complete body text but zero footer content. This triggered a critical
missing-vat-id fraud flag and needs_human_review.
After PDFBox text extraction, also detect XObject images on each page. For pages that pass the text quality gate but contain significant images, run Document AI OCR on those pages and spatially merge the results.
Updated flow:
PDF arrives
→ Load with PDFBox
→ For each page:
extract positioned text elements
detect XObject images with bounding boxes
evaluate character density
→ Pages failing quality gate → existing fallback (OCR/Vision)
→ Pages passing quality gate:
if no significant images → PDFBox layout only (done)
if significant images present → also run Document AI OCR (image mode)
→ spatial merge: PDFBox elements + OCR-only elements → done
→ Assemble all pages in order with === PAGE N === markers
PDFGraphicsStreamEngineAdd detect-page-images to pdfbox_layout.clj. Subclass
PDFGraphicsStreamEngine and override drawImage(PDImage) — this is called
whenever a Do operator renders an XObject image. At that point, the current
transformation matrix (CTM) provides the image's rendered position and
dimensions:
(defn ^:private detect-page-images
"Detect XObject images on a PDF page with their bounding boxes.
Returns seq of `{:x :y :width :height}` maps, normalized to 0.0-1.0."
[^PDDocument doc page-index]
(let [page (.getPage doc page-index)
page-w (.. page getMediaBox getWidth)
page-h (.. page getMediaBox getHeight)
images (atom [])
engine (proxy [PDFGraphicsStreamEngine] [page]
(drawImage [^PDImage pd-image]
(let [ctm (.. this getGraphicsState
getCurrentTransformationMatrix)
;; CTM gives rendered position/size in PDF points.
;; PDF Y-axis is bottom-up; normalize to top-down.
width (.getScaleX ctm)
height (.getScaleY ctm)
x (.getTranslateX ctm)
y (- page-h (.getTranslateY ctm) height)]
(swap! images conj
{:x (/ x page-w)
:y (/ y page-h)
:width (/ width page-w)
:height (/ height page-h)})))
;; Required abstract methods — no-op
(appendRectangle [_ _ _ _])
(clip [_])
(moveTo [_ _])
(lineTo [_ _])
(curveTo [_ _ _ _ _ _])
(getCurrentPoint [])
(closePath [])
(endPath [])
(strokePath [])
(fillPath [_])
(fillAndStrokePath [_])
(shadingFill [_]))]
(.processPage engine page)
@images))
Filter detected images to those likely to contain text:
(defn ^:private significant-image?
"An image is significant if it's wide enough to contain a line of text
and tall enough to hold at least one text line. Filters out logos,
icons, and decorative elements."
[{:keys [width height]}]
(and (> width 0.3) ;; > 30% page width (~200pt on A4)
(> height 0.02))) ;; > 2% page height (~17pt, ~1 line of text)
The Vodafone footer is 480/595 = 0.81 width, 43.5/842 = 0.052 height — passes easily. A small logo (e.g., 50x50pt) at 0.08 width would be filtered out.
This is intentionally loose. False positives (sending a page to OCR that only had decorative images) cost ~$0.0015 and a couple seconds. False negatives (missing image-rendered text) cause extraction failures. Err toward triggering.
For pages with significant images, run both PDFBox and Document AI OCR (image
mode, enableNativePdfParsing: false), then merge at the positioned-element
level:
(defn ^:private merge-pdfbox-ocr-elements
"Merge PDFBox and OCR positioned elements for a single page.
PDFBox elements are authoritative for native text (no clipping).
OCR elements are only included if they don't overlap with any PDFBox
element — meaning they come from image regions PDFBox couldn't read."
[pdfbox-elements ocr-elements]
(let [overlaps? (fn [ocr-elem]
(some (fn [pb-elem]
(and (< (:x ocr-elem)
(+ (:x pb-elem) (:width pb-elem)))
(< (:x pb-elem)
(+ (:x ocr-elem) (:width ocr-elem)))
(< (:y ocr-elem)
(+ (:y pb-elem) (:height pb-elem)))
(< (:y pb-elem)
(+ (:y ocr-elem) (:height ocr-elem)))))
pdfbox-elements))
ocr-only (remove overlaps? ocr-elements)]
(into (vec pdfbox-elements) ocr-only)))
The merged elements go through the same layout/elements->structured-text
pipeline. Since both sources use normalized 0.0-1.0 coordinates, the layout
algorithm handles row grouping and column detection uniformly.
extract-textIn transcription.clj, after the existing PDFBox quality gate:
;; Existing: pages pass or fail the char-count threshold
;; New: among passing pages, identify those with significant images
(let [pages-with-images (into (sorted-set)
(comp (filter (fn [[_ imgs]] (seq imgs)))
(map first))
page-images)
;; Only pages that PASSED text gate but have images need OCR supplement
supplement-pages (set/difference pages-with-images failing-pages)]
;; If supplement-pages is non-empty, extract those pages to sub-PDF,
;; send to Document AI OCR (image mode), get positioned elements per page,
;; merge with PDFBox elements, reconstruct layout.
...)
The OCR for supplement pages reuses the existing ocr-with-vision-fallback
path (or just the Document AI call without vision fallback, since we only need
OCR elements for the merge — vision text isn't positioned).
Important: For supplemented pages, we need positioned elements from OCR,
not just text. This means calling ocr-layout/page->elements on the Document
AI response before the merge, then running layout/elements->structured-text
on the merged result. The current reconstruct-layout produces text directly —
we need the intermediate elements.
This requires exposing page->elements from ocr_layout.clj (currently
private) or adding a function that returns elements instead of text.
Supplemented pages use a new method keyword: :pdf-lib+ocr
{:page-methods {1 :pdf-lib+ocr, 2 :pdf-lib}}
Document-level :method remains :pdf-lib (the primary source). The
:page-methods map captures the detail for diagnostics.
extract-with-layout returns (updated)Add :page-images and :page-elements to the return map:
{:text "..."
:page-count 2
:page-char-counts {1 500, 2 200}
:page-texts {1 "=== PAGE 1 ===\n...", 2 "..."}
:page-elements {1 [{:text "..." :x 0.1 :y 0.2 ...} ...], 2 [...]}
:page-images {1 [{:x 0.12 :y 0.95 :width 0.81 :height 0.05}], 2 []}
:quality-score 1.0
:raw-response []
:method :pdf-lib}
:page-elements is needed so the spatial merge can operate on positioned
elements rather than reconstructed text. Currently extract-with-layout runs
layout reconstruction inline and only returns text — the elements are lost.
Preserving them avoids re-extracting.
:page-images is filtered to significant images only.
For pages that fail the PDFBox quality gate, send to Document AI with
enableNativePdfParsing: false. Rationale:
The existing chunking logic (max 15 pages per request) still applies for fallback pages.
Each tier must emit === PAGE N === markers with globally correct page
numbers (1-indexed relative to the original document):
pdfbox_layout.clj receives the real page number.ocr_layout.clj currently 1-indexes relative to the
chunk. Pass the real page offset so it emits correct global indices.=== PAGE N ===
with correct global index during assembly. The Vision prompt already receives
page-start/page-end parameters — adjust these to reflect the real page
numbers of the specific failing pages.Assembly: sort all page blocks by page number, concatenate with \n\n.
:method (document-level): the "highest" tier used. If any page needed
Vision → :vision. If any needed OCR but none needed Vision → :ocr. If
all passed PDFBox → :pdf-lib.:page-methods (new, optional): map of page number → method keyword.
Stored in the S3 transcription output for diagnostics. Not persisted to DB.{:page-methods {1 :pdf-lib, 2 :pdf-lib, 3 :pdf-lib, 4 :ocr}}
Add to transcription config in config.edn:
:pdf-lib {:min-chars-per-page 50} ;; quality gate threshold
:text stringTranscriptionResult schema shape for downstream consumers| Scenario | PDFBox | Fallback |
|---|---|---|
| Modern programmatic PDF | Complete text, layout reconstructed | N/A |
| Scanned document (images) | 0 chars → fails gate | Document AI image OCR |
| Bad font encoding (mojibake) | Low char density or garbage → fails gate | Document AI image OCR |
| Scanned with bad OCR layer | Extracts bad OCR text, may pass gate | Risk: bad text used. Mitigated by low threshold. |
| Mixed content (some pages scanned) | Per-page: text pages pass, scan pages fail | Only scan pages go to Document AI |
| Hidden/duplicate text objects | May extract garbage, could pass gate | Rare for invoices |