PDFBox-First Transcription Pipeline

Problem

Document AI's native PDF parsing clips text at layout block boundaries. For programmatic PDFs with form-like layouts (multi-column headers, key-value fields), characters at the right edge of detected blocks are silently dropped. This produces truncated field values (e.g., "WP26" instead of "WP261041", "Armin Sa" instead of "Armin Saeidiani") with confidence=1.0 — making the errors undetectable by quality heuristics.

PDFBox reads the same embedded text objects without layout segmentation, so it returns complete text. Currently PDFBox is only attempted for PDFs >15 pages, meaning most invoices (1-3 pages) go straight to Document AI.

Design

Tier architecture

Replace the current all-or-nothing tier system with per-page evaluation:

PDF arrives
  → Load with PDFBox
  → For each page:
      extract positioned text elements (TextPosition objects)
      evaluate character density
  → Pages passing quality gate:
      layout reconstruction from PDFBox coordinates → done
  → Pages failing quality gate:
      split into sub-PDF
      send to Document AI (enableNativePdfParsing: false)
      → Pages with acceptable token confidence:
          layout reconstruction from Document AI coordinates → done
      → Pages with low confidence (>5% tokens below 0.8):
          render to images, send to Vision → done
  → Assemble all pages in order with === PAGE N === markers

PDFBox layout extraction

Subclass PDFTextStripper and override writeString(String text, List<TextPosition> textPositions). This method is called once per word group — PDFBox handles character-to-word grouping internally. Capture each group as a positioned element:

{:text "WP261041"
 :x 0.21          ;; normalized to 0.0-1.0 (x / page-width)
 :y 0.42          ;; normalized (y / page-height)
 :width 0.05      ;; normalized
 :height 0.01     ;; normalized
 :page 1}

Normalization to 0.0-1.0 makes the elements identical in shape to what ocr_layout.clj gets from Document AI's normalizedVertices. This enables reuse of the same row-grouping and column-separator algorithm.

Shared layout module

Extract the core layout algorithm from ocr_layout.clj into a shared module:

Output format is unchanged: === PAGE N ===\n headers with |-separated columns within rows.

Per-page quality gate

Character density threshold: 50 chars/page (configurable via config.edn). Optimized for correctness — if a page looks slightly suspicious, fall through to Document AI.

Evaluation is per-page using PDFTextStripper.setStartPage/setEndPage (or extracting per-page char counts from the full positioned elements). Pages are independent — a 4-page document might have pages 1-3 handled by PDFBox and page 4 by Document AI.

Image detection and spatial merge

Problem

PDFBox only extracts text from text rendering operators (Tj, TJ, etc.). Some PDFs contain text rendered as rasterized images — e.g., footers with legal details (VAT ID, IBAN, registered address) baked as bitmaps by large issuers like Vodafone. PDFBox sees the image XObject but can't read its content. The page passes the character density gate (plenty of native text in the body), so OCR is never triggered, and the image-rendered text is silently lost.

Observed impact: Vodafone invoice footer (USt-IdNr DE813113094, IBAN, BIC, registered address) rendered as a 640x58px image at Y=14.92. PDFBox produced complete body text but zero footer content. This triggered a critical missing-vat-id fraud flag and needs_human_review.

Design

After PDFBox text extraction, also detect XObject images on each page. For pages that pass the text quality gate but contain significant images, run Document AI OCR on those pages and spatially merge the results.

Updated flow:

PDF arrives
  → Load with PDFBox
  → For each page:
      extract positioned text elements
      detect XObject images with bounding boxes
      evaluate character density
  → Pages failing quality gate → existing fallback (OCR/Vision)
  → Pages passing quality gate:
      if no significant images → PDFBox layout only (done)
      if significant images present → also run Document AI OCR (image mode)
          → spatial merge: PDFBox elements + OCR-only elements → done
  → Assemble all pages in order with === PAGE N === markers

Image detection via PDFGraphicsStreamEngine

Add detect-page-images to pdfbox_layout.clj. Subclass PDFGraphicsStreamEngine and override drawImage(PDImage) — this is called whenever a Do operator renders an XObject image. At that point, the current transformation matrix (CTM) provides the image's rendered position and dimensions:

(defn ^:private detect-page-images
  "Detect XObject images on a PDF page with their bounding boxes.

   Returns seq of `{:x :y :width :height}` maps, normalized to 0.0-1.0."
  [^PDDocument doc page-index]
  (let [page     (.getPage doc page-index)
        page-w   (.. page getMediaBox getWidth)
        page-h   (.. page getMediaBox getHeight)
        images   (atom [])
        engine   (proxy [PDFGraphicsStreamEngine] [page]
                   (drawImage [^PDImage pd-image]
                     (let [ctm    (.. this getGraphicsState
                                       getCurrentTransformationMatrix)
                           ;; CTM gives rendered position/size in PDF points.
                           ;; PDF Y-axis is bottom-up; normalize to top-down.
                           width  (.getScaleX ctm)
                           height (.getScaleY ctm)
                           x      (.getTranslateX ctm)
                           y      (- page-h (.getTranslateY ctm) height)]
                       (swap! images conj
                              {:x      (/ x page-w)
                               :y      (/ y page-h)
                               :width  (/ width page-w)
                               :height (/ height page-h)})))
                   ;; Required abstract methods — no-op
                   (appendRectangle [_ _ _ _])
                   (clip [_])
                   (moveTo [_ _])
                   (lineTo [_ _])
                   (curveTo [_ _ _ _ _ _])
                   (getCurrentPoint [])
                   (closePath [])
                   (endPath [])
                   (strokePath [])
                   (fillPath [_])
                   (fillAndStrokePath [_])
                   (shadingFill [_]))]
    (.processPage engine page)
    @images))

Significant image heuristic

Filter detected images to those likely to contain text:

(defn ^:private significant-image?
  "An image is significant if it's wide enough to contain a line of text
   and tall enough to hold at least one text line. Filters out logos,
   icons, and decorative elements."
  [{:keys [width height]}]
  (and (> width 0.3)       ;; > 30% page width (~200pt on A4)
       (> height 0.02)))   ;; > 2% page height (~17pt, ~1 line of text)

The Vodafone footer is 480/595 = 0.81 width, 43.5/842 = 0.052 height — passes easily. A small logo (e.g., 50x50pt) at 0.08 width would be filtered out.

This is intentionally loose. False positives (sending a page to OCR that only had decorative images) cost ~$0.0015 and a couple seconds. False negatives (missing image-rendered text) cause extraction failures. Err toward triggering.

Spatial merge

For pages with significant images, run both PDFBox and Document AI OCR (image mode, enableNativePdfParsing: false), then merge at the positioned-element level:

(defn ^:private merge-pdfbox-ocr-elements
  "Merge PDFBox and OCR positioned elements for a single page.

   PDFBox elements are authoritative for native text (no clipping).
   OCR elements are only included if they don't overlap with any PDFBox
   element — meaning they come from image regions PDFBox couldn't read."
  [pdfbox-elements ocr-elements]
  (let [overlaps? (fn [ocr-elem]
                    (some (fn [pb-elem]
                            (and (< (:x ocr-elem)
                                    (+ (:x pb-elem) (:width pb-elem)))
                                 (< (:x pb-elem)
                                    (+ (:x ocr-elem) (:width ocr-elem)))
                                 (< (:y ocr-elem)
                                    (+ (:y pb-elem) (:height pb-elem)))
                                 (< (:y pb-elem)
                                    (+ (:y ocr-elem) (:height ocr-elem)))))
                          pdfbox-elements))
        ocr-only  (remove overlaps? ocr-elements)]
    (into (vec pdfbox-elements) ocr-only)))

The merged elements go through the same layout/elements->structured-text pipeline. Since both sources use normalized 0.0-1.0 coordinates, the layout algorithm handles row grouping and column detection uniformly.

Integration with extract-text

In transcription.clj, after the existing PDFBox quality gate:

;; Existing: pages pass or fail the char-count threshold
;; New: among passing pages, identify those with significant images
(let [pages-with-images (into (sorted-set)
                               (comp (filter (fn [[_ imgs]] (seq imgs)))
                                     (map first))
                               page-images)
      ;; Only pages that PASSED text gate but have images need OCR supplement
      supplement-pages  (set/difference pages-with-images failing-pages)]
  ;; If supplement-pages is non-empty, extract those pages to sub-PDF,
  ;; send to Document AI OCR (image mode), get positioned elements per page,
  ;; merge with PDFBox elements, reconstruct layout.
  ...)

The OCR for supplement pages reuses the existing ocr-with-vision-fallback path (or just the Document AI call without vision fallback, since we only need OCR elements for the merge — vision text isn't positioned).

Important: For supplemented pages, we need positioned elements from OCR, not just text. This means calling ocr-layout/page->elements on the Document AI response before the merge, then running layout/elements->structured-text on the merged result. The current reconstruct-layout produces text directly — we need the intermediate elements.

This requires exposing page->elements from ocr_layout.clj (currently private) or adding a function that returns elements instead of text.

Method reporting for supplemented pages

Supplemented pages use a new method keyword: :pdf-lib+ocr

{:page-methods {1 :pdf-lib+ocr, 2 :pdf-lib}}

Document-level :method remains :pdf-lib (the primary source). The :page-methods map captures the detail for diagnostics.

What extract-with-layout returns (updated)

Add :page-images and :page-elements to the return map:

{:text             "..."
 :page-count       2
 :page-char-counts {1 500, 2 200}
 :page-texts       {1 "=== PAGE 1 ===\n...", 2 "..."}
 :page-elements    {1 [{:text "..." :x 0.1 :y 0.2 ...} ...], 2 [...]}
 :page-images      {1 [{:x 0.12 :y 0.95 :width 0.81 :height 0.05}], 2 []}
 :quality-score    1.0
 :raw-response     []
 :method           :pdf-lib}

:page-elements is needed so the spatial merge can operate on positioned elements rather than reconstructed text. Currently extract-with-layout runs layout reconstruction inline and only returns text — the elements are lost. Preserving them avoids re-extracting.

:page-images is filtered to significant images only.

Document AI changes

For pages that fail the PDFBox quality gate, send to Document AI with enableNativePdfParsing: false. Rationale:

The existing chunking logic (max 15 pages per request) still applies for fallback pages.

Page number management

Each tier must emit === PAGE N === markers with globally correct page numbers (1-indexed relative to the original document):

Assembly: sort all page blocks by page number, concatenate with \n\n.

Method reporting

{:page-methods {1 :pdf-lib, 2 :pdf-lib, 3 :pdf-lib, 4 :ocr}}

Configuration

Add to transcription config in config.edn:

:pdf-lib {:min-chars-per-page 50}  ;; quality gate threshold

What doesn't change

Failure modes

Scenario PDFBox Fallback
Modern programmatic PDF Complete text, layout reconstructed N/A
Scanned document (images) 0 chars → fails gate Document AI image OCR
Bad font encoding (mojibake) Low char density or garbage → fails gate Document AI image OCR
Scanned with bad OCR layer Extracts bad OCR text, may pass gate Risk: bad text used. Mitigated by low threshold.
Mixed content (some pages scanned) Per-page: text pages pass, scan pages fail Only scan pages go to Document AI
Hidden/duplicate text objects May extract garbage, could pass gate Rare for invoices