For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Detect image-rendered text in PDFs that pass the PDFBox quality gate and supplement with targeted OCR, so content like Vodafone's footer (VAT ID, IBAN, address baked as a bitmap) is not silently lost.
Architecture: After PDFBox text extraction, scan each page for XObject images using PDFGraphicsStreamEngine. Pages that pass the text quality gate but contain significant images get a supplementary Document AI OCR call. The OCR positioned elements are spatially merged with PDFBox elements — PDFBox wins for overlapping regions, OCR fills in image-only regions. The merged elements go through the existing layout/elements->structured-text pipeline.
Design Reference: See "Image detection and spatial merge" section in docs/plans/2026-03-09-pdfbox-first-transcription-design.md.
Tech Stack: PDFBox (PDFGraphicsStreamEngine, PDImage), existing layout.clj positioned element model, existing Document AI OCR pipeline.
pdfbox_layout.cljFiles:
src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.cljtest/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj (create)Step 1: Write the failing test
Create the test file. Build a PDF with an embedded image using PDFBox, then assert detect-page-images returns its bounding box.
(ns com.getorcha.workers.ingestion.transcription.pdfbox-layout-test
(:require [clojure.test :refer [deftest is testing]]
[com.getorcha.workers.ingestion.transcription.pdfbox-layout :as pdfbox-layout])
(:import (java.io ByteArrayOutputStream)
(org.apache.pdfbox Loader)
(org.apache.pdfbox.pdmodel PDDocument PDPage PDPageContentStream)
(org.apache.pdfbox.pdmodel.font PDType1Font Standard14Fonts$FontName)
(org.apache.pdfbox.pdmodel.graphics.image PDImageXObject)))
(set! *warn-on-reflection* true)
(defn ^:private create-pdf-with-text-and-image
"Create a PDF with text content and an embedded image.
Returns byte array of the PDF."
^bytes []
(let [baos (ByteArrayOutputStream.)]
(with-open [doc (PDDocument.)]
(let [page (PDPage.)]
(.addPage doc page)
(with-open [cs (PDPageContentStream. doc page)]
;; Add text content in upper portion
(.beginText cs)
(.setFont cs (PDType1Font. Standard14Fonts$FontName/HELVETICA) 12)
(.newLineAtOffset cs 50 700)
(.showText cs "Invoice INV-001 from Supplier Corp for 12345.67 EUR total amount")
(.endText cs)
;; Add a wide image near the bottom (simulating a footer)
;; Create a minimal 480x58 pixel image
(let [img (java.awt.image.BufferedImage. 480 58 java.awt.image.BufferedImage/TYPE_INT_RGB)
pd-img (PDImageXObject/createFromByteArray
doc (.toByteArray
(doto (ByteArrayOutputStream.)
(javax.imageio.ImageIO/write img "png" ^java.io.OutputStream %)))
"footer.png")]
;; Draw image at bottom of page: x=72, y=15, width=480, height=44
(.drawImage cs pd-img 72.0 15.0 480.0 44.0))))
(.save doc baos))
(.toByteArray baos)))
(defn ^:private create-pdf-text-only
"Create a PDF with only text content, no images."
^bytes []
(let [baos (ByteArrayOutputStream.)]
(with-open [doc (PDDocument.)]
(let [page (PDPage.)]
(.addPage doc page)
(with-open [cs (PDPageContentStream. doc page)]
(.beginText cs)
(.setFont cs (PDType1Font. Standard14Fonts$FontName/HELVETICA) 12)
(.newLineAtOffset cs 50 700)
(.showText cs "Invoice INV-001 just text no images")
(.endText cs)))
(.save doc baos))
(.toByteArray baos)))
(deftest test-detect-page-images
(testing "Detects image with bounding box on a page"
(let [pdf-bytes (create-pdf-with-text-and-image)]
(with-open [doc (Loader/loadPDF pdf-bytes)]
(let [images (pdfbox-layout/detect-page-images doc 0)]
(is (= 1 (count images)))
(let [{:keys [x y width height]} (first images)]
;; Image is at x=72, width=480 on a 612pt-wide page
(is (< 0.1 x 0.13) "x should be ~72/612 = 0.118")
;; Image is near bottom of 792pt-high page
(is (> y 0.9) "y should be near bottom")
;; Width ~480/612 = 0.78
(is (< 0.7 width 0.85) "width should be ~0.78")
;; Height ~44/792 = 0.056
(is (< 0.04 height 0.07) "height should be ~0.056"))))))
(testing "Returns empty seq for pages without images"
(let [pdf-bytes (create-pdf-text-only)]
(with-open [doc (Loader/loadPDF pdf-bytes)]
(is (empty? (pdfbox-layout/detect-page-images doc 0)))))))
(deftest test-significant-image?
(testing "Wide, tall-enough images are significant"
(is (pdfbox-layout/significant-image?
{:x 0.12 :y 0.93 :width 0.78 :height 0.056})))
(testing "Small images (logos, icons) are not significant"
(is (not (pdfbox-layout/significant-image?
{:x 0.4 :y 0.05 :width 0.08 :height 0.09}))))
(testing "Narrow images are not significant"
(is (not (pdfbox-layout/significant-image?
{:x 0.1 :y 0.5 :width 0.2 :height 0.05}))))
(testing "Very short images are not significant"
(is (not (pdfbox-layout/significant-image?
{:x 0.1 :y 0.5 :width 0.5 :height 0.01})))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: Compilation error — detect-page-images and significant-image? don't exist.
Step 3: Implement detect-page-images and significant-image?
In pdfbox_layout.clj, add the import and two functions.
Add imports:
(:import (org.apache.pdfbox Loader)
(org.apache.pdfbox.contentstream PDFGraphicsStreamEngine)
(org.apache.pdfbox.pdmodel PDDocument PDPage)
(org.apache.pdfbox.pdmodel.graphics.image PDImage)
(org.apache.pdfbox.text PDFTextStripper TextPosition)))
Note: the existing import of PDDocument stays, but add PDPage, PDFGraphicsStreamEngine, and PDImage.
Add functions between the set! and extract-page-elements:
(defn detect-page-images
"Detect XObject images on a PDF page with their bounding boxes.
Subclasses PDFGraphicsStreamEngine to intercept drawImage calls.
Returns seq of `{:x :y :width :height}` maps, normalized to 0.0-1.0.
Coordinates are top-down (Y=0 at page top)."
[^PDDocument doc page-index]
(let [^PDPage page (.getPage doc page-index)
page-w (.. page getMediaBox getWidth)
page-h (.. page getMediaBox getHeight)
images (atom [])
engine (proxy [PDFGraphicsStreamEngine] [page]
(drawImage [^PDImage _pd-image]
(let [ctm (.. this getGraphicsState
getCurrentTransformationMatrix)
width (Math/abs (.getScaleX ctm))
height (Math/abs (.getScaleY ctm))
tx (.getTranslateX ctm)
ty (.getTranslateY ctm)
;; PDF Y-axis is bottom-up; convert to top-down
x tx
y (- page-h ty height)]
(swap! images conj
{:x (/ x page-w)
:y (/ y page-h)
:width (/ width page-w)
:height (/ height page-h)})))
;; Required abstract methods — no-op
(appendRectangle [_ _ _ _])
(clip [_])
(moveTo [_ _])
(lineTo [_ _])
(curveTo [_ _ _ _ _ _])
(getCurrentPoint [] nil)
(closePath [])
(endPath [])
(strokePath [])
(fillPath [_])
(fillAndStrokePath [_])
(shadingFill [_]))]
(.processPage engine page)
@images))
(defn significant-image?
"An image is significant if it's wide enough to contain a line of text
and tall enough to hold at least one text line.
Thresholds: >30% page width and >2% page height.
Intentionally loose — false positives cost ~$0.0015, false negatives
lose text."
[{:keys [width height]}]
(and (> width 0.3)
(> height 0.02)))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj
git commit -m "feat: add image detection to pdfbox_layout"
page-elements and page-images from extract-with-layoutFiles:
src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj:54-90 (extract-with-layout)test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.cljStep 1: Write the failing test
Add to the test file:
(deftest test-extract-with-layout-returns-elements-and-images
(testing "Returns page-elements and page-images in result"
(let [pdf-bytes (create-pdf-with-text-and-image)
result (pdfbox-layout/extract-with-layout pdf-bytes)]
;; page-elements contains positioned element seqs per page
(is (map? (:page-elements result)))
(is (seq (get-in result [:page-elements 1])) "Page 1 should have elements")
(is (every? (fn [e] (and (:text e) (:x e) (:y e) (:width e) (:height e)))
(get-in result [:page-elements 1])))
;; page-images contains detected images per page
(is (map? (:page-images result)))
(is (= 1 (count (get-in result [:page-images 1]))) "Page 1 has one image")))
(testing "page-images is empty map for text-only PDFs"
(let [pdf-bytes (create-pdf-text-only)
result (pdfbox-layout/extract-with-layout pdf-bytes)]
(is (every? empty? (vals (:page-images result)))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: FAIL — :page-elements and :page-images not in result.
Step 3: Update extract-with-layout
Modify the function to also detect images and preserve elements:
(defn extract-with-layout
"Extract text from PDF with layout reconstruction.
Processes each page independently, capturing positioned text elements
and reconstructing tabular structure with column separators.
Returns:
:text - Layout-reconstructed text with === PAGE N === markers
:page-count - Number of pages
:page-char-counts - Map of page-number (1-indexed) to character count
:page-elements - Map of page-number to seq of positioned elements
:page-images - Map of page-number to seq of significant image bounding boxes
:quality-score - Always 1.0 (native PDF text)
:raw-response - Empty vector (no external API calls)
:method - :pdf-lib"
[^bytes pdf-bytes]
(with-open [doc (Loader/loadPDF pdf-bytes)]
(let [page-count (.getNumberOfPages doc)
opts {:column-gap-threshold 0.05}
page-results (mapv (fn [i]
(let [elements (extract-page-elements doc i)
images (filterv significant-image?
(detect-page-images doc i))
page-text (layout/elements->structured-text elements opts)
page-num (inc i)]
{:page-num page-num
:elements elements
:images images
:text (when-not (str/blank? page-text)
(str "=== PAGE " page-num " ===\n" page-text))
:char-count (reduce + 0 (map (comp count :text) elements))}))
(range page-count))]
{:text (->> page-results
(keep :text)
(str/join "\n\n"))
:page-count page-count
:page-char-counts (into {} (map (juxt :page-num :char-count) page-results))
:page-texts (into {} (keep (fn [{:keys [page-num text]}]
(when text [page-num text]))
page-results))
:page-elements (into {} (map (juxt :page-num :elements) page-results))
:page-images (into {} (map (juxt :page-num :images) page-results))
:quality-score 1.0
:raw-response []
:method :pdf-lib})))
Also add filterv import note: filterv is in clojure.core, no import needed.
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: PASS
Step 5: Run existing tests to ensure no regressions
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'
Expected: PASS — existing tests don't check for absence of :page-elements/:page-images, and the existing fields (:text, :page-char-counts, :page-texts) are unchanged. The only risk is that test-pdfbox-first-skips-ocr-for-text-rich-pdfs checks (is (nil? (:page-char-counts result))) — but that nil check is on the final result after dissoc in transcription.clj, not on the pdfbox result directly.
Step 6: Commit
git add src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj
git commit -m "feat: return page-elements and page-images from extract-with-layout"
page->elements from ocr_layout.cljFiles:
src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj:50 (remove ^:private)test/com/getorcha/workers/ingestion/transcription/ocr_layout_test.cljThe spatial merge needs OCR positioned elements, not final text. Currently page->elements is private. Make it public.
Step 1: Write the failing test
Add to the existing ocr_layout_test.clj:
(deftest test-page->elements-returns-positioned-elements
(testing "Returns positioned element maps from a Document AI page"
(let [document-text "Invoice\nTotal: 100.00"
page {:lines [{:layout (make-layout 0 7 0.1 0.1 0.3 0.02)}
{:layout (make-layout 8 21 0.1 0.15 0.4 0.02)}]}
elements (ocr-layout/page->elements document-text page)]
(is (= 2 (count elements)))
(is (= "Invoice" (:text (first elements))))
(is (= "Total: 100.00" (:text (second elements))))
(is (every? #(and (:x %) (:y %) (:width %) (:height %)) elements)))))
Note: check how make-layout is defined in the existing test file. It likely builds the {:textAnchor {:textSegments ...} :boundingPoly {:normalizedVertices ...}} structure. Adapt the test to use the existing helper.
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.ocr-layout-test]'
Expected: FAIL — page->elements is private, calling it as ocr-layout/page->elements fails.
Step 3: Remove ^:private from page->elements
In ocr_layout.clj line 50, change:
(defn ^:private page->elements
to:
(defn page->elements
Update the docstring to note it's public for use by the spatial merge:
(defn page->elements
"Extract all positioned text elements from a Document AI page.
Uses :lines for finer granularity, falls back to :blocks.
Returns seq of `{:text :x :y :width :height}` maps with
normalized coordinates (0.0-1.0)."
[document-text {:keys [lines blocks] :as _page}]
...)
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.ocr-layout-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj test/com/getorcha/workers/ingestion/transcription/ocr_layout_test.clj
git commit -m "feat: make page->elements public for spatial merge"
layout.cljFiles:
src/com/getorcha/workers/ingestion/transcription/layout.cljtest/com/getorcha/workers/ingestion/transcription/layout_test.cljStep 1: Write the failing test
Add to layout_test.clj:
(deftest test-merge-elements
(testing "Non-overlapping OCR elements are included"
(let [pdfbox-elements [{:text "Invoice body" :x 0.1 :y 0.1 :width 0.3 :height 0.02}]
ocr-elements [{:text "Invoice body" :x 0.1 :y 0.1 :width 0.3 :height 0.02}
{:text "Footer: VAT ID DE123" :x 0.1 :y 0.95 :width 0.8 :height 0.04}]
merged (layout/merge-elements pdfbox-elements ocr-elements)]
(is (= 2 (count merged)))
(is (some #(= "Footer: VAT ID DE123" (:text %)) merged))
(is (some #(= "Invoice body" (:text %)) merged))))
(testing "Overlapping OCR elements are excluded (PDFBox wins)"
(let [pdfbox-elements [{:text "Total: 100.00" :x 0.5 :y 0.3 :width 0.2 :height 0.02}]
ocr-elements [{:text "Total: 100.0" :x 0.5 :y 0.3 :width 0.19 :height 0.02}]
merged (layout/merge-elements pdfbox-elements ocr-elements)]
(is (= 1 (count merged)))
(is (= "Total: 100.00" (:text (first merged))) "PDFBox text wins")))
(testing "Empty PDFBox elements — all OCR elements included"
(let [ocr-elements [{:text "Scanned text" :x 0.1 :y 0.1 :width 0.3 :height 0.02}]
merged (layout/merge-elements [] ocr-elements)]
(is (= 1 (count merged)))))
(testing "Empty OCR elements — PDFBox elements unchanged"
(let [pdfbox-elements [{:text "Native text" :x 0.1 :y 0.1 :width 0.3 :height 0.02}]
merged (layout/merge-elements pdfbox-elements [])]
(is (= 1 (count merged))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.layout-test]'
Expected: FAIL — merge-elements doesn't exist.
Step 3: Implement merge-elements
Add to layout.clj after elements->structured-text:
(defn ^:private overlaps?
"Check if two bounding boxes overlap (2D axis-aligned intersection)."
[a b]
(and (< (:x a) (+ (:x b) (:width b)))
(< (:x b) (+ (:x a) (:width a)))
(< (:y a) (+ (:y b) (:height b)))
(< (:y b) (+ (:y a) (:height a)))))
(defn merge-elements
"Merge PDFBox and OCR positioned elements for a single page.
PDFBox elements are authoritative for native text (no clipping issues).
OCR elements are only included when they don't overlap with any PDFBox
element — meaning they come from image regions that PDFBox couldn't read."
[pdfbox-elements ocr-elements]
(let [ocr-only (remove (fn [ocr-elem]
(some #(overlaps? ocr-elem %) pdfbox-elements))
ocr-elements)]
(into (vec pdfbox-elements) ocr-only)))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.layout-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/ingestion/transcription/layout.clj test/com/getorcha/workers/ingestion/transcription/layout_test.clj
git commit -m "feat: add spatial merge for PDFBox + OCR elements"
extract-textFiles:
src/com/getorcha/workers/ingestion/transcription.clj:706-754 (the PDF branch of extract-text)test/com/getorcha/workers/ingestion/transcription_test.cljThis is the integration point. The existing flow in extract-text (lines 706-754) is:
PDFBox → check page char counts → all pass / all fail / mixed
We add a fourth case: pages that pass the text gate but have significant images.
Step 1: Write the failing test
Add to transcription_test.clj. Build a PDF where page 1 has both text and an embedded image. Assert that OCR is called to supplement.
(deftest test-pdfbox-supplements-ocr-for-pages-with-images
(testing "Pages with text + images trigger supplementary OCR and merge"
(let [ocr-call-count (atom 0)
;; OCR response with a "footer" element positioned at the bottom
ocr-response {:document
{:text "Invoice INV-001\nFooter: USt-IdNr DE813113094"
:pages [{:imageQualityScores {:qualityScore 0.95}
:tokens [{:layout {:confidence 0.98}}
{:layout {:confidence 0.97}}]
:lines [{:layout {:textAnchor {:textSegments [{:startIndex "0" :endIndex "15"}]}
:boundingPoly {:normalizedVertices [{:x 0.1 :y 0.1}
{:x 0.6 :y 0.1}
{:x 0.6 :y 0.13}
{:x 0.1 :y 0.13}]}}}
{:layout {:textAnchor {:textSegments [{:startIndex "16" :endIndex "46"}]}
:boundingPoly {:normalizedVertices [{:x 0.1 :y 0.93}
{:x 0.9 :y 0.93}
{:x 0.9 :y 0.97}
{:x 0.1 :y 0.97}]}}}]}]}}
ocr-spy (fn [& _args]
(swap! ocr-call-count inc)
{:status 200 :body ocr-response})
;; Build PDF with text + image
pdf-bytes (let [baos (java.io.ByteArrayOutputStream.)]
(with-open [doc (org.apache.pdfbox.pdmodel.PDDocument.)]
(let [page (org.apache.pdfbox.pdmodel.PDPage.)]
(.addPage doc page)
(with-open [cs (org.apache.pdfbox.pdmodel.PDPageContentStream. doc page)]
;; Text content
(.beginText cs)
(.setFont cs (org.apache.pdfbox.pdmodel.font.PDType1Font.
org.apache.pdfbox.pdmodel.font.Standard14Fonts$FontName/HELVETICA) 12)
(.newLineAtOffset cs 50 700)
(.showText cs "Invoice INV-001 from Supplier Corp for 12345.67 EUR total amount")
(.endText cs)
;; Image at bottom (simulating footer)
(let [img (java.awt.image.BufferedImage. 480 58 java.awt.image.BufferedImage/TYPE_INT_RGB)
img-baos (java.io.ByteArrayOutputStream.)
_ (javax.imageio.ImageIO/write img "png" ^java.io.OutputStream img-baos)
pd-img (org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject/createFromByteArray
doc (.toByteArray img-baos) "footer.png")]
(.drawImage cs pd-img 72.0 15.0 480.0 44.0))))
(.save doc baos))
(.toByteArray baos))
context {:transcription {:pdf-lib {:min-chars-per-page 50}
:ocr {:provider :ocr
:project-id "test-project"
:location "eu"
:processor-id "test-processor"}
:vision {}}
:llm-config {}
:worker-pools {}}
ingestion {:file {:contents pdf-bytes :mime-type "application/pdf"}}]
(with-redefs-fn {#'hato/post ocr-spy
#'workers.transcription/get-access-token (constantly "test-token")}
(fn []
(let [result (workers.transcription/extract-text context ingestion)]
;; Method is pdf-lib (primary source)
(is (= :pdf-lib (:method result)))
;; OCR was called once (for the page with images)
(is (= 1 @ocr-call-count) "OCR should be called once for supplementation")
;; Result should contain native text
(is (string/includes? (:text result) "Invoice"))
;; Result should contain OCR-supplemented footer text
(is (string/includes? (:text result) "USt-IdNr DE813113094")
"Footer text from image should be present via OCR supplement")
;; Page method should indicate supplementation
(is (= {1 :pdf-lib+ocr} (:page-methods result)))))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' :vars '[com.getorcha.workers.ingestion.transcription-test/test-pdfbox-supplements-ocr-for-pages-with-images]'
Expected: FAIL — no supplementation logic exists yet, footer text won't appear.
Step 3: Implement the supplementation logic
In transcription.clj, update the require to include layout and ocr-layout:
[com.getorcha.workers.ingestion.transcription.layout :as layout]
[com.getorcha.workers.ingestion.transcription.ocr-layout :as ocr-layout]
Then in the extract-text function, modify the (empty? failing-pages) branch (line 732-737). Currently:
;; All pages pass — use PDFBox result
(empty? failing-pages)
(-> (dissoc pdfbox-result :page-char-counts :page-texts)
(assoc :page-methods
(into {} (map (fn [p] [p :pdf-lib])
(range 1 (inc total-pages))))))
Replace with logic that checks for pages needing image supplementation:
;; All pages pass text gate
(empty? failing-pages)
(let [supplement-pages (into (sorted-set)
(comp (filter (fn [[_ imgs]] (seq imgs)))
(map first))
(:page-images pdfbox-result))]
(if (empty? supplement-pages)
;; No images — pure PDFBox result
(-> (dissoc pdfbox-result :page-char-counts :page-texts
:page-elements :page-images)
(assoc :page-methods
(into {} (map (fn [p] [p :pdf-lib])
(range 1 (inc total-pages))))))
;; Some pages have images — supplement with OCR
(let [supp-sorted (vec (sort supplement-pages))
supp-indices (mapv dec supp-sorted)
supp-pdf (extract-pages contents supp-indices)
ocr-result (ocr-transcribe! ocr-context
{:file {:contents supp-pdf
:mime-type "application/pdf"}})
ocr-responses (:raw-response ocr-result)
;; Build OCR elements per supplement page
ocr-elements-by-page
(into {}
(mapcat
(fn [response page-offset]
(let [doc-text (get-in response [:document :text])
pages (get-in response [:document :pages])]
(map-indexed
(fn [idx page]
(let [ocr-page-num (nth supp-sorted (+ page-offset idx))]
[ocr-page-num (ocr-layout/page->elements doc-text page)]))
pages))))
(let [page-counts (reductions + 0 (map #(count (get-in % [:document :pages]))
ocr-responses))]
(map vector ocr-responses page-counts)))
;; Merge elements and reconstruct layout per supplemented page
opts {:column-gap-threshold 0.05}
supplemented-texts
(into {}
(map (fn [p]
(let [pb-elems (get-in pdfbox-result [:page-elements p])
ocr-elems (get ocr-elements-by-page p [])
merged (layout/merge-elements pb-elems ocr-elems)
page-text (layout/elements->structured-text merged opts)]
[p (str "=== PAGE " p " ===\n" page-text)])))
supp-sorted)
;; Replace page texts for supplemented pages
final-text (->> (range 1 (inc total-pages))
(keep (fn [p]
(or (get supplemented-texts p)
(get (:page-texts pdfbox-result) p))))
(str/join "\n\n"))]
{:text final-text
:page-count total-pages
:quality-score 1.0
:raw-response ocr-responses
:method :pdf-lib
:page-methods (into {} (map (fn [p]
[p (if (supplement-pages p)
:pdf-lib+ocr
:pdf-lib)])
(range 1 (inc total-pages))))})))
Note: ocr-transcribe! is the multimethod that calls Document AI. It's called directly here instead of ocr-with-vision-fallback because we need the raw response with bounding boxes — we don't want layout reconstruction or vision fallback for supplement pages (we just need positioned elements). Make sure to pass native-pdf-parsing?: false through the context.
Important implementation detail: Review how ocr-transcribe! is called. It currently takes (ocr-transcribe! context ingestion). The context must include :native-pdf-parsing? false (already set in ocr-context at line 719). The multimethod returns a map with :raw-response containing the Document AI responses.
Adapt the above code to match the exact calling convention of ocr-transcribe! in the codebase. Read its signature and return shape before implementing. The sketch above shows the data flow — the exact function calls may need adjustment.
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' :vars '[com.getorcha.workers.ingestion.transcription-test/test-pdfbox-supplements-ocr-for-pages-with-images]'
Expected: PASS
Step 5: Run all transcription tests
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'
Expected: PASS — existing tests should not be affected because their test PDFs have no embedded images, so supplement-pages will be empty and they'll hit the existing pure-PDFBox path.
Step 6: Commit
git add src/com/getorcha/workers/ingestion/transcription.clj test/com/getorcha/workers/ingestion/transcription_test.clj
git commit -m "feat: supplement PDFBox pages with OCR when images detected"
Files:
src/com/getorcha/workers/ingestion/transcription.cljStep 1: Verify internal fields don't leak
The existing test test-pdfbox-first-skips-ocr-for-text-rich-pdfs already checks:
(is (nil? (:page-char-counts result)))
(is (nil? (:page-texts result)))
Add checks for the new fields:
(is (nil? (:page-elements result)))
(is (nil? (:page-images result)))
Step 2: Run to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' :vars '[com.getorcha.workers.ingestion.transcription-test/test-pdfbox-first-skips-ocr-for-text-rich-pdfs]'
Expected: FAIL — :page-elements and :page-images leak into final result.
Step 3: Update dissoc calls
In transcription.clj, wherever pdfbox-result is dissoc'd before return, add :page-elements and :page-images. There are two places:
The pure PDFBox path (no failing pages, no supplement pages):
(dissoc pdfbox-result :page-char-counts :page-texts :page-elements :page-images)
The supplement path already builds a fresh map without these keys (no change needed).
The merge-mixed-pages function receives pdfbox-result — check that it doesn't pass through :page-elements/:page-images. It builds a fresh map, so it's fine.
Step 4: Run to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/workers/ingestion/transcription.clj test/com/getorcha/workers/ingestion/transcription_test.clj
git commit -m "fix: strip page-elements and page-images from final result"
Step 1: Lint
Run: clj-kondo --lint src test dev
Fix any issues.
Step 2: Run all transcription-related tests
clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test com.getorcha.workers.ingestion.transcription.layout-test com.getorcha.workers.ingestion.transcription.ocr-layout-test com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: ALL PASS
Step 3: Commit any lint fixes
PDFGraphicsStreamEngine proxy: The drawImage method receives a PDImage which is the abstract image interface. The CTM at that point reflects the cumulative transformation matrix. For simple image placements (cm ... /I0 Do), getScaleX/getScaleY give width/height and getTranslateX/getTranslateY give position. For rotated or skewed images, this is more complex — but invoice footers are always axis-aligned, so this is sufficient.
Y-axis conversion: PDF coordinates have Y=0 at the bottom of the page. PDFBox TextPosition.getY() already converts to top-down coordinates, but CTM.getTranslateY() does not. The image detection code must convert: y_topdown = page_height - translateY - height.
ocr-transcribe! calling convention: Read the multimethod definition carefully. It may need the full context map with :transcription config. The key thing is that native-pdf-parsing? is false so Document AI does image-based OCR (otherwise it would re-read the same native text and miss the image content).
Cost/latency impact: Only pages with significant images trigger OCR. For the vast majority of programmatic PDFs (no image-rendered text), the only added cost is the PDFGraphicsStreamEngine.processPage() call per page, which is ~1ms.
The mixed-pages path (some pages fail text gate, some pass with images): Both failing-pages and supplement-pages could be non-empty. The current plan handles them in separate branches (empty? failing-pages vs not). If a page fails the text gate, it goes through the existing OCR fallback path. If a page passes the text gate but has images, it goes through the new supplement path. These are mutually exclusive per page by definition (a page can't both fail and pass the threshold). But the implementation needs to handle the case where some pages fail and other pages pass-with-images in the same document — this requires both the existing mixed-page OCR and the supplement OCR. Handle this by collecting both sets, running OCR for (set/union failing-pages supplement-pages), then dispatching each page's result through the appropriate path (full OCR text vs spatial merge).