Image Text Supplementation Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Detect image-rendered text in PDFs that pass the PDFBox quality gate and supplement with targeted OCR, so content like Vodafone's footer (VAT ID, IBAN, address baked as a bitmap) is not silently lost.

Architecture: After PDFBox text extraction, scan each page for XObject images using PDFGraphicsStreamEngine. Pages that pass the text quality gate but contain significant images get a supplementary Document AI OCR call. The OCR positioned elements are spatially merged with PDFBox elements — PDFBox wins for overlapping regions, OCR fills in image-only regions. The merged elements go through the existing layout/elements->structured-text pipeline.

Design Reference: See "Image detection and spatial merge" section in docs/plans/2026-03-09-pdfbox-first-transcription-design.md.

Tech Stack: PDFBox (PDFGraphicsStreamEngine, PDImage), existing layout.clj positioned element model, existing Document AI OCR pipeline.


Task 1: Image detection in pdfbox_layout.clj

Files:

Step 1: Write the failing test

Create the test file. Build a PDF with an embedded image using PDFBox, then assert detect-page-images returns its bounding box.

(ns com.getorcha.workers.ingestion.transcription.pdfbox-layout-test
  (:require [clojure.test :refer [deftest is testing]]
            [com.getorcha.workers.ingestion.transcription.pdfbox-layout :as pdfbox-layout])
  (:import (java.io ByteArrayOutputStream)
           (org.apache.pdfbox Loader)
           (org.apache.pdfbox.pdmodel PDDocument PDPage PDPageContentStream)
           (org.apache.pdfbox.pdmodel.font PDType1Font Standard14Fonts$FontName)
           (org.apache.pdfbox.pdmodel.graphics.image PDImageXObject)))


(set! *warn-on-reflection* true)


(defn ^:private create-pdf-with-text-and-image
  "Create a PDF with text content and an embedded image.
   Returns byte array of the PDF."
  ^bytes []
  (let [baos (ByteArrayOutputStream.)]
    (with-open [doc (PDDocument.)]
      (let [page (PDPage.)]
        (.addPage doc page)
        (with-open [cs (PDPageContentStream. doc page)]
          ;; Add text content in upper portion
          (.beginText cs)
          (.setFont cs (PDType1Font. Standard14Fonts$FontName/HELVETICA) 12)
          (.newLineAtOffset cs 50 700)
          (.showText cs "Invoice INV-001 from Supplier Corp for 12345.67 EUR total amount")
          (.endText cs)
          ;; Add a wide image near the bottom (simulating a footer)
          ;; Create a minimal 480x58 pixel image
          (let [img (java.awt.image.BufferedImage. 480 58 java.awt.image.BufferedImage/TYPE_INT_RGB)
                pd-img (PDImageXObject/createFromByteArray
                         doc (.toByteArray
                               (doto (ByteArrayOutputStream.)
                                 (javax.imageio.ImageIO/write img "png" ^java.io.OutputStream %)))
                         "footer.png")]
            ;; Draw image at bottom of page: x=72, y=15, width=480, height=44
            (.drawImage cs pd-img 72.0 15.0 480.0 44.0))))
      (.save doc baos))
    (.toByteArray baos)))


(defn ^:private create-pdf-text-only
  "Create a PDF with only text content, no images."
  ^bytes []
  (let [baos (ByteArrayOutputStream.)]
    (with-open [doc (PDDocument.)]
      (let [page (PDPage.)]
        (.addPage doc page)
        (with-open [cs (PDPageContentStream. doc page)]
          (.beginText cs)
          (.setFont cs (PDType1Font. Standard14Fonts$FontName/HELVETICA) 12)
          (.newLineAtOffset cs 50 700)
          (.showText cs "Invoice INV-001 just text no images")
          (.endText cs)))
      (.save doc baos))
    (.toByteArray baos)))


(deftest test-detect-page-images
  (testing "Detects image with bounding box on a page"
    (let [pdf-bytes (create-pdf-with-text-and-image)]
      (with-open [doc (Loader/loadPDF pdf-bytes)]
        (let [images (pdfbox-layout/detect-page-images doc 0)]
          (is (= 1 (count images)))
          (let [{:keys [x y width height]} (first images)]
            ;; Image is at x=72, width=480 on a 612pt-wide page
            (is (< 0.1 x 0.13) "x should be ~72/612 = 0.118")
            ;; Image is near bottom of 792pt-high page
            (is (> y 0.9) "y should be near bottom")
            ;; Width ~480/612 = 0.78
            (is (< 0.7 width 0.85) "width should be ~0.78")
            ;; Height ~44/792 = 0.056
            (is (< 0.04 height 0.07) "height should be ~0.056"))))))

  (testing "Returns empty seq for pages without images"
    (let [pdf-bytes (create-pdf-text-only)]
      (with-open [doc (Loader/loadPDF pdf-bytes)]
        (is (empty? (pdfbox-layout/detect-page-images doc 0)))))))


(deftest test-significant-image?
  (testing "Wide, tall-enough images are significant"
    (is (pdfbox-layout/significant-image?
          {:x 0.12 :y 0.93 :width 0.78 :height 0.056})))

  (testing "Small images (logos, icons) are not significant"
    (is (not (pdfbox-layout/significant-image?
               {:x 0.4 :y 0.05 :width 0.08 :height 0.09}))))

  (testing "Narrow images are not significant"
    (is (not (pdfbox-layout/significant-image?
               {:x 0.1 :y 0.5 :width 0.2 :height 0.05}))))

  (testing "Very short images are not significant"
    (is (not (pdfbox-layout/significant-image?
               {:x 0.1 :y 0.5 :width 0.5 :height 0.01})))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'

Expected: Compilation error — detect-page-images and significant-image? don't exist.

Step 3: Implement detect-page-images and significant-image?

In pdfbox_layout.clj, add the import and two functions.

Add imports:

(:import (org.apache.pdfbox Loader)
         (org.apache.pdfbox.contentstream PDFGraphicsStreamEngine)
         (org.apache.pdfbox.pdmodel PDDocument PDPage)
         (org.apache.pdfbox.pdmodel.graphics.image PDImage)
         (org.apache.pdfbox.text PDFTextStripper TextPosition)))

Note: the existing import of PDDocument stays, but add PDPage, PDFGraphicsStreamEngine, and PDImage.

Add functions between the set! and extract-page-elements:

(defn detect-page-images
  "Detect XObject images on a PDF page with their bounding boxes.

   Subclasses PDFGraphicsStreamEngine to intercept drawImage calls.
   Returns seq of `{:x :y :width :height}` maps, normalized to 0.0-1.0.
   Coordinates are top-down (Y=0 at page top)."
  [^PDDocument doc page-index]
  (let [^PDPage page (.getPage doc page-index)
        page-w       (.. page getMediaBox getWidth)
        page-h       (.. page getMediaBox getHeight)
        images       (atom [])
        engine       (proxy [PDFGraphicsStreamEngine] [page]
                       (drawImage [^PDImage _pd-image]
                         (let [ctm    (.. this getGraphicsState
                                          getCurrentTransformationMatrix)
                               width  (Math/abs (.getScaleX ctm))
                               height (Math/abs (.getScaleY ctm))
                               tx     (.getTranslateX ctm)
                               ty     (.getTranslateY ctm)
                               ;; PDF Y-axis is bottom-up; convert to top-down
                               x      tx
                               y      (- page-h ty height)]
                           (swap! images conj
                                  {:x      (/ x page-w)
                                   :y      (/ y page-h)
                                   :width  (/ width page-w)
                                   :height (/ height page-h)})))
                       ;; Required abstract methods — no-op
                       (appendRectangle [_ _ _ _])
                       (clip [_])
                       (moveTo [_ _])
                       (lineTo [_ _])
                       (curveTo [_ _ _ _ _ _])
                       (getCurrentPoint [] nil)
                       (closePath [])
                       (endPath [])
                       (strokePath [])
                       (fillPath [_])
                       (fillAndStrokePath [_])
                       (shadingFill [_]))]
    (.processPage engine page)
    @images))


(defn significant-image?
  "An image is significant if it's wide enough to contain a line of text
   and tall enough to hold at least one text line.

   Thresholds: >30% page width and >2% page height.
   Intentionally loose — false positives cost ~$0.0015, false negatives
   lose text."
  [{:keys [width height]}]
  (and (> width 0.3)
       (> height 0.02)))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'

Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj
git commit -m "feat: add image detection to pdfbox_layout"

Task 2: Return page-elements and page-images from extract-with-layout

Files:

Step 1: Write the failing test

Add to the test file:

(deftest test-extract-with-layout-returns-elements-and-images
  (testing "Returns page-elements and page-images in result"
    (let [pdf-bytes (create-pdf-with-text-and-image)
          result    (pdfbox-layout/extract-with-layout pdf-bytes)]
      ;; page-elements contains positioned element seqs per page
      (is (map? (:page-elements result)))
      (is (seq (get-in result [:page-elements 1])) "Page 1 should have elements")
      (is (every? (fn [e] (and (:text e) (:x e) (:y e) (:width e) (:height e)))
                  (get-in result [:page-elements 1])))
      ;; page-images contains detected images per page
      (is (map? (:page-images result)))
      (is (= 1 (count (get-in result [:page-images 1]))) "Page 1 has one image")))

  (testing "page-images is empty map for text-only PDFs"
    (let [pdf-bytes (create-pdf-text-only)
          result    (pdfbox-layout/extract-with-layout pdf-bytes)]
      (is (every? empty? (vals (:page-images result)))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'

Expected: FAIL — :page-elements and :page-images not in result.

Step 3: Update extract-with-layout

Modify the function to also detect images and preserve elements:

(defn extract-with-layout
  "Extract text from PDF with layout reconstruction.

   Processes each page independently, capturing positioned text elements
   and reconstructing tabular structure with column separators.

   Returns:
     :text             - Layout-reconstructed text with === PAGE N === markers
     :page-count       - Number of pages
     :page-char-counts - Map of page-number (1-indexed) to character count
     :page-elements    - Map of page-number to seq of positioned elements
     :page-images      - Map of page-number to seq of significant image bounding boxes
     :quality-score    - Always 1.0 (native PDF text)
     :raw-response     - Empty vector (no external API calls)
     :method           - :pdf-lib"
  [^bytes pdf-bytes]
  (with-open [doc (Loader/loadPDF pdf-bytes)]
    (let [page-count   (.getNumberOfPages doc)
          opts         {:column-gap-threshold 0.05}
          page-results (mapv (fn [i]
                               (let [elements   (extract-page-elements doc i)
                                     images     (filterv significant-image?
                                                         (detect-page-images doc i))
                                     page-text  (layout/elements->structured-text elements opts)
                                     page-num   (inc i)]
                                 {:page-num   page-num
                                  :elements   elements
                                  :images     images
                                  :text       (when-not (str/blank? page-text)
                                                (str "=== PAGE " page-num " ===\n" page-text))
                                  :char-count (reduce + 0 (map (comp count :text) elements))}))
                             (range page-count))]
      {:text             (->> page-results
                              (keep :text)
                              (str/join "\n\n"))
       :page-count       page-count
       :page-char-counts (into {} (map (juxt :page-num :char-count) page-results))
       :page-texts       (into {} (keep (fn [{:keys [page-num text]}]
                                          (when text [page-num text]))
                                        page-results))
       :page-elements    (into {} (map (juxt :page-num :elements) page-results))
       :page-images      (into {} (map (juxt :page-num :images) page-results))
       :quality-score    1.0
       :raw-response     []
       :method           :pdf-lib})))

Also add filterv import note: filterv is in clojure.core, no import needed.

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'

Expected: PASS

Step 5: Run existing tests to ensure no regressions

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'

Expected: PASS — existing tests don't check for absence of :page-elements/:page-images, and the existing fields (:text, :page-char-counts, :page-texts) are unchanged. The only risk is that test-pdfbox-first-skips-ocr-for-text-rich-pdfs checks (is (nil? (:page-char-counts result))) — but that nil check is on the final result after dissoc in transcription.clj, not on the pdfbox result directly.

Step 6: Commit

git add src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj
git commit -m "feat: return page-elements and page-images from extract-with-layout"

Task 3: Expose page->elements from ocr_layout.clj

Files:

The spatial merge needs OCR positioned elements, not final text. Currently page->elements is private. Make it public.

Step 1: Write the failing test

Add to the existing ocr_layout_test.clj:

(deftest test-page->elements-returns-positioned-elements
  (testing "Returns positioned element maps from a Document AI page"
    (let [document-text "Invoice\nTotal: 100.00"
          page {:lines [{:layout (make-layout 0 7 0.1 0.1 0.3 0.02)}
                        {:layout (make-layout 8 21 0.1 0.15 0.4 0.02)}]}
          elements (ocr-layout/page->elements document-text page)]
      (is (= 2 (count elements)))
      (is (= "Invoice" (:text (first elements))))
      (is (= "Total: 100.00" (:text (second elements))))
      (is (every? #(and (:x %) (:y %) (:width %) (:height %)) elements)))))

Note: check how make-layout is defined in the existing test file. It likely builds the {:textAnchor {:textSegments ...} :boundingPoly {:normalizedVertices ...}} structure. Adapt the test to use the existing helper.

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.ocr-layout-test]'

Expected: FAIL — page->elements is private, calling it as ocr-layout/page->elements fails.

Step 3: Remove ^:private from page->elements

In ocr_layout.clj line 50, change:

(defn ^:private page->elements

to:

(defn page->elements

Update the docstring to note it's public for use by the spatial merge:

(defn page->elements
  "Extract all positioned text elements from a Document AI page.

   Uses :lines for finer granularity, falls back to :blocks.
   Returns seq of `{:text :x :y :width :height}` maps with
   normalized coordinates (0.0-1.0)."
  [document-text {:keys [lines blocks] :as _page}]
  ...)

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.ocr-layout-test]'

Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj test/com/getorcha/workers/ingestion/transcription/ocr_layout_test.clj
git commit -m "feat: make page->elements public for spatial merge"

Task 4: Spatial merge in layout.clj

Files:

Step 1: Write the failing test

Add to layout_test.clj:

(deftest test-merge-elements
  (testing "Non-overlapping OCR elements are included"
    (let [pdfbox-elements [{:text "Invoice body" :x 0.1 :y 0.1 :width 0.3 :height 0.02}]
          ocr-elements    [{:text "Invoice body" :x 0.1 :y 0.1 :width 0.3 :height 0.02}
                           {:text "Footer: VAT ID DE123" :x 0.1 :y 0.95 :width 0.8 :height 0.04}]
          merged (layout/merge-elements pdfbox-elements ocr-elements)]
      (is (= 2 (count merged)))
      (is (some #(= "Footer: VAT ID DE123" (:text %)) merged))
      (is (some #(= "Invoice body" (:text %)) merged))))

  (testing "Overlapping OCR elements are excluded (PDFBox wins)"
    (let [pdfbox-elements [{:text "Total: 100.00" :x 0.5 :y 0.3 :width 0.2 :height 0.02}]
          ocr-elements    [{:text "Total: 100.0" :x 0.5 :y 0.3 :width 0.19 :height 0.02}]
          merged (layout/merge-elements pdfbox-elements ocr-elements)]
      (is (= 1 (count merged)))
      (is (= "Total: 100.00" (:text (first merged))) "PDFBox text wins")))

  (testing "Empty PDFBox elements — all OCR elements included"
    (let [ocr-elements [{:text "Scanned text" :x 0.1 :y 0.1 :width 0.3 :height 0.02}]
          merged (layout/merge-elements [] ocr-elements)]
      (is (= 1 (count merged)))))

  (testing "Empty OCR elements — PDFBox elements unchanged"
    (let [pdfbox-elements [{:text "Native text" :x 0.1 :y 0.1 :width 0.3 :height 0.02}]
          merged (layout/merge-elements pdfbox-elements [])]
      (is (= 1 (count merged))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.layout-test]'

Expected: FAIL — merge-elements doesn't exist.

Step 3: Implement merge-elements

Add to layout.clj after elements->structured-text:

(defn ^:private overlaps?
  "Check if two bounding boxes overlap (2D axis-aligned intersection)."
  [a b]
  (and (< (:x a) (+ (:x b) (:width b)))
       (< (:x b) (+ (:x a) (:width a)))
       (< (:y a) (+ (:y b) (:height b)))
       (< (:y b) (+ (:y a) (:height a)))))


(defn merge-elements
  "Merge PDFBox and OCR positioned elements for a single page.

   PDFBox elements are authoritative for native text (no clipping issues).
   OCR elements are only included when they don't overlap with any PDFBox
   element — meaning they come from image regions that PDFBox couldn't read."
  [pdfbox-elements ocr-elements]
  (let [ocr-only (remove (fn [ocr-elem]
                           (some #(overlaps? ocr-elem %) pdfbox-elements))
                         ocr-elements)]
    (into (vec pdfbox-elements) ocr-only)))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.layout-test]'

Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/ingestion/transcription/layout.clj test/com/getorcha/workers/ingestion/transcription/layout_test.clj
git commit -m "feat: add spatial merge for PDFBox + OCR elements"

Task 5: Wire image supplementation into extract-text

Files:

This is the integration point. The existing flow in extract-text (lines 706-754) is:

PDFBox → check page char counts → all pass / all fail / mixed

We add a fourth case: pages that pass the text gate but have significant images.

Step 1: Write the failing test

Add to transcription_test.clj. Build a PDF where page 1 has both text and an embedded image. Assert that OCR is called to supplement.

(deftest test-pdfbox-supplements-ocr-for-pages-with-images
  (testing "Pages with text + images trigger supplementary OCR and merge"
    (let [ocr-call-count (atom 0)
          ;; OCR response with a "footer" element positioned at the bottom
          ocr-response   {:document
                          {:text "Invoice INV-001\nFooter: USt-IdNr DE813113094"
                           :pages [{:imageQualityScores {:qualityScore 0.95}
                                    :tokens [{:layout {:confidence 0.98}}
                                             {:layout {:confidence 0.97}}]
                                    :lines [{:layout {:textAnchor {:textSegments [{:startIndex "0" :endIndex "15"}]}
                                                      :boundingPoly {:normalizedVertices [{:x 0.1 :y 0.1}
                                                                                          {:x 0.6 :y 0.1}
                                                                                          {:x 0.6 :y 0.13}
                                                                                          {:x 0.1 :y 0.13}]}}}
                                            {:layout {:textAnchor {:textSegments [{:startIndex "16" :endIndex "46"}]}
                                                      :boundingPoly {:normalizedVertices [{:x 0.1 :y 0.93}
                                                                                          {:x 0.9 :y 0.93}
                                                                                          {:x 0.9 :y 0.97}
                                                                                          {:x 0.1 :y 0.97}]}}}]}]}}
          ocr-spy        (fn [& _args]
                           (swap! ocr-call-count inc)
                           {:status 200 :body ocr-response})
          ;; Build PDF with text + image
          pdf-bytes      (let [baos (java.io.ByteArrayOutputStream.)]
                           (with-open [doc (org.apache.pdfbox.pdmodel.PDDocument.)]
                             (let [page (org.apache.pdfbox.pdmodel.PDPage.)]
                               (.addPage doc page)
                               (with-open [cs (org.apache.pdfbox.pdmodel.PDPageContentStream. doc page)]
                                 ;; Text content
                                 (.beginText cs)
                                 (.setFont cs (org.apache.pdfbox.pdmodel.font.PDType1Font.
                                                org.apache.pdfbox.pdmodel.font.Standard14Fonts$FontName/HELVETICA) 12)
                                 (.newLineAtOffset cs 50 700)
                                 (.showText cs "Invoice INV-001 from Supplier Corp for 12345.67 EUR total amount")
                                 (.endText cs)
                                 ;; Image at bottom (simulating footer)
                                 (let [img (java.awt.image.BufferedImage. 480 58 java.awt.image.BufferedImage/TYPE_INT_RGB)
                                       img-baos (java.io.ByteArrayOutputStream.)
                                       _        (javax.imageio.ImageIO/write img "png" ^java.io.OutputStream img-baos)
                                       pd-img   (org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject/createFromByteArray
                                                  doc (.toByteArray img-baos) "footer.png")]
                                   (.drawImage cs pd-img 72.0 15.0 480.0 44.0))))
                             (.save doc baos))
                           (.toByteArray baos))
          context        {:transcription {:pdf-lib {:min-chars-per-page 50}
                                          :ocr    {:provider     :ocr
                                                   :project-id   "test-project"
                                                   :location     "eu"
                                                   :processor-id "test-processor"}
                                          :vision {}}
                          :llm-config    {}
                          :worker-pools  {}}
          ingestion      {:file {:contents pdf-bytes :mime-type "application/pdf"}}]
      (with-redefs-fn {#'hato/post                             ocr-spy
                       #'workers.transcription/get-access-token (constantly "test-token")}
        (fn []
          (let [result (workers.transcription/extract-text context ingestion)]
            ;; Method is pdf-lib (primary source)
            (is (= :pdf-lib (:method result)))
            ;; OCR was called once (for the page with images)
            (is (= 1 @ocr-call-count) "OCR should be called once for supplementation")
            ;; Result should contain native text
            (is (string/includes? (:text result) "Invoice"))
            ;; Result should contain OCR-supplemented footer text
            (is (string/includes? (:text result) "USt-IdNr DE813113094")
                "Footer text from image should be present via OCR supplement")
            ;; Page method should indicate supplementation
            (is (= {1 :pdf-lib+ocr} (:page-methods result)))))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' :vars '[com.getorcha.workers.ingestion.transcription-test/test-pdfbox-supplements-ocr-for-pages-with-images]'

Expected: FAIL — no supplementation logic exists yet, footer text won't appear.

Step 3: Implement the supplementation logic

In transcription.clj, update the require to include layout and ocr-layout:

[com.getorcha.workers.ingestion.transcription.layout :as layout]
[com.getorcha.workers.ingestion.transcription.ocr-layout :as ocr-layout]

Then in the extract-text function, modify the (empty? failing-pages) branch (line 732-737). Currently:

;; All pages pass — use PDFBox result
(empty? failing-pages)
(-> (dissoc pdfbox-result :page-char-counts :page-texts)
    (assoc :page-methods
           (into {} (map (fn [p] [p :pdf-lib])
                         (range 1 (inc total-pages))))))

Replace with logic that checks for pages needing image supplementation:

;; All pages pass text gate
(empty? failing-pages)
(let [supplement-pages (into (sorted-set)
                              (comp (filter (fn [[_ imgs]] (seq imgs)))
                                    (map first))
                              (:page-images pdfbox-result))]
  (if (empty? supplement-pages)
    ;; No images — pure PDFBox result
    (-> (dissoc pdfbox-result :page-char-counts :page-texts
                              :page-elements :page-images)
        (assoc :page-methods
               (into {} (map (fn [p] [p :pdf-lib])
                             (range 1 (inc total-pages))))))
    ;; Some pages have images — supplement with OCR
    (let [supp-sorted   (vec (sort supplement-pages))
          supp-indices  (mapv dec supp-sorted)
          supp-pdf      (extract-pages contents supp-indices)
          ocr-result    (ocr-transcribe! ocr-context
                                         {:file {:contents  supp-pdf
                                                 :mime-type "application/pdf"}})
          ocr-responses (:raw-response ocr-result)
          ;; Build OCR elements per supplement page
          ocr-elements-by-page
          (into {}
                (mapcat
                  (fn [response page-offset]
                    (let [doc-text (get-in response [:document :text])
                          pages    (get-in response [:document :pages])]
                      (map-indexed
                        (fn [idx page]
                          (let [ocr-page-num (nth supp-sorted (+ page-offset idx))]
                            [ocr-page-num (ocr-layout/page->elements doc-text page)]))
                        pages))))
                (let [page-counts (reductions + 0 (map #(count (get-in % [:document :pages]))
                                                       ocr-responses))]
                  (map vector ocr-responses page-counts)))
          ;; Merge elements and reconstruct layout per supplemented page
          opts {:column-gap-threshold 0.05}
          supplemented-texts
          (into {}
                (map (fn [p]
                       (let [pb-elems  (get-in pdfbox-result [:page-elements p])
                             ocr-elems (get ocr-elements-by-page p [])
                             merged    (layout/merge-elements pb-elems ocr-elems)
                             page-text (layout/elements->structured-text merged opts)]
                         [p (str "=== PAGE " p " ===\n" page-text)])))
                supp-sorted)
          ;; Replace page texts for supplemented pages
          final-text (->> (range 1 (inc total-pages))
                          (keep (fn [p]
                                  (or (get supplemented-texts p)
                                      (get (:page-texts pdfbox-result) p))))
                          (str/join "\n\n"))]
      {:text          final-text
       :page-count    total-pages
       :quality-score 1.0
       :raw-response  ocr-responses
       :method        :pdf-lib
       :page-methods  (into {} (map (fn [p]
                                      [p (if (supplement-pages p)
                                           :pdf-lib+ocr
                                           :pdf-lib)])
                                    (range 1 (inc total-pages))))})))

Note: ocr-transcribe! is the multimethod that calls Document AI. It's called directly here instead of ocr-with-vision-fallback because we need the raw response with bounding boxes — we don't want layout reconstruction or vision fallback for supplement pages (we just need positioned elements). Make sure to pass native-pdf-parsing?: false through the context.

Important implementation detail: Review how ocr-transcribe! is called. It currently takes (ocr-transcribe! context ingestion). The context must include :native-pdf-parsing? false (already set in ocr-context at line 719). The multimethod returns a map with :raw-response containing the Document AI responses.

Adapt the above code to match the exact calling convention of ocr-transcribe! in the codebase. Read its signature and return shape before implementing. The sketch above shows the data flow — the exact function calls may need adjustment.

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' :vars '[com.getorcha.workers.ingestion.transcription-test/test-pdfbox-supplements-ocr-for-pages-with-images]'

Expected: PASS

Step 5: Run all transcription tests

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'

Expected: PASS — existing tests should not be affected because their test PDFs have no embedded images, so supplement-pages will be empty and they'll hit the existing pure-PDFBox path.

Step 6: Commit

git add src/com/getorcha/workers/ingestion/transcription.clj test/com/getorcha/workers/ingestion/transcription_test.clj
git commit -m "feat: supplement PDFBox pages with OCR when images detected"

Task 6: Clean up — strip internal fields from final result

Files:

Step 1: Verify internal fields don't leak

The existing test test-pdfbox-first-skips-ocr-for-text-rich-pdfs already checks:

(is (nil? (:page-char-counts result)))
(is (nil? (:page-texts result)))

Add checks for the new fields:

(is (nil? (:page-elements result)))
(is (nil? (:page-images result)))

Step 2: Run to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' :vars '[com.getorcha.workers.ingestion.transcription-test/test-pdfbox-first-skips-ocr-for-text-rich-pdfs]'

Expected: FAIL — :page-elements and :page-images leak into final result.

Step 3: Update dissoc calls

In transcription.clj, wherever pdfbox-result is dissoc'd before return, add :page-elements and :page-images. There are two places:

  1. The pure PDFBox path (no failing pages, no supplement pages):

    (dissoc pdfbox-result :page-char-counts :page-texts :page-elements :page-images)
    
  2. The supplement path already builds a fresh map without these keys (no change needed).

  3. The merge-mixed-pages function receives pdfbox-result — check that it doesn't pass through :page-elements/:page-images. It builds a fresh map, so it's fine.

Step 4: Run to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'

Expected: PASS

Step 5: Commit

git add src/com/getorcha/workers/ingestion/transcription.clj test/com/getorcha/workers/ingestion/transcription_test.clj
git commit -m "fix: strip page-elements and page-images from final result"

Task 7: Lint and final verification

Step 1: Lint

Run: clj-kondo --lint src test dev

Fix any issues.

Step 2: Run all transcription-related tests

clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test com.getorcha.workers.ingestion.transcription.layout-test com.getorcha.workers.ingestion.transcription.ocr-layout-test com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'

Expected: ALL PASS

Step 3: Commit any lint fixes


Notes for implementer