PDFBox-First Transcription Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Replace all-or-nothing OCR tier system with per-page PDFBox-first extraction, with Document AI and Vision as per-page fallbacks.

Architecture: PDFBox extracts positioned text elements from each page. Pages passing a character density gate get layout reconstruction from PDFBox coordinates. Failing pages fall through to Document AI (image OCR only) then Vision. All pages assembled with === PAGE N === markers.

Tech Stack: Apache PDFBox 3.0.3 (already in deps), Google Document AI, Gemini Vision, Clojure multimethods.

Design doc: docs/plans/2026-03-09-pdfbox-first-transcription-design.md

Task 1: Extract shared layout algorithm from ocr_layout.clj

Extract the provider-agnostic row-grouping and column-separator logic into a shared module that both PDFBox and Document AI layout can use.

Files:

Create: src/com/getorcha/workers/ingestion/transcription/layout.clj
Modify: src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj
Test: test/com/getorcha/workers/ingestion/transcription/layout_test.clj

Step 1: Write failing tests for the shared layout module

Create test/com/getorcha/workers/ingestion/transcription/layout_test.clj:

(ns com.getorcha.workers.ingestion.transcription.layout-test
  (:require [clojure.test :refer [deftest is testing]]
            [com.getorcha.workers.ingestion.transcription.layout :as layout]))

(deftest test-group-into-rows
  (testing "Groups vertically overlapping elements into same row"
    (let [elements [{:text "Name" :x 0.1 :y 0.5 :width 0.1 :height 0.02}
                    {:text "Value" :x 0.5 :y 0.5 :width 0.1 :height 0.02}]
          rows (layout/group-into-rows elements)]
      (is (= 1 (count rows)))
      (is (= 2 (count (first rows))))))

  (testing "Separates non-overlapping elements into different rows"
    (let [elements [{:text "Row1" :x 0.1 :y 0.1 :width 0.1 :height 0.02}
                    {:text "Row2" :x 0.1 :y 0.5 :width 0.1 :height 0.02}]
          rows (layout/group-into-rows elements)]
      (is (= 2 (count rows))))))


(deftest test-row->text
  (testing "Inserts column separator for large gaps"
    (let [row [{:text "Description" :x 0.1 :y 0.5 :width 0.15 :height 0.02}
               {:text "100.00" :x 0.7 :y 0.5 :width 0.08 :height 0.02}]]
      (is (= "Description | 100.00" (layout/row->text row 0.05)))))

  (testing "No separator for small gaps"
    (let [row [{:text "First" :x 0.1 :y 0.5 :width 0.08 :height 0.02}
               {:text "Second" :x 0.19 :y 0.5 :width 0.08 :height 0.02}]]
      (is (= "First Second" (layout/row->text row 0.05))))))


(deftest test-elements->structured-text
  (testing "Full pipeline: elements to structured text with rows and columns"
    (let [elements [{:text "Pos" :x 0.05 :y 0.1 :width 0.05 :height 0.02}
                    {:text "Description" :x 0.15 :y 0.1 :width 0.2 :height 0.02}
                    {:text "Amount" :x 0.7 :y 0.1 :width 0.1 :height 0.02}
                    {:text "10" :x 0.05 :y 0.15 :width 0.03 :height 0.02}
                    {:text "Widget" :x 0.15 :y 0.15 :width 0.1 :height 0.02}
                    {:text "99.00" :x 0.7 :y 0.15 :width 0.08 :height 0.02}]
          text (layout/elements->structured-text elements {:column-gap-threshold 0.05})]
      (is (string? text))
      (is (re-find #"Pos" text))
      (is (re-find #"Widget" text))
      (is (re-find #"\|" text) "Should have column separators"))))

Step 2: Run tests to verify they fail

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.layout-test]' Expected: FAIL — namespace layout does not exist.

Step 3: Create the shared layout module

Create src/com/getorcha/workers/ingestion/transcription/layout.clj. Move these functions from ocr_layout.clj:

vertical-overlap?
row-overlaps?
group-into-rows
sort-row-by-x
row->text

Add a new public entry point elements->structured-text that takes a seq of {:text :x :y :width :height} maps and options, returns formatted text string.

(ns com.getorcha.workers.ingestion.transcription.layout
  "Shared layout reconstruction for positioned text elements.

   Converts sequences of positioned text elements (from any source — PDFBox,
   Document AI, etc.) into structured row-based text with column separators.

   Elements are maps with:
     :text   - the text content
     :x      - left edge (0.0-1.0 normalized)
     :y      - top edge (0.0-1.0 normalized)
     :width  - element width (normalized)
     :height - element height (normalized)"
  (:require [clojure.string :as string]))


(set! *warn-on-reflection* true)


(defn ^:private vertical-overlap?
  "Check if two elements overlap vertically (same visual row).
   Elements overlap if their Y-ranges intersect."
  [elem1 elem2]
  (let [y1-start (:y elem1)
        y1-end   (+ y1-start (:height elem1))
        y2-start (:y elem2)
        y2-end   (+ y2-start (:height elem2))]
    (and (< y1-start y2-end)
         (< y2-start y1-end))))


(defn ^:private row-overlaps?
  "Check if an element overlaps vertically with any element in the row."
  [row element]
  (some #(vertical-overlap? % element) row))


(defn group-into-rows
  "Group elements by vertical overlap (bounding box intersection).
   Elements on the same visual row will have overlapping Y-ranges.
   Returns seq of rows, each row being a seq of elements."
  [elements]
  (->> elements
       (sort-by :y)
       (reduce
        (fn [rows element]
          (if (empty? rows)
            [[element]]
            (let [current-row (peek rows)]
              (if (row-overlaps? current-row element)
                (conj (pop rows) (conj current-row element))
                (conj rows [element])))))
        [])))


(defn row->text
  "Convert a row of elements to text, inserting `|` separators for column gaps.
   Elements are sorted left-to-right by X-coordinate."
  [row column-gap-threshold]
  (let [sorted-row (sort-by :x row)]
    (->> sorted-row
         (partition 2 1 nil)
         (mapcat (fn [[elem next-elem]]
                   (if (and next-elem
                            (> (- (:x next-elem) (+ (:x elem) (:width elem)))
                               column-gap-threshold))
                     [(:text elem) " | "]
                     [(:text elem) " "])))
         (apply str)
         string/trim)))


(defn elements->structured-text
  "Convert positioned text elements into structured row-based text.

   Takes a seq of `{:text :x :y :width :height}` maps and options.
   Returns a string with rows separated by newlines and columns by `|`."
  [elements {:keys [column-gap-threshold]
             :or   {column-gap-threshold 0.05}
             :as   _opts}]
  (->> (group-into-rows elements)
       (map #(row->text % column-gap-threshold))
       (filter (complement string/blank?))
       (string/join "\n")))

Step 4: Run tests to verify they pass

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.layout-test]' Expected: PASS

Step 5: Refactor ocr_layout.clj to use shared layout module

Modify src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj:

Remove the duplicated functions: vertical-overlap?, row-overlaps?, group-into-rows, sort-row-by-x, row->text, page->structured-text
Require the new layout namespace
Keep layout->element and page->elements (Document AI-specific element extraction)
Replace page->structured-text with a call to layout/elements->structured-text

The resulting ocr_layout.clj should only contain:

layout->element — converts Document AI layout map → {:text :x :y :width :height}
page->elements — extracts elements from a Document AI page
reconstruct-single-document — uses layout/elements->structured-text per page
reconstruct-layout — public API, iterates chunks

Step 6: Run existing tests to verify no regressions

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' Expected: PASS — all existing OCR tests still work.

Step 7: Lint

Run: clj-kondo --lint src/com/getorcha/workers/ingestion/transcription/layout.clj src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj Expected: No errors.

Step 8: Commit

git add src/com/getorcha/workers/ingestion/transcription/layout.clj \
        src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj \
        test/com/getorcha/workers/ingestion/transcription/layout_test.clj
git commit -m "refactor: extract shared layout algorithm from ocr_layout"

Task 2: Build PDFBox layout extraction

Create the PDFBox-specific module that subclasses PDFTextStripper to capture positioned text elements per page, then feeds them into the shared layout algorithm.

Files:

Create: src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj
Test: test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj

Context: PDFBox 3.0.3 PDFTextStripper has a protected method writeString(String text, List<TextPosition> textPositions) called once per word group. Override it to capture each group as a positioned element with normalized coordinates. The TextPosition class provides .getX(), .getY(), .getWidth(), .getHeight(), .getPageWidth(), .getPageHeight().

Important: .getY() returns distance from top of page in PDFBox 3.x (changed from 2.x where it was from bottom). Verify this at implementation time by printing values — if Y increases downward, no transform needed. If Y increases upward, subtract from page height.

Step 1: Write failing test using the Viessmann invoice

Create test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj. Use the actual Viessmann PDF from dump/ as a test fixture — this is the document that motivated the entire refactor, so it's the perfect regression test.

(ns com.getorcha.workers.ingestion.transcription.pdfbox-layout-test
  (:require [clojure.java.io :as io]
            [clojure.string :as string]
            [clojure.test :refer [deftest is testing]]
            [com.getorcha.workers.ingestion.transcription.pdfbox-layout :as pdfbox-layout])
  (:import (org.apache.pdfbox Loader)))


(def ^:private viessmann-pdf-bytes
  (delay
    (let [f (io/file "dump/019cbfb0-1e37-709c-a2e2-43ec943db97c.pdf")]
      (when (.exists f)
        (.readAllBytes (java.io.FileInputStream. f))))))


(deftest test-extract-page-elements
  (testing "Extracts positioned elements from a PDF page"
    (when-let [pdf-bytes @viessmann-pdf-bytes
      (with-open [doc (Loader/loadPDF pdf-bytes)]
        (let [elements (pdfbox-layout/extract-page-elements doc 0)]
          (is (seq elements) "Should extract elements from page 1")
          (is (every? #(contains? % :text) elements))
          (is (every? #(contains? % :x) elements))
          (is (every? #(contains? % :y) elements))
          (is (every? #(contains? % :width) elements))
          (is (every? #(contains? % :height) elements))
          ;; All coordinates should be normalized 0.0-1.0
          (is (every? #(<= 0.0 (:x %) 1.0) elements))
          (is (every? #(<= 0.0 (:y %) 1.0) elements)))))))


(deftest test-viessmann-complete-text
  (testing "PDFBox extracts complete text that Document AI clipped"
    (when-let [pdf-bytes @viessmann-pdf-bytes]
      (let [result (pdfbox-layout/extract-with-layout pdf-bytes)]
        ;; These are the exact fields Document AI truncated
        (is (string/includes? (:text result) "WP261041")
            "Bestellnummer should be complete (was WP26)")
        (is (string/includes? (:text result) "Saeidiani")
            "Warenempfänger name should be complete (was Sa)")
        (is (string/includes? (:text result) "Schanzenweg")
            "Street should be complete (was Schanzenw)")
        (is (string/includes? (:text result) "Gerasdorf bei Wien")
            "City should be complete (was Gerasdorf be)")
        (is (string/includes? (:text result) "ÖSTERREICH")
            "Country should be complete (was ÖSTERR)")))))


(deftest test-page-markers
  (testing "Output contains page markers with correct numbering"
    (when-let [pdf-bytes @viessmann-pdf-bytes]
      (let [result (pdfbox-layout/extract-with-layout pdf-bytes)]
        (is (string/includes? (:text result) "=== PAGE 1 ==="))
        (is (string/includes? (:text result) "=== PAGE 2 ==="))
        (is (string/includes? (:text result) "=== PAGE 3 ==="))
        (is (= 3 (:page-count result)))))))


(deftest test-column-separators
  (testing "Table rows have column separators"
    (when-let [pdf-bytes @viessmann-pdf-bytes]
      (let [result (pdfbox-layout/extract-with-layout pdf-bytes)]
        ;; Line items should have separators between columns
        (is (string/includes? (:text result) "|")
            "Should have column separators in tabular data")))))


(deftest test-per-page-char-counts
  (testing "Returns per-page character counts for quality gating"
    (when-let [pdf-bytes @viessmann-pdf-bytes]
      (let [result (pdfbox-layout/extract-with-layout pdf-bytes)]
        (is (map? (:page-char-counts result)))
        (is (= 3 (count (:page-char-counts result))))
        ;; All pages of this invoice should have substantial text
        (is (every? #(> % 50) (vals (:page-char-counts result))))))))

Step 2: Run tests to verify they fail

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]' Expected: FAIL — namespace pdfbox-layout does not exist.

Step 3: Implement pdfbox_layout.clj

Create src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj.

Key implementation details:

Subclass PDFTextStripper using proxy or gen-class. Override writeString(String, List<TextPosition>). In the override, compute a bounding box from the list of TextPosition objects (min x, min y, max x+width, max y+height), normalize by page dimensions, and store as a {:text :x :y :width :height} element.
Per-page extraction: Use setStartPage / setEndPage to process one page at a time. For each page, collect all positioned elements, then call layout/elements->structured-text.

Public API: extract-with-layout takes PDF bytes, returns:

{:text             "=== PAGE 1 ===\n...\n\n=== PAGE 2 ===\n..."
 :page-count       3
 :page-char-counts {1 1234, 2 567, 3 890}  ;; for quality gating
 :quality-score     1.0
 :raw-response      []
 :method            :pdf-lib}

Page dimensions: Get from PDDocument.getPage(i).getMediaBox() — returns PDRectangle with .getWidth() and .getHeight() in PDF points. Use these to normalize TextPosition coordinates.
Coordinate normalization: TextPosition.getX() / mediaBox.getWidth() for x, TextPosition.getY() / mediaBox.getHeight() for y (verify Y direction at implementation time).

(ns com.getorcha.workers.ingestion.transcription.pdfbox-layout
  "PDFBox-based layout extraction for PDF documents with embedded text.

   Subclasses PDFTextStripper to capture positioned word groups, then
   reconstructs tabular layout using the shared layout algorithm.

   Produces the same output format as ocr_layout.clj (=== PAGE N === markers
   with | column separators) but from PDFBox coordinates instead of
   Document AI bounding boxes."
  (:require [clojure.string :as string]
            [com.getorcha.workers.ingestion.transcription.layout :as layout])
  (:import (org.apache.pdfbox Loader)
           (org.apache.pdfbox.pdmodel PDDocument)
           (org.apache.pdfbox.text PDFTextStripper TextPosition)))


(set! *warn-on-reflection* true)


(defn extract-page-elements
  "Extract positioned text elements from a single PDF page.

   Returns seq of {:text :x :y :width :height} maps with coordinates
   normalized to 0.0-1.0 range."
  [^PDDocument doc page-index]
  (let [page       (.getPage doc page-index)
        media-box  (.getMediaBox page)
        page-width (.getWidth media-box)
        page-height (.getHeight media-box)
        elements   (atom [])
        stripper   (proxy [PDFTextStripper] []
                     (writeString [^String text ^java.util.List text-positions]
                       (when (and text (not (string/blank? text)) (seq text-positions))
                         (let [positions (vec text-positions)
                               xs        (mapv #(.getX ^TextPosition %) positions)
                               ys        (mapv #(.getY ^TextPosition %) positions)
                               widths    (mapv #(.getWidth ^TextPosition %) positions)
                               heights   (mapv #(.getHeight ^TextPosition %) positions)
                               min-x     (apply min xs)
                               min-y     (apply min ys)
                               max-x     (apply max (map + xs widths))
                               max-y     (apply max (map + ys heights))]
                           (swap! elements conj
                                  {:text   (string/trim text)
                                   :x      (/ min-x page-width)
                                   :y      (/ min-y page-height)
                                   :width  (/ (- max-x min-x) page-width)
                                   :height (/ (- max-y min-y) page-height)})))))]
    ;; PDFTextStripper pages are 1-indexed
    (.setStartPage stripper (inc page-index))
    (.setEndPage stripper (inc page-index))
    ;; getText drives the parsing — writeString is called as a side effect
    (.getText stripper doc)
    (filterv #(not (string/blank? (:text %))) @elements)))


(defn extract-with-layout
  "Extract text from PDF with layout reconstruction.

   Processes each page independently, capturing positioned text elements
   and reconstructing tabular structure with column separators.

   Returns:
     :text             - Layout-reconstructed text with === PAGE N === markers
     :page-count       - Number of pages
     :page-char-counts - Map of page-number (1-indexed) to character count
     :quality-score    - Always 1.0 (native PDF text)
     :raw-response     - Empty vector (no external API calls)
     :method           - :pdf-lib"
  [^bytes pdf-bytes]
  (with-open [doc (Loader/loadPDF pdf-bytes)]
    (let [page-count  (.getNumberOfPages doc)
          opts        {:column-gap-threshold 0.05}
          page-results
          (for [i (range page-count)]
            (let [elements  (extract-page-elements doc i)
                  page-text (layout/elements->structured-text elements opts)
                  page-num  (inc i)]
              {:page-num   page-num
               :text       (when-not (string/blank? page-text)
                             (str "=== PAGE " page-num " ===\n" page-text))
               :char-count (reduce + 0 (map (comp count :text) elements))}))]
      {:text             (->> page-results
                              (keep :text)
                              (string/join "\n\n"))
       :page-count       page-count
       :page-char-counts (into {} (map (juxt :page-num :char-count) page-results))
       :quality-score    1.0
       :raw-response     []
       :method           :pdf-lib})))

Step 4: Run tests to verify they pass

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]' Expected: PASS. If the Viessmann PDF test fixture isn't available in CI, the when-let guards will skip those tests gracefully.

Step 5: Lint

Run: clj-kondo --lint src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj Expected: No errors.

Step 6: Commit

git add src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj \
        test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj
git commit -m "feat: PDFBox layout extraction with positioned text elements"

Task 3: Add config for PDFBox quality gate threshold

Files:

Modify: resources/com/getorcha/config.edn

Step 1: Add pdf-lib config

In resources/com/getorcha/config.edn, inside the :com.getorcha.workers.ingestion/orchestrator map, add :pdf-lib to the :transcription map (alongside :ocr and :vision):

:transcription {:pdf-lib {:min-chars-per-page 50}
                :ocr    {:provider         :ocr
                         ...existing...}
                :vision {...existing...}}

Step 2: Commit

git add resources/com/getorcha/config.edn
git commit -m "config: add pdf-lib min-chars-per-page threshold"

Task 4: Rewire extract-text to per-page PDFBox-first pipeline

This is the main orchestration change. Replace the current all-or-nothing tier logic in extract-text with per-page evaluation.

Files:

Modify: src/com/getorcha/workers/ingestion/transcription.clj
Modify: test/com/getorcha/workers/ingestion/transcription_test.clj

Step 1: Write tests for the new per-page pipeline

Add tests to test/com/getorcha/workers/ingestion/transcription_test.clj:

(deftest test-pdfbox-first-skips-ocr-for-text-rich-pdfs
  (testing "PDFBox extraction succeeds for PDFs with embedded text, no OCR called"
    ;; Create a PDF with actual text content
    (let [ocr-call-count (atom 0)
          ocr-spy        (fn [& _args]
                           (swap! ocr-call-count inc)
                           {:status 200 :body sample-docai-response})
          pdf-bytes      (let [baos (java.io.ByteArrayOutputStream.)]
                           (with-open [doc (org.apache.pdfbox.pdmodel.PDDocument.)
                                       page (org.apache.pdfbox.pdmodel.PDPage.)
                                       cs (org.apache.pdfbox.pdmodel.PDPageContentStream. doc page)]
                             (.addPage doc page)
                             (.beginText cs)
                             (.setFont cs (org.apache.pdfbox.pdmodel.font.PDType1Font. org.apache.pdfbox.pdmodel.font.Standard14Fonts$FontName/HELVETICA) 12)
                             (.newLineAtOffset cs 50 700)
                             ;; Write enough text to pass quality gate (>50 chars)
                             (.showText cs "Invoice INV-001 from Supplier Corp for 12345.67 EUR total amount")
                             (.endText cs)
                             (.save doc baos))
                           (.toByteArray baos))
          context        {:transcription {:pdf-lib {:min-chars-per-page 50}
                                          :ocr     {:provider     :ocr
                                                    :project-id   "test-project"
                                                    :location     "eu"
                                                    :processor-id "test-processor"}
                                          :vision  {}}
                          :llm-config    {}
                          :worker-pools  {}}
          ingestion      {:file     {:contents pdf-bytes :mime-type "application/pdf"}
                          :document {}}]
      (with-redefs-fn {#'hato/post                             ocr-spy
                       #'workers.transcription/get-access-token (constantly "test-token")}
        (fn []
          (let [result (workers.transcription/extract-text context ingestion)]
            (is (= :pdf-lib (:method result)))
            (is (= 0 @ocr-call-count) "OCR should not be called")
            (is (string/includes? (:text result) "Invoice"))
            (is (string/includes? (:text result) "=== PAGE 1 ==="))))))))


(deftest test-pdfbox-falls-through-to-ocr-for-scanned-pdfs
  (testing "Empty PDF pages fall through to Document AI"
    (let [ocr-call-count (atom 0)
          ocr-spy        (fn [& _args]
                           (swap! ocr-call-count inc)
                           {:status 200 :body sample-docai-response})
          ;; Create PDF with no text (simulates scanned document)
          pdf-bytes      (let [baos (java.io.ByteArrayOutputStream.)]
                           (with-open [doc (org.apache.pdfbox.pdmodel.PDDocument.)]
                             (.addPage doc (org.apache.pdfbox.pdmodel.PDPage.))
                             (.save doc baos))
                           (.toByteArray baos))
          context        {:transcription {:pdf-lib {:min-chars-per-page 50}
                                          :ocr     {:provider     :ocr
                                                    :project-id   "test-project"
                                                    :location     "eu"
                                                    :processor-id "test-processor"}
                                          :vision  {}}
                          :llm-config    {}
                          :worker-pools  {}}
          ingestion      {:file     {:contents pdf-bytes :mime-type "application/pdf"}
                          :document {}}]
      (with-redefs-fn {#'hato/post                             ocr-spy
                       #'workers.transcription/get-access-token (constantly "test-token")}
        (fn []
          (let [result (workers.transcription/extract-text context ingestion)]
            (is (= :ocr (:method result)))
            (is (pos? @ocr-call-count) "OCR should be called for empty pages")))))))

Step 2: Run new tests to verify they fail

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' Expected: New tests FAIL (old tests may also fail since we haven't changed extract-text yet).

Step 3: Rewrite extract-text PDF/image branch

In transcription.clj, replace the current PDF/image :else branch (lines 614-662) with the per-page pipeline:

The new logic (pseudocode):

;; 1. If it's a PDF, try PDFBox first
;; 2. Extract per-page elements and char counts
;; 3. Partition pages into passing/failing
;; 4. For failing pages: split PDF, send to Document AI
;; 5. For Document AI pages with low confidence: send to Vision
;; 6. Assemble all pages in order

Key changes to extract-text:

Remove the (> page-count 15) gate on extract-pdf-text
Replace extract-pdf-text with pdfbox-layout/extract-with-layout which returns per-page char counts
Use :page-char-counts to identify failing pages
For failing pages, split PDF and send only those pages to Document AI with enableNativePdfParsing: false
For Document AI pages needing vision fallback, send only those specific pages
Assemble final text from all page results, sorted by page number
Compute document-level :method as the "highest" tier used
Add :page-methods to result

Also modify process-docai-chunk to accept a native-pdf-parsing? parameter (default true for backward compat with non-PDF mime types like images, false when called as PDFBox fallback).

Step 4: Update existing tests

Existing tests use image/png mime type which skips the PDFBox path entirely — they should continue to work. But update the context maps to include the new :pdf-lib config key:

:transcription {:pdf-lib {:min-chars-per-page 50}
                :ocr    {...}
                :vision {...}}

The vision fallback tests that use application/pdf will need adjustment since the pipeline now tries PDFBox first. The empty-page PDFs in those tests have 0 chars, so they'll fail the quality gate and fall through to OCR as before.

Step 5: Run all tests

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' Expected: PASS — both old and new tests.

Step 6: Lint

Run: clj-kondo --lint src/com/getorcha/workers/ingestion/transcription.clj Expected: No errors.

Step 7: Commit

git add src/com/getorcha/workers/ingestion/transcription.clj \
        test/com/getorcha/workers/ingestion/transcription_test.clj
git commit -m "feat: per-page PDFBox-first transcription with OCR/vision fallback"

Task 5: Disable native PDF parsing in Document AI fallback

Small change but important for correctness: when Document AI is called as a fallback for pages that failed PDFBox, disable native PDF parsing to force image-based OCR.

Files:

Modify: src/com/getorcha/workers/ingestion/transcription.clj

Step 1: This should already be done in Task 4

Verify that process-docai-chunk uses enableNativePdfParsing: false when called from the PDFBox fallback path. If Task 4 already handled this, this task is just verification.

Step 2: Write a targeted test

Add to transcription_test.clj:

(deftest test-docai-fallback-disables-native-pdf-parsing
  (testing "Document AI fallback sends enableNativePdfParsing=false"
    (let [captured-body (atom nil)
          http-spy      (fn [_url opts]
                          (let [body (cheshire.core/parse-string (:body opts) true)]
                            (reset! captured-body body))
                          {:status 200 :body sample-docai-response})
          ;; Empty PDF — will fail quality gate
          pdf-bytes     (let [baos (java.io.ByteArrayOutputStream.)]
                          (with-open [doc (org.apache.pdfbox.pdmodel.PDDocument.)]
                            (.addPage doc (org.apache.pdfbox.pdmodel.PDPage.))
                            (.save doc baos))
                          (.toByteArray baos))
          context       {:transcription {:pdf-lib {:min-chars-per-page 50}
                                         :ocr     {:provider     :ocr
                                                   :project-id   "test-project"
                                                   :location     "eu"
                                                   :processor-id "test-processor"}
                                         :vision  {}}
                         :llm-config    {}
                         :worker-pools  {}}
          ingestion     {:file     {:contents pdf-bytes :mime-type "application/pdf"}
                         :document {}}]
      (with-redefs-fn {#'hato/post                             http-spy
                       #'workers.transcription/get-access-token (constantly "test-token")}
        (fn []
          (workers.transcription/extract-text context ingestion)
          (is (some? @captured-body))
          (is (false? (get-in @captured-body [:processOptions :ocrConfig :enableNativePdfParsing]))
              "Should disable native PDF parsing for fallback pages"))))))

Step 3: Run tests

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]' Expected: PASS

Step 4: Commit

git add src/com/getorcha/workers/ingestion/transcription.clj \
        test/com/getorcha/workers/ingestion/transcription_test.clj
git commit -m "feat: disable native PDF parsing in Document AI fallback"

Task 6: Update TranscriptionResult schema

Add the optional :page-methods field to the schema so it validates correctly when present.

Files:

Modify: src/com/getorcha/schema/ingestion.clj

Step 1: Add page-methods to all PDF-relevant variants

In TranscriptionResult, add [:page-methods {:optional true} [:map-of :int :keyword]] to the :pdf-lib, :ocr, and :vision variants of the multi schema:

[:pdf-lib
 (m.util/merge
  TranscriptionResultBase
  [:map
   [:method [:= :pdf-lib]]
   [:page-methods {:optional true} [:map-of :int :keyword]]])]

Same for :ocr and :vision.

Step 2: Verify with comment-block examples

Update the (comment ...) block in the schema file to include a :page-methods example.

Step 3: Lint

Run: clj-kondo --lint src/com/getorcha/schema/ingestion.clj

Step 4: Commit

git add src/com/getorcha/schema/ingestion.clj
git commit -m "schema: add optional page-methods to TranscriptionResult"

Task 7: Integration test with Viessmann PDF

End-to-end test using the actual Viessmann invoice to verify the full pipeline produces correct output.

Files:

Modify: test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj

Step 1: Add integration test

This test calls extract-text (the public API) with the Viessmann PDF and verifies the output contains the previously-truncated fields. It should NOT call Document AI (all 3 pages should pass the quality gate).

(deftest test-integration-viessmann-invoice
  (testing "Full pipeline produces complete text for Viessmann invoice"
    (when-let [pdf-bytes @viessmann-pdf-bytes]
      (let [context   {:transcription {:pdf-lib {:min-chars-per-page 50}
                                       :ocr     {:provider :ocr}
                                       :vision  {}}
                       :llm-config    {}
                       :worker-pools  {}}
            ingestion {:file     {:contents pdf-bytes :mime-type "application/pdf"}
                       :document {}}
            result    (workers.transcription/extract-text context ingestion)]
        ;; Should use PDFBox for all pages
        (is (= :pdf-lib (:method result)))
        (is (= {1 :pdf-lib 2 :pdf-lib 3 :pdf-lib} (:page-methods result)))
        ;; Previously truncated fields should be complete
        (is (string/includes? (:text result) "WP261041"))
        (is (string/includes? (:text result) "Saeidiani"))
        (is (string/includes? (:text result) "Schanzenweg"))
        (is (string/includes? (:text result) "ATU80760339"))))))

Step 2: Run the test

Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]' Expected: PASS

Step 3: Commit

git add test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj
git commit -m "test: integration test for Viessmann invoice with PDFBox pipeline"

Task 8: Run full test suite and lint

Step 1: Run all transcription-related tests

clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test com.getorcha.workers.ingestion.transcription.layout-test com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'

Expected: All PASS.

Step 2: Run full lint

clj-kondo --lint src test dev

Expected: No errors.

Step 3: Run the full test suite

clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Execution error|failed because|Ran .* tests)"

Expected: All tests pass, no regressions.

Step 4: Commit any fixes

If any issues found, fix and commit.