For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Replace all-or-nothing OCR tier system with per-page PDFBox-first extraction, with Document AI and Vision as per-page fallbacks.
Architecture: PDFBox extracts positioned text elements from each page. Pages passing a character density gate get layout reconstruction from PDFBox coordinates. Failing pages fall through to Document AI (image OCR only) then Vision. All pages assembled with === PAGE N === markers.
Tech Stack: Apache PDFBox 3.0.3 (already in deps), Google Document AI, Gemini Vision, Clojure multimethods.
Design doc: docs/plans/2026-03-09-pdfbox-first-transcription-design.md
Extract the provider-agnostic row-grouping and column-separator logic into a shared module that both PDFBox and Document AI layout can use.
Files:
src/com/getorcha/workers/ingestion/transcription/layout.cljsrc/com/getorcha/workers/ingestion/transcription/ocr_layout.cljtest/com/getorcha/workers/ingestion/transcription/layout_test.cljStep 1: Write failing tests for the shared layout module
Create test/com/getorcha/workers/ingestion/transcription/layout_test.clj:
(ns com.getorcha.workers.ingestion.transcription.layout-test
(:require [clojure.test :refer [deftest is testing]]
[com.getorcha.workers.ingestion.transcription.layout :as layout]))
(deftest test-group-into-rows
(testing "Groups vertically overlapping elements into same row"
(let [elements [{:text "Name" :x 0.1 :y 0.5 :width 0.1 :height 0.02}
{:text "Value" :x 0.5 :y 0.5 :width 0.1 :height 0.02}]
rows (layout/group-into-rows elements)]
(is (= 1 (count rows)))
(is (= 2 (count (first rows))))))
(testing "Separates non-overlapping elements into different rows"
(let [elements [{:text "Row1" :x 0.1 :y 0.1 :width 0.1 :height 0.02}
{:text "Row2" :x 0.1 :y 0.5 :width 0.1 :height 0.02}]
rows (layout/group-into-rows elements)]
(is (= 2 (count rows))))))
(deftest test-row->text
(testing "Inserts column separator for large gaps"
(let [row [{:text "Description" :x 0.1 :y 0.5 :width 0.15 :height 0.02}
{:text "100.00" :x 0.7 :y 0.5 :width 0.08 :height 0.02}]]
(is (= "Description | 100.00" (layout/row->text row 0.05)))))
(testing "No separator for small gaps"
(let [row [{:text "First" :x 0.1 :y 0.5 :width 0.08 :height 0.02}
{:text "Second" :x 0.19 :y 0.5 :width 0.08 :height 0.02}]]
(is (= "First Second" (layout/row->text row 0.05))))))
(deftest test-elements->structured-text
(testing "Full pipeline: elements to structured text with rows and columns"
(let [elements [{:text "Pos" :x 0.05 :y 0.1 :width 0.05 :height 0.02}
{:text "Description" :x 0.15 :y 0.1 :width 0.2 :height 0.02}
{:text "Amount" :x 0.7 :y 0.1 :width 0.1 :height 0.02}
{:text "10" :x 0.05 :y 0.15 :width 0.03 :height 0.02}
{:text "Widget" :x 0.15 :y 0.15 :width 0.1 :height 0.02}
{:text "99.00" :x 0.7 :y 0.15 :width 0.08 :height 0.02}]
text (layout/elements->structured-text elements {:column-gap-threshold 0.05})]
(is (string? text))
(is (re-find #"Pos" text))
(is (re-find #"Widget" text))
(is (re-find #"\|" text) "Should have column separators"))))
Step 2: Run tests to verify they fail
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.layout-test]'
Expected: FAIL — namespace layout does not exist.
Step 3: Create the shared layout module
Create src/com/getorcha/workers/ingestion/transcription/layout.clj. Move these functions from ocr_layout.clj:
vertical-overlap?row-overlaps?group-into-rowssort-row-by-xrow->textAdd a new public entry point elements->structured-text that takes a seq of {:text :x :y :width :height} maps and options, returns formatted text string.
(ns com.getorcha.workers.ingestion.transcription.layout
"Shared layout reconstruction for positioned text elements.
Converts sequences of positioned text elements (from any source — PDFBox,
Document AI, etc.) into structured row-based text with column separators.
Elements are maps with:
:text - the text content
:x - left edge (0.0-1.0 normalized)
:y - top edge (0.0-1.0 normalized)
:width - element width (normalized)
:height - element height (normalized)"
(:require [clojure.string :as string]))
(set! *warn-on-reflection* true)
(defn ^:private vertical-overlap?
"Check if two elements overlap vertically (same visual row).
Elements overlap if their Y-ranges intersect."
[elem1 elem2]
(let [y1-start (:y elem1)
y1-end (+ y1-start (:height elem1))
y2-start (:y elem2)
y2-end (+ y2-start (:height elem2))]
(and (< y1-start y2-end)
(< y2-start y1-end))))
(defn ^:private row-overlaps?
"Check if an element overlaps vertically with any element in the row."
[row element]
(some #(vertical-overlap? % element) row))
(defn group-into-rows
"Group elements by vertical overlap (bounding box intersection).
Elements on the same visual row will have overlapping Y-ranges.
Returns seq of rows, each row being a seq of elements."
[elements]
(->> elements
(sort-by :y)
(reduce
(fn [rows element]
(if (empty? rows)
[[element]]
(let [current-row (peek rows)]
(if (row-overlaps? current-row element)
(conj (pop rows) (conj current-row element))
(conj rows [element])))))
[])))
(defn row->text
"Convert a row of elements to text, inserting `|` separators for column gaps.
Elements are sorted left-to-right by X-coordinate."
[row column-gap-threshold]
(let [sorted-row (sort-by :x row)]
(->> sorted-row
(partition 2 1 nil)
(mapcat (fn [[elem next-elem]]
(if (and next-elem
(> (- (:x next-elem) (+ (:x elem) (:width elem)))
column-gap-threshold))
[(:text elem) " | "]
[(:text elem) " "])))
(apply str)
string/trim)))
(defn elements->structured-text
"Convert positioned text elements into structured row-based text.
Takes a seq of `{:text :x :y :width :height}` maps and options.
Returns a string with rows separated by newlines and columns by `|`."
[elements {:keys [column-gap-threshold]
:or {column-gap-threshold 0.05}
:as _opts}]
(->> (group-into-rows elements)
(map #(row->text % column-gap-threshold))
(filter (complement string/blank?))
(string/join "\n")))
Step 4: Run tests to verify they pass
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.layout-test]'
Expected: PASS
Step 5: Refactor ocr_layout.clj to use shared layout module
Modify src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj:
vertical-overlap?, row-overlaps?, group-into-rows, sort-row-by-x, row->text, page->structured-textlayout namespacelayout->element and page->elements (Document AI-specific element extraction)page->structured-text with a call to layout/elements->structured-textThe resulting ocr_layout.clj should only contain:
layout->element — converts Document AI layout map → {:text :x :y :width :height}page->elements — extracts elements from a Document AI pagereconstruct-single-document — uses layout/elements->structured-text per pagereconstruct-layout — public API, iterates chunksStep 6: Run existing tests to verify no regressions
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'
Expected: PASS — all existing OCR tests still work.
Step 7: Lint
Run: clj-kondo --lint src/com/getorcha/workers/ingestion/transcription/layout.clj src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj
Expected: No errors.
Step 8: Commit
git add src/com/getorcha/workers/ingestion/transcription/layout.clj \
src/com/getorcha/workers/ingestion/transcription/ocr_layout.clj \
test/com/getorcha/workers/ingestion/transcription/layout_test.clj
git commit -m "refactor: extract shared layout algorithm from ocr_layout"
Create the PDFBox-specific module that subclasses PDFTextStripper to capture positioned text elements per page, then feeds them into the shared layout algorithm.
Files:
src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.cljtest/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.cljContext: PDFBox 3.0.3 PDFTextStripper has a protected method writeString(String text, List<TextPosition> textPositions) called once per word group. Override it to capture each group as a positioned element with normalized coordinates. The TextPosition class provides .getX(), .getY(), .getWidth(), .getHeight(), .getPageWidth(), .getPageHeight().
Important: .getY() returns distance from top of page in PDFBox 3.x (changed from 2.x where it was from bottom). Verify this at implementation time by printing values — if Y increases downward, no transform needed. If Y increases upward, subtract from page height.
Step 1: Write failing test using the Viessmann invoice
Create test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj. Use the actual Viessmann PDF from dump/ as a test fixture — this is the document that motivated the entire refactor, so it's the perfect regression test.
(ns com.getorcha.workers.ingestion.transcription.pdfbox-layout-test
(:require [clojure.java.io :as io]
[clojure.string :as string]
[clojure.test :refer [deftest is testing]]
[com.getorcha.workers.ingestion.transcription.pdfbox-layout :as pdfbox-layout])
(:import (org.apache.pdfbox Loader)))
(def ^:private viessmann-pdf-bytes
(delay
(let [f (io/file "dump/019cbfb0-1e37-709c-a2e2-43ec943db97c.pdf")]
(when (.exists f)
(.readAllBytes (java.io.FileInputStream. f))))))
(deftest test-extract-page-elements
(testing "Extracts positioned elements from a PDF page"
(when-let [pdf-bytes @viessmann-pdf-bytes
(with-open [doc (Loader/loadPDF pdf-bytes)]
(let [elements (pdfbox-layout/extract-page-elements doc 0)]
(is (seq elements) "Should extract elements from page 1")
(is (every? #(contains? % :text) elements))
(is (every? #(contains? % :x) elements))
(is (every? #(contains? % :y) elements))
(is (every? #(contains? % :width) elements))
(is (every? #(contains? % :height) elements))
;; All coordinates should be normalized 0.0-1.0
(is (every? #(<= 0.0 (:x %) 1.0) elements))
(is (every? #(<= 0.0 (:y %) 1.0) elements)))))))
(deftest test-viessmann-complete-text
(testing "PDFBox extracts complete text that Document AI clipped"
(when-let [pdf-bytes @viessmann-pdf-bytes]
(let [result (pdfbox-layout/extract-with-layout pdf-bytes)]
;; These are the exact fields Document AI truncated
(is (string/includes? (:text result) "WP261041")
"Bestellnummer should be complete (was WP26)")
(is (string/includes? (:text result) "Saeidiani")
"Warenempfänger name should be complete (was Sa)")
(is (string/includes? (:text result) "Schanzenweg")
"Street should be complete (was Schanzenw)")
(is (string/includes? (:text result) "Gerasdorf bei Wien")
"City should be complete (was Gerasdorf be)")
(is (string/includes? (:text result) "ÖSTERREICH")
"Country should be complete (was ÖSTERR)")))))
(deftest test-page-markers
(testing "Output contains page markers with correct numbering"
(when-let [pdf-bytes @viessmann-pdf-bytes]
(let [result (pdfbox-layout/extract-with-layout pdf-bytes)]
(is (string/includes? (:text result) "=== PAGE 1 ==="))
(is (string/includes? (:text result) "=== PAGE 2 ==="))
(is (string/includes? (:text result) "=== PAGE 3 ==="))
(is (= 3 (:page-count result)))))))
(deftest test-column-separators
(testing "Table rows have column separators"
(when-let [pdf-bytes @viessmann-pdf-bytes]
(let [result (pdfbox-layout/extract-with-layout pdf-bytes)]
;; Line items should have separators between columns
(is (string/includes? (:text result) "|")
"Should have column separators in tabular data")))))
(deftest test-per-page-char-counts
(testing "Returns per-page character counts for quality gating"
(when-let [pdf-bytes @viessmann-pdf-bytes]
(let [result (pdfbox-layout/extract-with-layout pdf-bytes)]
(is (map? (:page-char-counts result)))
(is (= 3 (count (:page-char-counts result))))
;; All pages of this invoice should have substantial text
(is (every? #(> % 50) (vals (:page-char-counts result))))))))
Step 2: Run tests to verify they fail
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: FAIL — namespace pdfbox-layout does not exist.
Step 3: Implement pdfbox_layout.clj
Create src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj.
Key implementation details:
Subclass PDFTextStripper using proxy or gen-class. Override writeString(String, List<TextPosition>). In the override, compute a bounding box from the list of TextPosition objects (min x, min y, max x+width, max y+height), normalize by page dimensions, and store as a {:text :x :y :width :height} element.
Per-page extraction: Use setStartPage / setEndPage to process one page at a time. For each page, collect all positioned elements, then call layout/elements->structured-text.
Public API: extract-with-layout takes PDF bytes, returns:
{:text "=== PAGE 1 ===\n...\n\n=== PAGE 2 ===\n..."
:page-count 3
:page-char-counts {1 1234, 2 567, 3 890} ;; for quality gating
:quality-score 1.0
:raw-response []
:method :pdf-lib}
Page dimensions: Get from PDDocument.getPage(i).getMediaBox() — returns PDRectangle with .getWidth() and .getHeight() in PDF points. Use these to normalize TextPosition coordinates.
Coordinate normalization: TextPosition.getX() / mediaBox.getWidth() for x, TextPosition.getY() / mediaBox.getHeight() for y (verify Y direction at implementation time).
(ns com.getorcha.workers.ingestion.transcription.pdfbox-layout
"PDFBox-based layout extraction for PDF documents with embedded text.
Subclasses PDFTextStripper to capture positioned word groups, then
reconstructs tabular layout using the shared layout algorithm.
Produces the same output format as ocr_layout.clj (=== PAGE N === markers
with | column separators) but from PDFBox coordinates instead of
Document AI bounding boxes."
(:require [clojure.string :as string]
[com.getorcha.workers.ingestion.transcription.layout :as layout])
(:import (org.apache.pdfbox Loader)
(org.apache.pdfbox.pdmodel PDDocument)
(org.apache.pdfbox.text PDFTextStripper TextPosition)))
(set! *warn-on-reflection* true)
(defn extract-page-elements
"Extract positioned text elements from a single PDF page.
Returns seq of {:text :x :y :width :height} maps with coordinates
normalized to 0.0-1.0 range."
[^PDDocument doc page-index]
(let [page (.getPage doc page-index)
media-box (.getMediaBox page)
page-width (.getWidth media-box)
page-height (.getHeight media-box)
elements (atom [])
stripper (proxy [PDFTextStripper] []
(writeString [^String text ^java.util.List text-positions]
(when (and text (not (string/blank? text)) (seq text-positions))
(let [positions (vec text-positions)
xs (mapv #(.getX ^TextPosition %) positions)
ys (mapv #(.getY ^TextPosition %) positions)
widths (mapv #(.getWidth ^TextPosition %) positions)
heights (mapv #(.getHeight ^TextPosition %) positions)
min-x (apply min xs)
min-y (apply min ys)
max-x (apply max (map + xs widths))
max-y (apply max (map + ys heights))]
(swap! elements conj
{:text (string/trim text)
:x (/ min-x page-width)
:y (/ min-y page-height)
:width (/ (- max-x min-x) page-width)
:height (/ (- max-y min-y) page-height)})))))]
;; PDFTextStripper pages are 1-indexed
(.setStartPage stripper (inc page-index))
(.setEndPage stripper (inc page-index))
;; getText drives the parsing — writeString is called as a side effect
(.getText stripper doc)
(filterv #(not (string/blank? (:text %))) @elements)))
(defn extract-with-layout
"Extract text from PDF with layout reconstruction.
Processes each page independently, capturing positioned text elements
and reconstructing tabular structure with column separators.
Returns:
:text - Layout-reconstructed text with === PAGE N === markers
:page-count - Number of pages
:page-char-counts - Map of page-number (1-indexed) to character count
:quality-score - Always 1.0 (native PDF text)
:raw-response - Empty vector (no external API calls)
:method - :pdf-lib"
[^bytes pdf-bytes]
(with-open [doc (Loader/loadPDF pdf-bytes)]
(let [page-count (.getNumberOfPages doc)
opts {:column-gap-threshold 0.05}
page-results
(for [i (range page-count)]
(let [elements (extract-page-elements doc i)
page-text (layout/elements->structured-text elements opts)
page-num (inc i)]
{:page-num page-num
:text (when-not (string/blank? page-text)
(str "=== PAGE " page-num " ===\n" page-text))
:char-count (reduce + 0 (map (comp count :text) elements))}))]
{:text (->> page-results
(keep :text)
(string/join "\n\n"))
:page-count page-count
:page-char-counts (into {} (map (juxt :page-num :char-count) page-results))
:quality-score 1.0
:raw-response []
:method :pdf-lib})))
Step 4: Run tests to verify they pass
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: PASS. If the Viessmann PDF test fixture isn't available in CI, the when-let guards will skip those tests gracefully.
Step 5: Lint
Run: clj-kondo --lint src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj
Expected: No errors.
Step 6: Commit
git add src/com/getorcha/workers/ingestion/transcription/pdfbox_layout.clj \
test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj
git commit -m "feat: PDFBox layout extraction with positioned text elements"
Files:
resources/com/getorcha/config.ednStep 1: Add pdf-lib config
In resources/com/getorcha/config.edn, inside the :com.getorcha.workers.ingestion/orchestrator map, add :pdf-lib to the :transcription map (alongside :ocr and :vision):
:transcription {:pdf-lib {:min-chars-per-page 50}
:ocr {:provider :ocr
...existing...}
:vision {...existing...}}
Step 2: Commit
git add resources/com/getorcha/config.edn
git commit -m "config: add pdf-lib min-chars-per-page threshold"
This is the main orchestration change. Replace the current all-or-nothing tier logic in extract-text with per-page evaluation.
Files:
src/com/getorcha/workers/ingestion/transcription.cljtest/com/getorcha/workers/ingestion/transcription_test.cljStep 1: Write tests for the new per-page pipeline
Add tests to test/com/getorcha/workers/ingestion/transcription_test.clj:
(deftest test-pdfbox-first-skips-ocr-for-text-rich-pdfs
(testing "PDFBox extraction succeeds for PDFs with embedded text, no OCR called"
;; Create a PDF with actual text content
(let [ocr-call-count (atom 0)
ocr-spy (fn [& _args]
(swap! ocr-call-count inc)
{:status 200 :body sample-docai-response})
pdf-bytes (let [baos (java.io.ByteArrayOutputStream.)]
(with-open [doc (org.apache.pdfbox.pdmodel.PDDocument.)
page (org.apache.pdfbox.pdmodel.PDPage.)
cs (org.apache.pdfbox.pdmodel.PDPageContentStream. doc page)]
(.addPage doc page)
(.beginText cs)
(.setFont cs (org.apache.pdfbox.pdmodel.font.PDType1Font. org.apache.pdfbox.pdmodel.font.Standard14Fonts$FontName/HELVETICA) 12)
(.newLineAtOffset cs 50 700)
;; Write enough text to pass quality gate (>50 chars)
(.showText cs "Invoice INV-001 from Supplier Corp for 12345.67 EUR total amount")
(.endText cs)
(.save doc baos))
(.toByteArray baos))
context {:transcription {:pdf-lib {:min-chars-per-page 50}
:ocr {:provider :ocr
:project-id "test-project"
:location "eu"
:processor-id "test-processor"}
:vision {}}
:llm-config {}
:worker-pools {}}
ingestion {:file {:contents pdf-bytes :mime-type "application/pdf"}
:document {}}]
(with-redefs-fn {#'hato/post ocr-spy
#'workers.transcription/get-access-token (constantly "test-token")}
(fn []
(let [result (workers.transcription/extract-text context ingestion)]
(is (= :pdf-lib (:method result)))
(is (= 0 @ocr-call-count) "OCR should not be called")
(is (string/includes? (:text result) "Invoice"))
(is (string/includes? (:text result) "=== PAGE 1 ==="))))))))
(deftest test-pdfbox-falls-through-to-ocr-for-scanned-pdfs
(testing "Empty PDF pages fall through to Document AI"
(let [ocr-call-count (atom 0)
ocr-spy (fn [& _args]
(swap! ocr-call-count inc)
{:status 200 :body sample-docai-response})
;; Create PDF with no text (simulates scanned document)
pdf-bytes (let [baos (java.io.ByteArrayOutputStream.)]
(with-open [doc (org.apache.pdfbox.pdmodel.PDDocument.)]
(.addPage doc (org.apache.pdfbox.pdmodel.PDPage.))
(.save doc baos))
(.toByteArray baos))
context {:transcription {:pdf-lib {:min-chars-per-page 50}
:ocr {:provider :ocr
:project-id "test-project"
:location "eu"
:processor-id "test-processor"}
:vision {}}
:llm-config {}
:worker-pools {}}
ingestion {:file {:contents pdf-bytes :mime-type "application/pdf"}
:document {}}]
(with-redefs-fn {#'hato/post ocr-spy
#'workers.transcription/get-access-token (constantly "test-token")}
(fn []
(let [result (workers.transcription/extract-text context ingestion)]
(is (= :ocr (:method result)))
(is (pos? @ocr-call-count) "OCR should be called for empty pages")))))))
Step 2: Run new tests to verify they fail
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'
Expected: New tests FAIL (old tests may also fail since we haven't changed extract-text yet).
Step 3: Rewrite extract-text PDF/image branch
In transcription.clj, replace the current PDF/image :else branch (lines 614-662) with the per-page pipeline:
The new logic (pseudocode):
;; 1. If it's a PDF, try PDFBox first
;; 2. Extract per-page elements and char counts
;; 3. Partition pages into passing/failing
;; 4. For failing pages: split PDF, send to Document AI
;; 5. For Document AI pages with low confidence: send to Vision
;; 6. Assemble all pages in order
Key changes to extract-text:
(> page-count 15) gate on extract-pdf-textextract-pdf-text with pdfbox-layout/extract-with-layout which returns per-page char counts:page-char-counts to identify failing pagesenableNativePdfParsing: false:method as the "highest" tier used:page-methods to resultAlso modify process-docai-chunk to accept a native-pdf-parsing? parameter (default true for backward compat with non-PDF mime types like images, false when called as PDFBox fallback).
Step 4: Update existing tests
Existing tests use image/png mime type which skips the PDFBox path entirely — they should continue to work. But update the context maps to include the new :pdf-lib config key:
:transcription {:pdf-lib {:min-chars-per-page 50}
:ocr {...}
:vision {...}}
The vision fallback tests that use application/pdf will need adjustment since the pipeline now tries PDFBox first. The empty-page PDFs in those tests have 0 chars, so they'll fail the quality gate and fall through to OCR as before.
Step 5: Run all tests
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'
Expected: PASS — both old and new tests.
Step 6: Lint
Run: clj-kondo --lint src/com/getorcha/workers/ingestion/transcription.clj
Expected: No errors.
Step 7: Commit
git add src/com/getorcha/workers/ingestion/transcription.clj \
test/com/getorcha/workers/ingestion/transcription_test.clj
git commit -m "feat: per-page PDFBox-first transcription with OCR/vision fallback"
Small change but important for correctness: when Document AI is called as a fallback for pages that failed PDFBox, disable native PDF parsing to force image-based OCR.
Files:
src/com/getorcha/workers/ingestion/transcription.cljStep 1: This should already be done in Task 4
Verify that process-docai-chunk uses enableNativePdfParsing: false when called from the PDFBox fallback path. If Task 4 already handled this, this task is just verification.
Step 2: Write a targeted test
Add to transcription_test.clj:
(deftest test-docai-fallback-disables-native-pdf-parsing
(testing "Document AI fallback sends enableNativePdfParsing=false"
(let [captured-body (atom nil)
http-spy (fn [_url opts]
(let [body (cheshire.core/parse-string (:body opts) true)]
(reset! captured-body body))
{:status 200 :body sample-docai-response})
;; Empty PDF — will fail quality gate
pdf-bytes (let [baos (java.io.ByteArrayOutputStream.)]
(with-open [doc (org.apache.pdfbox.pdmodel.PDDocument.)]
(.addPage doc (org.apache.pdfbox.pdmodel.PDPage.))
(.save doc baos))
(.toByteArray baos))
context {:transcription {:pdf-lib {:min-chars-per-page 50}
:ocr {:provider :ocr
:project-id "test-project"
:location "eu"
:processor-id "test-processor"}
:vision {}}
:llm-config {}
:worker-pools {}}
ingestion {:file {:contents pdf-bytes :mime-type "application/pdf"}
:document {}}]
(with-redefs-fn {#'hato/post http-spy
#'workers.transcription/get-access-token (constantly "test-token")}
(fn []
(workers.transcription/extract-text context ingestion)
(is (some? @captured-body))
(is (false? (get-in @captured-body [:processOptions :ocrConfig :enableNativePdfParsing]))
"Should disable native PDF parsing for fallback pages"))))))
Step 3: Run tests
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test]'
Expected: PASS
Step 4: Commit
git add src/com/getorcha/workers/ingestion/transcription.clj \
test/com/getorcha/workers/ingestion/transcription_test.clj
git commit -m "feat: disable native PDF parsing in Document AI fallback"
Add the optional :page-methods field to the schema so it validates correctly when present.
Files:
src/com/getorcha/schema/ingestion.cljStep 1: Add page-methods to all PDF-relevant variants
In TranscriptionResult, add [:page-methods {:optional true} [:map-of :int :keyword]] to the :pdf-lib, :ocr, and :vision variants of the multi schema:
[:pdf-lib
(m.util/merge
TranscriptionResultBase
[:map
[:method [:= :pdf-lib]]
[:page-methods {:optional true} [:map-of :int :keyword]]])]
Same for :ocr and :vision.
Step 2: Verify with comment-block examples
Update the (comment ...) block in the schema file to include a :page-methods example.
Step 3: Lint
Run: clj-kondo --lint src/com/getorcha/schema/ingestion.clj
Step 4: Commit
git add src/com/getorcha/schema/ingestion.clj
git commit -m "schema: add optional page-methods to TranscriptionResult"
End-to-end test using the actual Viessmann invoice to verify the full pipeline produces correct output.
Files:
test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.cljStep 1: Add integration test
This test calls extract-text (the public API) with the Viessmann PDF and verifies the output contains the previously-truncated fields. It should NOT call Document AI (all 3 pages should pass the quality gate).
(deftest test-integration-viessmann-invoice
(testing "Full pipeline produces complete text for Viessmann invoice"
(when-let [pdf-bytes @viessmann-pdf-bytes]
(let [context {:transcription {:pdf-lib {:min-chars-per-page 50}
:ocr {:provider :ocr}
:vision {}}
:llm-config {}
:worker-pools {}}
ingestion {:file {:contents pdf-bytes :mime-type "application/pdf"}
:document {}}
result (workers.transcription/extract-text context ingestion)]
;; Should use PDFBox for all pages
(is (= :pdf-lib (:method result)))
(is (= {1 :pdf-lib 2 :pdf-lib 3 :pdf-lib} (:page-methods result)))
;; Previously truncated fields should be complete
(is (string/includes? (:text result) "WP261041"))
(is (string/includes? (:text result) "Saeidiani"))
(is (string/includes? (:text result) "Schanzenweg"))
(is (string/includes? (:text result) "ATU80760339"))))))
Step 2: Run the test
Run: clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: PASS
Step 3: Commit
git add test/com/getorcha/workers/ingestion/transcription/pdfbox_layout_test.clj
git commit -m "test: integration test for Viessmann invoice with PDFBox pipeline"
Step 1: Run all transcription-related tests
clj -X:test:silent :nses '[com.getorcha.workers.ingestion.transcription-test com.getorcha.workers.ingestion.transcription.layout-test com.getorcha.workers.ingestion.transcription.pdfbox-layout-test]'
Expected: All PASS.
Step 2: Run full lint
clj-kondo --lint src test dev
Expected: No errors.
Step 3: Run the full test suite
clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Execution error|failed because|Ran .* tests)"
Expected: All tests pass, no regressions.
Step 4: Commit any fixes
If any issues found, fix and commit.