Document AI correctly identifies individual text elements with accurate bounding boxes, but our layout reconstruction (layout.clj) groups elements into rows using Y-midpoint proximity. On dense documents like receipts, the vertical gaps between different logical items (e.g., 0.001 normalized) are smaller than the grouping tolerance (~0.01), causing elements from adjacent items to bleed into each other's rows.
This produces garbled text where prices shift by one line item. The LLM extraction faithfully parses the garbled text, producing incorrect structured data that passes validation (addressed separately in the gross-total validation fix).
Example: IKEA receipt 019d2e52-e9c9-70e0-88f6-353601d75aa8 — 184 lines on page 1, Y-gaps as small as 0.0013 between items.
Two complementary changes:
Always runs, regardless of document source.
Before row grouping, compute per-page statistics from the positioned elements:
When the density ratio falls below a threshold (0.7), switch same-row? to use tighter tolerance:
0.75 × max(anchor_h, element_h) (current behavior)0.5 × min(anchor_h, element_h)This prevents large elements (like Artikel 90455086, h=0.0151) from pulling small nearby elements into the wrong row when gaps are tiny.
File: src/com/getorcha/workers/ap/ingestion/transcription/layout.clj
Changes:
page-density-ratio that computes the ratio from a seq of elementselements->structured-text to compute density ratio and pass it through to group-into-rowssame-row? to accept a density mode parameter and use the appropriate tolerance formulaOnly runs for non-PDFBox documents (pages where OCR is the sole transcription source).
After OCR, compute per-page Y-gap statistics from the Document AI response (bounding boxes are already available in raw-response). If any page has a density ratio below threshold, fall back to vision transcription for the whole document.
This is a safety net: the adaptive tolerance (level 1) fixes many cases, but vision models understand spatial layout natively and handle edge cases the tolerance heuristic might miss.
File: src/com/getorcha/workers/ap/ingestion/transcription.clj
Changes:
needs-dense-layout-fallback? that analyzes the raw Document AI response for gap statistics per pageocr-with-vision-fallback to check dense layout in addition to the existing low-confidence-ratio checklow-confidence-ratio)The Document AI response already contains per-line bounding boxes (pages[].lines[].layout.boundingPoly.normalizedVertices). The ocr_layout.clj module already extracts these into {:x :y :width :height} elements. We compute gap statistics from these elements before layout reconstruction.
For level 2, we need to compute the statistics from the raw response directly (before layout reconstruction), since the decision to fall back happens before reconstruction. This can reuse the same extraction logic in ocr_layout.clj.
page-density-ratio with normal and dense element setssame-row? behavior in both density modes