Dense Layout Handling

Problem

Document AI correctly identifies individual text elements with accurate bounding boxes, but our layout reconstruction (layout.clj) groups elements into rows using Y-midpoint proximity. On dense documents like receipts, the vertical gaps between different logical items (e.g., 0.001 normalized) are smaller than the grouping tolerance (~0.01), causing elements from adjacent items to bleed into each other's rows.

This produces garbled text where prices shift by one line item. The LLM extraction faithfully parses the garbled text, producing incorrect structured data that passes validation (addressed separately in the gross-total validation fix).

Example: IKEA receipt 019d2e52-e9c9-70e0-88f6-353601d75aa8 — 184 lines on page 1, Y-gaps as small as 0.0013 between items.

Solution

Two complementary changes:

1. Adaptive Tolerance in Layout Reconstruction

Always runs, regardless of document source.

Before row grouping, compute per-page statistics from the positioned elements:

Median element height
Median Y-gap between consecutive elements (sorted by Y)
Density ratio = median_gap / median_height

When the density ratio falls below a threshold (0.7), switch same-row? to use tighter tolerance:

Normal: 0.75 × max(anchor_h, element_h) (current behavior)
Dense: 0.5 × min(anchor_h, element_h)

This prevents large elements (like Artikel 90455086, h=0.0151) from pulling small nearby elements into the wrong row when gaps are tiny.

File: src/com/getorcha/workers/ap/ingestion/transcription/layout.clj

Changes:

New function page-density-ratio that computes the ratio from a seq of elements
Modify elements->structured-text to compute density ratio and pass it through to group-into-rows
Modify same-row? to accept a density mode parameter and use the appropriate tolerance formula

2. Vision Fallback for Dense OCR Pages

Only runs for non-PDFBox documents (pages where OCR is the sole transcription source).

After OCR, compute per-page Y-gap statistics from the Document AI response (bounding boxes are already available in raw-response). If any page has a density ratio below threshold, fall back to vision transcription for the whole document.

This is a safety net: the adaptive tolerance (level 1) fixes many cases, but vision models understand spatial layout natively and handle edge cases the tolerance heuristic might miss.

File: src/com/getorcha/workers/ap/ingestion/transcription.clj

Changes:

New function needs-dense-layout-fallback? that analyzes the raw Document AI response for gap statistics per page
Modify ocr-with-vision-fallback to check dense layout in addition to the existing low-confidence-ratio check
Same fallback pattern: try vision, fall back to OCR result on failure
Config: density ratio threshold in transcription config (alongside existing low-confidence-ratio)

Where the Gap Statistics Come From

The Document AI response already contains per-line bounding boxes (pages[].lines[].layout.boundingPoly.normalizedVertices). The ocr_layout.clj module already extracts these into {:x :y :width :height} elements. We compute gap statistics from these elements before layout reconstruction.

For level 2, we need to compute the statistics from the raw response directly (before layout reconstruction), since the decision to fall back happens before reconstruction. This can reuse the same extraction logic in ocr_layout.clj.

What's NOT in Scope

Image preprocessing (CLAHE, denoising, deskew from the cc-reconciliation spike) — addresses image quality, not layout interpretation. Separate concern.
Per-page vision fallback — when triggered, vision re-transcribes the whole document. The per-page extraction infrastructure exists but adds complexity for marginal benefit on 1-2 page receipts.
Document AI table detection — tested on the IKEA receipt, returns 0 tables. Not a reliable signal.

Testing

Unit tests for page-density-ratio with normal and dense element sets
Unit tests for same-row? behavior in both density modes
Integration: re-ingest the IKEA receipt locally and verify line items match the PDF