Dense Layout Handling

Problem

Document AI correctly identifies individual text elements with accurate bounding boxes, but our layout reconstruction (layout.clj) groups elements into rows using Y-midpoint proximity. On dense documents like receipts, the vertical gaps between different logical items (e.g., 0.001 normalized) are smaller than the grouping tolerance (~0.01), causing elements from adjacent items to bleed into each other's rows.

This produces garbled text where prices shift by one line item. The LLM extraction faithfully parses the garbled text, producing incorrect structured data that passes validation (addressed separately in the gross-total validation fix).

Example: IKEA receipt 019d2e52-e9c9-70e0-88f6-353601d75aa8 — 184 lines on page 1, Y-gaps as small as 0.0013 between items.

Solution

Two complementary changes:

1. Adaptive Tolerance in Layout Reconstruction

Always runs, regardless of document source.

Before row grouping, compute per-page statistics from the positioned elements:

When the density ratio falls below a threshold (0.7), switch same-row? to use tighter tolerance:

This prevents large elements (like Artikel 90455086, h=0.0151) from pulling small nearby elements into the wrong row when gaps are tiny.

File: src/com/getorcha/workers/ap/ingestion/transcription/layout.clj

Changes:

2. Vision Fallback for Dense OCR Pages

Only runs for non-PDFBox documents (pages where OCR is the sole transcription source).

After OCR, compute per-page Y-gap statistics from the Document AI response (bounding boxes are already available in raw-response). If any page has a density ratio below threshold, fall back to vision transcription for the whole document.

This is a safety net: the adaptive tolerance (level 1) fixes many cases, but vision models understand spatial layout natively and handle edge cases the tolerance heuristic might miss.

File: src/com/getorcha/workers/ap/ingestion/transcription.clj

Changes:

Where the Gap Statistics Come From

The Document AI response already contains per-line bounding boxes (pages[].lines[].layout.boundingPoly.normalizedVertices). The ocr_layout.clj module already extracts these into {:x :y :width :height} elements. We compute gap statistics from these elements before layout reconstruction.

For level 2, we need to compute the statistics from the raw response directly (before layout reconstruction), since the decision to fall back happens before reconstruction. This can reuse the same extraction logic in ocr_layout.clj.

What's NOT in Scope

Testing