Ideas for future enhancements to the ingestion pipeline.
Problem: The LLM cannot determine if an invoice is "incoming" (Accounts Payable) or "outgoing" (Accounts Receivable) without knowing who the user/tenant is. Currently removed from extraction.
Solution: Post-extraction step that matches issuer and recipient against the tenant's registered legal entities:
(defn determine-invoice-type [tenant-id structured-data]
(let [legal-entities (db/get-legal-entities tenant-id)
issuer-name (get-in structured-data [:issuer :name])
recipient-name (get-in structured-data [:recipient :name])
issuer-match (some #(fuzzy-name-match? (:name %) issuer-name) legal-entities)
recipient-match (some #(fuzzy-name-match? (:name %) recipient-name) legal-entities)]
(cond
issuer-match :outgoing ; We are the issuer, sending invoice
recipient-match :incoming ; We are the recipient, receiving invoice
:else :unknown)))
Prerequisite: Tenant authentication and legal entity management must be implemented first.
When to implement: After authentication is complete and tenants can register their legal entities.
Problem: When invoices have complex multi-column tables, the LLM may struggle to associate line item descriptions with their corresponding prices, since Document AI outputs text in reading order without explicit table structure.
Solution: Preprocess the OCR output to reconstruct table rows by grouping text blocks with similar Y-coordinates. This creates explicit row associations:
ROW:
[left] 1 1 St SOCIAL AREA 01
[right] 24.860,00
ROW:
[left] 2 1 St SOCIAL AREA 02
[right] 15.870,00
When to implement: If extraction accuracy drops on invoices with dense tables where prices are misattributed to wrong line items.
Token cost: ~1.5x vs plain text.
Status: Token quality stats are now captured in OCR results (:ocr-token-quality-stats in TranscriptionResult).
Problem: Page-level OCR quality scores (0.95) can mask localized errors. Individual tokens may have low confidence even when overall page quality appears good. Analysis of invoice 04 showed:
Current Implementation: The OCR worker now extracts token quality statistics:
{:token-count 316
:low-confidence-count 21
:low-confidence-ratio 0.066 ; 6.6% problematic
:mean-confidence 0.937
:min-confidence 0.37}
Idea: Mark low-confidence tokens in the text sent to the LLM, helping it understand where OCR may have failed:
Input: "Conathan Schilling"
Output: "[?Conathan?] Schilling"
The LLM could then:
Implementation approach:
layout->element in ocr/layout.clj to include confidencerow->text to optionally wrap low-confidence tokens:annotate-low-confidence?, :low-confidence-thresholdWhen to implement: When extraction errors from OCR quality become a significant issue and the token quality stats show high low-confidence-ratio.
Idea: Current layout reconstruction uses "lines" from Document AI. Using "tokens" instead would provide:
Trade-off: More data to process, potentially different row grouping behavior.
When to implement: After evaluating whether line-based reconstruction is sufficient for production use cases.
Status: Stub endpoint exists at POST /document/new/gmail
Implementation:
queue-for-ingestion! for each PDF/image attachmentConsiderations:
Problem: Online Document AI processing handles one document at a time. Bulk uploads (e.g., monthly batch from supplier) could be more efficient.
Solution: Use Document AI's batch API which processes up to 50 documents per request with 5 concurrent batches.
When to implement: When bulk upload becomes a common use case.
Problem: Document AI has quotas (600 requests/min, 120 pages/min). Bursts of uploads could hit rate limits.
Solution: Implement a semaphore or token bucket in front of OCR calls to smooth out bursts.
When to implement: When rate limit errors appear in production logs.
Problem: Currently, failed messages retry after visibility timeout expires (~5 minute gap). Rate-limited responses (429s) would benefit from longer backoff.
Solution: On failure, use ChangeMessageVisibility to set visibility timeout to the backoff delay (30s, 60s, 120s...) instead of deleting the message. It reappears after the delay with ApproximateReceiveCount already incremented.
When to implement: When 429 errors become frequent in production.
Problem: Processed documents accumulate in S3 indefinitely.
Solution: Configure S3 lifecycle rules to archive documents to Glacier after retention period (e.g., 90 days post-ingestion).
When to implement: When storage costs become significant or compliance requires archival.
Problem: Different document types (invoices vs purchase orders vs contracts) may need different LLM prompts or validation rules.
Solution: Create a registry of pipeline configurations keyed by document type, with type-specific prompts and field validation.
When to implement: When supporting document types beyond invoices.
Problem: Single poller throughput may become a bottleneck under high load.
Solution: Run multiple poller threads. SQS handles concurrent receives safely.
When to implement: When poller becomes the throughput bottleneck (monitor queue depth vs processing rate).