Potential Improvements

Ideas for future enhancements to the ingestion pipeline.


Invoice Type Detection (High Priority)

Problem: The LLM cannot determine if an invoice is "incoming" (Accounts Payable) or "outgoing" (Accounts Receivable) without knowing who the user/tenant is. Currently removed from extraction.

Solution: Post-extraction step that matches issuer and recipient against the tenant's registered legal entities:

(defn determine-invoice-type [tenant-id structured-data]
  (let [legal-entities (db/get-legal-entities tenant-id)
        issuer-name    (get-in structured-data [:issuer :name])
        recipient-name (get-in structured-data [:recipient :name])
        issuer-match   (some #(fuzzy-name-match? (:name %) issuer-name) legal-entities)
        recipient-match (some #(fuzzy-name-match? (:name %) recipient-name) legal-entities)]
    (cond
      issuer-match    :outgoing  ; We are the issuer, sending invoice
      recipient-match :incoming  ; We are the recipient, receiving invoice
      :else           :unknown)))

Prerequisite: Tenant authentication and legal entity management must be implemented first.

When to implement: After authentication is complete and tenants can register their legal entities.


OCR Layout Preprocessing for Complex Tables

Problem: When invoices have complex multi-column tables, the LLM may struggle to associate line item descriptions with their corresponding prices, since Document AI outputs text in reading order without explicit table structure.

Solution: Preprocess the OCR output to reconstruct table rows by grouping text blocks with similar Y-coordinates. This creates explicit row associations:

ROW:
  [left]  1 1 St  SOCIAL AREA 01
  [right] 24.860,00
ROW:
  [left]  2 1 St  SOCIAL AREA 02
  [right] 15.870,00

When to implement: If extraction accuracy drops on invoices with dense tables where prices are misattributed to wrong line items.

Token cost: ~1.5x vs plain text.


Token-Level OCR Quality Enhancements

Status: Token quality stats are now captured in OCR results (:ocr-token-quality-stats in TranscriptionResult).

Problem: Page-level OCR quality scores (0.95) can mask localized errors. Individual tokens may have low confidence even when overall page quality appears good. Analysis of invoice 04 showed:

Current Implementation: The OCR worker now extracts token quality statistics:

{:token-count          316
 :low-confidence-count 21
 :low-confidence-ratio 0.066  ; 6.6% problematic
 :mean-confidence      0.937
 :min-confidence       0.37}

Future Enhancement: Low-Confidence Token Annotation

Idea: Mark low-confidence tokens in the text sent to the LLM, helping it understand where OCR may have failed:

Input:  "Conathan Schilling"
Output: "[?Conathan?] Schilling"

The LLM could then:

  1. Be more cautious about low-confidence values
  2. Attempt inference from context (e.g., "Conathan" is likely "Jonathan")
  3. Flag fields containing low-confidence tokens in its confidence assessment

Implementation approach:

  1. Update layout->element in ocr/layout.clj to include confidence
  2. Update row->text to optionally wrap low-confidence tokens
  3. Add config options: :annotate-low-confidence?, :low-confidence-threshold

When to implement: When extraction errors from OCR quality become a significant issue and the token quality stats show high low-confidence-ratio.

Future Enhancement: Token-Based Layout Reconstruction

Idea: Current layout reconstruction uses "lines" from Document AI. Using "tokens" instead would provide:

Trade-off: More data to process, potentially different row grouping behavior.

When to implement: After evaluating whether line-based reconstruction is sufficient for production use cases.


Gmail Integration

Status: Stub endpoint exists at POST /document/new/gmail

Implementation:

  1. Set up Gmail API Watch with Pub/Sub push subscription for billing inbox
  2. Receive notifications at webhook endpoint
  3. Fetch full message and attachments via Gmail API
  4. Call queue-for-ingestion! for each PDF/image attachment

Considerations:


Batch Processing for Bulk Uploads

Problem: Online Document AI processing handles one document at a time. Bulk uploads (e.g., monthly batch from supplier) could be more efficient.

Solution: Use Document AI's batch API which processes up to 50 documents per request with 5 concurrent batches.

When to implement: When bulk upload becomes a common use case.


Rate Limiting for External APIs

Problem: Document AI has quotas (600 requests/min, 120 pages/min). Bursts of uploads could hit rate limits.

Solution: Implement a semaphore or token bucket in front of OCR calls to smooth out bursts.

When to implement: When rate limit errors appear in production logs.


Exponential Backoff for Retries

Problem: Currently, failed messages retry after visibility timeout expires (~5 minute gap). Rate-limited responses (429s) would benefit from longer backoff.

Solution: On failure, use ChangeMessageVisibility to set visibility timeout to the backoff delay (30s, 60s, 120s...) instead of deleting the message. It reappears after the delay with ApproximateReceiveCount already incremented.

When to implement: When 429 errors become frequent in production.


S3 Lifecycle Policy

Problem: Processed documents accumulate in S3 indefinitely.

Solution: Configure S3 lifecycle rules to archive documents to Glacier after retention period (e.g., 90 days post-ingestion).

When to implement: When storage costs become significant or compliance requires archival.


Multiple Pipeline Variants

Problem: Different document types (invoices vs purchase orders vs contracts) may need different LLM prompts or validation rules.

Solution: Create a registry of pipeline configurations keyed by document type, with type-specific prompts and field validation.

When to implement: When supporting document types beyond invoices.


Multiple SQS Pollers

Problem: Single poller throughput may become a bottleneck under high load.

Solution: Run multiple poller threads. SQS handles concurrent receives safely.

When to implement: When poller becomes the throughput bottleneck (monitor queue depth vs processing rate).