Email Acquisition

Email-based document acquisition for automatic invoice extraction from customer inboxes.

Overview

The email acquisition system enables automatic extraction of invoices and financial documents from connected email accounts. It supports multiple email providers through a unified protocol-based architecture.

Supported Providers:

Setup: See infra/runbooks/new-environment/ for environment setup.

Architecture

┌─────────────────┐      ┌─────────────────┐
│  Email Provider │      │  Email Provider │
│  (Outlook)      │      │  (Gmail)        │
└────────┬────────┘      └────────┬────────┘
         │                        │
         │ Webhook/Push           │ Pub/Sub Push
         ▼                        ▼
┌─────────────────────────────────────────────┐
│              ERP Service                     │
│  /webhooks/outlook    /webhooks/gmail        │
│                                              │
│  Validate → Lookup ap_doc_source → Queue SQS  │
└─────────────────────┬───────────────────────┘
                      │
                      │ SQS: acquisition queue
                      ▼
┌─────────────────────────────────────────────┐
│           Workers Service                    │
│                                              │
│  Acquisition Orchestrator                    │
│  └── Email Sync Worker                       │
│      ├── Refresh OAuth token                 │
│      ├── Delta sync (fetch new messages)     │
│      ├── Relevancy filter                    │
│      ├── LLM triage                          │
│      └── Queue for ingestion                 │
└─────────────────────────────────────────────┘

Key Concepts

Provider Protocols

Providers implement three Clojure protocols:

Protocol Purpose
EmailSyncer Token refresh, message sync, message fetch
SubscriptionManager Webhook subscription lifecycle
OAuthProvider OAuth 2.0 authorization flow

Delta Sync

Both providers support incremental sync to fetch only new messages:

Provider Mechanism Token
Outlook Delta query @odata.deltaLink URL
Gmail History API historyId integer

Webhook Notifications

Real-time push notifications when new emails arrive:

Provider Method Identification
Outlook Direct webhook POST clientState contains doc_source_id
Gmail Pub/Sub push emailAddress in payload

Token Storage

OAuth tokens stored in AWS SSM Parameter Store:

Setup Guides

Environment setup runbooks are in the infra repo:

  1. Outlook Setup - Microsoft Entra app registration
  2. GCP Setup - Document AI, Gmail Pub/Sub, Workload Identity

Configuration

Providers are configured in config.edn under :com.getorcha/oauth-providers:

:com.getorcha/oauth-providers
{:outlook {:client-id     "..."
           :client-secret "..."
           ;; ... see config.edn for full structure
           }
 :gmail   {:client-id     "..."
           :client-secret "..."
           ;; ... see config.edn for full structure
           }}

Database Schema

Core Tables

-- Email-specific ap_doc_source extension
ap_doc_source_email (
  doc_source_id             -- FK to ap_doc_source
  email_address             -- Connected email
  provider                  -- 'outlook' | 'gmail'
  connection_status         -- 'active' | 'token_refresh_failed' | 'subscription_expired'
  provider_subscription_id  -- Provider's subscription/watch ID
  subscription_expires_at   -- When subscription needs renewal
  webhook_client_state      -- For Outlook validation
  last_sync_token           -- Delta link or history ID
)

-- Message deduplication
ap_doc_source_email_processed_messages (
  doc_source_id
  message_id                -- Provider's message ID
  status                    -- 'processed' | 'rejected' | 'errored'
  error_type
  error_reason
)

Processing Pipeline

  1. Webhook receives notification - Validates, looks up doc_source, queues to SQS
  2. Acquisition orchestrator - Polls SQS, dispatches to worker pool
  3. Email sync worker:

Error Handling

Message-Level Errors

Individual message failures don't block the sync:

Sync-Level Errors

Error Response
Token refresh failed Increment failure counter, deactivate after 3 failures
Delta token expired Clear sync token, retry (will do full resync)
API rate limit Exponential backoff and retry

Subscription Renewal

Subscriptions/watches expire (~7 days). The renewal orchestrator:

Configuration in config.edn:

:com.getorcha.workers.ap.acquisition.email.subscription/renewal
{:check-interval-minutes 60
 :hours-until-expiry     48}

Adding a New Provider

To add support for a new email provider:

  1. Create src/com/getorcha/workers/ap/acquisition/email/{provider}.clj
  2. Implement three protocols:
  3. Add provider to build-providers in email.clj
  4. Add webhook handler in webhooks.clj
  5. Add configuration section to config.edn
  6. Add setup runbook in infra/runbooks/new-environment/{provider}-setup.md