Email-based document acquisition for automatic invoice extraction from customer inboxes.
The email acquisition system enables automatic extraction of invoices and financial documents from connected email accounts. It supports multiple email providers through a unified protocol-based architecture.
Supported Providers:
Setup: See infra/runbooks/new-environment/ for environment setup.
┌─────────────────┐ ┌─────────────────┐
│ Email Provider │ │ Email Provider │
│ (Outlook) │ │ (Gmail) │
└────────┬────────┘ └────────┬────────┘
│ │
│ Webhook/Push │ Pub/Sub Push
▼ ▼
┌─────────────────────────────────────────────┐
│ ERP Service │
│ /webhooks/outlook /webhooks/gmail │
│ │
│ Validate → Lookup ap_doc_source → Queue SQS │
└─────────────────────┬───────────────────────┘
│
│ SQS: acquisition queue
▼
┌─────────────────────────────────────────────┐
│ Workers Service │
│ │
│ Acquisition Orchestrator │
│ └── Email Sync Worker │
│ ├── Refresh OAuth token │
│ ├── Delta sync (fetch new messages) │
│ ├── Relevancy filter │
│ ├── LLM triage │
│ └── Queue for ingestion │
└─────────────────────────────────────────────┘
Providers implement three Clojure protocols:
| Protocol | Purpose |
|---|---|
EmailSyncer |
Token refresh, message sync, message fetch |
SubscriptionManager |
Webhook subscription lifecycle |
OAuthProvider |
OAuth 2.0 authorization flow |
Both providers support incremental sync to fetch only new messages:
| Provider | Mechanism | Token |
|---|---|---|
| Outlook | Delta query | @odata.deltaLink URL |
| Gmail | History API | historyId integer |
Real-time push notifications when new emails arrive:
| Provider | Method | Identification |
|---|---|---|
| Outlook | Direct webhook POST | clientState contains doc_source_id |
| Gmail | Pub/Sub push | emailAddress in payload |
OAuth tokens stored in AWS SSM Parameter Store:
/orcha/{environment}/email-tokens/{doc-source-id}:access-token, :refresh-token, :expires-atEnvironment setup runbooks are in the infra repo:
Providers are configured in config.edn under :com.getorcha/oauth-providers:
:com.getorcha/oauth-providers
{:outlook {:client-id "..."
:client-secret "..."
;; ... see config.edn for full structure
}
:gmail {:client-id "..."
:client-secret "..."
;; ... see config.edn for full structure
}}
-- Email-specific ap_doc_source extension
ap_doc_source_email (
doc_source_id -- FK to ap_doc_source
email_address -- Connected email
provider -- 'outlook' | 'gmail'
connection_status -- 'active' | 'token_refresh_failed' | 'subscription_expired'
provider_subscription_id -- Provider's subscription/watch ID
subscription_expires_at -- When subscription needs renewal
webhook_client_state -- For Outlook validation
last_sync_token -- Delta link or history ID
)
-- Message deduplication
ap_doc_source_email_processed_messages (
doc_source_id
message_id -- Provider's message ID
status -- 'processed' | 'rejected' | 'errored'
error_type
error_reason
)
Individual message failures don't block the sync:
ap_doc_source_email_processed_messages with error details| Error | Response |
|---|---|
| Token refresh failed | Increment failure counter, deactivate after 3 failures |
| Delta token expired | Clear sync token, retry (will do full resync) |
| API rate limit | Exponential backoff and retry |
Subscriptions/watches expire (~7 days). The renewal orchestrator:
Configuration in config.edn:
:com.getorcha.workers.ap.acquisition.email.subscription/renewal
{:check-interval-minutes 60
:hours-until-expiry 48}
To add support for a new email provider:
src/com/getorcha/workers/ap/acquisition/email/{provider}.cljEmailSyncer - Token management and message syncSubscriptionManager - Webhook subscription lifecycleOAuthProvider - OAuth flowbuild-providers in email.cljwebhooks.cljconfig.edninfra/runbooks/new-environment/{provider}-setup.md