SES Email Acquisition Architecture

Forward-based email acquisition using AWS SES for automatic invoice extraction.


Overview

SES email acquisition provides a simpler alternative to OAuth-based email integration. Instead of managing OAuth tokens, webhook subscriptions, and provider-specific APIs, customers simply forward emails to a dedicated SES receiving address.

Benefits over OAuth approach:

Trade-offs:


Service Architecture

┌─────────────────────┐
│  Customer Email     │
│  (M365, Gmail, etc) │
└──────────┬──────────┘
           │
           │ Mail rule (auto-forward)
           ▼
┌─────────────────────────────────────────────────────────────────┐
│                        AWS SES                                   │
│  documents@mail.{env}.getorcha.com                               │
│                                                                  │
│  Receipt Rule:                                                   │
│    ├── Verify domain identity (mail.{env}.getorcha.com)          │
│    └── Store to S3 bucket (v1-orcha-ses-emails-{account})        │
└──────────────────────┬──────────────────────────────────────────┘
                       │
                       │ S3 Event Notification
                       ▼
┌─────────────────────────────────────────────────────────────────┐
│                     SQS: email-acquire                           │
│                                                                  │
│  Message: {bucket, key, event: "ObjectCreated"}                  │
└──────────────────────┬──────────────────────────────────────────┘
                       │
                       │ Poll
                       ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Workers Service                              │
│                                                                  │
│  Acquisition Orchestrator                                        │
│  └── SES Handler (multi/handle-queue-message :ses)               │
│      1. Fetch .eml from S3                                       │
│      2. Parse MIME (headers, body, attachments)                  │
│      3. Lookup ap_doc_source_ses by sender email                 │
│      4. If known sender: triage → upload → queue ingestion       │
│      5. Record status in ap_doc_source_ses_processed             │
│      6. Delete .eml from S3                                      │
└──────────────────────┬──────────────────────────────────────────┘
                       │
                       │ SQS: ingest
                       ▼
┌─────────────────────────────────────────────────────────────────┐
│  Ingestion Pipeline (transcription → extraction → validation)   │
└─────────────────────────────────────────────────────────────────┘

Data Flow

1. Email Arrives at SES

Customer forwards an invoice email. SES receives at documents@mail.{env}.getorcha.com.

2. SES Stores to S3

Receipt rule saves the raw .eml file to S3:

3. S3 Triggers SQS

S3 event notification sends to email-acquire queue:

{
  "Records": [{
    "s3": {
      "bucket": {"name": "v1-orcha-ses-emails-123456789"},
      "object": {"key": "abc123def456"}
    }
  }]
}

4. Worker Processes Email

;; SES handler entry point
(defmethod multi/handle-queue-message :ses
  [{:keys [aws] :as context}
   {:keys [bucket key] :as _message}]

  ;; 1. Check deduplication
  (when-not (already-processed? context key)

    ;; 2. Fetch .eml from S3
    (let [eml-bytes (aws/get-object (:s3-client aws) bucket key)

          ;; 3. Parse MIME
          {:keys [from subject attachments] :as email} (parse-eml eml-bytes)]

      ;; 4. Lookup tenant by sender
      (if-let [{:ap-doc-source-ses/keys [doc-source-id]}
               (lookup-doc-source-by-sender context from)]

        ;; Known sender - process
        (let [result (triage/queue-extractable-items! context doc-source-ses email)]
          (record-processed! context {:status "processed" ...})
          (aws/delete-object! s3-client bucket key))

        ;; Unknown sender - reject
        (do
          (record-processed! context {:status "rejected" :error-reason "unknown-sender"})
          (aws/delete-object! s3-client bucket key))))))

5. Triage & Queue

The shared triage module (triage/queue-extractable-items!) processes the email identically to OAuth-based acquisition:

  1. Apply relevancy filter (spam detection, invoice keywords)
  2. LLM triage (identify extractable documents)
  3. Upload attachments/body to S3
  4. Queue ingestion messages

Database Schema

ap_doc_source_ses

Maps sender email addresses to doc sources (tenants).

CREATE TABLE ap_doc_source_ses (
    doc_source_id         UUID PRIMARY KEY REFERENCES ap_doc_source(id) ON DELETE CASCADE,
    sender_email          TEXT NOT NULL UNIQUE,
    ses_receiving_address TEXT NOT NULL,  -- e.g., documents@mail.prod.getorcha.com
    created_at            TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_ap_doc_source_ses_sender_email ON ap_doc_source_ses(sender_email);

Usage:

ap_doc_source_ses_processed

Audit/deduplication table for processed emails.

CREATE TABLE ap_doc_source_ses_processed (
    id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    doc_source_id  UUID REFERENCES ap_doc_source(id) ON DELETE CASCADE,
    s3_object_key  TEXT NOT NULL UNIQUE,  -- SES message ID (S3 key)
    ses_message_id TEXT,                  -- Message-ID header
    sender_email   TEXT NOT NULL,
    processed_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
    status         TEXT NOT NULL,         -- 'processed' | 'rejected' | 'error'
    error_reason   TEXT
);

Status values:


Error Handling

Unknown Sender

When an email arrives from an unregistered sender:

  1. Log warning with sender address and S3 key
  2. Record in ap_doc_source_ses_processed with status rejected
  3. Delete email from S3 (no data retention for unknown senders)

Parse Errors

If MIME parsing fails:

  1. Log error with exception details
  2. Record in ap_doc_source_ses_processed with status error
  3. Retain email in S3 for manual investigation
  4. Re-throw exception to DLQ the SQS message

Size Limit

SES has a 30MB message size limit. Emails exceeding this are rejected by SES before reaching our infrastructure.

Deduplication

The already-processed? check prevents reprocessing:


Configuration

config.edn

:com.getorcha/aws
{:s3-buckets {:ses-emails #join ["v1-orcha-ses-emails-"
                                  #profile {:local-dev "local-stack"
                                            :test      "test"
                                            :default   #orcha/param "/v1-orcha/account-id"}]}}

CDK Infrastructure

See infra/stacks/foundation_stack.py:

# SES emails bucket
self.ses_emails_bucket = s3.Bucket(
    self, "SesEmailsBucket",
    bucket_name=f"v1-orcha-ses-emails-{self.account}",
    ...
)

# S3 → SQS event notification
self.ses_emails_bucket.add_event_notification(
    s3.EventType.OBJECT_CREATED,
    s3n.SqsDestination(self.email_acquire_queue),
)

# SES Receipt Rule
ses.ReceiptRule(
    self, "StoreToS3Rule",
    rule_set=receipt_rule_set,
    recipients=[f"documents@{mail_domain}"],
    actions=[ses_actions.S3(bucket=self.ses_emails_bucket)],
)

Monitoring

CloudWatch Alarms

Alarm Threshold Description
v1-orcha-email-acquire-dlq-not-empty > 0 messages Processing failures
v1-orcha-email-acquire-latency > 60 seconds Queue backup

Key Metrics

Metric Source Description
ApproximateNumberOfMessagesVisible SQS Queue depth
ApproximateAgeOfOldestMessage SQS Processing latency
NumberOfMessagesReceived SQS Throughput

Logs

Application logs in CloudWatch /v1-orcha/application:

;; Success path
(log/info "Processing SES email" {:bucket bucket :key key})
(log/info "Parsed SES email" {:from from :subject subject :attachment-count n})
(log/info "SES email processed" {:doc-source-id id :queued-items n})

;; Rejection
(log/warn "Unknown sender for SES email, rejecting" {:from from :key key})

;; Error
(log/error e "Failed to parse/process SES email" {:key key})

Security

Email Validation

S3 Bucket Access

Sender Registration

Only registered sender emails are processed:

  1. Admin adds ap_doc_source_ses row with sender email
  2. Sender must forward from exactly that email address
  3. Spoofed From headers would fail sender lookup

Comparison with OAuth Acquisition

Aspect SES (Forward) OAuth (Direct)
Setup Customer configures mail rule Customer authorizes OAuth
Maintenance None Token refresh, subscription renewal
Latency Forwarding delay (~1 min) Real-time webhooks
Reliability Email forwarding OAuth tokens, webhooks
Metadata From, Subject, Attachments Full message, folders, read status
Provider support Any email system Outlook, Gmail only