SES Email Troubleshooting Runbook

Operational procedures for diagnosing and resolving SES email acquisition issues.


Quick Diagnosis Commands

Set environment variables:

export AWS_PROFILE=orcha-prod
export AWS_REGION=eu-central-1

Queue Status

# Check queue depth
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-central-1.amazonaws.com/$(aws sts get-caller-identity --query Account --output text)/v1-orcha-global-email-acquire \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible

# Check DLQ
aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-central-1.amazonaws.com/$(aws sts get-caller-identity --query Account --output text)/v1-orcha-global-email-acquire-dlq \
  --attribute-names ApproximateNumberOfMessages

S3 Bucket Status

# List recent emails in bucket
aws s3 ls s3://v1-orcha-ses-emails-$(aws sts get-caller-identity --query Account --output text)/ --recursive | tail -20

# Check if specific email exists
aws s3 ls s3://v1-orcha-ses-emails-$(aws sts get-caller-identity --query Account --output text)/{message-id}

Recent Logs

# Last 30 minutes of SES processing logs
aws logs filter-log-events \
  --log-group-name /v1-orcha/application \
  --start-time $(($(date +%s) - 1800))000 \
  --filter-pattern '"SES email"'

Problem: DLQ Has Messages

Alarm: v1-orcha-email-acquire-dlq-not-empty

Step 1: Inspect DLQ Messages

# Receive messages (does not delete)
aws sqs receive-message \
  --queue-url https://sqs.eu-central-1.amazonaws.com/$(aws sts get-caller-identity --query Account --output text)/v1-orcha-global-email-acquire-dlq \
  --max-number-of-messages 5

Step 2: Check Corresponding Logs

Search for the S3 key from the DLQ message:

aws logs filter-log-events \
  --log-group-name /v1-orcha/application \
  --start-time $(($(date +%s) - 86400))000 \
  --filter-pattern '"{s3-key-from-message}"'

Step 3: Common Causes

Error Cause Resolution
MIME parse exception Malformed email Check raw .eml in S3, may need manual processing
Database connection error Transient DB issue Reprocess by moving from DLQ to main queue
Unknown sender rejection Expected behavior Verify sender should be registered

Step 4: Reprocess or Discard

To reprocess: Move message back to main queue

# Use SQS console or script to redrive messages
aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:eu-central-1:ACCOUNT:v1-orcha-global-email-acquire-dlq \
  --destination-arn arn:aws:sqs:eu-central-1:ACCOUNT:v1-orcha-global-email-acquire

To discard: Purge DLQ (after investigation)

aws sqs purge-queue \
  --queue-url https://sqs.eu-central-1.amazonaws.com/ACCOUNT/v1-orcha-global-email-acquire-dlq

Problem: High Queue Latency

Alarm: v1-orcha-email-acquire-latency (> 60 seconds)

Step 1: Check Queue Depth

aws sqs get-queue-attributes \
  --queue-url https://sqs.eu-central-1.amazonaws.com/ACCOUNT/v1-orcha-global-email-acquire \
  --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible

Step 2: Check Worker Health

# Via SSM
aws ssm start-session --target i-{instance-id}

# On instance
docker logs orcha-app 2>&1 | grep -i "acquisition\|email" | tail -50

Step 3: Check for Stuck Processing

Query database for processing records:

SELECT s3_object_key, status, processed_at, error_reason
FROM ap_doc_source_ses_processed
WHERE processed_at > now() - interval '1 hour'
ORDER BY processed_at DESC
LIMIT 20;

Step 4: Resolution

Cause Resolution
Worker not running Restart container: docker restart orcha-app
DB connection issues Check RDS status, connection pool
High volume spike Scale up or wait for processing to catch up

Problem: Customer Reports Missing Email

Step 1: Gather Information

Ask customer for:

Step 2: Check SES Receipt

Search CloudWatch for SES logs:

aws logs filter-log-events \
  --log-group-name /aws/ses/mail.prod.getorcha.com \
  --start-time $(($(date +%s) - 86400))000 \
  --filter-pattern '"sender@example.com"'

Step 3: Check S3 Bucket

# List emails from time window
aws s3api list-objects-v2 \
  --bucket v1-orcha-ses-emails-ACCOUNT \
  --query "Contents[?LastModified>='2026-01-20T10:00:00']"

Step 4: Check Processing Table

SELECT *
FROM ap_doc_source_ses_processed
WHERE sender_email = 'sender@example.com'
AND processed_at > now() - interval '7 days'
ORDER BY processed_at DESC;

Step 5: Diagnosis Tree

Email not in SES logs?
└── Customer forwarding not configured or failed
    → Check customer's mail rules

Email in S3 but not processed?
└── SQS message failed or pending
    → Check queue depth, DLQ

Email processed with status "rejected"?
└── Unknown sender
    → Register sender in ap_doc_source_ses

Email processed with status "processed" but no documents?
└── Triage rejected as non-invoice
    → Check triage result in ingestion table

Problem: Unknown Sender Rejections

Step 1: Find Rejection Logs

aws logs filter-log-events \
  --log-group-name /v1-orcha/application \
  --start-time $(($(date +%s) - 86400))000 \
  --filter-pattern '"Unknown sender for SES email"'

Step 2: Register Sender

-- 1. Find or create ap_doc_source for tenant
-- 2. Insert ap_doc_source_ses mapping
INSERT INTO ap_doc_source_ses (doc_source_id, sender_email, ses_receiving_address)
VALUES (
    'doc-source-uuid',
    'newsender@supplier.com',
    'documents@mail.prod.getorcha.com'
);

Step 3: Reprocess if Needed

If email is still in S3 (error status):

-- Delete error record to allow reprocessing
DELETE FROM ap_doc_source_ses_processed
WHERE sender_email = 'newsender@supplier.com'
AND status = 'error';

Then trigger reprocessing by re-sending S3 event or waiting for next email.


Verification Commands

Verify Alarm Status

aws cloudwatch describe-alarms \
  --alarm-names "v1-orcha-email-acquire-dlq-not-empty" "v1-orcha-email-acquire-latency" \
  --query 'MetricAlarms[*].[AlarmName,StateValue]'

Verify SES Receipt Rules

aws ses describe-active-receipt-rule-set
aws ses describe-receipt-rule-set --rule-set-name v1-orcha-prod-inbound

Verify S3 Event Notification

aws s3api get-bucket-notification-configuration \
  --bucket v1-orcha-ses-emails-ACCOUNT

Useful Queries

Processing Statistics (Last 24h)

SELECT
    status,
    COUNT(*) as count,
    COUNT(DISTINCT sender_email) as unique_senders
FROM ap_doc_source_ses_processed
WHERE processed_at > now() - interval '24 hours'
GROUP BY status;

Error Distribution

SELECT
    error_reason,
    COUNT(*) as count
FROM ap_doc_source_ses_processed
WHERE status IN ('rejected', 'error')
AND processed_at > now() - interval '7 days'
GROUP BY error_reason
ORDER BY count DESC;

Top Senders

SELECT
    sender_email,
    COUNT(*) as email_count
FROM ap_doc_source_ses_processed
WHERE processed_at > now() - interval '30 days'
GROUP BY sender_email
ORDER BY email_count DESC
LIMIT 10;