Orcha AWS Infrastructure

AWS CDK infrastructure for the Orcha document processing system.

Architecture

                                    +---------------------------+
                                    |      ALB (HTTPS:443)      |
Internet -------------------------->|   app.prod.getorcha.com   |
                                    +-------------+-------------+
                                                  |
                                                  | Port 8888
                                                  v
+---------------------------------------------------------------------------------+
|                         Public Subnet (AZ-A, AZ-B)                              |
|                                                                                 |
|    +---------------------------------------------------------------------+      |
|    |                    Auto Scaling Group (min=max=1)                   |      |
|    |                         EC2 t4g.medium                              |      |
|    |                                                                     |      |
|    |   docker-compose                                                    |      |
|    |   +---------------------+    +---------------------+                |      |
|    |   |   App Container     |    |  Worker Container   |                |      |
|    |   |   (port 8888)       |    |  (polls SQS)        |                |      |
|    |   +---------------------+    +---------------------+                |      |
|    +---------------------------------------------------------------------+      |
|                                           |                                     |
+-------------------------------------------+-------------------------------------+
                                            |
+-------------------------------------------+-------------------------------------+
|                         Private Subnet (AZ-A, AZ-B)                             |
|                                           |                                     |
|                                           v                                     |
|                                  +----------------+                             |
|                                  |      RDS       |                             |
|                                  |   PostgreSQL   |                             |
|                                  +----------------+                             |
+---------------------------------------------------------------------------------+

CI/CD Pipeline:
+----------+    +-------------+    +-----------+    +---------+    +------------+
|  GitHub  |--->| CodePipeline|--->| CodeBuild |--->| Approve |--->| CodeDeploy |
|  (push)  |    |             |    | (test +   |    | (manual)|    | (to ASG)   |
+----------+    +-------------+    |  build)   |    +---------+    +------------+
                                   +-----------+
                                        |
                                        v
                                   +---------+
                                   |   ECR   |
                                   | (images)|
                                   +---------+

Stacks

Infrastructure is organized into 4 CDK stacks deployed in order:

Stack Resources Dependencies
FoundationStack VPC, subnets, security groups, S3, SQS, ECR, Route53 hosted zone None
DataStack RDS PostgreSQL, Secrets Manager Foundation
ComputeStack ALB, ASG, ACM certificate, Route53 record, IAM roles Foundation, Data
OpsStack CloudWatch, SNS, CodePipeline, CodeBuild, CodeDeploy, Budget Foundation, Data, Compute

Deploy all stacks:

source .venv/bin/activate
AWS_PROFILE=orcha-prod cdk deploy --all --context env_name=prod

Resources

Networking

Resource Name/Value
VPC CIDR 10.0.0.0/16
Region eu-central-1
Availability Zones eu-central-1a, eu-central-1b
NAT Gateway None (EC2 in public subnet)

Security Groups:

Storage

Resource Name
S3 Bucket (documents) v1-orcha-global-storage-{account_id}
S3 Bucket (pipeline) v1-orcha-pipeline-artifacts-{account_id}
ECR Repository v1-orcha (keeps last 5 images)

Queues

Queue Purpose Visibility Timeout
v1-orcha-global-ingest Document processing 600s (10 min)
v1-orcha-global-ingest-dlq Failed documents 14 day retention
v1-orcha-global-email-acquire Email acquisition 300s (5 min)
v1-orcha-global-email-acquire-dlq Failed emails 14 day retention

Database

Attribute Value
Instance v1-orcha-db
Engine PostgreSQL 18.1
Instance type db.t4g.medium
Storage 30 GB gp3 (autoscales to 100 GB)
Multi-AZ No
Backup retention 14 days
Deletion protection Yes

Credentials stored in Secrets Manager: /v1-orcha/db-credentials

Compute

Attribute Value
Instance type t4g.medium (ARM64)
AMI Amazon Linux 2023
ASG capacity min=1, max=1
Health check ELB, 300s grace period

IAM Role (v1-orcha-service-role):

DNS & SSL

Resource Value
Hosted zone prod.getorcha.com
A record app.prod.getorcha.com → ALB
Certificate ACM, DNS-validated, auto-renews

CI/CD Pipeline

Pipeline: v1-orcha-deploy

Stage Action
Source GitHub (CodeConnections) on master branch push
Build CodeBuild: run tests, build uberjar, build/push Docker image
Approve Manual approval (SNS notification)
Deploy CodeDeploy to ASG (AllAtOnce, auto-rollback on failure)

CodeBuild (v1-orcha-build):

CodeDeploy (v1-orcha / v1-orcha-production):

Monitoring

CloudWatch Alarms (10 total):

Alarm Condition Severity
v1-orcha-alb-unhealthy HealthyHostCount < 1 Critical
v1-orcha-ec2-status-check GroupInServiceInstances < 1 Critical
v1-orcha-rds-no-connections DatabaseConnections = 0 Critical
v1-orcha-ingest-dlq-not-empty DLQ messages > 0 Critical
v1-orcha-email-acquire-dlq-not-empty DLQ messages > 0 Critical
v1-orcha-ec2-high-cpu CPU > 80% Operational
v1-orcha-rds-high-cpu CPU > 80% Operational
v1-orcha-rds-low-storage Free storage < 5 GB Operational
v1-orcha-rds-high-connections Connections > 100 Operational
v1-orcha-cert-expiring Days to expiry < 14 Operational

SNS Topic: v1-orcha-alerts (email subscription)

Cost Controls:

Log Groups (30-day retention):

Deployment

First-Time Setup

  1. Bootstrap CDK:

    AWS_PROFILE=orcha-prod cdk bootstrap aws://700558745280/eu-central-1 --context env_name=prod
    
  2. Deploy all stacks:

    AWS_PROFILE=orcha-prod cdk deploy --all --context env_name=prod
    
  3. Delegate subdomain (in management account):

    ./scripts/delegate-subdomain.sh prod <HOSTED_ZONE_ID>
    
  4. Complete GitHub connection:

  5. Confirm SNS subscription (check email)

  6. Update SSM parameters:

    ./scripts/update-secrets.sh --from-file secrets
    

Subsequent Deployments

Push to master branch triggers the pipeline automatically.

Manual deployment:

AWS_PROFILE=orcha-prod cdk deploy --all --context env_name=prod

Instance Access

No SSH. Use SSM Session Manager:

# Get instance ID
INSTANCE_ID=$(AWS_PROFILE=orcha-prod aws ec2 describe-instances \
  --filters "Name=tag:Name,Values=v1-orcha-app" \
  --query "Reservations[0].Instances[0].InstanceId" --output text)

# Start session
AWS_PROFILE=orcha-prod aws ssm start-session --target $INSTANCE_ID

# Port forward REPL (9878)
AWS_PROFILE=orcha-prod aws ssm start-session --target $INSTANCE_ID \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["9878"],"localPortNumber":["9878"]}'

Tags

Applied to all resources:

Key Value
Project orcha
Environment prod
ManagedBy cdk

Future Improvements

Items identified but not yet implemented:

From Original Specification

Item Description Priority
Migration Test CodeBuild v1-orcha-migration-test project to test schema changes against RDS snapshot before deploying Medium
SQS Backlog Alarm v1-orcha-ingest-backlog alarm when queue > 500 messages Low
Pipeline Notifications SNS notifications for pipeline success/failure events (beyond manual approval) Low
Developer SSM Policy v1-orcha-developer-ssm-access IAM managed policy for team access Low

Infrastructure Enhancements

Item Cost/Month Priority When
S3 Gateway Endpoint FREE Medium Security improvement, keeps S3 traffic in AWS
ECR retention → 10 images +$1 Medium Better rollback capability
WAF +$15 Medium When handling sensitive customer data
VPC Interface Endpoints +$122 Low Security, compliance requirements
HA (min=2 instances) +$30 Low When SLA commitments needed
NAT Gateway +$38 Low Compliance (move EC2 to private subnet)
RDS Multi-AZ +$60 Low When SLA commitments needed

Operational Improvements

Item Description
Auto DB Credential Rotation EventBridge + Lambda to rotate credentials and restart app
Per-Tenant Resource Scaling Dedicated SQS queues and workers for large tenants
Multi-Environment Support Config-driven stack parameters for dev/staging/prod

Current Trade-offs (Accepted for MVP)


File Structure

./
├── app.py                    # CDK app entry point
├── cdk.json                  # CDK configuration
├── requirements.txt          # Python dependencies
├── stacks/
│   ├── foundation_stack.py   # VPC, S3, SQS, ECR, Route53
│   ├── data_stack.py         # RDS, Secrets Manager
│   ├── compute_stack.py      # ALB, ASG, ACM, Route53 record
│   └── ops_stack.py          # CI/CD, monitoring, alerts
├── runbooks/
│   ├── deploy.md             # Deployment instructions
│   ├── bootstrap-cdk.md      # CDK bootstrap details
│   └── update-secrets.md     # SSM parameter updates
└── scripts/
    ├── delegate-subdomain.sh # NS delegation helper
    └── update-secrets.sh     # SSM parameter updates