Future Infrastructure Improvements

This document tracks infrastructure improvements to implement as the project scales and generates revenue.


Current Trade-offs (Accepted for MVP)

Single Instance / Deployment Downtime

Public Subnet for EC2

See Risk Assessment below.


Completed Improvements

None yet


Planned Improvements

1. High Availability (When Revenue Justifies)

Current Cost: ~$30/month (1x t4g.medium) HA Cost: ~$60/month (2x t4g.medium) Additional Cost: +$30/month

Changes required:

Total HA upgrade: +$90/month


2. VPC Endpoints

Priority: Medium (security + minor latency improvement)

Endpoint Type Cost/month
S3 Gateway FREE
SQS Interface (2 AZs) ~$17.50
Secrets Manager Interface (2 AZs) ~$17.50
ECR API Interface (2 AZs) ~$17.50
ECR Docker Interface (2 AZs) ~$17.50
SSM (for Session Manager) Interface (2 AZs) ~$17.50
SSM Messages Interface (2 AZs) ~$17.50
EC2 Messages Interface (2 AZs) ~$17.50

Recommended first step: Add S3 Gateway Endpoint (free) immediately.

Full implementation: ~$122/month for all interface endpoints

Note: Data processing adds $0.01/GB through interface endpoints, but this is typically less than NAT Gateway costs for the same traffic.


3. WAF (Web Application Firewall)

Priority: Medium (once handling sensitive customer data)

Component Cost/month
Web ACL $5.00
AWS Managed Rules - Common (baseline) $1.00
AWS Managed Rules - SQLi $1.00
AWS Managed Rules - Known Bad Inputs $1.00
Request charges (5M requests) $3.00

Minimum WAF setup: ~$11/month Recommended setup (4-5 rule groups): ~$15/month

Consider adding when:


4. ECR Retention Policy

Current: 5 images retained Consideration: Increase to 10-15 for better rollback capability

Cost impact:

Recommendation: Increase to 10 images. Negligible cost, better rollback safety.


5. NAT Gateway

Current: No NAT Gateway (EC2 in public subnet) Cost if added: ~$38/month + $0.052/GB processed

See Risk Assessment for why this is acceptable for now.


Public Subnet Risk Assessment

Current Architecture

EC2 instance in public subnet with:

Actual Risks

Risk Severity Mitigation Residual Risk
Direct attack on EC2 Low SG blocks all ports except 8888 from ALB only Minimal - no exposed ports
Instance metadata exposure Low IMDSv2 required (token-based) Minimal
Outbound data exfiltration Medium Would need to compromise app first Acceptable
AWS API credential theft Low Instance role with scoped permissions Acceptable

What Public Subnet Actually Exposes

  1. Network path exists from internet to instance (but SG blocks it)
  2. Public IP is visible (but no services listening except via ALB)
  3. Outbound traffic goes directly to internet (vs through NAT)

What It Does NOT Expose

Comparison: Public Subnet vs NAT Gateway

Aspect Public Subnet Private + NAT
Monthly cost $0 ~$38
Attack surface SG-protected Identical (SG still primary defense)
Compliance May fail some audits Preferred for SOC2/HIPAA
Operational complexity Lower Higher (NAT is SPOF unless HA)

Recommendation

Keep public subnet for MVP. The security group is the primary defense in both architectures. Move to private subnet + NAT when:


Migration Testing Strategy

Current Plan

Run migration tests only on schema-changing merges, not every deployment.

Detection

Add a CodeBuild step that:

  1. Checks if any resources/migrations/*.sql files changed
  2. If yes, trigger migration test pipeline
  3. If no, skip directly to deploy

Cost Impact


Stack Consolidation Proposal

Current: 9 Stacks

NetworkStack → StorageStack → QueueStack → EcrStack → DatabaseStack
                    ↓
               DnsStack → ComputeStack → MonitoringStack → CicdStack

Proposed: 4 Stacks

1. FoundationStack (replaces Network + Storage + Queue + ECR)

2. DataStack (replaces Database)

3. ComputeStack (replaces DNS + Compute)

4. OpsStack (replaces Monitoring + CICD)

Benefits

Migration Path

  1. Create new consolidated stacks
  2. Import existing resources (CDK resource importing)
  3. Delete old stacks

Effort: ~4-6 hours Risk: Medium (resource importing can be tricky) Recommendation: Do this before first production deployment, not after


Cost Summary

Improvement Monthly Cost Priority When
S3 Gateway Endpoint FREE High Now
ECR retention 10 images +$1.00 Medium Now
Stack consolidation $0 Medium Before prod
WAF +$15 Medium With customers
VPC Interface Endpoints +$122 Low With revenue
HA (2 instances) +$30 Low With SLA needs
NAT Gateway +$38 Low With compliance
RDS Multi-AZ +$60 Low With SLA needs

Automatic Database Credential Rotation

Current: Manual rotation only. Database credentials are stored in Secrets Manager (/v1-orcha/db-credentials). After manual rotation, restart the application to pick up new credentials.

Future: Enable automatic rotation with application restart.

Implementation

  1. Enable RDS Secrets Manager rotation on the secret
  2. Add EventBridge rule to catch SecretsManager Secret Rotation Successful event
  3. Trigger application restart via one of:

Alternative: Split Credentials

If you need rotation without restart:

  1. Change to separate DB_HOST env var + v1-orcha/db-credentials secret (username/password only)
  2. Use AWS JDBC Driver which has built-in Secrets Manager integration and auto-refresh
  3. Or implement credential refresh logic in the app (catch auth failures, refetch, rebuild pool)

Cost


Per-Tenant Resource Scaling

When a large tenant requires dedicated resources to avoid noisy-neighbor issues:

Resources to Create Per Tenant

  1. SQS Queues:

  2. S3 Bucket:

  3. Dedicated Worker Instance(s):

  4. Optional: Dedicated API Keys:

Implementation Steps

  1. Create tenant-specific AWS resources (queues, bucket)
  2. Store resource configuration in database (tenant table or config table):
  3. Update ERP to route work to tenant-specific queue based on tenant config
  4. Deploy dedicated worker instance(s) with tenant-specific config
  5. Monitor and scale worker instances independently

Cost Considerations

Resource Per-Tenant Cost
SQS queues (4x) ~$0 (pay per message)
S3 bucket ~$0 (pay per storage/request)
Worker instance (t4g.medium) ~$30/month
Dedicated API keys Depends on usage

Recommendation: Only create dedicated resources when tenant volume justifies the operational overhead. Start with dedicated workers consuming from global queues, escalate to full isolation only if needed.


Revision History

Date Change
2026-01-10 Added automatic database credential rotation section
2026-01-10 Added per-tenant resource scaling section
2026-01-09 Initial document created