Production Resilience Hardening — Phase 1
Problem
On 2026-05-16 production served 504 Gateway Timeout to all users for ~2.5 hours. A 23-page PDF uploaded by a user was split into 23 invoices, which the document pipeline processed concurrently and unbounded on the single 4 GiB t4g.medium instance. Memory exhausted; every outbound call (Cognito, Google, SES, Slack, SQS, STS) failed simultaneously at 13:34:47 UTC; the SSM agent dropped at 13:36; the app went silent at 13:37:46.
The failure was then amplified and made non-recoverable by three independent weaknesses:
- Unbounded fan-out. Every worker submits to
Executors/newVirtualThreadPerTaskExecutor— no concurrency cap, no backpressure. N invoices = N concurrent heavy tasks (PDF render, OCR, Gemini vision, LLM, matching) regardless of memory budget. - No-backoff retry loops. The 5 worker poll loops catch errors and immediately re-loop. When the SDK fails fast (resolution failure), the SQS long-poll wait is bypassed, so the loop spins at CPU speed and floods logs (~1 MB/s) — pegging CPU to 99% and draining the burstable instance's CPU credits, which then throttled the box into a permanent wedge.
- No auto-recovery. The ASG uses
HealthChecks.ec2(); EC2 reachability kept passing, so the ASG never replaced the dead app.min=max=desired=1means no redundancy. Thev1-orcha-alb-unhealthyalarm fired (alerting works) but nothing acts on it. Service stayed down until the instance was manually replaced.
We were also blind: the CloudWatch agent ships logs only — there are no memory, swap, or disk metrics, so the memory exhaustion was invisible until inferred from the failure pattern.
The uuid_array PSQLException seen at 13:24:25 is a separate latent bug (the DB was healthy throughout) and is explicitly out of scope here — tracked separately.
Goals & non-goals
Goals
- A burst of N heavy documents (any N) cannot exhaust instance memory.
- A failing dependency degrades gracefully — no CPU-pegging spin, no log flood, no credit drain.
- If the app does die, it dies fast and clean and is automatically recovered within minutes, not hours.
- Memory / swap / disk are observable and alarmed.
- Recurring cost delta held to ~$2–3/month; no speculative instance-size increase.
Non-goals (deferred to Phase 2)
- Multi-instance HA for the web tier.
- Splitting the ERP API and ingestion workers into separate processes/ASGs (already the codebase's stated future direction —
system.clj:127-130). - Switching the ASG to ELB health checks and the associated CodeDeploy-deadlock rework.
- Fixing the
uuid_arraySQL bug. - Increasing the instance size (kept as a metrics-gated lever, see Open questions).
Approach
The core architectural choice is how to bound the document-pipeline fan-out — the precise thing that exhausted memory.
| Option | Bounds aggregate memory? | Change size | Cost |
|---|---|---|---|
| A. Global heavy-work semaphore | Yes — one cap across all workers | Small, localized | $0 |
| B. Per-worker bounded executors | No — 4 workers × N still sum | Medium | $0 |
| C. SQS prefetch / visibility throttling | No — split is internal, post-dequeue | Small | $0 |
Approach A — pros
- Only option that bounds aggregate peak memory — the exact failure mode
- One env-tunable knob, sized to the memory budget
- Smallest, most localized change; $0
- Independent of fan-out source or size (23 or 230 invoices)
Approach A — cons
- A shared component threaded through 5 workers
- Permit count needs a sensible default (measured: see §1.1) + in-prod alarm tuning
Adopt Approach A (global heavy-work semaphore). B is a partial fix (does not bound aggregate memory); C cannot see the internal post-dequeue split that caused the incident.
Design
1 · Bounded fan-out concurrency — the fix
A new shared Integrant component exposes a single fair java.util.concurrent.Semaphore ("heavy-work gate"). Every memory-heavy document task acquires a permit before heavy work and releases it in a finally. Applies to the heavy phases of ap.ingestion, ap.processors.matching.worker, document-output, diagnostics-recompute, and the attachment-download / PDF-split phase of ap.acquisition. Virtual-thread executors are unchanged — they stay cheap for I/O waiting; the gate only caps how many tasks do heavy work at once. A blocked acquire is the backpressure (waiting virtual threads are nearly free).
clojure;; shape only — exact namespace decided in the plan (defn with-heavy-permit [gate f] (.acquire ^Semaphore (:sem gate)) (try (f) (finally (.release ^Semaphore (:sem gate)))))
Permit count: default 3, read from config/env (ORCHA_HEAVY_CONCURRENCY), tunable against the memory alarm without a code change (restart only). This default is empirically grounded, not estimated — see the measurement below.
1.1 · Empirical basis for the permit default
The real heavy path (extract-pages → Loader/loadPDF → render-page-to-jpeg @ 150 DPI → JPEG → base64 → request build) was measured against the actual incident document (the 23-page PDF, processed as 23 single-page invoices) over the project nREPL — serially, one page at a time, GC between iterations, with a machine-memory guard. Transient heap high-water per single-page invoice task:
| min | median | mean | p90 | max (n=23) |
|---|---|---|---|---|
| 144 MB | 168 MB | 173 MB | 202 MB | 208 MB |
Production memory budget after the §3 config changes (4 GiB box): heap ≈ 0.60 × ~3.2 GB ≈ ~1.9 GB; conservative steady-state app live set ≈ ~600 MB; GC-health ceiling at ~70% heap ≈ ~1.33 GB ⇒ usable transient budget for concurrent heavy work ≈ ~730 MB. Under concurrency, G1 reclaims slower than N threads allocate, so budget ~230 MB per concurrent task (worst-case page + GC lag). 730 ÷ 230 ≈ 3.2 → default 3 (raise to 4 only if the new memory alarm shows sustained slack).
Default permit count = 3, env-tunable via ORCHA_HEAVY_CONCURRENCY. Derived from measured ~170 MB/task against a ~730 MB usable budget; the memory alarm + env knob remain the in-prod tuning mechanism.
This also reproduces the outage arithmetically: pre-fix heap was ≈3.0 GB (MaxRAMPercentage=75 on 4 GB, no container limit); unbounded fan-out of 23 invoices × ~170 MB ≈ ~3.9 GB of simultaneous transient demand against a ~3.0 GB heap → guaranteed exhaustion + GC death spiral. The root cause is now confirmed by measurement, not inference.
2 · Backoff + jitter on poll loops — amplifier fix
Introduce one shared poll-loop helper (none exists today; the 5 loops are duplicated). On a caught exception: exponential backoff with full jitter (base 1s, ×2, cap 30s) before re-looping; reset to base after a successful poll; continues to honor the polling? stop atom. The normal path keeps SQS long-poll untouched. This removes the tight spin, the log flood, and the credit drain on a fast-failing dependency.
3 · JVM / memory safety — config
-XX:MaxRAMPercentage75 → 60, plus an explicit container memory limit (~3.2 GB, leaving ~0.8 GB for OS + SSM/CW agents).- Add a 2 GB swapfile on the existing 30 GB volume — a transient spike swaps instead of OOM-killing.
-XX:+ExitOnOutOfMemoryError+-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/orcha/: on heap exhaustion the JVM dumps and exits cleanly so it restarts in seconds — never again a multi-hour GC-death-spiral wedge.- Verify and, if absent, set an auto-restart policy on the process (systemd
Restart=always/ container restart) so a clean exit self-heals without instance replacement. This is a checked requirement in the plan.
4 · CPU credits → unlimited
Set credit_specification = unlimited on the launch template via CDK. Prevents the credit-exhaustion throttle that turned a transient spike into a permanent wedge. Idle baseline CPU is ~1.7%, workload is bursty — normal-operation cost ≈ $0.
5 · Monitoring — close the blind spot
Extend the existing CLOUDWATCH_AGENT_CONFIG (logs-only today) to emit mem_used_percent, swap_used_percent, disk_used_percent. New alarms in ops_stack.py → existing v1-orcha-alerts SNS topic: high memory (>85% sustained), high swap, low disk (<15% free). JVM-internal heap metric is noted as optional/deferred — mem_used_percent is a sufficient Phase 1 proxy.
6 · Automated recovery
Attach guarded auto-remediation to the existing v1-orcha-alb-unhealthy alarm (email-only today). Sustained alarm → SSM Automation runbook terminates the unhealthy ASG instance → ASG relaunches a fresh one (today's manual fix, automated). The ASG health check stays ec2() — no CodeDeploy-deadlock risk; the ELB-health rework is Phase 2.
Safeguards:
- Fire only after
HealthyHostCount < 1sustained for ≥ 15 min — longer than any normal CodeDeploy deployment window. - Abort if a CodeDeploy deployment for the app is in progress.
- Cooldown between remediations to prevent flap loops.
Testing
- Unit: semaphore helper (permit enforcement, release-on-throw, blocking); backoff helper (sequence, jitter bounds, reset-on-success, respects stop atom).
- Behavioral: submit a fan-out of N > permits; assert max concurrent heavy tasks ≤ permits.
- Memory: replay the 23-invoice scenario in staging; assert memory stays under the alarm threshold and any OOM exits + restarts cleanly.
- Auto-recovery: SSM runbook dry-run including the CodeDeploy-in-progress guard and cooldown.
- Regression: existing worker suite +
ingestion-regression-teststay green; the shared poll-loop helper must preserve current behavior.
Cost
| Change | Recurring cost |
|---|---|
| Bounded concurrency, backoff, heap/swap config, app-code | $0 |
| CPU credits → unlimited | ~$0 normal; small on rare burst |
| CW-agent custom metrics (mem/swap/disk) + alarms | ~$2–3/mo |
| Alarm-driven SSM auto-replacement | ~$0 (free tier) |
| Delete forensic snapshot after post-mortem | −~$1.5/mo |
Rollout & safety
All infrastructure changes go through CDK (no CloudFormation drift). App changes ship behind config defaults; concurrency limit and backoff caps are env/config-tunable. Deploy via the existing CodeDeploy pipeline.
Open questions
- Heavy-work permit default. Resolved — measured at ~170 MB/task on the real document (§1.1); default
3, env-tunable. Residual unknown is the prod steady-state live set (assumed ~600 MB); the memory alarm validates and the env knob tunes post-deploy. - Container memory-limit mechanism. The exact place the limit is applied (CodeDeploy run script vs. systemd unit vs.
docker run --memory) is an implementation detail to pin down in the plan. - Instance-size lever. Bumping
t4g.medium → large(~+$22/mo) stays deferred and is pulled only if the new memory alarm shows the instance is still tight after the concurrency cap is in place.
References
- Incident: 2026-05-16, instance
i-0bfb8b8304ed5a5a9; forensic snapshotsnap-057455f69dd56c893(delete after post-mortem volume read). infra/stacks/compute_stack.py— ASG, launch template, health checks, CodeDeploy agent, CloudWatch-agent config.infra/stacks/ops_stack.py— SNSv1-orcha-alerts+ 12 existing alarms.src/com/getorcha/system.clj:127-130— single-JVM is explicitly temporary; web/worker split is the stated future (Phase 2).- Worker poll loops:
workers/ap/acquisition.clj,ap/ingestion.clj,ap/processors/matching/worker.clj,document_output.clj,diagnostics_recompute.clj. Dockerfile:10— currentJAVA_OPTSwithMaxRAMPercentage=75.