In review

Production Resilience Hardening — Phase 1

2026-05-16DanielspecPost-incident: 2026-05-16 prod outage

Problem

On 2026-05-16 production served 504 Gateway Timeout to all users for ~2.5 hours. A 23-page PDF uploaded by a user was split into 23 invoices, which the document pipeline processed concurrently and unbounded on the single 4 GiB t4g.medium instance. Memory exhausted; every outbound call (Cognito, Google, SES, Slack, SQS, STS) failed simultaneously at 13:34:47 UTC; the SSM agent dropped at 13:36; the app went silent at 13:37:46.

The failure was then amplified and made non-recoverable by three independent weaknesses:

Unbounded fan-out. Every worker submits to Executors/newVirtualThreadPerTaskExecutor — no concurrency cap, no backpressure. N invoices = N concurrent heavy tasks (PDF render, OCR, Gemini vision, LLM, matching) regardless of memory budget.
No-backoff retry loops. The 5 worker poll loops catch errors and immediately re-loop. When the SDK fails fast (resolution failure), the SQS long-poll wait is bypassed, so the loop spins at CPU speed and floods logs (~1 MB/s) — pegging CPU to 99% and draining the burstable instance's CPU credits, which then throttled the box into a permanent wedge.
No auto-recovery. The ASG uses HealthChecks.ec2(); EC2 reachability kept passing, so the ASG never replaced the dead app. min=max=desired=1 means no redundancy. The v1-orcha-alb-unhealthy alarm fired (alerting works) but nothing acts on it. Service stayed down until the instance was manually replaced.

We were also blind: the CloudWatch agent ships logs only — there are no memory, swap, or disk metrics, so the memory exhaustion was invisible until inferred from the failure pattern.

Warning

The uuid_array PSQLException seen at 13:24:25 is a separate latent bug (the DB was healthy throughout) and is explicitly out of scope here — tracked separately.

Goals & non-goals

Goals

A burst of N heavy documents (any N) cannot exhaust instance memory.
A failing dependency degrades gracefully — no CPU-pegging spin, no log flood, no credit drain.
If the app does die, it dies fast and clean and is automatically recovered within minutes, not hours.
Memory / swap / disk are observable and alarmed.
Recurring cost delta held to ~$2–3/month; no speculative instance-size increase.

Non-goals (deferred to Phase 2)

Multi-instance HA for the web tier.
Splitting the ERP API and ingestion workers into separate processes/ASGs (already the codebase's stated future direction — system.clj:127-130).
Switching the ASG to ELB health checks and the associated CodeDeploy-deadlock rework.
Fixing the uuid_array SQL bug.
Increasing the instance size (kept as a metrics-gated lever, see Open questions).

Approach

The core architectural choice is how to bound the document-pipeline fan-out — the precise thing that exhausted memory.

Decision matrix — recommended row is Approach A.
Option	Bounds aggregate memory?	Change size	Cost
A. Global heavy-work semaphore	Yes — one cap across all workers	Small, localized	$0
B. Per-worker bounded executors	No — 4 workers × N still sum	Medium	$0
C. SQS prefetch / visibility throttling	No — split is internal, post-dequeue	Small	$0

Approach A — pros

Only option that bounds aggregate peak memory — the exact failure mode
One env-tunable knob, sized to the memory budget
Smallest, most localized change; $0
Independent of fan-out source or size (23 or 230 invoices)

Approach A — cons

A shared component threaded through 5 workers
Permit count needs a sensible default (measured: see §1.1) + in-prod alarm tuning

Decision

Adopt Approach A (global heavy-work semaphore). B is a partial fix (does not bound aggregate memory); C cannot see the internal post-dequeue split that caused the incident.

Design

1 · Bounded fan-out concurrency — the fix

A new shared Integrant component exposes a single fair java.util.concurrent.Semaphore ("heavy-work gate"). Every memory-heavy document task acquires a permit before heavy work and releases it in a finally. Applies to the heavy phases of ap.ingestion, ap.processors.matching.worker, document-output, diagnostics-recompute, and the attachment-download / PDF-split phase of ap.acquisition. Virtual-thread executors are unchanged — they stay cheap for I/O waiting; the gate only caps how many tasks do heavy work at once. A blocked acquire is the backpressure (waiting virtual threads are nearly free).

clojure;; shape only — exact namespace decided in the plan
(defn with-heavy-permit [gate f]
  (.acquire ^Semaphore (:sem gate))
  (try (f) (finally (.release ^Semaphore (:sem gate)))))

Permit count: default 3, read from config/env (ORCHA_HEAVY_CONCURRENCY), tunable against the memory alarm without a code change (restart only). This default is empirically grounded, not estimated — see the measurement below.

1.1 · Empirical basis for the permit default

The real heavy path (extract-pages → Loader/loadPDF → render-page-to-jpeg @ 150 DPI → JPEG → base64 → request build) was measured against the actual incident document (the 23-page PDF, processed as 23 single-page invoices) over the project nREPL — serially, one page at a time, GC between iterations, with a machine-memory guard. Transient heap high-water per single-page invoice task:

Per-invoice transient memory high-water; JPEG 313–561 KB/page. Dominated by the 150-DPI `BufferedImage` raster + encode buffers.
min	median	mean	p90	max (n=23)
144 MB	168 MB	173 MB	202 MB	208 MB

Production memory budget after the §3 config changes (4 GiB box): heap ≈ 0.60 × ~3.2 GB ≈ ~1.9 GB; conservative steady-state app live set ≈ ~600 MB; GC-health ceiling at ~70% heap ≈ ~1.33 GB ⇒ usable transient budget for concurrent heavy work ≈ ~730 MB. Under concurrency, G1 reclaims slower than N threads allocate, so budget ~230 MB per concurrent task (worst-case page + GC lag). 730 ÷ 230 ≈ 3.2 → default 3 (raise to 4 only if the new memory alarm shows sustained slack).

Decision

Default permit count = 3, env-tunable via ORCHA_HEAVY_CONCURRENCY. Derived from measured ~170 MB/task against a ~730 MB usable budget; the memory alarm + env knob remain the in-prod tuning mechanism.

Warning

This also reproduces the outage arithmetically: pre-fix heap was ≈3.0 GB (MaxRAMPercentage=75 on 4 GB, no container limit); unbounded fan-out of 23 invoices × ~170 MB ≈ ~3.9 GB of simultaneous transient demand against a ~3.0 GB heap → guaranteed exhaustion + GC death spiral. The root cause is now confirmed by measurement, not inference.

2 · Backoff + jitter on poll loops — amplifier fix

Introduce one shared poll-loop helper (none exists today; the 5 loops are duplicated). On a caught exception: exponential backoff with full jitter (base 1s, ×2, cap 30s) before re-looping; reset to base after a successful poll; continues to honor the polling? stop atom. The normal path keeps SQS long-poll untouched. This removes the tight spin, the log flood, and the credit drain on a fast-failing dependency.

3 · JVM / memory safety — config

-XX:MaxRAMPercentage 75 → 60, plus an explicit container memory limit (~3.2 GB, leaving ~0.8 GB for OS + SSM/CW agents).
Add a 2 GB swapfile on the existing 30 GB volume — a transient spike swaps instead of OOM-killing.
-XX:+ExitOnOutOfMemoryError + -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/orcha/: on heap exhaustion the JVM dumps and exits cleanly so it restarts in seconds — never again a multi-hour GC-death-spiral wedge.
Verify and, if absent, set an auto-restart policy on the process (systemd Restart=always / container restart) so a clean exit self-heals without instance replacement. This is a checked requirement in the plan.

4 · CPU credits → unlimited

Set credit_specification = unlimited on the launch template via CDK. Prevents the credit-exhaustion throttle that turned a transient spike into a permanent wedge. Idle baseline CPU is ~1.7%, workload is bursty — normal-operation cost ≈ $0.

5 · Monitoring — close the blind spot

Extend the existing CLOUDWATCH_AGENT_CONFIG (logs-only today) to emit mem_used_percent, swap_used_percent, disk_used_percent. New alarms in ops_stack.py → existing v1-orcha-alerts SNS topic: high memory (>85% sustained), high swap, low disk (<15% free). JVM-internal heap metric is noted as optional/deferred — mem_used_percent is a sufficient Phase 1 proxy.

6 · Automated recovery

Attach guarded auto-remediation to the existing v1-orcha-alb-unhealthy alarm (email-only today). Sustained alarm → SSM Automation runbook terminates the unhealthy ASG instance → ASG relaunches a fresh one (today's manual fix, automated). The ASG health check stays ec2() — no CodeDeploy-deadlock risk; the ELB-health rework is Phase 2.

Safeguards:

Fire only after HealthyHostCount < 1 sustained for ≥ 15 min — longer than any normal CodeDeploy deployment window.
Abort if a CodeDeploy deployment for the app is in progress.
Cooldown between remediations to prevent flap loops.

Bounded data flow: memory is capped by the gate regardless of fan-out size.

Testing

Unit: semaphore helper (permit enforcement, release-on-throw, blocking); backoff helper (sequence, jitter bounds, reset-on-success, respects stop atom).
Behavioral: submit a fan-out of N > permits; assert max concurrent heavy tasks ≤ permits.
Memory: replay the 23-invoice scenario in staging; assert memory stays under the alarm threshold and any OOM exits + restarts cleanly.
Auto-recovery: SSM runbook dry-run including the CodeDeploy-in-progress guard and cooldown.
Regression: existing worker suite + ingestion-regression-test stay green; the shared poll-loop helper must preserve current behavior.

Cost

Net Phase 1 recurring delta ≈ $2–3/month. Instance-size bump excluded by decision.
Change	Recurring cost
Bounded concurrency, backoff, heap/swap config, app-code	$0
CPU credits → unlimited	~$0 normal; small on rare burst
CW-agent custom metrics (mem/swap/disk) + alarms	~$2–3/mo
Alarm-driven SSM auto-replacement	~$0 (free tier)
Delete forensic snapshot after post-mortem	−~$1.5/mo

Rollout & safety

All infrastructure changes go through CDK (no CloudFormation drift). App changes ship behind config defaults; concurrency limit and backoff caps are env/config-tunable. Deploy via the existing CodeDeploy pipeline.

Open questions

Heavy-work permit default. Resolved — measured at ~170 MB/task on the real document (§1.1); default 3, env-tunable. Residual unknown is the prod steady-state live set (assumed ~600 MB); the memory alarm validates and the env knob tunes post-deploy.
Container memory-limit mechanism. The exact place the limit is applied (CodeDeploy run script vs. systemd unit vs. docker run --memory) is an implementation detail to pin down in the plan.
Instance-size lever. Bumping t4g.medium → large (~+$22/mo) stays deferred and is pulled only if the new memory alarm shows the instance is still tight after the concurrency cap is in place.

References

Incident: 2026-05-16, instance i-0bfb8b8304ed5a5a9; forensic snapshot snap-057455f69dd56c893 (delete after post-mortem volume read).
infra/stacks/compute_stack.py — ASG, launch template, health checks, CodeDeploy agent, CloudWatch-agent config.
infra/stacks/ops_stack.py — SNS v1-orcha-alerts + 12 existing alarms.
src/com/getorcha/system.clj:127-130 — single-JVM is explicitly temporary; web/worker split is the stated future (Phase 2).
Worker poll loops: workers/ap/acquisition.clj, ap/ingestion.clj, ap/processors/matching/worker.clj, document_output.clj, diagnostics_recompute.clj.
Dockerfile:10 — current JAVA_OPTS with MaxRAMPercentage=75.