Production Resilience Hardening (Phase 1) — Orcha
In review

Production Resilience Hardening — Phase 1

2026-05-16DanielspecPost-incident: 2026-05-16 prod outage

Problem

On 2026-05-16 production served 504 Gateway Timeout to all users for ~2.5 hours. A 23-page PDF uploaded by a user was split into 23 invoices, which the document pipeline processed concurrently and unbounded on the single 4 GiB t4g.medium instance. Memory exhausted; every outbound call (Cognito, Google, SES, Slack, SQS, STS) failed simultaneously at 13:34:47 UTC; the SSM agent dropped at 13:36; the app went silent at 13:37:46.

The failure was then amplified and made non-recoverable by three independent weaknesses:

  1. Unbounded fan-out. Every worker submits to Executors/newVirtualThreadPerTaskExecutor — no concurrency cap, no backpressure. N invoices = N concurrent heavy tasks (PDF render, OCR, Gemini vision, LLM, matching) regardless of memory budget.
  2. No-backoff retry loops. The 5 worker poll loops catch errors and immediately re-loop. When the SDK fails fast (resolution failure), the SQS long-poll wait is bypassed, so the loop spins at CPU speed and floods logs (~1 MB/s) — pegging CPU to 99% and draining the burstable instance's CPU credits, which then throttled the box into a permanent wedge.
  3. No auto-recovery. The ASG uses HealthChecks.ec2(); EC2 reachability kept passing, so the ASG never replaced the dead app. min=max=desired=1 means no redundancy. The v1-orcha-alb-unhealthy alarm fired (alerting works) but nothing acts on it. Service stayed down until the instance was manually replaced.

We were also blind: the CloudWatch agent ships logs only — there are no memory, swap, or disk metrics, so the memory exhaustion was invisible until inferred from the failure pattern.

Warning

The uuid_array PSQLException seen at 13:24:25 is a separate latent bug (the DB was healthy throughout) and is explicitly out of scope here — tracked separately.

Goals & non-goals

Goals

Non-goals (deferred to Phase 2)

Approach

The core architectural choice is how to bound the document-pipeline fan-out — the precise thing that exhausted memory.

OptionBounds aggregate memory?Change sizeCost
B. Per-worker bounded executorsNo — 4 workers × N still sumMedium$0
C. SQS prefetch / visibility throttlingNo — split is internal, post-dequeueSmall$0
Decision matrix — recommended row is Approach A.

Approach A — pros

  • Only option that bounds aggregate peak memory — the exact failure mode
  • One env-tunable knob, sized to the memory budget
  • Smallest, most localized change; $0
  • Independent of fan-out source or size (23 or 230 invoices)

Approach A — cons

  • A shared component threaded through 5 workers
  • Permit count needs a sensible default (measured: see §1.1) + in-prod alarm tuning
Decision

Adopt Approach A (global heavy-work semaphore). B is a partial fix (does not bound aggregate memory); C cannot see the internal post-dequeue split that caused the incident.

Design

1 · Bounded fan-out concurrency — the fix

A new shared Integrant component exposes a single fair java.util.concurrent.Semaphore ("heavy-work gate"). Every memory-heavy document task acquires a permit before heavy work and releases it in a finally. Applies to the heavy phases of ap.ingestion, ap.processors.matching.worker, document-output, diagnostics-recompute, and the attachment-download / PDF-split phase of ap.acquisition. Virtual-thread executors are unchanged — they stay cheap for I/O waiting; the gate only caps how many tasks do heavy work at once. A blocked acquire is the backpressure (waiting virtual threads are nearly free).

clojure;; shape only — exact namespace decided in the plan
(defn with-heavy-permit [gate f]
  (.acquire ^Semaphore (:sem gate))
  (try (f) (finally (.release ^Semaphore (:sem gate)))))

Permit count: default 3, read from config/env (ORCHA_HEAVY_CONCURRENCY), tunable against the memory alarm without a code change (restart only). This default is empirically grounded, not estimated — see the measurement below.

1.1 · Empirical basis for the permit default

The real heavy path (extract-pages → Loader/loadPDF → render-page-to-jpeg @ 150 DPI → JPEG → base64 → request build) was measured against the actual incident document (the 23-page PDF, processed as 23 single-page invoices) over the project nREPL — serially, one page at a time, GC between iterations, with a machine-memory guard. Transient heap high-water per single-page invoice task:

minmedianmeanp90max (n=23)
144 MB168 MB173 MB202 MB208 MB
Per-invoice transient memory high-water; JPEG 313–561 KB/page. Dominated by the 150-DPI BufferedImage raster + encode buffers.

Production memory budget after the §3 config changes (4 GiB box): heap ≈ 0.60 × ~3.2 GB ≈ ~1.9 GB; conservative steady-state app live set ≈ ~600 MB; GC-health ceiling at ~70% heap ≈ ~1.33 GB ⇒ usable transient budget for concurrent heavy work ≈ ~730 MB. Under concurrency, G1 reclaims slower than N threads allocate, so budget ~230 MB per concurrent task (worst-case page + GC lag). 730 ÷ 230 ≈ 3.2 → default 3 (raise to 4 only if the new memory alarm shows sustained slack).

Decision

Default permit count = 3, env-tunable via ORCHA_HEAVY_CONCURRENCY. Derived from measured ~170 MB/task against a ~730 MB usable budget; the memory alarm + env knob remain the in-prod tuning mechanism.

Warning

This also reproduces the outage arithmetically: pre-fix heap was ≈3.0 GB (MaxRAMPercentage=75 on 4 GB, no container limit); unbounded fan-out of 23 invoices × ~170 MB ≈ ~3.9 GB of simultaneous transient demand against a ~3.0 GB heap → guaranteed exhaustion + GC death spiral. The root cause is now confirmed by measurement, not inference.

2 · Backoff + jitter on poll loops — amplifier fix

Introduce one shared poll-loop helper (none exists today; the 5 loops are duplicated). On a caught exception: exponential backoff with full jitter (base 1s, ×2, cap 30s) before re-looping; reset to base after a successful poll; continues to honor the polling? stop atom. The normal path keeps SQS long-poll untouched. This removes the tight spin, the log flood, and the credit drain on a fast-failing dependency.

3 · JVM / memory safety — config

4 · CPU credits → unlimited

Set credit_specification = unlimited on the launch template via CDK. Prevents the credit-exhaustion throttle that turned a transient spike into a permanent wedge. Idle baseline CPU is ~1.7%, workload is bursty — normal-operation cost ≈ $0.

5 · Monitoring — close the blind spot

Extend the existing CLOUDWATCH_AGENT_CONFIG (logs-only today) to emit mem_used_percent, swap_used_percent, disk_used_percent. New alarms in ops_stack.py → existing v1-orcha-alerts SNS topic: high memory (>85% sustained), high swap, low disk (<15% free). JVM-internal heap metric is noted as optional/deferred — mem_used_percent is a sufficient Phase 1 proxy.

6 · Automated recovery

Attach guarded auto-remediation to the existing v1-orcha-alb-unhealthy alarm (email-only today). Sustained alarm → SSM Automation runbook terminates the unhealthy ASG instance → ASG relaunches a fresh one (today's manual fix, automated). The ASG health check stays ec2() — no CodeDeploy-deadlock risk; the ELB-health rework is Phase 2.

Safeguards:

SQS message poll loop err → backoff+jitter heavy-work gate ≤ N permits heavy work OCR / vision / LLM
Bounded data flow: memory is capped by the gate regardless of fan-out size.

Testing

Cost

ChangeRecurring cost
Bounded concurrency, backoff, heap/swap config, app-code$0
CPU credits → unlimited~$0 normal; small on rare burst
CW-agent custom metrics (mem/swap/disk) + alarms~$2–3/mo
Alarm-driven SSM auto-replacement~$0 (free tier)
Delete forensic snapshot after post-mortem−~$1.5/mo
Net Phase 1 recurring delta ≈ $2–3/month. Instance-size bump excluded by decision.

Rollout & safety

All infrastructure changes go through CDK (no CloudFormation drift). App changes ship behind config defaults; concurrency limit and backoff caps are env/config-tunable. Deploy via the existing CodeDeploy pipeline.

Open questions

References