Phase 1 Resilience — Plan B: Instance Memory & CPU Safety Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Steps use checkbox (- [ ]) syntax. Infra changes are CDK-only; this plan stops at cdk synth/cdk diff verification. Actual cdk deploy is an operator-gated step the user runs explicitly — the plan never deploys.

Goal: Give the JVM real memory headroom on the 4 GiB box and stop a burst from throttling the instance into a wedge.

Architecture: Lower JVM heap fraction + a hard container memory limit so the JVM cannot starve the OS/agents; add an OS swapfile as a cushion; make the JVM fail-fast on OOM (clean exit → docker-compose restarts it); switch the burstable instance to unlimited CPU credits so a spike can't throttle recovery. All but the JVM flag are CDK changes.

Tech Stack: Dockerfile, docker-compose (Compose Spec), AWS CDK (Python), cfn-init.

Measured basis: ~170 MB/single-page invoice (spec §1.1). Heap target ≈ 0.60 × ~3.2 GB ≈ 1.9 GB.


File Structure

These are independent files; order is Dockerfile → compose → CDK.


Task 1: Fail-fast JVM heap settings

Files:

Current line 10:

ENV JAVA_OPTS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -Dlog4j2.configurationFile=classpath:log4j2-prod.xml"

Replace with:

ENV JAVA_OPTS="-XX:+UseContainerSupport -XX:MaxRAMPercentage=60.0 -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/app/logs/ -Dlog4j2.configurationFile=classpath:log4j2-prod.xml"

(/app/logs is the in-container mount of host /var/log/orcha per docker-compose.yml:33, so a heap dump survives container restart and is shipped/inspectable. ExitOnOutOfMemoryError turns an OOM into an immediate clean process exit instead of a GC-death-spiral wedge.)

Run: docker build --build-arg COMMIT_SHA=plan-b-test -t orcha:plan-b-test . Expected: build succeeds; docker run --rm --entrypoint sh orcha:plan-b-test -c 'echo $JAVA_OPTS' prints the new flags.

git add Dockerfile
git commit -m "feat(jvm): MaxRAMPercentage 75->60, fail-fast + heap dump on OOM"

Task 2: Hard container memory limit + confirm restart policy

Files:

In the orcha: service block (lines 16-45), add these two keys at the same indent level as restart: (i.e. under orcha:), immediately after line 19 (restart: unless-stopped):

    mem_limit: 3200m
    memswap_limit: 3200m

mem_limit caps the container at 3.2 GB so the JVM's -XX:+UseContainerSupport sizes the heap from 3.2 GB (≈1.9 GB at 60%), leaving ~0.8 GB on the box for OS + SSM/CloudWatch agents + xray-daemon. memswap_limit: 3200m (equal to mem_limit) forbids the container from consuming host swap — host swap (Task 3) is an OS-level cushion, not a way to mask a container overrun.

Plan A registers :permits #or [#env ORCHA_HEAVY_CONCURRENCY 3], but the orcha service declares no env_file and a fixed environment: list — Compose passes only declared vars, so a host ORCHA_HEAVY_CONCURRENCY is silently dropped and permits stay pinned at 3. Without this step Plan A's documented "restart-only, no code change" rollback/tuning is inert.

Add this entry to the existing orcha: environment: list (the block currently ending with GOOGLE_APPLICATION_CREDENTIALS=…, line ~31), same - KEY=VALUE style:

      - ORCHA_HEAVY_CONCURRENCY=${ORCHA_HEAVY_CONCURRENCY:-3}

${ORCHA_HEAVY_CONCURRENCY:-3} interpolates at docker compose up time: default 3, overridable without editing this file. Operator tuning step (document in the PR body): to change the cap, set ORCHA_HEAVY_CONCURRENCY in the host environment (or a .env file in the compose working dir) on the instance, then restart the container via the normal CodeDeploy lifecycle — no image rebuild. Invalid values fail startup fast (Plan A gate-init-rejects-non-positive-permits), they do not silently deadlock.

Verify line 19 reads restart: unless-stopped. This already restarts the container when the JVM exits (including the fail-fast OOM exit from Task 1). No change needed — this step is a confirmation; record it in the commit message.

§Design 3 calls the auto-restart a checked requirement — verify the fail-fast premise, don't assume it. The self-heal loop is two independent links; prove both:

(a) JVM exits non-zero + dumps on OOM (deterministic, image-agnostic — exercises the Task 1 flags, not Orcha internals; match the JDK tag to the app image's Java major version):

mkdir -p /tmp/orcha-oom-logs
docker run --rm -v /tmp/orcha-oom-logs:/dump eclipse-temurin:21-jdk \
  jshell -R-Xmx32m -R-XX:+ExitOnOutOfMemoryError \
         -R-XX:+HeapDumpOnOutOfMemoryError -R-XX:HeapDumpPath=/dump/ -s - <<'EOF'
var l = new java.util.ArrayList<byte[]>();
while (true) l.add(new byte[1 << 20]);
EOF
echo "exit=$?"
ls -1 /tmp/orcha-oom-logs/*.hprof

Expected: exit= is non-zero within seconds (an immediate clean exit, not a multi-minute GC-spiral hang) and a *.hprof is present in the mounted dir. This confirms -XX:+ExitOnOutOfMemoryError + -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath= behave as Task 1 relies on, and that the dump survives on a host-mounted volume (the same kind docker-compose.yml:33 provides).

(b) Compose restarts a non-zero exit: Step 2 confirmed restart: unless-stopped on the orcha service; by the Compose spec unless-stopped restarts a container that exits for any reason except an explicit docker stop. (a) + (b) ⇒ a real OOM self-heals in seconds instead of wedging for hours. Record both observed results (the exit code and the .hprof filename) in the commit message.

Run: docker compose -f deploy/docker-compose.yml config Expected: exit 0, no errors; resolved config shows mem_limit: "3200m" (or 3355443200) under orcha, and ORCHA_HEAVY_CONCURRENCY: "3" in its environment (the :-3 default). Re-run with ORCHA_HEAVY_CONCURRENCY=1 docker compose -f deploy/docker-compose.yml config and confirm it resolves to "1" — proving the knob is actually wired.

git add deploy/docker-compose.yml
git commit -m "feat(deploy): cap orcha at 3200m, plumb ORCHA_HEAVY_CONCURRENCY; verify OOM restart loop"

Task 3: OS swapfile via cfn-init

Files:

Inside the "configure": ec2.InitConfig([ ... ]) list, add this element immediately after the "04-ecr-login" command (after line 475, before the closing ] on line 476):

                        # 2 GB swapfile — OS-level cushion so a transient
                        # memory spike swaps instead of OOM-killing.
                        ec2.InitCommand.shell_command(
                            "test -f /swapfile || ("
                            "fallocate -l 2G /swapfile && chmod 600 /swapfile && "
                            "mkswap /swapfile && swapon /swapfile && "
                            "echo '/swapfile none swap sw 0 0' >> /etc/fstab)",
                            key="05-create-swapfile",
                        ),

Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdCompute > /tmp/compute-synth-b.yaml Expected: synth succeeds; grep -c 05-create-swapfile /tmp/compute-synth-b.yaml ≥ 1.

git add infra/stacks/compute_stack.py
git commit -m "feat(infra): add 2G swapfile via cfn-init configure set"

Task 4: Unlimited CPU credits on the launch template

Files:

In the launch_template = ec2.LaunchTemplate(self, "LaunchTemplate", ...) constructor, add this keyword argument immediately after launch_template_name="v1-orcha-lt", (line 515):

            cpu_credits=ec2.CpuCredits.UNLIMITED,

(The installed aws-cdk-lib requires the enum ec2.CpuCredits.UNLIMITEDcpu_credits is typed Optional[CpuCredits], not a bare str; a string "unlimited" is rejected. It synthesizes to CloudFormation CreditSpecification: { CpuCredits: unlimited }. Idle baseline CPU is ~1.7% and the workload is bursty, so unlimited bills ≈ $0 in normal operation while preventing the credit-exhaustion throttle that made the incident unrecoverable.)

Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdCompute > /tmp/compute-synth-b2.yaml Expected: synth succeeds; grep -i creditspecification /tmp/compute-synth-b2.yaml shows CpuCredits: unlimited on the launch template.

Run: cd infra && . .venv/bin/activate && cdk diff V1OrchaProdCompute Expected: the diff shows ONLY the launch-template CreditSpecification addition + the Task 3 cfn-init metadata change (no ASG capacity/health-check/security-group changes — confirming no drift risk beyond intent).

git add infra/stacks/compute_stack.py
git commit -m "feat(infra): set t4g launch template to unlimited CPU credits"

Task 5: Local 23-invoice memory replay — pre-deploy acceptance gate (running nREPL)

Files: none (verification). This task IS executed by the plan — unlike the CDK tasks it is local, not operator-gated. It is the spec §Testing "memory replay". The spec says "in staging", but Orcha has only prod and local (no staging — infra/app.py:38 rejects any non-prod env_name), and locally there is no docker and no jar — the system runs in a long-lived nREPL (port 33905, full Integrant system up via integrant.repl, local Postgres + MiniStack S3/SQS available). So this replay drives the incident document through that running nREPL system and asserts the JVM-heap property that the prod container limit constrains. It validates Plan A's ORCHA_HEAVY_CONCURRENCY default against this plan's lowered heap budget; it depends only on Plan A (the gate) + this plan's Tasks 1-2 (the prod image flags / mem_limit define the budget this asserts against — they are not run locally). It does not need Tasks 3-4 (instance-level CDK) or Plan C — the prod v1-orcha-mem-high alarm is just the production monitor of this same property.

Why JVM heap, not container RSS: the prod failure is used heap → max heap → ExitOnOutOfMemoryError. Prod max heap = mem_limit 3200 MiB × MaxRAMPercentage=60.01920 MiB. Container RSS can't be reproduced without docker, but JVM used heap is measurable in-process and is exactly the quantity that limit caps. The replay therefore asserts against the prod heap budget, applying the same 85% ratio as the prod RSS alarm: peak used heap < ~1.6 GiB (≈0.85 × 1920 MiB).

Artifact: the actual 2026-05-16 incident document — a 23-page PDF that splits into 23 invoices. Operator-local, not a repo fixture (not committed): /home/volrath/Downloads/CircularsOfficeOrders-06319749027d334-12351246.pdf

Execution constraint (hard): all nREPL interaction is synchronous evals only (clj-nrepl-eval -p 33905). Never background the eval, never run interrupt-semantics or detached-thread evals on the nREPL eval thread (a prior session wedged the REPL doing this). The single replay eval is itself synchronous and bounded by an internal timeout; the only thread it starts is a short-lived in-JVM daemon sampler that it joins before returning — that is not "backgrounding the eval".

  1. Baseline the REPL heap. Synchronous eval: (.maxMemory (Runtime/getRuntime)). Record it. The measurement is only meaningful if the nREPL JVM's max heap ≥ ~1.9 GiB (so it isn't artificially capped below the prod budget). If it is smaller than ~1.9 GiB, a clean pass is still conservative (safe — the gate held under a tighter heap than prod) but an OOM is inconclusive (could be the smaller REPL heap, not the gate) — note this explicitly in the PR body. If it is much larger, instantaneous "used" overstates the true live set (GC runs less often); mitigate by also forcing (System/gc) at peak and reading used heap then (the authoritative live-set number).
  2. Get the incident PDF into the local datastore as a document so the normal ingestion path can run: object in local MiniStack S3 + a document row + an ap-ingestion row, then a message to the local ingestion SQS — i.e. exactly the enqueue shape of reingest (src/com/getorcha/app/http/documents/view/shared.clj:1035: insert ap-ingestion, aws/send-message! to (get-in aws [:queue-urls :ingestion]) with the ingestion id), minus the HTTP wrapper. The exact upload/insert mechanics are an execution detail the verifier resolves by reading the upload + ingestion code; the invariant is that the document enters via the real ingestion SQS so the live system's poll loops drive split → 23-child re-enqueue → each child re-entering through with-permit (the gate that bounds the incident fan-out).
  3. One synchronous replay eval that: (a) starts a daemon sampler thread recording max of (- (.totalMemory rt) (.freeMemory rt)) into an atom every ~250 ms; (b) performs the enqueue from (2); (c) polls the DB until all 23 child documents reach a terminal/processed state or a generous timeout (e.g. 15 min); (d) stops + joins the sampler; (e) forces (System/gc) and reads used heap (live set); (f) returns {:repl-max-heap-mib … :peak-used-mib … :post-gc-used-mib … :children-terminal <n>/23 :oom? <bool>}. Resolve the system map from the running integrant.repl/system (the verifier knows the accessor — same as Task 6's verifier used). No *.hprof should appear in the project dir / JVM HeapDumpPath.

All must hold:

Tuning: ORCHA_HEAVY_CONCURRENCY is plumbed (Task 2 Step 1b) and the running system reads it via the #or [#env ORCHA_HEAVY_CONCURRENCY 3] config default. To re-test at a different permit count without a full restart, the cleanest path is to bounce the gate component with an overridden permit (or restart the nREPL with the env var set) and re-run the single replay eval. If peak used heap breaches ~1.6 GiB or the JVM OOMs (with an adequately-sized REPL heap), lower the permit (e.g. 2) and re-run. If peak stays well under (<60% of the 1920 MiB budget) and throughput matters, 4 may be tried the same way. Record the chosen value, the observed :peak-used-mib/:post-gc-used-mib, and the REPL :repl-max-heap-mib in the PR body. Phase 1 is not complete, and the operator must not deploy (Task 6), until this local replay passes.


Task 6: Deployment handoff note (no deploy performed)

Files: none.

The plan does NOT run cdk deploy. Precondition: Task 5 (local memory replay) has passed and the chosen ORCHA_HEAVY_CONCURRENCY is recorded. When the user chooses to roll out:

  1. Deploy the app image (Plan A + Task 1 here) via the normal CodeDeploy pipeline.
  2. cdk deploy V1OrchaProdCompute (Tasks 3-4) — this updates the launch template + cfn-init; existing instance is not auto-replaced (ASG uses EC2 health checks), so a one-time instance refresh / manual replace is required for swap + CPU-credits to take effect. Note this explicitly for the operator.
  3. mem_limit/JVM flags take effect on the next app container start.

Document this in the PR body. No code change in this task.


Self-Review