For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Steps use checkbox (
- [ ]) syntax. CDK-only; the plan stops atcdk synth/cdk diff.cdk deployand the SSM dry-run are operator-gated steps the user runs explicitly.
Goal: Make memory/swap/disk observable with alarms, and automatically replace a wedged instance the way it was replaced by hand during the incident — guarded so it never fights a deploy.
Architecture: The CloudWatch agent already emits mem_used_percent/disk_used_percent (namespace V1Orcha); add swap_used_percent and ASG-aggregation dimensions, then add three alarms → existing SNS v1-orcha-alerts. Separately, a sustained (≥15 min) ALB-unhealthy alarm drives an SSM Automation document via two EventBridge rules — an edge rule (on the OK→ALARM transition, for fast first response) and a 5-minute scheduled retry rule (so an aborted remediation is retried while still down). The document guards in order: (1) abort unless the sustained alarm is currently ALARM (makes the periodic rule safe and closes the edge-trigger gap), (2) abort if a CodeDeploy deployment is in progress, (3) abort if any ASG instance launched within the cooldown window, else (4) terminate the unhealthy ASG instance so the ASG relaunches a fresh one. The AwsApi event target is Lambda-backed (a CDK-managed singleton Lambda — see Task 4).
Tech Stack: AWS CDK (Python): cloudwatch, ssm, events, events_targets, iam; CloudWatch agent JSON.
infra/stacks/compute_stack.py — CLOUDWATCH_AGENT_CONFIG (lines 74-85): add swap + append/aggregation dimensions.infra/stacks/ops_stack.py — add 3 metric alarms; add the sustained alarm + SSM Automation document + EventBridge rule + IAM roles.Files:
Modify: infra/stacks/compute_stack.py (the "metrics" block, lines 74-85)
Step 1: Replace the "metrics" block
Current (lines 74-85):
"metrics": {
"namespace": "V1Orcha",
"metrics_collected": {
"mem": {
"measurement": ["mem_used_percent"]
},
"disk": {
"measurement": ["disk_used_percent"],
"resources": ["/"]
}
}
}
Replace with:
"metrics": {
"namespace": "V1Orcha",
"append_dimensions": {
"AutoScalingGroupName": "${aws:AutoScalingGroupName}"
},
"aggregation_dimensions": [["AutoScalingGroupName"]],
"metrics_collected": {
"mem": {
"measurement": ["mem_used_percent"]
},
"swap": {
"measurement": ["swap_used_percent"]
},
"disk": {
"measurement": ["disk_used_percent"],
"resources": ["/"]
}
}
}
This adds swap_used_percent and makes the agent publish an AutoScalingGroupName-aggregated series so static CDK alarms can target it (the same dimension the existing v1-orcha-ec2-high-cpu alarm uses).
Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdCompute > /tmp/c1.yaml
Expected: synth succeeds; grep -c swap_used_percent /tmp/c1.yaml ≥ 1 and grep -c aggregation_dimensions /tmp/c1.yaml ≥ 1.
git add infra/stacks/compute_stack.py
git commit -m "feat(infra): CW agent emits swap + ASG-aggregated mem/swap/disk"
Files:
Modify: infra/stacks/ops_stack.py (alongside the existing alarms; pattern = the ec2_high_cpu_alarm block which already uses asg.auto_scaling_group_name)
Step 1: Add three alarms
Immediately after the existing ec2_high_cpu_alarm.add_alarm_action(...) line, add:
# 6b. Instance memory high (V1Orcha custom metric, ASG-aggregated)
mem_high_alarm = cloudwatch.Alarm(
self,
"MemHighAlarm",
alarm_name="v1-orcha-mem-high",
alarm_description="Instance memory usage above 85% (pre-OOM warning)",
metric=cloudwatch.Metric(
namespace="V1Orcha",
metric_name="mem_used_percent",
dimensions_map={"AutoScalingGroupName": asg.auto_scaling_group_name},
statistic="Average",
period=Duration.seconds(300),
),
threshold=85,
comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
evaluation_periods=2,
treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
)
mem_high_alarm.add_alarm_action(cw_actions.SnsAction(self.alert_topic))
# 6c. Instance swap high (cushion being consumed)
swap_high_alarm = cloudwatch.Alarm(
self,
"SwapHighAlarm",
alarm_name="v1-orcha-swap-high",
alarm_description="Instance swap usage above 50% - memory pressure",
metric=cloudwatch.Metric(
namespace="V1Orcha",
metric_name="swap_used_percent",
dimensions_map={"AutoScalingGroupName": asg.auto_scaling_group_name},
statistic="Average",
period=Duration.seconds(300),
),
threshold=50,
comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
evaluation_periods=2,
treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
)
swap_high_alarm.add_alarm_action(cw_actions.SnsAction(self.alert_topic))
# 6d. Instance disk low (root volume; log flood / heap dumps)
disk_low_alarm = cloudwatch.Alarm(
self,
"DiskLowAlarm",
alarm_name="v1-orcha-disk-low",
alarm_description="Root volume above 85% used",
metric=cloudwatch.Metric(
namespace="V1Orcha",
metric_name="disk_used_percent",
dimensions_map={"AutoScalingGroupName": asg.auto_scaling_group_name},
statistic="Average",
period=Duration.seconds(300),
),
threshold=85,
comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
evaluation_periods=1,
treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
)
disk_low_alarm.add_alarm_action(cw_actions.SnsAction(self.alert_topic))
Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdOps > /tmp/c2.yaml
Expected: synth succeeds; grep -E "v1-orcha-(mem|swap|disk)-(high|low)" /tmp/c2.yaml | sort -u shows all three alarm names.
git add infra/stacks/ops_stack.py
git commit -m "feat(infra): add mem/swap/disk CloudWatch alarms -> v1-orcha-alerts"
Files:
Modify: infra/stacks/ops_stack.py (reuse the exact metric= block from the existing alb_unhealthy_alarm, the Tier-1 alarm named v1-orcha-alb-unhealthy)
Step 1: Add a separate, slower alarm
Locate the existing alb_unhealthy_alarm = cloudwatch.Alarm(... alarm_name="v1-orcha-alb-unhealthy" ...) (Tier-1, ops_stack.py:152-172). Directly after its .add_alarm_action(...) line — same method scope, so target_group and alb are in scope — add a second alarm. Copy only the series identity from the source alarm's metric= block so both alarms watch the same data: namespace, metric_name, and the literal dimensions_map dict (ops_stack.py:160-163). Do not copy statistic, and do not reference alb_unhealthy_alarm.metric.dimensions (CDK cloudwatch.Metric exposes no such reusable attribute, and the dimension values are CDK tokens, not literals). statistic, period, evaluation_periods, and datapoints_to_alarm are deliberately different so this alarm fires only on a sustained outage:
# Auto-recovery trigger: ALB has no healthy targets for >= 15 min.
# Deliberately slower than any normal CodeDeploy deployment window.
alb_unhealthy_sustained_alarm = cloudwatch.Alarm(
self,
"AlbUnhealthySustainedAlarm",
alarm_name="v1-orcha-alb-unhealthy-sustained",
alarm_description="No healthy ALB targets for >=15 min - auto-replace instance",
metric=cloudwatch.Metric(
# Same series as v1-orcha-alb-unhealthy (ops_stack.py:157-166):
# identical namespace/metric_name/dimensions_map so both alarms
# watch the SAME data. dimensions_map is the LITERAL dict from
# the source alarm (target_group/alb are in scope here) — never
# alb_unhealthy_alarm.metric.dimensions (no such attribute).
namespace="AWS/ApplicationELB",
metric_name="HealthyHostCount",
dimensions_map={
"TargetGroup": target_group.target_group_full_name,
"LoadBalancer": alb.load_balancer_full_name,
},
# Deliberately Maximum, NOT the source alarm's Minimum.
# With LESS_THAN_THRESHOLD(1), Maximum<1 means the healthy
# count was 0 across the ENTIRE period (every sample) — a true
# sustained-zero signal. Minimum<1 would trip on any single
# transient dip (e.g. normal target dereg on deploy), which
# must never auto-terminate a prod instance.
statistic="Maximum",
period=Duration.seconds(60),
),
threshold=1,
comparison_operator=cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
evaluation_periods=15,
datapoints_to_alarm=15,
treat_missing_data=cloudwatch.TreatMissingData.BREACHING,
)
alb_unhealthy_sustained_alarm.add_alarm_action(cw_actions.SnsAction(self.alert_topic))
Implementation note: the
dimensions_mapliteral above is copied fromops_stack.py:160-163(the sourcealb_unhealthy_alarm);target_groupandalbare the same in-scope locals that alarm uses, since the new alarm is added in the same method. If the sourcedimensions_mapever changes, change it here too — both alarms must watch the identical series. Thestatisticdifference (Maximumhere vsMinimumthere) is intentional and load-bearing for an auto-terminate trigger; see the inline comment.
Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdOps > /tmp/c3.yaml
Expected: grep -c v1-orcha-alb-unhealthy-sustained /tmp/c3.yaml ≥ 1; alarm shows EvaluationPeriods: 15, Period: 60.
git add infra/stacks/ops_stack.py
git commit -m "feat(infra): add v1-orcha-alb-unhealthy-sustained (15-min) alarm"
Files:
Modify: infra/stacks/ops_stack.py (imports + new constructs)
Step 1: Ensure imports
In the top-of-file from aws_cdk import (...) block, ensure these aliases exist (add any missing — aws_codedeploy, aws_sns, cloudwatch, cw_actions are already imported):
aws_ssm as ssm,
aws_events as events,
aws_events_targets as events_targets,
aws_iam as iam,
After the Task 3 alarm, add the automation document. It takes the ASG name and the CodeDeploy application + deployment-group names as parameters (wired in Step 4 from the existing CodeDeploy constructs in this same stack):
replace_doc = ssm.CfnDocument(
self,
"ReplaceWedgedInstanceDoc",
name="v1-orcha-replace-wedged-instance",
document_type="Automation",
document_format="YAML",
content={
"schemaVersion": "0.3",
"description": "Replace a wedged ASG instance if no deploy is in progress and not within cooldown.",
"assumeRole": "{{ AutomationAssumeRole }}",
"parameters": {
"AutomationAssumeRole": {"type": "String"},
"AsgName": {"type": "String"},
"SustainedAlarmName": {"type": "String"},
"CodeDeployApp": {"type": "String"},
"CodeDeployGroup": {"type": "String"},
"CooldownMinutes": {"type": "String", "default": "20"},
},
"mainSteps": [
{
"name": "GuardAndRemediate",
"action": "aws:executeScript",
"inputs": {
"Runtime": "python3.11",
"Handler": "handler",
"InputPayload": {
"asg": "{{ AsgName }}",
"alarm": "{{ SustainedAlarmName }}",
"cd_app": "{{ CodeDeployApp }}",
"cd_group": "{{ CodeDeployGroup }}",
"cooldown_minutes": "{{ CooldownMinutes }}",
},
"Script": (
"import boto3, datetime\n"
"def handler(event, context):\n"
" asg_name = event['asg']\n"
" # Only act while the sustained alarm is actually ALARM.\n"
" # This makes the periodic retry rule safe AND closes the\n"
" # edge-trigger gap: after a codedeploy-in-progress abort the\n"
" # alarm stays ALARM (no new state-change event), so the\n"
" # periodic rule re-invokes and remediates once the deploy\n"
" # clears -- instead of never recovering.\n"
" cw = boto3.client('cloudwatch')\n"
" al = cw.describe_alarms(\n"
" AlarmNames=[event['alarm']]).get('MetricAlarms', [])\n"
" if not al or al[0]['StateValue'] != 'ALARM':\n"
" return {'action':'aborted','reason':'alarm-not-in-alarm'}\n"
" cd = boto3.client('codedeploy')\n"
" deps = cd.list_deployments(applicationName=event['cd_app'],\n"
" deploymentGroupName=event['cd_group'],\n"
" includeOnlyStatuses=['Created','Queued','InProgress','Baking','Ready'])\n"
" if deps.get('deployments'):\n"
" return {'action':'aborted','reason':'codedeploy-in-progress'}\n"
" asg = boto3.client('autoscaling')\n"
" g = asg.describe_auto_scaling_groups(\n"
" AutoScalingGroupNames=[asg_name])['AutoScalingGroups'][0]\n"
" # Exclude instances already being torn down. A slow ASG\n"
" # terminate still lists the old instance (LifecycleState\n"
" # Terminating*) when the 5-min retry fires; without this\n"
" # filter the retry would re-terminate that doomed box.\n"
" # Phase 1 ASG is min=max=desired=1, so this yields the one\n"
" # live instance -- the safety of picking iids[0] below\n"
" # depends on that single-instance invariant.\n"
" iids = [i['InstanceId'] for i in g['Instances']\n"
" if not i['LifecycleState'].startswith('Terminating')]\n"
" if not iids:\n"
" return {'action':'aborted','reason':'no-instances'}\n"
" # Cooldown: abort if ANY current ASG instance was launched\n"
" # within the window. Reads each instance's real LaunchTime\n"
" # (depth-independent) instead of scanning the last N scaling\n"
" # activities, which can page the launch record off the list\n"
" # under churn and double-terminate.\n"
" cutoff = datetime.datetime.now(datetime.timezone.utc) - \\\n"
" datetime.timedelta(minutes=int(event['cooldown_minutes']))\n"
" ec2 = boto3.client('ec2')\n"
" res = ec2.describe_instances(InstanceIds=iids)\n"
" launched = [inst['LaunchTime']\n"
" for r in res['Reservations'] for inst in r['Instances']]\n"
" if launched and max(launched) > cutoff:\n"
" return {'action':'aborted','reason':'cooldown'}\n"
" iid = iids[0]\n"
" asg.terminate_instance_in_auto_scaling_group(\n"
" InstanceId=iid, ShouldDecrementDesiredCapacity=False)\n"
" return {'action':'terminated','instance':iid}\n"
),
},
}
],
},
)
automation_role = iam.Role(
self,
"ReplaceWedgedInstanceRole",
assumed_by=iam.ServicePrincipal("ssm.amazonaws.com"),
)
automation_role.add_to_policy(
iam.PolicyStatement(
actions=[
# Gate everything on the sustained alarm actually
# being ALARM (makes the periodic retry rule safe).
"cloudwatch:DescribeAlarms",
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"codedeploy:ListDeployments",
# Load-bearing: the cooldown guard reads each ASG
# instance's LaunchTime (depth-independent — no
# DescribeScalingActivities paging).
"ec2:DescribeInstances",
],
resources=["*"],
)
)
Pinned to this repo's CDK (aws-cdk-lib 2.238.0, verified): aws_events_targets.SsmAutomation does not exist; aws_events_targets.AwsApi does. AwsApi is Lambda-backed — its class doc is literally "Use an AWS Lambda function that makes API calls as an event rule target" (aws_events_targets/__init__.py:1280). It synthesizes a CDK-managed singleton Lambda (one AWS<account>...AwsApi function shared by all AwsApi targets in the stack) plus that Lambda's execution role and log group, and the policy_statement is attached to the Lambda's role (not an EventBridge role). This is acceptable for Phase 1 — invoked only on a sustained outage / at most once per retry tick, so cost ≈ $0 — but the synth/diff will include those Lambda resources; the earlier "no Lambda shim" framing was wrong (do not expect a Lambda-free diff).
Two rules, one target: the alarm-state-change rule is edge-triggered (fires once on the OK→ALARM transition). On its own, a codedeploy-in-progress abort while the alarm stays ALARM would never be retried — the exact deploy-adjacent wedge Phase 1 must cover. So add (1) the edge rule for fast first response and (2) a 5-minute scheduled rule as the retry net. Both invoke the same document; the script's first guard (alarm must be ALARM) makes the periodic rule a no-op on a healthy box, and the CodeDeploy + cooldown guards prevent fighting a deploy or flapping. CodeDeploy names are the source-of-truth literals from ops_stack.py: application v1-orcha (line 829), deployment group v1-orcha-production (line 836); ASG is v1-orcha-asg; sustained alarm v1-orcha-alb-unhealthy-sustained (Task 3).
# One target, reused by both rules (AwsApi's singleton Lambda is
# shared, the cooldown guard prevents double-terminate if both fire).
replace_target = events_targets.AwsApi(
service="SSM",
action="startAutomationExecution",
parameters={
"DocumentName": replace_doc.name,
"Parameters": {
"AutomationAssumeRole": [automation_role.role_arn],
"AsgName": ["v1-orcha-asg"],
"SustainedAlarmName": ["v1-orcha-alb-unhealthy-sustained"],
"CodeDeployApp": ["v1-orcha"],
"CodeDeployGroup": ["v1-orcha-production"],
"CooldownMinutes": ["20"],
},
},
policy_statement=iam.PolicyStatement(
actions=["ssm:StartAutomationExecution", "iam:PassRole"],
resources=[
f"arn:aws:ssm:{self.region}:{self.account}:automation-definition/{replace_doc.name}:*",
automation_role.role_arn,
],
),
)
# (1) Edge-triggered: fast first response on the OK->ALARM transition.
events.Rule(
self,
"AlbUnhealthySustainedToSsm",
rule_name="v1-orcha-auto-replace-on-sustained-unhealthy",
event_pattern=events.EventPattern(
source=["aws.cloudwatch"],
detail_type=["CloudWatch Alarm State Change"],
resources=[alb_unhealthy_sustained_alarm.alarm_arn],
detail={"state": {"value": ["ALARM"]}},
),
targets=[replace_target],
)
# (2) Retry net: re-invoke every 5 min so a remediation that aborted
# (codedeploy-in-progress / cooldown) is retried while the alarm
# stays ALARM. The script's alarm-state guard makes this a no-op
# whenever the alarm is not ALARM (i.e. the box is healthy).
events.Rule(
self,
"AlbUnhealthySustainedRetry",
rule_name="v1-orcha-auto-replace-retry",
schedule=events.Schedule.rate(Duration.minutes(5)),
targets=[replace_target],
)
(AwsApi synthesizes a CDK-managed singleton Lambda + its execution role + log group; the policy_statement above is attached to that Lambda's role. automation_role (Step 3) is separate — it is what the SSM document assumes, passed as AutomationAssumeRole. Both rules share the one replace_target, so only one singleton Lambda is created.)
Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdOps > /tmp/c4.yaml
Expected: synth succeeds; /tmp/c4.yaml contains v1-orcha-replace-wedged-instance, both events rules (v1-orcha-auto-replace-on-sustained-unhealthy and v1-orcha-auto-replace-retry), automation_role, and the AwsApi singleton Lambda + its execution role. Note: current CDK emits the AwsApi Lambda with a hashed logical id (AWS<hash>), NOT one literally containing "AwsApi" — do not grep for the string "AwsApi". Verify the Lambda by: there are ≥2 AWS::Lambda::Function resources, and the AwsApi one is identifiable via its aws:cdk:path metadata (…AlbUnhealthySustainedToSsmTarget…Handler) and its inline default policy granting ssm:StartAutomationExecution/iam:PassRole. The Lambda is expected — see Step 4.
cdk diff — confirm no unintended changesRun: cd infra && . .venv/bin/activate && cdk diff V1OrchaProdOps
Expected additions only: the new alarms (Tasks 2-3), the SSM document, two EventBridge rules, automation_role, and the CDK-managed AwsApi singleton Lambda + its execution role + log group (per Step 4 — these are expected, not drift). No modifications to existing alarms, the pipeline, the SNS topic, or CodeDeploy.
git add infra/stacks/ops_stack.py
git commit -m "feat(infra): guarded SSM auto-replace (edge + 5m retry, alarm-state gated)"
Files: none. Run after cdk deploy V1OrchaProdOps. These need the deployed SSM document + ASG + CodeDeploy, so they are inherently prod/operator actions (Orcha has only prod and local — there is no staging). The memory acceptance gate that validates Plan A's permit default is a separate local, pre-deploy test owned by Plan B Task 5 (it depends only on the gate + lowered heap, not on this plan's alarm).
After cdk deploy V1OrchaProdOps. Note: aws ssm start-automation-execution returns only an AutomationExecutionId — it does not return the aws:executeScript payload. For each dry-run, capture the id and read the result:
EID=$(aws ssm start-automation-execution \
--document-name v1-orcha-replace-wedged-instance \
--parameters AutomationAssumeRole=<role-arn>,AsgName=v1-orcha-asg,SustainedAlarmName=v1-orcha-alb-unhealthy-sustained,CodeDeployApp=v1-orcha,CodeDeployGroup=v1-orcha-production \
--query AutomationExecutionId --output text)
aws ssm get-automation-execution --automation-execution-id "$EID" \
--query 'AutomationExecution.Outputs' --output json
# (or: aws ssm describe-automation-step-executions --automation-execution-id "$EID")
{"action":"aborted","reason":"alarm-not-in-alarm"} — proves the periodic-retry safety guard; no instance terminated.{"action":"aborted","reason":"codedeploy-in-progress"}.{"action":"aborted","reason":"cooldown"} (validates the LaunchTime-based, depth-independent cooldown).codedeploy-in-progress abort is retried and succeeds by the 5-minute rule once the deployment clears (the F3 gap this closes).These are operator actions; the plan does not execute them.
ec2(); ≥15-min dwell; alarm-state + CodeDeploy-in-progress + cooldown guards; edge rule + 5-min retry rule so a deploy-adjacent abort still recovers) → Tasks 3-4. §Testing: auto-recovery alarm-state/CodeDeploy/cooldown dry-runs (with get-automation-execution to read the script payload) → Task 5. The spec's "23-invoice memory replay" says "in staging" — but Orcha has no staging (infra/app.py:38 only allows env_name=prod); it is reworked as a local, pre-deploy gate owned by Plan B Task 5, since it depends only on Plan A's gate + Plan B's heap, not on this plan's alarm. The prod v1-orcha-mem-high alarm is the production monitor of that same property (validated by Tasks 1-2 synth, not by a staging run). §Cost (~$2-3/mo; +1 swap metric; the AwsApi singleton Lambda is invoked only on sustained outage / per retry tick ≈ $0) → satisfied.aws-cdk-lib 2.238.0); events_targets.AwsApi is Lambda-backed (verified at aws_events_targets/__init__.py:1280) — synth/diff expectations in Steps 5-6 account for the CDK-managed singleton Lambda + role + log group. CodeDeploy/ASG/alarm names are source-of-truth literals from ops_stack.py:829,836 + Task 3. Task 3's sustained-alarm metric uses the literal dimensions_map dict from ops_stack.py:160-163 (no reliance on a non-existent metric.dimensions attribute).replace_doc.name reused in the IAM policy ARN; SustainedAlarmName param threaded document↔InputPayload↔replace_target and matches the Task 3 alarm name; one replace_target shared by both rules (one singleton Lambda; cooldown guard prevents double-terminate if both fire); automation_role is what the document assumes (AutomationAssumeRole), separate from the AwsApi Lambda's own execution role; alarm v1-orcha-alb-unhealthy-sustained ARN feeds the edge rule's resources; the sustained alarm shares the source alarm's series (same namespace/metric_name/literal dimensions_map) but deliberately uses statistic="Maximum" (sustained-zero) vs the source's Minimum; the cooldown guard is depth-independent (ec2:DescribeInstances LaunchTime, not DescribeScalingActivities paging); IAM grants cloudwatch:DescribeAlarms for the alarm-state guard; namespace V1Orcha + AutoScalingGroupName dimension consistent between Task 1 (agent emits) and Task 2 (alarms consume).