Phase 1 Resilience — Plan C: Observability & Auto-Recovery Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Steps use checkbox (- [ ]) syntax. CDK-only; the plan stops at cdk synth/cdk diff. cdk deploy and the SSM dry-run are operator-gated steps the user runs explicitly.

Goal: Make memory/swap/disk observable with alarms, and automatically replace a wedged instance the way it was replaced by hand during the incident — guarded so it never fights a deploy.

Architecture: The CloudWatch agent already emits mem_used_percent/disk_used_percent (namespace V1Orcha); add swap_used_percent and ASG-aggregation dimensions, then add three alarms → existing SNS v1-orcha-alerts. Separately, a sustained (≥15 min) ALB-unhealthy alarm drives an SSM Automation document via two EventBridge rules — an edge rule (on the OK→ALARM transition, for fast first response) and a 5-minute scheduled retry rule (so an aborted remediation is retried while still down). The document guards in order: (1) abort unless the sustained alarm is currently ALARM (makes the periodic rule safe and closes the edge-trigger gap), (2) abort if a CodeDeploy deployment is in progress, (3) abort if any ASG instance launched within the cooldown window, else (4) terminate the unhealthy ASG instance so the ASG relaunches a fresh one. The AwsApi event target is Lambda-backed (a CDK-managed singleton Lambda — see Task 4).

Tech Stack: AWS CDK (Python): cloudwatch, ssm, events, events_targets, iam; CloudWatch agent JSON.


File Structure


Task 1: Emit swap + ASG-aggregated dimensions from the CloudWatch agent

Files:

Current (lines 74-85):

  "metrics": {
    "namespace": "V1Orcha",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"]
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["/"]
      }
    }
  }

Replace with:

  "metrics": {
    "namespace": "V1Orcha",
    "append_dimensions": {
      "AutoScalingGroupName": "${aws:AutoScalingGroupName}"
    },
    "aggregation_dimensions": [["AutoScalingGroupName"]],
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent"]
      },
      "swap": {
        "measurement": ["swap_used_percent"]
      },
      "disk": {
        "measurement": ["disk_used_percent"],
        "resources": ["/"]
      }
    }
  }

This adds swap_used_percent and makes the agent publish an AutoScalingGroupName-aggregated series so static CDK alarms can target it (the same dimension the existing v1-orcha-ec2-high-cpu alarm uses).

Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdCompute > /tmp/c1.yaml Expected: synth succeeds; grep -c swap_used_percent /tmp/c1.yaml ≥ 1 and grep -c aggregation_dimensions /tmp/c1.yaml ≥ 1.

git add infra/stacks/compute_stack.py
git commit -m "feat(infra): CW agent emits swap + ASG-aggregated mem/swap/disk"

Task 2: Memory / swap / disk alarms

Files:

Immediately after the existing ec2_high_cpu_alarm.add_alarm_action(...) line, add:

        # 6b. Instance memory high (V1Orcha custom metric, ASG-aggregated)
        mem_high_alarm = cloudwatch.Alarm(
            self,
            "MemHighAlarm",
            alarm_name="v1-orcha-mem-high",
            alarm_description="Instance memory usage above 85% (pre-OOM warning)",
            metric=cloudwatch.Metric(
                namespace="V1Orcha",
                metric_name="mem_used_percent",
                dimensions_map={"AutoScalingGroupName": asg.auto_scaling_group_name},
                statistic="Average",
                period=Duration.seconds(300),
            ),
            threshold=85,
            comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
            evaluation_periods=2,
            treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
        )
        mem_high_alarm.add_alarm_action(cw_actions.SnsAction(self.alert_topic))

        # 6c. Instance swap high (cushion being consumed)
        swap_high_alarm = cloudwatch.Alarm(
            self,
            "SwapHighAlarm",
            alarm_name="v1-orcha-swap-high",
            alarm_description="Instance swap usage above 50% - memory pressure",
            metric=cloudwatch.Metric(
                namespace="V1Orcha",
                metric_name="swap_used_percent",
                dimensions_map={"AutoScalingGroupName": asg.auto_scaling_group_name},
                statistic="Average",
                period=Duration.seconds(300),
            ),
            threshold=50,
            comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
            evaluation_periods=2,
            treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
        )
        swap_high_alarm.add_alarm_action(cw_actions.SnsAction(self.alert_topic))

        # 6d. Instance disk low (root volume; log flood / heap dumps)
        disk_low_alarm = cloudwatch.Alarm(
            self,
            "DiskLowAlarm",
            alarm_name="v1-orcha-disk-low",
            alarm_description="Root volume above 85% used",
            metric=cloudwatch.Metric(
                namespace="V1Orcha",
                metric_name="disk_used_percent",
                dimensions_map={"AutoScalingGroupName": asg.auto_scaling_group_name},
                statistic="Average",
                period=Duration.seconds(300),
            ),
            threshold=85,
            comparison_operator=cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
            evaluation_periods=1,
            treat_missing_data=cloudwatch.TreatMissingData.NOT_BREACHING,
        )
        disk_low_alarm.add_alarm_action(cw_actions.SnsAction(self.alert_topic))

Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdOps > /tmp/c2.yaml Expected: synth succeeds; grep -E "v1-orcha-(mem|swap|disk)-(high|low)" /tmp/c2.yaml | sort -u shows all three alarm names.

git add infra/stacks/ops_stack.py
git commit -m "feat(infra): add mem/swap/disk CloudWatch alarms -> v1-orcha-alerts"

Task 3: Sustained ALB-unhealthy alarm (auto-recovery trigger)

Files:

Locate the existing alb_unhealthy_alarm = cloudwatch.Alarm(... alarm_name="v1-orcha-alb-unhealthy" ...) (Tier-1, ops_stack.py:152-172). Directly after its .add_alarm_action(...) line — same method scope, so target_group and alb are in scope — add a second alarm. Copy only the series identity from the source alarm's metric= block so both alarms watch the same data: namespace, metric_name, and the literal dimensions_map dict (ops_stack.py:160-163). Do not copy statistic, and do not reference alb_unhealthy_alarm.metric.dimensions (CDK cloudwatch.Metric exposes no such reusable attribute, and the dimension values are CDK tokens, not literals). statistic, period, evaluation_periods, and datapoints_to_alarm are deliberately different so this alarm fires only on a sustained outage:

        # Auto-recovery trigger: ALB has no healthy targets for >= 15 min.
        # Deliberately slower than any normal CodeDeploy deployment window.
        alb_unhealthy_sustained_alarm = cloudwatch.Alarm(
            self,
            "AlbUnhealthySustainedAlarm",
            alarm_name="v1-orcha-alb-unhealthy-sustained",
            alarm_description="No healthy ALB targets for >=15 min - auto-replace instance",
            metric=cloudwatch.Metric(
                # Same series as v1-orcha-alb-unhealthy (ops_stack.py:157-166):
                # identical namespace/metric_name/dimensions_map so both alarms
                # watch the SAME data. dimensions_map is the LITERAL dict from
                # the source alarm (target_group/alb are in scope here) — never
                # alb_unhealthy_alarm.metric.dimensions (no such attribute).
                namespace="AWS/ApplicationELB",
                metric_name="HealthyHostCount",
                dimensions_map={
                    "TargetGroup": target_group.target_group_full_name,
                    "LoadBalancer": alb.load_balancer_full_name,
                },
                # Deliberately Maximum, NOT the source alarm's Minimum.
                # With LESS_THAN_THRESHOLD(1), Maximum<1 means the healthy
                # count was 0 across the ENTIRE period (every sample) — a true
                # sustained-zero signal. Minimum<1 would trip on any single
                # transient dip (e.g. normal target dereg on deploy), which
                # must never auto-terminate a prod instance.
                statistic="Maximum",
                period=Duration.seconds(60),
            ),
            threshold=1,
            comparison_operator=cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
            evaluation_periods=15,
            datapoints_to_alarm=15,
            treat_missing_data=cloudwatch.TreatMissingData.BREACHING,
        )
        alb_unhealthy_sustained_alarm.add_alarm_action(cw_actions.SnsAction(self.alert_topic))

Implementation note: the dimensions_map literal above is copied from ops_stack.py:160-163 (the source alb_unhealthy_alarm); target_group and alb are the same in-scope locals that alarm uses, since the new alarm is added in the same method. If the source dimensions_map ever changes, change it here too — both alarms must watch the identical series. The statistic difference (Maximum here vs Minimum there) is intentional and load-bearing for an auto-terminate trigger; see the inline comment.

Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdOps > /tmp/c3.yaml Expected: grep -c v1-orcha-alb-unhealthy-sustained /tmp/c3.yaml ≥ 1; alarm shows EvaluationPeriods: 15, Period: 60.

git add infra/stacks/ops_stack.py
git commit -m "feat(infra): add v1-orcha-alb-unhealthy-sustained (15-min) alarm"

Task 4: Guarded SSM Automation runbook + EventBridge wiring

Files:

In the top-of-file from aws_cdk import (...) block, ensure these aliases exist (add any missing — aws_codedeploy, aws_sns, cloudwatch, cw_actions are already imported):

    aws_ssm as ssm,
    aws_events as events,
    aws_events_targets as events_targets,
    aws_iam as iam,

After the Task 3 alarm, add the automation document. It takes the ASG name and the CodeDeploy application + deployment-group names as parameters (wired in Step 4 from the existing CodeDeploy constructs in this same stack):

        replace_doc = ssm.CfnDocument(
            self,
            "ReplaceWedgedInstanceDoc",
            name="v1-orcha-replace-wedged-instance",
            document_type="Automation",
            document_format="YAML",
            content={
                "schemaVersion": "0.3",
                "description": "Replace a wedged ASG instance if no deploy is in progress and not within cooldown.",
                "assumeRole": "{{ AutomationAssumeRole }}",
                "parameters": {
                    "AutomationAssumeRole": {"type": "String"},
                    "AsgName": {"type": "String"},
                    "SustainedAlarmName": {"type": "String"},
                    "CodeDeployApp": {"type": "String"},
                    "CodeDeployGroup": {"type": "String"},
                    "CooldownMinutes": {"type": "String", "default": "20"},
                },
                "mainSteps": [
                    {
                        "name": "GuardAndRemediate",
                        "action": "aws:executeScript",
                        "inputs": {
                            "Runtime": "python3.11",
                            "Handler": "handler",
                            "InputPayload": {
                                "asg": "{{ AsgName }}",
                                "alarm": "{{ SustainedAlarmName }}",
                                "cd_app": "{{ CodeDeployApp }}",
                                "cd_group": "{{ CodeDeployGroup }}",
                                "cooldown_minutes": "{{ CooldownMinutes }}",
                            },
                            "Script": (
                                "import boto3, datetime\n"
                                "def handler(event, context):\n"
                                "    asg_name = event['asg']\n"
                                "    # Only act while the sustained alarm is actually ALARM.\n"
                                "    # This makes the periodic retry rule safe AND closes the\n"
                                "    # edge-trigger gap: after a codedeploy-in-progress abort the\n"
                                "    # alarm stays ALARM (no new state-change event), so the\n"
                                "    # periodic rule re-invokes and remediates once the deploy\n"
                                "    # clears -- instead of never recovering.\n"
                                "    cw = boto3.client('cloudwatch')\n"
                                "    al = cw.describe_alarms(\n"
                                "        AlarmNames=[event['alarm']]).get('MetricAlarms', [])\n"
                                "    if not al or al[0]['StateValue'] != 'ALARM':\n"
                                "        return {'action':'aborted','reason':'alarm-not-in-alarm'}\n"
                                "    cd = boto3.client('codedeploy')\n"
                                "    deps = cd.list_deployments(applicationName=event['cd_app'],\n"
                                "        deploymentGroupName=event['cd_group'],\n"
                                "        includeOnlyStatuses=['Created','Queued','InProgress','Baking','Ready'])\n"
                                "    if deps.get('deployments'):\n"
                                "        return {'action':'aborted','reason':'codedeploy-in-progress'}\n"
                                "    asg = boto3.client('autoscaling')\n"
                                "    g = asg.describe_auto_scaling_groups(\n"
                                "        AutoScalingGroupNames=[asg_name])['AutoScalingGroups'][0]\n"
                                "    # Exclude instances already being torn down. A slow ASG\n"
                                "    # terminate still lists the old instance (LifecycleState\n"
                                "    # Terminating*) when the 5-min retry fires; without this\n"
                                "    # filter the retry would re-terminate that doomed box.\n"
                                "    # Phase 1 ASG is min=max=desired=1, so this yields the one\n"
                                "    # live instance -- the safety of picking iids[0] below\n"
                                "    # depends on that single-instance invariant.\n"
                                "    iids = [i['InstanceId'] for i in g['Instances']\n"
                                "        if not i['LifecycleState'].startswith('Terminating')]\n"
                                "    if not iids:\n"
                                "        return {'action':'aborted','reason':'no-instances'}\n"
                                "    # Cooldown: abort if ANY current ASG instance was launched\n"
                                "    # within the window. Reads each instance's real LaunchTime\n"
                                "    # (depth-independent) instead of scanning the last N scaling\n"
                                "    # activities, which can page the launch record off the list\n"
                                "    # under churn and double-terminate.\n"
                                "    cutoff = datetime.datetime.now(datetime.timezone.utc) - \\\n"
                                "        datetime.timedelta(minutes=int(event['cooldown_minutes']))\n"
                                "    ec2 = boto3.client('ec2')\n"
                                "    res = ec2.describe_instances(InstanceIds=iids)\n"
                                "    launched = [inst['LaunchTime']\n"
                                "        for r in res['Reservations'] for inst in r['Instances']]\n"
                                "    if launched and max(launched) > cutoff:\n"
                                "        return {'action':'aborted','reason':'cooldown'}\n"
                                "    iid = iids[0]\n"
                                "    asg.terminate_instance_in_auto_scaling_group(\n"
                                "        InstanceId=iid, ShouldDecrementDesiredCapacity=False)\n"
                                "    return {'action':'terminated','instance':iid}\n"
                            ),
                        },
                    }
                ],
            },
        )
        automation_role = iam.Role(
            self,
            "ReplaceWedgedInstanceRole",
            assumed_by=iam.ServicePrincipal("ssm.amazonaws.com"),
        )
        automation_role.add_to_policy(
            iam.PolicyStatement(
                actions=[
                    # Gate everything on the sustained alarm actually
                    # being ALARM (makes the periodic retry rule safe).
                    "cloudwatch:DescribeAlarms",
                    "autoscaling:DescribeAutoScalingGroups",
                    "autoscaling:TerminateInstanceInAutoScalingGroup",
                    "codedeploy:ListDeployments",
                    # Load-bearing: the cooldown guard reads each ASG
                    # instance's LaunchTime (depth-independent — no
                    # DescribeScalingActivities paging).
                    "ec2:DescribeInstances",
                ],
                resources=["*"],
            )
        )

Pinned to this repo's CDK (aws-cdk-lib 2.238.0, verified): aws_events_targets.SsmAutomation does not exist; aws_events_targets.AwsApi does. AwsApi is Lambda-backed — its class doc is literally "Use an AWS Lambda function that makes API calls as an event rule target" (aws_events_targets/__init__.py:1280). It synthesizes a CDK-managed singleton Lambda (one AWS<account>...AwsApi function shared by all AwsApi targets in the stack) plus that Lambda's execution role and log group, and the policy_statement is attached to the Lambda's role (not an EventBridge role). This is acceptable for Phase 1 — invoked only on a sustained outage / at most once per retry tick, so cost ≈ $0 — but the synth/diff will include those Lambda resources; the earlier "no Lambda shim" framing was wrong (do not expect a Lambda-free diff).

Two rules, one target: the alarm-state-change rule is edge-triggered (fires once on the OK→ALARM transition). On its own, a codedeploy-in-progress abort while the alarm stays ALARM would never be retried — the exact deploy-adjacent wedge Phase 1 must cover. So add (1) the edge rule for fast first response and (2) a 5-minute scheduled rule as the retry net. Both invoke the same document; the script's first guard (alarm must be ALARM) makes the periodic rule a no-op on a healthy box, and the CodeDeploy + cooldown guards prevent fighting a deploy or flapping. CodeDeploy names are the source-of-truth literals from ops_stack.py: application v1-orcha (line 829), deployment group v1-orcha-production (line 836); ASG is v1-orcha-asg; sustained alarm v1-orcha-alb-unhealthy-sustained (Task 3).

        # One target, reused by both rules (AwsApi's singleton Lambda is
        # shared, the cooldown guard prevents double-terminate if both fire).
        replace_target = events_targets.AwsApi(
            service="SSM",
            action="startAutomationExecution",
            parameters={
                "DocumentName": replace_doc.name,
                "Parameters": {
                    "AutomationAssumeRole": [automation_role.role_arn],
                    "AsgName": ["v1-orcha-asg"],
                    "SustainedAlarmName": ["v1-orcha-alb-unhealthy-sustained"],
                    "CodeDeployApp": ["v1-orcha"],
                    "CodeDeployGroup": ["v1-orcha-production"],
                    "CooldownMinutes": ["20"],
                },
            },
            policy_statement=iam.PolicyStatement(
                actions=["ssm:StartAutomationExecution", "iam:PassRole"],
                resources=[
                    f"arn:aws:ssm:{self.region}:{self.account}:automation-definition/{replace_doc.name}:*",
                    automation_role.role_arn,
                ],
            ),
        )

        # (1) Edge-triggered: fast first response on the OK->ALARM transition.
        events.Rule(
            self,
            "AlbUnhealthySustainedToSsm",
            rule_name="v1-orcha-auto-replace-on-sustained-unhealthy",
            event_pattern=events.EventPattern(
                source=["aws.cloudwatch"],
                detail_type=["CloudWatch Alarm State Change"],
                resources=[alb_unhealthy_sustained_alarm.alarm_arn],
                detail={"state": {"value": ["ALARM"]}},
            ),
            targets=[replace_target],
        )

        # (2) Retry net: re-invoke every 5 min so a remediation that aborted
        # (codedeploy-in-progress / cooldown) is retried while the alarm
        # stays ALARM. The script's alarm-state guard makes this a no-op
        # whenever the alarm is not ALARM (i.e. the box is healthy).
        events.Rule(
            self,
            "AlbUnhealthySustainedRetry",
            rule_name="v1-orcha-auto-replace-retry",
            schedule=events.Schedule.rate(Duration.minutes(5)),
            targets=[replace_target],
        )

(AwsApi synthesizes a CDK-managed singleton Lambda + its execution role + log group; the policy_statement above is attached to that Lambda's role. automation_role (Step 3) is separate — it is what the SSM document assumes, passed as AutomationAssumeRole. Both rules share the one replace_target, so only one singleton Lambda is created.)

Run: cd infra && . .venv/bin/activate && cdk synth V1OrchaProdOps > /tmp/c4.yaml Expected: synth succeeds; /tmp/c4.yaml contains v1-orcha-replace-wedged-instance, both events rules (v1-orcha-auto-replace-on-sustained-unhealthy and v1-orcha-auto-replace-retry), automation_role, and the AwsApi singleton Lambda + its execution role. Note: current CDK emits the AwsApi Lambda with a hashed logical id (AWS<hash>), NOT one literally containing "AwsApi" — do not grep for the string "AwsApi". Verify the Lambda by: there are ≥2 AWS::Lambda::Function resources, and the AwsApi one is identifiable via its aws:cdk:path metadata (…AlbUnhealthySustainedToSsmTarget…Handler) and its inline default policy granting ssm:StartAutomationExecution/iam:PassRole. The Lambda is expected — see Step 4.

Run: cd infra && . .venv/bin/activate && cdk diff V1OrchaProdOps Expected additions only: the new alarms (Tasks 2-3), the SSM document, two EventBridge rules, automation_role, and the CDK-managed AwsApi singleton Lambda + its execution role + log group (per Step 4 — these are expected, not drift). No modifications to existing alarms, the pipeline, the SNS topic, or CodeDeploy.

git add infra/stacks/ops_stack.py
git commit -m "feat(infra): guarded SSM auto-replace (edge + 5m retry, alarm-state gated)"

Task 5: SSM runbook guard dry-runs (operator-gated; documented, not executed by the plan)

Files: none. Run after cdk deploy V1OrchaProdOps. These need the deployed SSM document + ASG + CodeDeploy, so they are inherently prod/operator actions (Orcha has only prod and local — there is no staging). The memory acceptance gate that validates Plan A's permit default is a separate local, pre-deploy test owned by Plan B Task 5 (it depends only on the gate + lowered heap, not on this plan's alarm).

After cdk deploy V1OrchaProdOps. Note: aws ssm start-automation-execution returns only an AutomationExecutionId — it does not return the aws:executeScript payload. For each dry-run, capture the id and read the result:

EID=$(aws ssm start-automation-execution \
  --document-name v1-orcha-replace-wedged-instance \
  --parameters AutomationAssumeRole=<role-arn>,AsgName=v1-orcha-asg,SustainedAlarmName=v1-orcha-alb-unhealthy-sustained,CodeDeployApp=v1-orcha,CodeDeployGroup=v1-orcha-production \
  --query AutomationExecutionId --output text)
aws ssm get-automation-execution --automation-execution-id "$EID" \
  --query 'AutomationExecution.Outputs' --output json
# (or: aws ssm describe-automation-step-executions --automation-execution-id "$EID")
  1. With the alarm not in ALARM (healthy box) → step output {"action":"aborted","reason":"alarm-not-in-alarm"} — proves the periodic-retry safety guard; no instance terminated.
  2. While a CodeDeploy deployment is intentionally in progress (and the sustained alarm forced ALARM) → {"action":"aborted","reason":"codedeploy-in-progress"}.
  3. Right after a fresh instance launch → {"action":"aborted","reason":"cooldown"} (validates the LaunchTime-based, depth-independent cooldown).
  4. Only in a maintenance window: force the main target unhealthy ≥15 min and confirm the edge rule fires within ~1 min of the sustained alarm, the document terminates the instance, and the ASG relaunches; then confirm a simulated codedeploy-in-progress abort is retried and succeeds by the 5-minute rule once the deployment clears (the F3 gap this closes).

These are operator actions; the plan does not execute them.


Self-Review