Prod-Clone Refactor Testing — Design Spec

Status: Draft — awaiting author review Date: 2026-04-24 Author: Daniel Barreto (with Claude)

Purpose

Establish a standard procedure for validating large, schema-touching refactors against a clone of production data before deploying to prod. First concrete use case is the legal_entitytenant rename (see docs/superpowers/plans/2026-04-24-rename-legal-entity-to-tenant.md), but the procedure is intended to be reusable for any future migration whose blast radius warrants more than a local-schema pass.

Scope

In scope:

Out of scope:

Architecture

Three logical pieces:

  1. One-shot cloning pipeline — prod RDS → manual snapshot → throwaway RDS instance → pg_dump over SSM port-forward → local dump file → throwaway instance deleted.
  2. Local testing loop — dump file → pg_restore into a orcha_prod_clone database inside the existing docker-compose Postgres → apply migrations → schema diff → ingestion → UI smoke. Reset-from-dump is minutes, so iteration is cheap.
  3. Runbook — the documented procedure that humans/agents follow, tying the scripts together.
┌─────────────────────────────────────────────────────────────────┐
│ Cloning (bb db:clone-prod) — ~30–45 min, unattended             │
│                                                                 │
│   RDS prod ──snapshot──▶ throwaway RDS ──pg_dump──▶ dump file  │
│                                  │                              │
│                                  ▼                              │
│                             (deleted)                           │
└─────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────┐
│ Local testing loop (iterable from dump file)                    │
│                                                                 │
│   dump ──load──▶ orcha_prod_clone DB (local docker Postgres)    │
│                              │                                  │
│                              ├── bb migrate migrate             │
│                              ├── bb db:schema-diff → pass/fail  │
│                              ├── bb ingest <test docs>          │
│                              └── manual UI smoke checklist      │
│                                                                 │
│   On failure: fix code, bb db:load-clone, re-run (~3 min)       │
└─────────────────────────────────────────────────────────────────┘

Why these choices

Throwaway RDS instead of dumping from prod directly

A pg_dump against prod holds a long REPEATABLE READ transaction (~45 min for a 30 GB DB over SSM-tunneled single-connection). That adds IO load to a single-AZ t4g.medium instance and delays vacuum for the duration. Restoring a manual snapshot to a db.t4g.small throwaway instance costs roughly $0.03 per run and ~15 min of extra wall-clock, and keeps prod untouched. For a procedure that's run before every large refactor deploy, that's cheap insurance.

Local dump instead of testing against the throwaway RDS

Keeping a long-lived RDS clone and pointing local tests at it (what we called "full Option C" in brainstorming) performs worse than a local copy for this use case:

The dump file on disk is the reusable artifact. The remote clone lives ~30 min and dies.

Schema diff as the primary correctness gate

pg_dump -s of (prod clone + new migration) vs (fresh local with all migrations applied) catches every structural mistake: missed constraint rename, wrong default expression, stale index, forgotten trigger body, enum rename miss. It's reusable across future refactors with no bespoke code, and a zero-diff exit is an unambiguous pass signal. Alternatives considered:

Local clone as a distinct database (orcha_prod_clone)

Living inside the same docker-compose Postgres container, not a separate volume, avoids docker-compose edits and volume-juggling. The app switches databases via ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone env var — no ambient mode flag, no risk of forgetting which DB you're pointed at.

Components

New scripts (all under scripts/)

  1. scripts/clone_prod_db.clj → bb task db:clone-prod

  2. scripts/schema_diff.clj → bb task db:schema-diff --a <db> --b <db>

  3. scripts/db_clone_helpers.clj — small helpers exposed as bb tasks:

Runbook

docs/runbooks/prod-clone-refactor-testing.md — the canonical procedure. Structured as an ordered checklist so both humans and agents can follow it verbatim.

Data flow (runbook steps)

Step 1 — Clone prod to a dump file (~30–45 min, unattended)
  bb db:clone-prod
  → dump/prod-<timestamp>.dump

Step 2 — Load into local Postgres (~2–5 min)
  bb db:load-clone
  → database "orcha_prod_clone" with pre-migration schema

Step 3 — Sanity check: current master boots against pre-migration clone
  ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev
  → (integrant.repl/go) succeeds. Confirms the clone is usable and
    current master is healthy against prod schema.

Step 4 — Apply the pending migration(s)
  ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb migrate migrate

Step 5 — Schema assertion (the gate)
  bb db:fresh
  bb db:schema-diff --a orcha_prod_clone --b orcha_fresh
  → must exit 0. Any diff is a migration bug; iterate.

Step 6 — Ingestion smoke (programmatic)
  ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb dev          # start app
  bb ingest test/fixtures/<invoice-1>.pdf
  bb ingest test/fixtures/<invoice-2>.pdf
  → assert documents reach terminal state without errors.
    Watch logs for: queries against renamed columns, Malli decode
    errors, trigger-function payload key mismatches, SES consumer
    key mismatches, FK violations.

Step 7 — UI smoke (manual)
  Minimum checklist:
    - Document list loads and paginates
    - Open a document → view renders without errors
    - Matching screen loads a cluster
    - /tenants admin panel loads and shows renamed entities
    - /organizations admin panel loads
  → watch browser devtools + app logs for 500s.

Step 8 — Iterate on failure
  Fix code or migration, then:
    bb db:load-clone   # reset from cached dump (~3 min)
    (repeat from Step 4)

Step 9 — Cleanup
  bb db:drop-clone
  bb db:list-clones    # verify no stray RDS clones remain
  rm dump/prod-<timestamp>.dump   # manual; contains PII

Error handling & safety

What this procedure does NOT cover

Explicit limits, copied to the runbook:

Future extensions

Called out so they're visible but explicitly not built now: