Status: Draft — awaiting author review Date: 2026-04-24 Author: Daniel Barreto (with Claude)
Establish a standard procedure for validating large, schema-touching refactors against a clone of production data before deploying to prod. First concrete use case is the legal_entity → tenant rename (see docs/superpowers/plans/2026-04-24-rename-legal-entity-to-tenant.md), but the procedure is intended to be reusable for any future migration whose blast radius warrants more than a local-schema pass.
In scope:
Out of scope:
Three logical pieces:
pg_dump over SSM port-forward → local dump file → throwaway instance deleted.pg_restore into a orcha_prod_clone database inside the existing docker-compose Postgres → apply migrations → schema diff → ingestion → UI smoke. Reset-from-dump is minutes, so iteration is cheap.┌─────────────────────────────────────────────────────────────────┐
│ Cloning (bb db:clone-prod) — ~30–45 min, unattended │
│ │
│ RDS prod ──snapshot──▶ throwaway RDS ──pg_dump──▶ dump file │
│ │ │
│ ▼ │
│ (deleted) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Local testing loop (iterable from dump file) │
│ │
│ dump ──load──▶ orcha_prod_clone DB (local docker Postgres) │
│ │ │
│ ├── bb migrate migrate │
│ ├── bb db:schema-diff → pass/fail │
│ ├── bb ingest <test docs> │
│ └── manual UI smoke checklist │
│ │
│ On failure: fix code, bb db:load-clone, re-run (~3 min) │
└─────────────────────────────────────────────────────────────────┘
A pg_dump against prod holds a long REPEATABLE READ transaction (~45 min for a 30 GB DB over SSM-tunneled single-connection). That adds IO load to a single-AZ t4g.medium instance and delays vacuum for the duration. Restoring a manual snapshot to a db.t4g.small throwaway instance costs roughly $0.03 per run and ~15 min of extra wall-clock, and keeps prod untouched. For a procedure that's run before every large refactor deploy, that's cheap insurance.
Keeping a long-lived RDS clone and pointing local tests at it (what we called "full Option C" in brainstorming) performs worse than a local copy for this use case:
pg_restore resets in ~3 min. Re-restoring from snapshot to RDS is ~15 min.localhost Postgres + local MiniStack. Remote DB + local everything-else is awkward.The dump file on disk is the reusable artifact. The remote clone lives ~30 min and dies.
pg_dump -s of (prod clone + new migration) vs (fresh local with all migrations applied) catches every structural mistake: missed constraint rename, wrong default expression, stale index, forgotten trigger body, enum rename miss. It's reusable across future refactors with no bespoke code, and a zero-diff exit is an unambiguous pass signal. Alternatives considered:
information_schema Clojure test — narrower coverage and we'd write bespoke expected-names per refactor.migrate up exit code — under-protects; plenty of structural mistakes don't cause runtime errors until weeks later.orcha_prod_clone)Living inside the same docker-compose Postgres container, not a separate volume, avoids docker-compose edits and volume-juggling. The app switches databases via ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone env var — no ambient mode flag, no risk of forgetting which DB you're pointed at.
scripts/)scripts/clone_prod_db.clj → bb task db:clone-prod
v1-orcha-db (automated snapshots run nightly, so the newest is usually <24h old). Create a fresh manual snapshot only if the newest is older than a freshness threshold (default 24h) or if --fresh-snapshot is passed.aws rds restore-db-instance-from-db-snapshot to v1-orcha-db-clone-<timestamp>:
db.t4g.small, storage gp3, single-AZbackup_retention=0, deletion_protection=false, publicly_accessible=falsev1-orcha-app EC2 can reach it over SSM port-forward)CloneOf=v1-orcha-db, CreatedBy=clone-prod-db, CreatedAt=<timestamp>DBInstanceStatus=available (~15 min).v1-orcha-app using AWS-StartPortForwardingSessionToRemoteHost with host=<clone-endpoint>, portNumber=5432, localPortNumber=25432.pg_dump -Fc -Z1 -h localhost -p 25432 -U <master> orcha -f dump/prod-<timestamp>.dump. Password passed via PGPASSWORD env in the subprocess only, never written to disk.aws rds delete-db-instance --skip-final-snapshot --delete-automated-backups.bb db:list-clones) lists all instances tagged CreatedBy=clone-prod-db — run it periodically to catch strays.finally block.scripts/schema_diff.clj → bb task db:schema-diff --a <db> --b <db>
pg_dump -s --no-owner --no-privileges --schema=public <db> for both inputs.-- line comments and /* ... */ block commentsSET and SELECT pg_catalog.set_config(...) statementsSELECT pg_catalog.setval(...))CREATE/ALTER statements by (object-type, object-name)dump/schema-<db>.sql, run diff --color=always -u.scripts/db_clone_helpers.clj — small helpers exposed as bb tasks:
db:load-clone — drops orcha_prod_clone if it exists, creates it, pg_restore --no-owner --no-privileges -d orcha_prod_clone dump/<latest-or-specified>.dump.db:fresh — drops and recreates orcha_fresh, runs init.sql, runs all migrations via migratus.db:drop-clone — drops orcha_prod_clone.db:list-clones — aws rds describe-db-instances filtered by Tag:CreatedBy=clone-prod-db, prints identifiers and ages. Zero-cost safety net.docs/runbooks/prod-clone-refactor-testing.md — the canonical procedure. Structured as an ordered checklist so both humans and agents can follow it verbatim.
Step 1 — Clone prod to a dump file (~30–45 min, unattended)
bb db:clone-prod
→ dump/prod-<timestamp>.dump
Step 2 — Load into local Postgres (~2–5 min)
bb db:load-clone
→ database "orcha_prod_clone" with pre-migration schema
Step 3 — Sanity check: current master boots against pre-migration clone
ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev
→ (integrant.repl/go) succeeds. Confirms the clone is usable and
current master is healthy against prod schema.
Step 4 — Apply the pending migration(s)
ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb migrate migrate
Step 5 — Schema assertion (the gate)
bb db:fresh
bb db:schema-diff --a orcha_prod_clone --b orcha_fresh
→ must exit 0. Any diff is a migration bug; iterate.
Step 6 — Ingestion smoke (programmatic)
ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb dev # start app
bb ingest test/fixtures/<invoice-1>.pdf
bb ingest test/fixtures/<invoice-2>.pdf
→ assert documents reach terminal state without errors.
Watch logs for: queries against renamed columns, Malli decode
errors, trigger-function payload key mismatches, SES consumer
key mismatches, FK violations.
Step 7 — UI smoke (manual)
Minimum checklist:
- Document list loads and paginates
- Open a document → view renders without errors
- Matching screen loads a cluster
- /tenants admin panel loads and shows renamed entities
- /organizations admin panel loads
→ watch browser devtools + app logs for 500s.
Step 8 — Iterate on failure
Fix code or migration, then:
bb db:load-clone # reset from cached dump (~3 min)
(repeat from Step 4)
Step 9 — Cleanup
bb db:drop-clone
bb db:list-clones # verify no stray RDS clones remain
rm dump/prod-<timestamp>.dump # manual; contains PII
db:list-clones check. Three independent layers; no single failure leaves a running clone.25432, not 5432. Script refuses to run if bound; prevents accidentally pointing bb migrate at the SSM tunnel.ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone prefix. No ambient mode flag.pg_dump via PGPASSWORD env of the subprocess only, never written to disk, never logged.dump/ (gitignored). Runbook explicitly instructs manual deletion at cleanup. No auto-expiry; short-term retention for iteration is intentional.up migration is already wrapped in a single transaction (per the rename plan). Failure leaves the clone DB untouched; bb db:load-clone resets cleanly.pg_dump. Single long-running dump is the biggest failure mode. Mitigation: dump is re-runnable — on failure, re-run bb db:clone-prod (snapshot is already taken, script detects and reuses a recent snapshot to skip that step).Explicit limits, copied to the runbook:
db.t4g.small clone + untuned local Postgres are not perf-representative.up is fully isolated. Deadlocks from concurrent prod writers during the real deploy are not surfaced here. Mitigation: continue deploying in low-traffic windows.Called out so they're visible but explicitly not built now:
clj -X:test pointed at orcha_prod_clone would catch code paths that need real rows to execute. Not yet; adds complexity (test isolation against non-transactional data) without clear payoff for a rename-only refactor.pg_dump and local pg_restore. Not a default — anonymization usually breaks something and is a maintenance burden.--keep-clone flag that skips teardown and emits the connection info. Would flip us closer to "full Option C" for the few cases where it pays.