Prod-Clone Refactor Testing — Design Spec

Status: Draft — awaiting author review Date: 2026-04-24 Author: Daniel Barreto (with Claude)

Purpose

Establish a standard procedure for validating large, schema-touching refactors against a clone of production data before deploying to prod. First concrete use case is the legal_entity → tenant rename (see docs/superpowers/plans/2026-04-24-rename-legal-entity-to-tenant.md), but the procedure is intended to be reusable for any future migration whose blast radius warrants more than a local-schema pass.

Scope

In scope:

Pull a consistent, point-in-time copy of the prod Postgres DB to the local dev machine.
Apply pending migrations to the copy.
Prove the resulting schema is bit-identical to a fresh local DB with the same migrations applied.
Exercise the app's ingestion path programmatically against the migrated copy.
Manual UI smoke against the migrated copy.

Out of scope:

Down-migration verification. Migrations are treated as forward-only; breaking changes are avoided across releases and cleaned up in follow-up migrations once old clients are gone.
Prod S3 data. The clone is DB-only; ingestion exercises use fresh local test documents routed through MiniStack.
Automated UI testing (Playwright or otherwise). Deferred; UI smoke stays a human-driven checklist for now.
Load/perf testing. The throwaway clone instance is cost-optimized, not perf-representative.
Cross-service integration testing (DATEV/SAP/etc.) beyond whatever incidentally fires during a test ingestion.

Architecture

Three logical pieces:

One-shot cloning pipeline — prod RDS → manual snapshot → throwaway RDS instance → pg_dump over SSM port-forward → local dump file → throwaway instance deleted.
Local testing loop — dump file → pg_restore into a orcha_prod_clone database inside the existing docker-compose Postgres → apply migrations → schema diff → ingestion → UI smoke. Reset-from-dump is minutes, so iteration is cheap.
Runbook — the documented procedure that humans/agents follow, tying the scripts together.

┌─────────────────────────────────────────────────────────────────┐
│ Cloning (bb db:clone-prod) — ~30–45 min, unattended             │
│                                                                 │
│   RDS prod ──snapshot──▶ throwaway RDS ──pg_dump──▶ dump file  │
│                                  │                              │
│                                  ▼                              │
│                             (deleted)                           │
└─────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────┐
│ Local testing loop (iterable from dump file)                    │
│                                                                 │
│   dump ──load──▶ orcha_prod_clone DB (local docker Postgres)    │
│                              │                                  │
│                              ├── bb migrate migrate             │
│                              ├── bb db:schema-diff → pass/fail  │
│                              ├── bb ingest <test docs>          │
│                              └── manual UI smoke checklist      │
│                                                                 │
│   On failure: fix code, bb db:load-clone, re-run (~3 min)       │
└─────────────────────────────────────────────────────────────────┘

Why these choices

Throwaway RDS instead of dumping from prod directly

A pg_dump against prod holds a long REPEATABLE READ transaction (~45 min for a 30 GB DB over SSM-tunneled single-connection). That adds IO load to a single-AZ t4g.medium instance and delays vacuum for the duration. Restoring a manual snapshot to a db.t4g.small throwaway instance costs roughly $0.03 per run and ~15 min of extra wall-clock, and keeps prod untouched. For a procedure that's run before every large refactor deploy, that's cheap insurance.

Local dump instead of testing against the throwaway RDS

Keeping a long-lived RDS clone and pointing local tests at it (what we called "full Option C" in brainstorming) performs worse than a local copy for this use case:

Iteration speed. Local pg_restore resets in ~3 min. Re-restoring from snapshot to RDS is ~15 min.
Network. ~30 ms SSM round-trip per query; noticeable during test runs and flaky over long sessions.
Tooling friction. docker-compose app already expects localhost Postgres + local MiniStack. Remote DB + local everything-else is awkward.
Leak risk. A clone that lives "for days" reliably invites being forgotten.

The dump file on disk is the reusable artifact. The remote clone lives ~30 min and dies.

Schema diff as the primary correctness gate

pg_dump -s of (prod clone + new migration) vs (fresh local with all migrations applied) catches every structural mistake: missed constraint rename, wrong default expression, stale index, forgotten trigger body, enum rename miss. It's reusable across future refactors with no bespoke code, and a zero-diff exit is an unambiguous pass signal. Alternatives considered:

information_schema Clojure test — narrower coverage and we'd write bespoke expected-names per refactor.
Trust migrate up exit code — under-protects; plenty of structural mistakes don't cause runtime errors until weeks later.

Local clone as a distinct database (`orcha_prod_clone`)

Living inside the same docker-compose Postgres container, not a separate volume, avoids docker-compose edits and volume-juggling. The app switches databases via ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone env var — no ambient mode flag, no risk of forgetting which DB you're pointed at.

Components

New scripts (all under `scripts/`)

scripts/clone_prod_db.clj → bb task db:clone-prod
- Find the most recent snapshot of v1-orcha-db (automated snapshots run nightly, so the newest is usually <24h old). Create a fresh manual snapshot only if the newest is older than a freshness threshold (default 24h) or if --fresh-snapshot is passed.
- aws rds restore-db-instance-from-db-snapshot to v1-orcha-db-clone-<timestamp>:
  - instance class db.t4g.small, storage gp3, single-AZ
  - backup_retention=0, deletion_protection=false, publicly_accessible=false
  - same VPC subnet group and security group as prod (so the existing v1-orcha-app EC2 can reach it over SSM port-forward)
  - tags: CloneOf=v1-orcha-db, CreatedBy=clone-prod-db, CreatedAt=<timestamp>
- Poll until DBInstanceStatus=available (~15 min).
- Start SSM port-forward via v1-orcha-app using AWS-StartPortForwardingSessionToRemoteHost with host=<clone-endpoint>, portNumber=5432, localPortNumber=25432.
- Fetch the master password from the secret inherited by the snapshot (lives in the same Secrets Manager path as prod).
- pg_dump -Fc -Z1 -h localhost -p 25432 -U <master> orcha -f dump/prod-<timestamp>.dump. Password passed via PGPASSWORD env in the subprocess only, never written to disk.
- Cleanup is guaranteed:
  - JVM shutdown hook issues aws rds delete-db-instance --skip-final-snapshot --delete-automated-backups.
  - Clone identifier is printed prominently at startup so manual deletion is one line if the hook ever fails.
  - A separate tiny helper (bb db:list-clones) lists all instances tagged CreatedBy=clone-prod-db — run it periodically to catch strays.
- Port forward is killed in the same finally block.
scripts/schema_diff.clj → bb task db:schema-diff --a <db> --b <db>
- pg_dump -s --no-owner --no-privileges --schema=public <db> for both inputs.
- Canonicalize each dump:
  - strip -- line comments and /* ... */ block comments
  - drop SET and SELECT pg_catalog.set_config(...) statements
  - drop sequence restart values (regex on SELECT pg_catalog.setval(...))
  - drop extension-installed-version lines that embed OIDs
  - sort top-level CREATE/ALTER statements by (object-type, object-name)
- Write both canonicalized dumps to dump/schema-<db>.sql, run diff --color=always -u.
- Exit 0 on empty diff. Exit 1 with printed diff otherwise.
scripts/db_clone_helpers.clj — small helpers exposed as bb tasks:
- db:load-clone — drops orcha_prod_clone if it exists, creates it, pg_restore --no-owner --no-privileges -d orcha_prod_clone dump/<latest-or-specified>.dump.
- db:fresh — drops and recreates orcha_fresh, runs init.sql, runs all migrations via migratus.
- db:drop-clone — drops orcha_prod_clone.
- db:list-clones — aws rds describe-db-instances filtered by Tag:CreatedBy=clone-prod-db, prints identifiers and ages. Zero-cost safety net.

Runbook

docs/runbooks/prod-clone-refactor-testing.md — the canonical procedure. Structured as an ordered checklist so both humans and agents can follow it verbatim.

Data flow (runbook steps)

Step 1 — Clone prod to a dump file (~30–45 min, unattended)
  bb db:clone-prod
  → dump/prod-<timestamp>.dump

Step 2 — Load into local Postgres (~2–5 min)
  bb db:load-clone
  → database "orcha_prod_clone" with pre-migration schema

Step 3 — Sanity check: current master boots against pre-migration clone
  ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev
  → (integrant.repl/go) succeeds. Confirms the clone is usable and
    current master is healthy against prod schema.

Step 4 — Apply the pending migration(s)
  ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb migrate migrate

Step 5 — Schema assertion (the gate)
  bb db:fresh
  bb db:schema-diff --a orcha_prod_clone --b orcha_fresh
  → must exit 0. Any diff is a migration bug; iterate.

Step 6 — Ingestion smoke (programmatic)
  ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb dev          # start app
  bb ingest test/fixtures/<invoice-1>.pdf
  bb ingest test/fixtures/<invoice-2>.pdf
  → assert documents reach terminal state without errors.
    Watch logs for: queries against renamed columns, Malli decode
    errors, trigger-function payload key mismatches, SES consumer
    key mismatches, FK violations.

Step 7 — UI smoke (manual)
  Minimum checklist:
    - Document list loads and paginates
    - Open a document → view renders without errors
    - Matching screen loads a cluster
    - /tenants admin panel loads and shows renamed entities
    - /organizations admin panel loads
  → watch browser devtools + app logs for 500s.

Step 8 — Iterate on failure
  Fix code or migration, then:
    bb db:load-clone   # reset from cached dump (~3 min)
    (repeat from Step 4)

Step 9 — Cleanup
  bb db:drop-clone
  bb db:list-clones    # verify no stray RDS clones remain
  rm dump/prod-<timestamp>.dump   # manual; contains PII

Error handling & safety

Throwaway RDS leak. Shutdown hook + startup-printed identifier + periodic db:list-clones check. Three independent layers; no single failure leaves a running clone.
Port collision. Forwarded port is 25432, not 5432. Script refuses to run if bound; prevents accidentally pointing bb migrate at the SSM tunnel.
Wrong-DB confusion. Every command in the runbook carries an explicit ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone prefix. No ambient mode flag.
Secrets handling. Master password is fetched on demand from Secrets Manager, passed to pg_dump via PGPASSWORD env of the subprocess only, never written to disk, never logged.
PII on laptop. Dump file lives in dump/ (gitignored). Runbook explicitly instructs manual deletion at cleanup. No auto-expiry; short-term retention for iteration is intentional.
Migration failure mid-run. up migration is already wrapped in a single transaction (per the rename plan). Failure leaves the clone DB untouched; bb db:load-clone resets cleanly.
Schema diff false positives. Canonicalizer is pragmatic — residual noise from new pg_dump versions is expected occasionally. Runbook documents how to extend the ignore list; diff output is deliberately human-readable so noise is visible, not hidden.
SSM session drop during pg_dump. Single long-running dump is the biggest failure mode. Mitigation: dump is re-runnable — on failure, re-run bb db:clone-prod (snapshot is already taken, script detects and reuses a recent snapshot to skip that step).

What this procedure does NOT cover

Explicit limits, copied to the runbook:

Data-path bugs requiring prod S3 objects. Can't reprocess a specific prod document against the clone; ingestion uses fresh local test documents only.
Load/perf testing. db.t4g.small clone + untuned local Postgres are not perf-representative.
Production deploy race conditions. Local up is fully isolated. Deadlocks from concurrent prod writers during the real deploy are not surfaced here. Mitigation: continue deploying in low-traffic windows.
Cross-service integrations. DATEV, SAP, Outlook, etc. are exercised only incidentally via ingestion. Full integration smoke requires separate tooling.

Future extensions

Called out so they're visible but explicitly not built now:

Playwright-driven UI smoke. Once the procedure has proven itself manually a couple of times, scripting the Step 7 checklist in Playwright makes it agent-drivable.
Test suite against the clone. clj -X:test pointed at orcha_prod_clone would catch code paths that need real rows to execute. Not yet; adds complexity (test isolation against non-transactional data) without clear payoff for a rename-only refactor.
Anonymized dump. If PII-on-laptop becomes a compliance concern, a post-dump anonymization step (mask email addresses, clear OCR text, etc.) can slot in between pg_dump and local pg_restore. Not a default — anonymization usually breaks something and is a maintenance burden.
Reuse the clone instance for iteration. If dump-and-restore ever becomes the bottleneck, the script could grow a --keep-clone flag that skips teardown and emits the connection info. Would flip us closer to "full Option C" for the few cases where it pays.