Prod-Clone Refactor Testing

Pre-deploy validation procedure for large, schema-touching refactors. Restores a clone of prod Postgres locally, applies pending migrations, verifies the resulting schema matches a fresh-from-init baseline, and exercises ingestion and UI against the migrated clone.

Design spec: docs/superpowers/specs/2026-04-24-prod-clone-refactor-testing-design.md

When to use

Run before deploying any migration that:

Renames tables, columns, constraints, indexes, or types.
Modifies trigger function bodies or JSONB path consumers.
Makes NOT NULL a previously-nullable column.
Changes FK topology.
Anything where a successful bb migrate migrate against init.sql wouldn't catch data-shape issues.

Not needed for additive migrations (new columns/tables) that are covered by unit/integration tests and don't touch live data layout.

Prerequisites

AWS SSO login: aws sso login --profile orcha-prod
docker-compose stack running: bb dev:up && bb dev:seed
Postgres 18+ client tools on PATH (pg_dump, pg_restore, psql). The host client major version MUST be ≥ the docker-compose Postgres server version (currently pgvector/pgvector:pg18 per docker-compose.yml). Verify both:
```
pg_dump --version                                                                # → 18.x
docker-compose exec -T postgres psql -U postgres -tAc "SHOW server_version;"     # → 18.x
```
Mismatch will cause pg_restore to fail loading the prod-shaped dump.
AWS Session Manager plugin installed (session-manager-plugin --version must succeed). Without it, the SSM port-forward step silently hangs.
Branch with the migration under test checked out

Procedure

Step 1 — Clone prod to a dump file

Unattended, ~30–45 min (longer if a fresh snapshot needs to be taken).

bb db:clone-prod

The script first calls aws sts get-caller-identity and aborts unless the profile resolves to the prod account. It then prints the session header and prompts Continue? [y/N] before any AWS mutation. Confirm with y to proceed, or pass --yes/-y to skip the prompt for unattended runs.

Produces dump/prod-<timestamp>.dump. Watch output for the throwaway clone identifier — if the script dies unexpectedly, that's what you need to delete manually (see Step 9).

If the script creates a fresh manual source snapshot, it tags it as temporary clone-test data and deletes it during cleanup. If cleanup fails, Step 9 shows how to list and remove the leftover snapshot.

Flags:

--fresh-snapshot — skip snapshot reuse; always create a new one.
--freshness-hours N — reuse snapshots up to N hours old (default 24).
--yes, -y — skip the interactive confirmation prompt.
--skip-restore — dry-run; print plan and exit before any AWS calls.

Step 2 — Load the dump into local Postgres

~2–5 min.

bb db:load-clone

Creates orcha_prod_clone in local docker-compose Postgres, restored from the newest dump/*.dump file.

Step 3 — Sanity check: current master boots against pre-migration clone

ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev

At the REPL: (integrant.repl/go) — must succeed. Confirms the clone is usable and that current master is healthy against prod schema. Exit the REPL.

Step 4 — Apply pending migrations on the clone

ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb migrate migrate

All pending migrations apply cleanly against prod data.

Step 5 — Schema assertion (the gate)

bb db:fresh
bb db:schema-diff --a orcha_prod_clone --b orcha_fresh

bb db:fresh (re)creates orcha_fresh at HEAD schema. bb db:schema-diff exits 0 on empty diff. Any diff means the migration produces a schema that diverges from what init.sql + migrations produce — i.e., a migration bug. Iterate until exit 0.

Step 6 — Ingestion smoke (programmatic)

Start the app pointed at the clone:

ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev

At the REPL: (integrant.repl/go). In another terminal, list available fixtures and ingest a couple. Don't reference prod document IDs — these should be fresh test PDFs, since prod S3 objects aren't local.

ls test/fixtures/        # discover what's available
bb ingest test/fixtures/<picked-invoice-1>.pdf
bb ingest test/fixtures/<picked-invoice-2>.pdf

Assert each document reaches a terminal state (processed/failed) without errors. Watch the REPL logs for:

Queries referencing renamed columns (legal_entity_id, etc.)
Malli decode errors
Trigger-function payload key mismatches (:legal-entity-id vs :tenant-id)
SES/notification consumer key mismatches
FK violations

Step 7 — UI smoke (manual)

With the app still running against the clone, walk through:

Document list — loads and paginates
Open a document → view renders without errors
Matching screen — loads a cluster
Tenants admin panel (/tenants) — loads, shows renamed entities
Organizations admin panel (/organizations) — loads

Watch browser devtools and REPL logs for 500s.

Step 8 — Iterate on failure

If any step fails:

Fix code or migration in the repo.
bb db:load-clone — resets orcha_prod_clone from the cached dump file (~3 min, no re-clone of prod needed).
Repeat from Step 4.

Step 9 — Cleanup

bb db:drop-clone
bb db:list-clones   # must be empty
bb db:list-clone-snapshots   # must be empty
rm dump/prod-<timestamp>.dump   # manually; contains PII

bb db:list-clones surfaces any leaked throwaway RDS instances (should not happen under normal conditions — the script has a shutdown hook — but run this as a belt-and-braces check). bb db:list-clone-snapshots surfaces any leaked manual source snapshots created by this workflow.

What this does NOT cover

Data-path bugs that require specific prod S3 objects. Ingestion uses fresh local test documents only.
Load or performance testing — throwaway clone is cost-optimized.
Prod deploy race conditions (concurrent writers during migration). Mitigation: deploy in low-traffic windows.
Integration side effects (DATEV/SAP/Outlook) beyond what ingestion incidentally triggers.

Troubleshooting

Clone restore hangs past 25 minutes. The AWS wait timeout may have been exceeded. bb db:list-clones — if the instance exists and is available, restart the script; if it's creating, keep waiting (larger snapshots take longer).

SSM port-forward fails with "target not connected". Run aws sso login --profile orcha-prod and retry.

pg_dump fails with password authentication failed. The script fetches the master password from SSM (/v1-orcha/db-credentials), and restored snapshots inherit the master password as it was AT SNAPSHOT TIME. If the prod password rotated between the snapshot and the restore, those won't match. Resolutions:

Use --fresh-snapshot so the snapshot reflects the current password.
Or pass --master-user-password <known> to restore-db-instance-from-db-snapshot and use that known password (script enhancement; not currently supported).

pg_restore warnings about missing roles. Expected; --no-owner --no-privileges skips ownership. Exit code 1 with warnings is treated as success by bb db:load-clone.

Schema diff shows expected vs actual differences you think are fine. Add a canonicalizer rule in scripts/schema_diff.clj. Review the diff carefully — "noise" is often a subtle bug.

bb db:list-clones shows a stray instance you don't recognize. Delete it:

aws rds delete-db-instance --profile orcha-prod --region eu-central-1 \
  --db-instance-identifier <ID> --skip-final-snapshot --delete-automated-backups