For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.⚠️ Execution boundary: Tasks 1–6 are agent-executable; they perform no AWS mutations (one read-only
rds describe-db-instancescall at most, in Task 4). Task 7 is USER-ONLY — it is the first real run of the procedure and is the only task that mutates AWS resources inorcha-prod(snapshot create, RDS restore, RDS delete) and pulls prod credentials + PII to the local machine. After completing Task 6, STOP and hand control back to the user. See the warning block at the top of Task 7.
Goal: Build a repeatable procedure (three bb scripts + a runbook) for restoring a prod Postgres clone into local docker-compose, applying pending migrations, and proving the resulting schema matches a fresh-from-init baseline — so we can validate large refactors (like the legal_entity → tenant rename) against real data before deploying.
Architecture: A one-shot cloning pipeline (bb db:clone-prod) snapshots prod RDS, restores it to a short-lived throwaway RDS, pg_dumps from it via SSM port-forward to a local dump file, then deletes the throwaway. A schema diff tool (bb db:schema-diff) canonicalizes pg_dump -s output and compares two local DBs. Small helper tasks (bb db:load-clone, bb db:fresh, bb db:drop-clone, bb db:list-clones) manage local and remote state. The app points at the clone via an env-driven dbname override. A runbook ties it all together.
Tech Stack: Babashka scripts (Clojure via bb), AWS CLI (rds/ssm/secretsmanager), pg_dump/pg_restore/psql, docker-compose Postgres, migratus.
Authoritative source: docs/superpowers/specs/2026-04-24-prod-clone-refactor-testing-design.md. Review it before starting — this plan implements exactly what's in there.
scripts/clone_prod_db.clj — snapshot → throwaway RDS → pg_dump → teardown. Exposed as bb db:clone-prod.scripts/schema_diff.clj — pg_dump -s canonicalization + diff. Exposed as bb db:schema-diff.scripts/db_clone_helpers.clj — db:load-clone, db:fresh, db:drop-clone, db:list-clones. Four small thin helpers.docs/runbooks/prod-clone-refactor-testing.md — the end-user-facing checklist.bb.edn — register the six new bb tasks..gitignore — add /dump/ (currently untracked, not ignored).src/com/getorcha/db.clj — tiny change to parse-credentials to honor an ORCHA_LOCAL_DB_NAME_OVERRIDE env var.test/com/getorcha/db_test.clj (create if missing) — one unit test for the override path.docker-compose.yml — no changes. The clone lives as a separate database inside the existing Postgres container.resources/com/getorcha/config.edn — no changes. Override happens in code.One logical unit per commit. Suggested prefixes (match existing git log style):
feat(dev): env override for local DB namefeat(dev): add schema diff scriptfeat(dev): add db clone helper tasksfeat(dev): add prod DB clone scriptdocs(runbooks): add prod-clone refactor testing runbookchore: gitignore dump directoryPurpose: The procedure writes prod database dumps (containing PII) to dump/. Must not be accidentally committed.
Files:
Modify: .gitignore
Step 1: Verify current state
Run: grep -n dump .gitignore
Expected: no match, or only references to unrelated entries.
Run: git status --short dump/ | head -3
Expected: lists untracked dump/ contents (already has misc files from prior work).
/dump/ to .gitignoreOpen .gitignore and add the following line (group it near other tool/output exclusions, e.g., near the existing /tmp/ entry):
/dump/
dump/ is currently trackedRun: git ls-files dump/ | head
Expected: no output. If output appears, stop and flag — existing tracked files shouldn't be silently ignored.
Run: git status --short | grep '^?? dump'
Expected: no match (the dump/ line that was there at the start of the session is now hidden).
git add .gitignore
git commit -m "chore: gitignore dump directory"
ORCHA_LOCAL_DB_NAME_OVERRIDE env overridePurpose: Allow the app and migrations to target orcha_prod_clone without modifying SSM parameters or config files. The override lives in parse-credentials so both the runtime pool (com.getorcha.db/::pool) and the migrations path (com.getorcha.db.migrations/->derive-db) pick it up from a single code point.
Files:
Modify: src/com/getorcha/db.clj:193-219 (parse-credentials)
Modify: test/com/getorcha/db_test.clj (create if missing)
Step 1: Locate the current implementation
Open src/com/getorcha/db.clj at parse-credentials (~line 193). Confirm the body matches what Task 1's grep showed: it json-parses, then returns a map with :url, :user, :password, :host, :port, :dbname.
Run: test -f test/com/getorcha/db_test.clj && echo exists || echo missing
If missing, create it with this skeleton:
(ns com.getorcha.db-test
(:require [clojure.test :refer [deftest testing is]]
[com.getorcha.db :as db]))
If exists, open it and append the new test at the bottom.
Append to test/com/getorcha/db_test.clj:
(def ^:private sample-credentials-json
(str "{\"username\":\"u\",\"password\":\"p\","
"\"host\":\"h\",\"port\":5432,\"dbname\":\"orcha\"}"))
(deftest parse-credentials-honors-local-db-name-override
(testing "without override, dbname matches the JSON"
(let [parsed (db/parse-credentials sample-credentials-json)]
(is (= "orcha" (:dbname parsed)))
(is (= "jdbc:postgresql://h:5432/orcha" (:url parsed)))))
(testing "with override env var set, dbname is replaced and URL reflects it"
(with-redefs [db/env-db-name-override (constantly "orcha_prod_clone")]
(let [parsed (db/parse-credentials sample-credentials-json)]
(is (= "orcha_prod_clone" (:dbname parsed)))
(is (= "jdbc:postgresql://h:5432/orcha_prod_clone" (:url parsed))))))
(testing "blank override is ignored (treated as absent)"
(with-redefs [db/env-db-name-override (constantly "")]
(let [parsed (db/parse-credentials sample-credentials-json)]
(is (= "orcha" (:dbname parsed)))))))
Run: clj -X:test:silent :nses '[com.getorcha.db-test]' 2>&1 | grep -E "FAIL|ERROR|Ran"
Expected: ERROR (because db/env-db-name-override doesn't exist yet) or FAIL (because the override is not applied).
Edit src/com/getorcha/db.clj. Immediately above parse-credentials, add:
(defn ^:private env-db-name-override
"Returns the value of ORCHA_LOCAL_DB_NAME_OVERRIDE or nil.
Indirected through a defn (rather than called inline) so tests can
stub it via with-redefs without touching process env. Intended for
local development only — production task definitions MUST NOT set
this env var."
[]
(System/getenv "ORCHA_LOCAL_DB_NAME_OVERRIDE"))
Then modify the body of parse-credentials to apply the override. Replace:
[credentials-json]
(let [{:keys [username password host port dbname]}
(json/parse-string credentials-json true)]
{:url (format "jdbc:postgresql://%s:%d/%s" host port dbname)
:user username
:password (or password "")
:host host
:port port
:dbname dbname}))
with:
[credentials-json]
(let [{:keys [username password host port dbname]}
(json/parse-string credentials-json true)
override (env-db-name-override)
effective-dbname (if (and override (not (string/blank? override)))
override
dbname)]
{:url (format "jdbc:postgresql://%s:%d/%s" host port effective-dbname)
:user username
:password (or password "")
:host host
:port port
:dbname effective-dbname}))
Required ns edit: verify the :require block of src/com/getorcha/db.clj includes [clojure.string :as string]. If absent (it currently is — check first with head -25 src/com/getorcha/db.clj), add it in alphabetical order. Without this require, string/blank? in the new code will fail to compile.
Run: clj -X:test:silent :nses '[com.getorcha.db-test]' 2>&1 | grep -E "FAIL|ERROR|Ran"
Expected: Ran 1 tests containing 5 assertions. 0 failures, 0 errors. (or similar; numbers depend on pre-existing tests in the file).
Run: clj-kondo --lint src/com/getorcha/db.clj test/com/getorcha/db_test.clj
Expected: no warnings or errors.
git add src/com/getorcha/db.clj test/com/getorcha/db_test.clj
git commit -m "feat(dev): env override for local DB name"
bb db:schema-diffPurpose: Diff the schema of two local databases after running pg_dump -s, with canonicalization to eliminate non-semantic differences (comments, SET statements, sequence setvals, extension OIDs, statement order noise). Exit 0 on empty diff; non-zero with printed diff otherwise.
Files:
Create: scripts/schema_diff.clj
Modify: bb.edn (register db:schema-diff)
Create: scripts/test_schema_diff.clj (bb-runnable sanity checks for the canonicalizer)
Step 1: Scaffold the script
Create scripts/schema_diff.clj:
(ns schema-diff
"Compare the schema of two local PostgreSQL databases.
Runs pg_dump -s against each, canonicalizes the output (strips comments,
SET statements, sequence setvals, and sorts top-level statements by
object type + name), and diffs the result. Exits 0 on empty diff;
exits 1 with the diff printed on any difference.
Usage:
bb db:schema-diff --a <db-a> --b <db-b> [--host HOST] [--port PORT] [--user USER]
Example:
bb db:schema-diff --a orcha_prod_clone --b orcha_fresh"
(:require [babashka.process :as p]
[clojure.java.io :as io]
[clojure.string :as str])
(:import [java.io File]))
(set! *warn-on-reflection* true)
(def ^:private default-host "localhost")
(def ^:private default-port "5432")
(def ^:private default-user "postgres")
(defn ^:private parse-args
"Parse command-line args into an options map."
[args]
(loop [args args
opts {:host default-host
:port default-port
:user default-user}]
(if (empty? args)
opts
(let [[arg & rest-args] args]
(case arg
"--a" (recur (rest rest-args) (assoc opts :db-a (first rest-args)))
"--b" (recur (rest rest-args) (assoc opts :db-b (first rest-args)))
"--host" (recur (rest rest-args) (assoc opts :host (first rest-args)))
"--port" (recur (rest rest-args) (assoc opts :port (first rest-args)))
"--user" (recur (rest rest-args) (assoc opts :user (first rest-args)))
"--help" (assoc opts :help true)
(recur rest-args opts))))))
(defn canonicalize
"Canonicalize pg_dump -s output to eliminate non-semantic differences.
Rules:
- Strip `--` line comments (including the section-header decorators
pg_dump emits before each object).
- Drop `SET` statements and `SELECT pg_catalog.set_config(...)` lines.
- Drop sequence `setval` calls (carry no schema meaning).
- Drop extension version lines (carry build-time OIDs).
- Collapse runs of blank lines and trim trailing whitespace.
Does NOT reorder statements: pg_dump emits objects in a deterministic
topological order (namespace → kind → name). Two databases with the
same final schema produce the same ordering. Reordering would require
a $$-aware splitter, which we don't need."
[sql-text]
(let [filtered (->> (str/split-lines sql-text)
(remove #(str/starts-with? (str/triml %) "--"))
(remove #(re-find #"^\s*SET\s" %))
(remove #(re-find #"^\s*SELECT\s+pg_catalog\.set_config" %))
(remove #(re-find #"^\s*SELECT\s+pg_catalog\.setval" %))
(remove #(re-find #"COMMENT ON EXTENSION" %))
(map str/trimr))]
(-> (str/join "\n" filtered)
(str/replace #"\n{3,}" "\n\n")
str/trim)))
(defn ^:private pg-dump-schema
"Run pg_dump -s against `dbname`, return stdout string.
Throws ex-info on non-zero exit."
[{:keys [host port user] :as _opts} dbname]
(let [result (p/shell {:out :string
:err :string
:continue true}
"pg_dump" "-s"
"--no-owner" "--no-privileges"
"--schema=public"
"-h" host "-p" port "-U" user
dbname)]
(when-not (zero? (:exit result))
(throw (ex-info (format "pg_dump failed for %s: %s" dbname (:err result))
{:dbname dbname :exit (:exit result)})))
(:out result)))
(defn ^:private write-file!
[^File file content]
(.mkdirs (.getParentFile file))
(spit file content))
(defn ^:private print-usage
[]
(println "Usage: bb db:schema-diff --a <db-a> --b <db-b>")
(println " [--host HOST] [--port PORT] [--user USER]")
(println)
(println "Compares the schemas of two local PostgreSQL databases.")
(println "Exits 0 on empty diff; 1 with the diff printed on any difference."))
(defn -main
[& args]
(let [{:keys [db-a db-b help] :as opts} (parse-args args)]
(cond
help
(print-usage)
(or (nil? db-a) (nil? db-b))
(do (println "Error: --a and --b are required")
(println)
(print-usage)
(System/exit 2))
:else
(let [raw-a (pg-dump-schema opts db-a)
raw-b (pg-dump-schema opts db-b)
canon-a (canonicalize raw-a)
canon-b (canonicalize raw-b)
file-a (io/file (format "dump/schema-%s.sql" db-a))
file-b (io/file (format "dump/schema-%s.sql" db-b))]
(write-file! file-a canon-a)
(write-file! file-b canon-b)
(if (= canon-a canon-b)
(do (println (format "OK: schemas match (%s vs %s)" db-a db-b))
(System/exit 0))
(let [diff (p/shell {:out :string :err :string :continue true}
"diff" "-u"
(.getPath file-a) (.getPath file-b))]
(println (format "DIFF: schemas differ (%s vs %s)" db-a db-b))
(println (:out diff))
(System/exit 1)))))))
canonicalizeCreate scripts/test_schema_diff.clj:
(ns test-schema-diff
"Sanity tests for the schema-diff canonicalizer. Run via:
bb scripts/test_schema_diff.clj"
(:require [clojure.test :as t :refer [deftest is run-tests testing]]
[schema-diff :as sd]))
(deftest canonicalize-strips-comments
(let [raw "-- this is a comment\nCREATE TABLE foo (id int);\n-- another\n"]
(is (not (re-find #"--" (sd/canonicalize raw))))
(is (re-find #"CREATE TABLE foo" (sd/canonicalize raw)))))
(deftest canonicalize-drops-set-statements
(let [raw "SET statement_timeout = 0;\nSET client_encoding = 'UTF8';\nCREATE TABLE foo (id int);\n"]
(is (not (re-find #"(?i)SET\b" (sd/canonicalize raw))))
(is (re-find #"CREATE TABLE foo" (sd/canonicalize raw)))))
(deftest canonicalize-drops-setval
(let [raw "CREATE TABLE foo (id int);\nSELECT pg_catalog.setval('foo_id_seq', 42, true);\n"]
(is (not (re-find #"setval" (sd/canonicalize raw))))
(is (re-find #"CREATE TABLE foo" (sd/canonicalize raw)))))
(deftest canonicalize-preserves-function-bodies
(let [raw (str "CREATE FUNCTION foo() RETURNS trigger AS $$\n"
"BEGIN\n"
" PERFORM 1;\n"
" RETURN NEW;\n"
"END;\n"
"$$ LANGUAGE plpgsql;\n")
out (sd/canonicalize raw)]
(is (re-find #"BEGIN" out))
(is (re-find #"RETURN NEW" out))
(is (re-find #"END" out))
(is (re-find #"\$\$ LANGUAGE plpgsql" out))))
(deftest canonicalize-detects-real-differences
(let [a "CREATE TABLE foo (id int);\n"
b "CREATE TABLE foo (id bigint);\n"]
(is (not= (sd/canonicalize a) (sd/canonicalize b)))))
(let [{:keys [fail error]} (run-tests 'test-schema-diff)]
(System/exit (if (zero? (+ fail error)) 0 1)))
Run: bb scripts/test_schema_diff.clj
Expected: Ran 5 tests containing 11 assertions. 0 failures, 0 errors. (exit 0; assertion count varies if you add tests).
If any fail, fix canonicalize until green.
Edit bb.edn. Locate the database migrations section (around migrate task). Add, after the existing db:logs task:
db:schema-diff
{:doc "Diff the schema of two local databases: bb db:schema-diff --a DB-A --b DB-B"
:requires ([schema-diff])
:task (apply schema-diff/-main *command-line-args*)}
Run:
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE IF EXISTS orcha_sdtest_a;"
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE IF EXISTS orcha_sdtest_b;"
docker-compose exec -T postgres psql -U postgres -c "CREATE DATABASE orcha_sdtest_a;"
docker-compose exec -T postgres psql -U postgres -c "CREATE DATABASE orcha_sdtest_b;"
docker-compose exec -T postgres psql -U postgres -d orcha_sdtest_a -c "CREATE TABLE foo (id int, name text);"
docker-compose exec -T postgres psql -U postgres -d orcha_sdtest_b -c "CREATE TABLE foo (id int, name text);"
bb db:schema-diff --a orcha_sdtest_a --b orcha_sdtest_b
Expected: OK: schemas match (orcha_sdtest_a vs orcha_sdtest_b) with exit 0.
Run:
docker-compose exec -T postgres psql -U postgres -d orcha_sdtest_b -c "ALTER TABLE foo ALTER COLUMN id TYPE bigint;"
bb db:schema-diff --a orcha_sdtest_a --b orcha_sdtest_b
echo "Exit: $?"
Expected: DIFF: schemas differ (...) followed by a unified diff showing id int vs id bigint, exit code 1.
pg_dump -s outputThe two-table sanity test from Steps 5–6 uses trivial schemas. The real
schema (functions with $$ bodies, triggers, ~80 tables) may surface
canonicalizer gaps. Run a fresh-vs-fresh comparison:
This step depends on Task 4's bb db:fresh, so run it AFTER completing
Task 4 (or skip this step now and come back). With bb db:fresh available:
bb db:fresh
docker-compose exec -T postgres psql -U postgres -c "ALTER DATABASE orcha_fresh RENAME TO orcha_fresh_a;"
bb db:fresh
docker-compose exec -T postgres psql -U postgres -c "ALTER DATABASE orcha_fresh RENAME TO orcha_fresh_b;"
bb db:schema-diff --a orcha_fresh_a --b orcha_fresh_b
Expected: OK: schemas match. Two independently-built copies of the same
schema must canonicalize identically. If the diff is non-empty, the
canonicalizer is missing a rule (sequence ownership, generated identity
expressions, function body whitespace, etc.) — extend canonicalize with
the offending pattern stripped.
Cleanup:
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_fresh_a;"
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_fresh_b;"
Run:
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_sdtest_a;"
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_sdtest_b;"
Run: clj-kondo --lint scripts/schema_diff.clj scripts/test_schema_diff.clj
Expected: no warnings or errors.
git add scripts/schema_diff.clj scripts/test_schema_diff.clj bb.edn
git commit -m "feat(dev): add schema diff script"
Purpose: Four thin helpers that manage local DB state and surface RDS clones tagged by our script.
db:load-clone — drops and recreates orcha_prod_clone, runs pg_restore from a dump file (newest in dump/ by default, or one specified).db:fresh — drops and recreates orcha_fresh, runs init.sql, then migrates to head. Used as the baseline for db:schema-diff.db:drop-clone — drops orcha_prod_clone.db:list-clones — lists AWS RDS instances tagged CreatedBy=clone-prod-db (safety net for leaked throwaway instances).Files:
Create: scripts/db_clone_helpers.clj
Modify: bb.edn (register four tasks)
Step 1: Write the script
Create scripts/db_clone_helpers.clj:
(ns db-clone-helpers
"Helper tasks for managing the prod-clone local database and throwaway
AWS RDS instances used by `bb db:clone-prod`.
Exposed as bb tasks:
bb db:load-clone [dump-file] Restore dump into orcha_prod_clone.
bb db:fresh Recreate orcha_fresh at HEAD schema.
bb db:drop-clone Drop orcha_prod_clone.
bb db:list-clones List RDS throwaway clones still alive."
(:require [babashka.process :as p]
[clojure.java.io :as io]
[clojure.string :as str])
(:import [java.io File]))
(set! *warn-on-reflection* true)
(def ^:private clone-db-name "orcha_prod_clone")
(def ^:private fresh-db-name "orcha_fresh")
(def ^:private default-host "localhost")
(def ^:private default-port "5432")
(def ^:private default-user "postgres")
(def ^:private aws-profile "orcha-prod")
(def ^:private aws-region "eu-central-1")
(defn ^:private psql-admin!
"Run a psql command against the postgres administrative database inside
the docker-compose container. Throws ex-info on non-zero exit."
[sql]
(let [result (p/shell {:out :string
:err :string
:continue true}
"docker-compose" "exec" "-T" "postgres"
"psql" "-U" default-user "-d" "postgres" "-v" "ON_ERROR_STOP=1"
"-c" sql)]
(when-not (zero? (:exit result))
(throw (ex-info (format "psql failed: %s" (:err result))
{:sql sql :exit (:exit result)})))
result))
(defn ^:private latest-dump-file
"Return the newest .dump file in ./dump/ as a File, or nil if none exist."
^File []
(let [dir (io/file "dump")]
(when (.isDirectory dir)
(->> (.listFiles dir)
(filter #(and (.isFile ^File %)
(str/ends-with? (.getName ^File %) ".dump")))
(sort-by #(.lastModified ^File %))
last))))
(defn load-clone!
"Drop and recreate orcha_prod_clone, then pg_restore from `dump-file`
(defaults to the newest .dump in ./dump/).
Expects `pg_restore` available on PATH locally (Postgres 18+ client tools)."
[& [dump-file-arg]]
(let [dump-file (if dump-file-arg
(io/file dump-file-arg)
(latest-dump-file))]
(when-not (and dump-file (.exists ^File dump-file))
(println "Error: no dump file found."
(if dump-file-arg
(format "File %s does not exist." dump-file-arg)
"Place a .dump file under ./dump/ or pass a path."))
(System/exit 1))
(println (format "Dropping %s..." clone-db-name))
(psql-admin! (format "DROP DATABASE IF EXISTS %s;" clone-db-name))
(println (format "Creating %s..." clone-db-name))
(psql-admin! (format "CREATE DATABASE %s;" clone-db-name))
(println (format "Restoring from %s..." (.getPath ^File dump-file)))
(let [result (p/shell {:out :string
:err :string
:continue true}
"pg_restore"
"--no-owner" "--no-privileges"
"-h" default-host "-p" default-port "-U" default-user
"-d" clone-db-name
(.getPath ^File dump-file))
stderr (:err result)
;; pg_restore prints "WARNING:" lines for non-fatal issues (missing
;; roles, etc.) and "pg_restore: error:" for fatal ones. Exit 1 is
;; ambiguous; treat fatal-error lines as the real signal.
fatal? (re-find #"(?m)^pg_restore: error:" (or stderr ""))]
(when (or fatal? (> (:exit result) 1))
(println stderr)
(throw (ex-info "pg_restore failed" {:exit (:exit result)})))
(when (seq stderr)
(println "pg_restore warnings (non-fatal):")
(println stderr)))
(println (format "Done. %s is ready. App env: ORCHA_LOCAL_DB_NAME_OVERRIDE=%s"
clone-db-name clone-db-name))))
(defn drop-clone!
"Drop orcha_prod_clone (no-op if missing)."
[& _]
(println (format "Dropping %s..." clone-db-name))
(psql-admin! (format "DROP DATABASE IF EXISTS %s;" clone-db-name))
(println "Done."))
(defn fresh!
"Drop, recreate, and migrate-to-HEAD the orcha_fresh database.
init.sql is applied implicitly by migratus when the database is empty;
migratus picks up `resources/migrations/init.sql` and all timestamped
migrations from the classpath."
[& _]
(println (format "Dropping %s..." fresh-db-name))
(psql-admin! (format "DROP DATABASE IF EXISTS %s;" fresh-db-name))
(println (format "Creating %s..." fresh-db-name))
(psql-admin! (format "CREATE DATABASE %s;" fresh-db-name))
(println "Running migrations (init.sql runs automatically via migratus :init-script)...")
(let [result (p/shell {:extra-env {"ORCHA_LOCAL_DB_NAME_OVERRIDE" fresh-db-name}
:continue true}
"clojure" "-X:dev" "user/migratus"
":command" "migrate"
":args" "[]")]
(when-not (zero? (:exit result))
(throw (ex-info "migrate failed" {:exit (:exit result)}))))
(println (format "Done. %s is at HEAD schema." fresh-db-name)))
(defn list-clones
"List AWS RDS instances tagged CreatedBy=clone-prod-db. Prints identifier,
status, and CreatedAt tag value."
[& _]
(let [result (p/shell {:out :string
:err :string
:continue true
:extra-env {"AWS_PROFILE" aws-profile}}
"aws" "rds" "describe-db-instances"
"--region" aws-region
"--query"
(str "DBInstances[?contains(TagList[?Key==`CreatedBy`].Value, `clone-prod-db`)]"
".{ID:DBInstanceIdentifier,Status:DBInstanceStatus,Endpoint:Endpoint.Address}")
"--output" "table")]
(if (zero? (:exit result))
(do (println (:out result))
(println "To delete a stray clone:")
(println " aws rds delete-db-instance --profile orcha-prod --region eu-central-1 \\")
(println " --db-instance-identifier <ID> --skip-final-snapshot --delete-automated-backups"))
(do (println "Error:" (:err result))
(System/exit 1)))))
Edit bb.edn, adding these four entries adjacent to the existing db:psql/db:logs/db:schema-diff tasks:
db:load-clone
{:doc "Restore a prod dump into orcha_prod_clone: bb db:load-clone [dump-file]"
:requires ([db-clone-helpers])
:task (apply db-clone-helpers/load-clone! *command-line-args*)}
db:fresh
{:doc "Recreate orcha_fresh at HEAD schema (used as schema-diff baseline)"
:requires ([db-clone-helpers])
:task (apply db-clone-helpers/fresh! *command-line-args*)}
db:drop-clone
{:doc "Drop the orcha_prod_clone database"
:requires ([db-clone-helpers])
:task (apply db-clone-helpers/drop-clone! *command-line-args*)}
db:list-clones
{:doc "List AWS RDS throwaway clones tagged by clone-prod-db"
:requires ([db-clone-helpers])
:task (apply db-clone-helpers/list-clones *command-line-args*)}
db:freshRun: bb db:fresh
Expected: output ending in Done. orcha_fresh is at HEAD schema., no errors.
Verify the DB exists and has tables:
docker-compose exec -T postgres psql -U postgres -d orcha_fresh -c "\dt"
Expected: lists tables (ap_*, document, tenant, etc.).
db:load-clone with a synthetic dumpCreate a tiny synthetic dump to exercise the helper without needing prod data:
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE IF EXISTS orcha_loadtest_src;"
docker-compose exec -T postgres psql -U postgres -c "CREATE DATABASE orcha_loadtest_src;"
docker-compose exec -T postgres psql -U postgres -d orcha_loadtest_src -c "CREATE TABLE foo (id int); INSERT INTO foo VALUES (1), (2), (3);"
mkdir -p dump
pg_dump -Fc -h localhost -p 5432 -U postgres orcha_loadtest_src -f dump/loadtest.dump
bb db:load-clone dump/loadtest.dump
docker-compose exec -T postgres psql -U postgres -d orcha_prod_clone -c "SELECT count(*) FROM foo;"
Expected: output shows count = 3.
db:drop-cloneRun: bb db:drop-clone
Expected: Done.
Verify it's gone:
docker-compose exec -T postgres psql -U postgres -c "SELECT 1 FROM pg_database WHERE datname = 'orcha_prod_clone';" | head
Expected: (0 rows).
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_loadtest_src;"
rm dump/loadtest.dump
db:list-clonesRun: bb db:list-clones
Expected: either prints an empty table (no active clones — normal) and the delete hint, or lists any currently-alive throwaway clones. No error.
Run: clj-kondo --lint scripts/db_clone_helpers.clj
Expected: no warnings or errors.
git add scripts/db_clone_helpers.clj bb.edn
git commit -m "feat(dev): add db clone helper tasks"
bb db:clone-prodPurpose: The main cloning script. Finds the most recent RDS snapshot of v1-orcha-db (or creates a fresh one if stale), restores it to a throwaway db.t4g.small instance, port-forwards via SSM through v1-orcha-app, pg_dumps into dump/prod-<timestamp>.dump, and deletes the throwaway — with a shutdown hook so teardown happens even on crash/SIGINT.
Files:
Create: scripts/clone_prod_db.clj
Modify: bb.edn (register db:clone-prod)
Create: scripts/test_clone_prod_db.clj (bb-runnable tests for pure helpers)
Step 1: Write the script
Create scripts/clone_prod_db.clj:
(ns clone-prod-db
"Clone the prod RDS database to a local dump file via a throwaway RDS
instance restored from a recent snapshot.
Flow:
1. Find the most recent snapshot of v1-orcha-db (create one if the
newest is older than the freshness threshold).
2. Restore to v1-orcha-db-clone-<timestamp> (db.t4g.small, no backups,
same VPC/SG so v1-orcha-app can reach it).
3. Wait until available.
4. Start SSM port-forward to the clone's endpoint via v1-orcha-app.
5. pg_dump -Fc -Z1 into dump/prod-<timestamp>.dump.
6. Delete the throwaway instance (also runs on abnormal exit).
Usage:
bb db:clone-prod [--fresh-snapshot]
[--freshness-hours N]
[--instance-class db.t4g.small]
[--skip-restore]"
(:require [babashka.process :as p]
[cheshire.core :as json]
[clojure.java.io :as io]
[clojure.string :as str])
(:import [java.time Instant]
[java.time.temporal ChronoUnit]))
(set! *warn-on-reflection* true)
(def ^:private aws-profile "orcha-prod")
(def ^:private aws-region "eu-central-1")
(def ^:private prod-db-identifier "v1-orcha-db")
(def ^:private app-instance-name "v1-orcha-app")
(def ^:private local-forward-port "25432")
(def ^:private default-instance-class "db.t4g.small")
(def ^:private default-freshness-hours 24)
;; ---------------------------------------------------------------------------
;; Pure helpers
;; ---------------------------------------------------------------------------
(defn parse-args
"Parse command-line args into an options map with defaults."
[args]
(loop [args args
opts {:fresh-snapshot? false
:freshness-hours default-freshness-hours
:instance-class default-instance-class
:skip-restore? false}]
(if (empty? args)
opts
(let [[arg & rest-args] args]
(case arg
"--fresh-snapshot" (recur rest-args (assoc opts :fresh-snapshot? true))
"--freshness-hours" (recur (rest rest-args)
(assoc opts :freshness-hours
(Long/parseLong (first rest-args))))
"--instance-class" (recur (rest rest-args)
(assoc opts :instance-class (first rest-args)))
"--skip-restore" (recur rest-args (assoc opts :skip-restore? true))
"--help" (assoc opts :help true)
(recur rest-args opts))))))
(defn clone-identifier
"Deterministic throwaway identifier: v1-orcha-db-clone-<yyyyMMddHHmmss>."
[^Instant now]
(let [fmt (-> (java.time.format.DateTimeFormatter/ofPattern "yyyyMMddHHmmss")
(.withZone (java.time.ZoneId/of "UTC")))]
(str prod-db-identifier "-clone-" (.format fmt now))))
(defn snapshot-fresh-enough?
"True if `snapshot-create-time` (ISO-8601 string) is within `max-age-hours`
of `now`."
[snapshot-create-time ^Instant now max-age-hours]
(let [created (Instant/parse snapshot-create-time)
age-hours (.between ChronoUnit/HOURS created now)]
(<= age-hours max-age-hours)))
;; ---------------------------------------------------------------------------
;; AWS primitives (side-effectful)
;; ---------------------------------------------------------------------------
(defn ^:private aws-cmd
"Run an aws CLI command with the prod profile and region.
Returns {:out ... :err ... :exit N} without throwing."
[& args]
(apply p/shell
{:out :string
:err :string
:continue true
:extra-env {"AWS_PROFILE" aws-profile}}
"aws"
(concat args ["--region" aws-region])))
(defn ^:private find-latest-snapshot
"Return {:identifier ... :create-time ...} for the newest available
snapshot of `prod-db-identifier`, or nil if none."
[]
(let [result (aws-cmd "rds" "describe-db-snapshots"
"--db-instance-identifier" prod-db-identifier
"--query" "reverse(sort_by(DBSnapshots[?Status==`available`], &SnapshotCreateTime))[0].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Type:SnapshotType}"
"--output" "json")]
(when (zero? (:exit result))
(let [parsed (json/parse-string (:out result) true)]
(when (:ID parsed)
{:identifier (:ID parsed)
:create-time (:Time parsed)
:type (:Type parsed)})))))
(defn ^:private create-manual-snapshot!
"Create and wait for a fresh manual snapshot. Returns the identifier."
[^Instant now]
(let [id (str prod-db-identifier "-clone-src-"
(.format (.withZone (java.time.format.DateTimeFormatter/ofPattern
"yyyyMMddHHmmss")
(java.time.ZoneId/of "UTC"))
now))]
(println (format "Creating fresh manual snapshot: %s" id))
(let [r (aws-cmd "rds" "create-db-snapshot"
"--db-instance-identifier" prod-db-identifier
"--db-snapshot-identifier" id)]
(when-not (zero? (:exit r))
(throw (ex-info (str "create-db-snapshot failed: " (:err r)) {}))))
(println "Waiting for snapshot to become available (this can take 5-15 minutes)...")
(let [r (aws-cmd "rds" "wait" "db-snapshot-available"
"--db-snapshot-identifier" id)]
(when-not (zero? (:exit r))
(throw (ex-info (str "snapshot wait failed: " (:err r)) {}))))
id))
(defn ^:private subnet-group-name
"Look up the subnet group name of the prod instance (needed to restore
into the same subnets)."
[]
(let [r (aws-cmd "rds" "describe-db-instances"
"--db-instance-identifier" prod-db-identifier
"--query" "DBInstances[0].DBSubnetGroup.DBSubnetGroupName"
"--output" "text")]
(when-not (zero? (:exit r))
(throw (ex-info "failed to look up subnet group" {:err (:err r)})))
(str/trim (:out r))))
(defn ^:private security-group-ids
"Look up the VPC SG IDs attached to the prod instance."
[]
(let [r (aws-cmd "rds" "describe-db-instances"
"--db-instance-identifier" prod-db-identifier
"--query" "DBInstances[0].VpcSecurityGroups[].VpcSecurityGroupId"
"--output" "text")]
(when-not (zero? (:exit r))
(throw (ex-info "failed to look up security groups" {:err (:err r)})))
(str/split (str/trim (:out r)) #"\s+")))
(defn ^:private restore-snapshot!
"Restore `snapshot-id` to `clone-id`. Blocks until available."
[snapshot-id clone-id instance-class timestamp-str]
(let [sg-ids (security-group-ids)
sg-args (mapcat (fn [sg] ["--vpc-security-group-ids" sg]) sg-ids)
args (concat ["rds" "restore-db-instance-from-db-snapshot"
"--db-snapshot-identifier" snapshot-id
"--db-instance-identifier" clone-id
"--db-instance-class" instance-class
"--db-subnet-group-name" (subnet-group-name)
"--no-publicly-accessible"
"--no-multi-az"
"--no-auto-minor-version-upgrade"
"--storage-type" "gp3"
"--no-deletion-protection"
"--backup-retention-period" "0"
"--tags"
(format "Key=CloneOf,Value=%s" prod-db-identifier)
"Key=CreatedBy,Value=clone-prod-db"
(format "Key=CreatedAt,Value=%s" timestamp-str)]
sg-args)]
(println (format "Restoring snapshot %s to %s (%s)..." snapshot-id clone-id instance-class))
(let [r (apply aws-cmd args)]
(when-not (zero? (:exit r))
(throw (ex-info (str "restore failed: " (:err r)) {}))))
(println "Waiting for clone to become available (~10-20 minutes)...")
(let [r (aws-cmd "rds" "wait" "db-instance-available"
"--db-instance-identifier" clone-id)]
(when-not (zero? (:exit r))
(throw (ex-info (str "wait failed: " (:err r)) {}))))))
(defn ^:private clone-endpoint
[clone-id]
(let [r (aws-cmd "rds" "describe-db-instances"
"--db-instance-identifier" clone-id
"--query" "DBInstances[0].Endpoint.Address"
"--output" "text")]
(when-not (zero? (:exit r))
(throw (ex-info "endpoint lookup failed" {:err (:err r)})))
(str/trim (:out r))))
(defn ^:private master-secret-json
"Fetch the prod DB credentials JSON from SSM (same source config uses).
Restored clones inherit the master user/password from the snapshot, so
the prod credentials work against the clone."
[]
(let [r (aws-cmd "ssm" "get-parameter"
"--name" "/v1-orcha/db-credentials"
"--with-decryption"
"--query" "Parameter.Value"
"--output" "text")]
(when-not (zero? (:exit r))
(throw (ex-info "credentials fetch failed" {:err (:err r)})))
(str/trim (:out r))))
(defn ^:private ec2-instance-id
"Look up the running v1-orcha-app EC2 instance ID."
[]
(let [r (aws-cmd "ec2" "describe-instances"
"--filters"
(format "Name=tag:Name,Values=%s" app-instance-name)
"Name=instance-state-name,Values=running"
"--query" "Reservations[0].Instances[0].InstanceId"
"--output" "text")]
(when-not (zero? (:exit r))
(throw (ex-info "ec2 lookup failed" {:err (:err r)})))
(let [id (str/trim (:out r))]
(when (or (str/blank? id) (= id "None"))
(throw (ex-info (format "No running %s instance found" app-instance-name) {})))
id)))
(defn ^:private start-port-forward!
"Start an SSM session that tunnels local-forward-port → rds-endpoint:5432."
[ec2-id rds-endpoint]
(println (format "Starting SSM port-forward: localhost:%s → %s:5432" local-forward-port rds-endpoint))
(p/process {:extra-env {"AWS_PROFILE" aws-profile}
:out :string
:err :string}
"aws" "ssm" "start-session"
"--target" ec2-id
"--document-name" "AWS-StartPortForwardingSessionToRemoteHost"
"--parameters" (format "{\"host\":[\"%s\"],\"portNumber\":[\"5432\"],\"localPortNumber\":[\"%s\"]}"
rds-endpoint local-forward-port)
"--region" aws-region))
(defn ^:private wait-for-port
[port timeout-ms]
(let [start (System/currentTimeMillis)]
(loop []
(if (> (- (System/currentTimeMillis) start) timeout-ms)
false
(if (try
(with-open [_ (java.net.Socket. "localhost" (Integer/parseInt port))]
true)
(catch java.net.ConnectException _ false))
true
(do (Thread/sleep 1000) (recur)))))))
(defn ^:private pg-dump!
[creds-json dump-path]
(let [{:keys [username password dbname]}
(json/parse-string creds-json true)]
(println (format "Running pg_dump to %s..." dump-path))
(let [r (p/shell {:extra-env {"PGPASSWORD" password}
:continue true}
"pg_dump" "-Fc" "-Z1"
"--no-owner" "--no-privileges"
"-h" "localhost"
"-p" local-forward-port
"-U" username
"-f" dump-path
dbname)]
(when-not (zero? (:exit r))
(throw (ex-info "pg_dump failed" {:exit (:exit r)}))))))
(defn ^:private delete-clone!
"Delete the throwaway RDS clone. Retries while the instance is still in
`creating` state, which AWS rejects with InvalidDBInstanceState."
[clone-id]
(println (format "Deleting throwaway clone %s..." clone-id))
(loop [attempt 1]
(let [r (aws-cmd "rds" "delete-db-instance"
"--db-instance-identifier" clone-id
"--skip-final-snapshot"
"--delete-automated-backups")]
(cond
(zero? (:exit r))
(println (format "Deletion initiated for %s." clone-id))
(and (re-find #"InvalidDBInstanceState" (or (:err r) ""))
(< attempt 30))
(do (println (format "Clone still in transient state (attempt %d/30); waiting 30s..." attempt))
(Thread/sleep 30000)
(recur (inc attempt)))
:else
(do (println (format "WARNING: delete-db-instance failed for %s: %s"
clone-id (:err r)))
(println (format "Manual cleanup required: bb db:list-clones && \\"))
(println (format " aws rds delete-db-instance --profile orcha-prod \\"))
(println (format " --region eu-central-1 --db-instance-identifier %s \\" clone-id))
(println " --skip-final-snapshot --delete-automated-backups"))))))
(defn ^:private ensure-cleanup-hook!
"Register a JVM shutdown hook that attempts to delete the clone. Safe
against double-invocation via an atom guard."
[clone-id deleted?-atom]
(.addShutdownHook (Runtime/getRuntime)
(Thread.
(fn []
(when (compare-and-set! deleted?-atom false true)
(println)
(println "Shutdown hook: cleaning up throwaway clone...")
(delete-clone! clone-id))))))
;; ---------------------------------------------------------------------------
;; Orchestration
;; ---------------------------------------------------------------------------
(defn print-usage
[]
(println "Usage: bb db:clone-prod [options]")
(println)
(println "Options:")
(println " --fresh-snapshot Always take a new manual snapshot (skip reuse check).")
(println " --freshness-hours N Reuse newest snapshot if younger than N hours (default 24).")
(println " --instance-class C Throwaway instance class (default db.t4g.small).")
(println " --skip-restore Debug: print what would run, don't create anything.")
(println " --help"))
(defn -main
[& args]
(let [opts (parse-args args)]
(when (:help opts) (print-usage) (System/exit 0))
(let [now (Instant/now)
ts (.format (.withZone (java.time.format.DateTimeFormatter/ofPattern
"yyyyMMdd-HHmmss")
(java.time.ZoneId/of "UTC"))
now)
clone-id (clone-identifier now)
dump-path (format "dump/prod-%s.dump" ts)]
(println "Prod DB clone — session" ts)
(println "Throwaway clone identifier:" clone-id)
(println "Target dump file: " dump-path)
(println)
(when (:skip-restore? opts)
(println "[--skip-restore] exiting before any AWS side-effects.")
(System/exit 0))
;; Step 1: pick or create a snapshot
(let [existing (when-not (:fresh-snapshot? opts) (find-latest-snapshot))
snapshot-id (if (and existing
(snapshot-fresh-enough?
(:create-time existing) now (:freshness-hours opts)))
(do (println (format "Reusing snapshot %s (%s, type=%s)"
(:identifier existing)
(:create-time existing)
(:type existing)))
(:identifier existing))
(create-manual-snapshot! now))]
;; Step 2: restore + register cleanup
;; Hook is registered BEFORE the restore call so that a SIGINT during
;; the restore wait still triggers cleanup. delete-db-instance against
;; an instance that didn't quite materialize is a no-op + log line.
(let [deleted? (atom false)]
(ensure-cleanup-hook! clone-id deleted?)
(try
(restore-snapshot! snapshot-id clone-id (:instance-class opts) ts)
;; Step 3: port-forward
(let [ec2-id (ec2-instance-id)
endpoint (clone-endpoint clone-id)
forward (start-port-forward! ec2-id endpoint)]
(try
(if (wait-for-port local-forward-port 30000)
(println "Port forward established.")
(do
;; Surface SSM stderr — a missing session-manager-plugin
;; or expired SSO token is the usual cause and the
;; default 30s timeout otherwise gives no signal.
(when-let [err (some-> @forward :err)]
(println "SSM stderr:")
(println err))
(throw (ex-info "port-forward did not come up in 30s"
{:ec2-id ec2-id :endpoint endpoint}))))
;; Step 4: dump
(let [creds (master-secret-json)]
(io/make-parents (io/file dump-path))
(pg-dump! creds dump-path)
(println (format "Dump complete: %s" dump-path)))
(finally
(p/destroy forward))))
(finally
;; Step 5: teardown (also covered by shutdown hook for abnormal exit)
(when (compare-and-set! deleted? false true)
(delete-clone! clone-id)))))))))
Create scripts/test_clone_prod_db.clj:
(ns test-clone-prod-db
"Unit tests for pure helpers in clone-prod-db. Run via:
bb scripts/test_clone_prod_db.clj"
(:require [clojure.test :as t :refer [deftest is run-tests testing]]
[clone-prod-db :as cpd])
(:import [java.time Instant]))
(deftest parse-args-defaults
(let [o (cpd/parse-args [])]
(is (false? (:fresh-snapshot? o)))
(is (= 24 (:freshness-hours o)))
(is (= "db.t4g.small" (:instance-class o)))
(is (false? (:skip-restore? o)))))
(deftest parse-args-all-flags
(let [o (cpd/parse-args ["--fresh-snapshot"
"--freshness-hours" "6"
"--instance-class" "db.t4g.medium"
"--skip-restore"])]
(is (true? (:fresh-snapshot? o)))
(is (= 6 (:freshness-hours o)))
(is (= "db.t4g.medium" (:instance-class o)))
(is (true? (:skip-restore? o)))))
(deftest clone-identifier-is-deterministic-from-now
(let [now (Instant/parse "2026-04-24T09:38:39Z")
id (cpd/clone-identifier now)]
(is (= "v1-orcha-db-clone-20260424093839" id))))
(deftest snapshot-fresh-enough?
(let [now (Instant/parse "2026-04-24T12:00:00Z")]
(testing "1h old is fresh (default threshold 24h)"
(is (true? (cpd/snapshot-fresh-enough? "2026-04-24T11:00:00Z" now 24))))
(testing "23h old is fresh"
(is (true? (cpd/snapshot-fresh-enough? "2026-04-23T13:00:00Z" now 24))))
(testing "25h old is stale"
(is (false? (cpd/snapshot-fresh-enough? "2026-04-23T10:59:00Z" now 24))))
(testing "custom threshold 6h — 5h fresh"
(is (true? (cpd/snapshot-fresh-enough? "2026-04-24T07:00:00Z" now 6))))
(testing "custom threshold 6h — 7h stale"
(is (false? (cpd/snapshot-fresh-enough? "2026-04-24T05:00:00Z" now 6))))))
(let [{:keys [fail error]} (run-tests 'test-clone-prod-db)]
(System/exit (if (zero? (+ fail error)) 0 1)))
Run: bb scripts/test_clone_prod_db.clj
Expected: all tests pass, exit 0.
--help and --skip-restoreRegister the bb task. Edit bb.edn:
db:clone-prod
{:doc "Clone prod RDS to a local dump file via a throwaway RDS instance"
:requires ([clone-prod-db])
:task (apply clone-prod-db/-main *command-line-args*)}
Run: bb db:clone-prod --help
Expected: prints the usage block, exits 0.
Run: bb db:clone-prod --skip-restore
Expected: prints the session header + throwaway identifier + dump path, then exits with [--skip-restore] exiting before any AWS side-effects.
Run: clj-kondo --lint scripts/clone_prod_db.clj scripts/test_clone_prod_db.clj
Expected: no warnings or errors.
git add scripts/clone_prod_db.clj scripts/test_clone_prod_db.clj bb.edn
git commit -m "feat(dev): add prod DB clone script"
The real AWS execution — take snapshot, restore, dump, teardown — is validated in Task 7 against the rename migration. Don't run it here; it's slow and costs money.
Purpose: A single page a human or agent can follow top-to-bottom before a big migration deploy. References the scripts we just built.
Files:
Create: docs/runbooks/prod-clone-refactor-testing.md
Step 1: Write the runbook
Create docs/runbooks/prod-clone-refactor-testing.md:
# Prod-Clone Refactor Testing
Pre-deploy validation procedure for large, schema-touching refactors. Restores
a clone of prod Postgres locally, applies pending migrations, verifies the
resulting schema matches a fresh-from-init baseline, and exercises ingestion
and UI against the migrated clone.
**Design spec:** `docs/superpowers/specs/2026-04-24-prod-clone-refactor-testing-design.md`
## When to use
Run before deploying any migration that:
- Renames tables, columns, constraints, indexes, or types.
- Modifies trigger function bodies or JSONB path consumers.
- Makes `NOT NULL` a previously-nullable column.
- Changes FK topology.
- Anything where a successful `bb migrate migrate` against init.sql wouldn't
catch data-shape issues.
**Not** needed for additive migrations (new columns/tables) that are covered
by unit/integration tests and don't touch live data layout.
## Prerequisites
- AWS SSO login: `aws sso login --profile orcha-prod`
- docker-compose stack running: `bb dev:up && bb dev:seed`
- Postgres 18+ client tools on PATH (`pg_dump`, `pg_restore`, `psql`).
The host client major version MUST be ≥ the docker-compose Postgres
server version (currently `pgvector/pgvector:pg18` per `docker-compose.yml`).
Verify both:
pg_dump --version # → 18.x docker-compose exec -T postgres psql -U postgres -tAc "SHOW server_version;" # → 18.x
Mismatch will cause `pg_restore` to fail loading the prod-shaped dump.
- AWS Session Manager plugin installed (`session-manager-plugin --version`
must succeed). Without it, the SSM port-forward step silently hangs.
- Branch with the migration under test checked out
## Procedure
### Step 1 — Clone prod to a dump file
Unattended, ~30–45 min (longer if a fresh snapshot needs to be taken).
bb db:clone-prod
Produces `dump/prod-<timestamp>.dump`. Watch output for the throwaway clone
identifier — if the script dies unexpectedly, that's what you need to delete
manually (see Step 9).
**Flags:**
- `--fresh-snapshot` — skip snapshot reuse; always create a new one.
- `--freshness-hours N` — reuse snapshots up to N hours old (default 24).
### Step 2 — Load the dump into local Postgres
~2–5 min.
bb db:load-clone
Creates `orcha_prod_clone` in local docker-compose Postgres, restored from
the newest `dump/*.dump` file.
### Step 3 — Sanity check: current master boots against pre-migration clone
ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev
At the REPL: `(integrant.repl/go)` — must succeed. Confirms the clone is
usable and that current master is healthy against prod schema. Exit the REPL.
### Step 4 — Apply pending migrations on the clone
ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb migrate migrate
All pending migrations apply cleanly against prod data.
### Step 5 — Schema assertion (the gate)
bb db:fresh bb db:schema-diff --a orcha_prod_clone --b orcha_fresh
`bb db:fresh` (re)creates `orcha_fresh` at HEAD schema. `bb db:schema-diff`
exits 0 on empty diff. Any diff means the migration produces a schema that
diverges from what init.sql + migrations produce — i.e., a migration bug.
Iterate until exit 0.
### Step 6 — Ingestion smoke (programmatic)
Start the app pointed at the clone:
ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev
At the REPL: `(integrant.repl/go)`. In another terminal, list available
fixtures and ingest a couple. Don't reference prod document IDs — these
should be fresh test PDFs, since prod S3 objects aren't local.
ls test/fixtures/ # discover what's available
bb ingest test/fixtures/
Assert each document reaches a terminal state (processed/failed) without
errors. Watch the REPL logs for:
- Queries referencing renamed columns (`legal_entity_id`, etc.)
- Malli decode errors
- Trigger-function payload key mismatches (`:legal-entity-id` vs `:tenant-id`)
- SES/notification consumer key mismatches
- FK violations
### Step 7 — UI smoke (manual)
With the app still running against the clone, walk through:
- Document list — loads and paginates
- Open a document → view renders without errors
- Matching screen — loads a cluster
- Tenants admin panel (`/tenants`) — loads, shows renamed entities
- Organizations admin panel (`/organizations`) — loads
Watch browser devtools and REPL logs for 500s.
### Step 8 — Iterate on failure
If any step fails:
1. Fix code or migration in the repo.
2. `bb db:load-clone` — resets `orcha_prod_clone` from the cached dump
file (~3 min, no re-clone of prod needed).
3. Repeat from Step 4.
### Step 9 — Cleanup
bb db:drop-clone
bb db:list-clones # must be empty
rm dump/prod-
`bb db:list-clones` surfaces any leaked throwaway RDS instances (should not
happen under normal conditions — the script has a shutdown hook — but run
this as a belt-and-braces check).
## What this does NOT cover
- Data-path bugs that require specific prod S3 objects. Ingestion uses
fresh local test documents only.
- Load or performance testing — throwaway clone is cost-optimized.
- Prod deploy race conditions (concurrent writers during migration).
Mitigation: deploy in low-traffic windows.
- Integration side effects (DATEV/SAP/Outlook) beyond what ingestion
incidentally triggers.
## Troubleshooting
**Clone restore hangs past 25 minutes.** The AWS wait timeout may have been
exceeded. `bb db:list-clones` — if the instance exists and is `available`,
restart the script; if it's `creating`, keep waiting (larger snapshots take
longer).
**SSM port-forward fails with "target not connected".** Run
`aws sso login --profile orcha-prod` and retry.
**`pg_dump` fails with `password authentication failed`.** The script
fetches the master password from SSM (`/v1-orcha/db-credentials`), and
restored snapshots inherit the master password as it was AT SNAPSHOT TIME.
If the prod password rotated between the snapshot and the restore, those
won't match. Resolutions:
- Use `--fresh-snapshot` so the snapshot reflects the current password.
- Or pass `--master-user-password <known>` to
`restore-db-instance-from-db-snapshot` and use that known password
(script enhancement; not currently supported).
**`pg_restore` warnings about missing roles.** Expected; `--no-owner
--no-privileges` skips ownership. Exit code 1 with warnings is treated as
success by `bb db:load-clone`.
**Schema diff shows expected vs actual differences you think are fine.**
Add a canonicalizer rule in `scripts/schema_diff.clj`. Review the diff
carefully — "noise" is often a subtle bug.
**`bb db:list-clones` shows a stray instance you don't recognize.**
Delete it:
aws rds delete-db-instance --profile orcha-prod --region eu-central-1
--db-instance-identifier
git add docs/runbooks/prod-clone-refactor-testing.md
git commit -m "docs(runbooks): add prod-clone refactor testing runbook"
⚠️ USER-ONLY — DO NOT AUTO-EXECUTE.
This task is the first real use of the procedure and is the only task in this plan that performs AWS mutations against the
orcha-prodaccount:rds create-db-snapshot,rds restore-db-instance-from-db-snapshot,ssm start-sessiontov1-orcha-app, andrds delete-db-instance --skip-final-snapshot --delete-automated-backups. It also pulls prod master DB credentials to the local machine viassm get-parameter --with-decryptionand writes a dump file containing production PII todump/.Agents executing this plan must STOP after Task 6 and hand control back to the user. The user runs Task 7 interactively, end-to-end, and confirms each step's outcome. Do not invoke
bb db:clone-prod(without--skip-restore),bb db:load-clone,bb db:fresh,bb db:schema-diff,bb db:drop-clone, or any of the AWS CLI commands listed below on the user's behalf.
Purpose: Actually run the full procedure against the legal_entity → tenant rename migration and confirm it catches real issues (or gives a clean pass). This is both the first real use and the acceptance test for the procedure itself.
Files: none modified — just running scripts and documenting findings.
This task is run interactively by the user and its results inform whether the procedure is ready. Not a clean TDD loop; more of a dress rehearsal.
aws sso login --profile orcha-prod
bb dev:up
bb dev:seed
docker-compose ps
pg_dump --version # must be 18+
Expected: aws sso succeeds, containers running, pg_dump (PostgreSQL) 18.x.
git log --oneline -1
Expected: current HEAD is the tip of the rename work (the latest fix(tenant-rename): ... commit or later).
bb db:clone-prod
Runs unattended. When complete, note the dump path printed at the end.
bb db:list-clones
Expected: empty table (or only shows the one being deleted, status deleting).
bb db:load-clone
Expected: succeeds with "Done. orcha_prod_clone is ready."
This is the canary: the clone contains prod data pre-rename (i.e., still has legal_entity tables). But our current master expects the renamed tables. So this step is expected to FAIL if current HEAD is post-rename code.
If HEAD is post-rename code, skip this step. If you want this canary, check out a pre-rename commit and try it.
For this validation run, note that this step is N/A (current HEAD is post-rename).
ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb migrate migrate
Expected: the rename migration applies cleanly. Exit 0.
bb db:fresh
bb db:schema-diff --a orcha_prod_clone --b orcha_fresh
Expected: exit 0, "schemas match". If any diff appears, stop and investigate — it represents a real defect in either the migration or init.sql.
ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev
At the REPL:
(integrant.repl/go)
In a second terminal:
bb ingest test/fixtures/01-invoice.pdf
(Use whatever invoice fixtures actually exist under test/fixtures/ — list them first with ls test/fixtures/.)
Watch REPL logs for any legal-entity-id residual query, Malli decode error, or 500-level response.
Expected: ingestion completes; document moves to a terminal state; no errors referencing old-schema identifiers.
With the app running, open http://localhost:3000 (or whatever the app's local URL is — check bb.edn / README) and click through the checklist in docs/runbooks/prod-clone-refactor-testing.md Step 7.
bb db:drop-clone
bb db:list-clones # empty
rm dump/prod-*.dump # after cleanup confirmed
If anything surprising came up (schema diff issues, ingestion errors, UI 500s, throwaway-instance cleanup problems), file them as issues on the rename branch and add canonicalizer rules / plan steps as needed. If nothing surprising came up, the procedure has passed its first real test.
No commit required for this task unless script fixes were made.
Before handing this plan to execution, run through:
TBD / TODO / handle edge cases placeholders. Grep the plan for those strings and fix any hits....s?ORCHA_LOCAL_DB_NAME_OVERRIDE; the clone DB is always orcha_prod_clone; the fresh DB is always orcha_fresh; the task names always match between bb.edn and the doc.docs/superpowers/specs/2026-04-24-prod-clone-refactor-testing-design.md.