Prod-Clone Refactor Testing Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

⚠️ Execution boundary: Tasks 1–6 are agent-executable; they perform no AWS mutations (one read-only rds describe-db-instances call at most, in Task 4). Task 7 is USER-ONLY — it is the first real run of the procedure and is the only task that mutates AWS resources in orcha-prod (snapshot create, RDS restore, RDS delete) and pulls prod credentials + PII to the local machine. After completing Task 6, STOP and hand control back to the user. See the warning block at the top of Task 7.

Goal: Build a repeatable procedure (three bb scripts + a runbook) for restoring a prod Postgres clone into local docker-compose, applying pending migrations, and proving the resulting schema matches a fresh-from-init baseline — so we can validate large refactors (like the legal_entitytenant rename) against real data before deploying.

Architecture: A one-shot cloning pipeline (bb db:clone-prod) snapshots prod RDS, restores it to a short-lived throwaway RDS, pg_dumps from it via SSM port-forward to a local dump file, then deletes the throwaway. A schema diff tool (bb db:schema-diff) canonicalizes pg_dump -s output and compares two local DBs. Small helper tasks (bb db:load-clone, bb db:fresh, bb db:drop-clone, bb db:list-clones) manage local and remote state. The app points at the clone via an env-driven dbname override. A runbook ties it all together.

Tech Stack: Babashka scripts (Clojure via bb), AWS CLI (rds/ssm/secretsmanager), pg_dump/pg_restore/psql, docker-compose Postgres, migratus.


Design spec

Authoritative source: docs/superpowers/specs/2026-04-24-prod-clone-refactor-testing-design.md. Review it before starting — this plan implements exactly what's in there.


File Structure

Created

Modified

Not touched


Commit convention

One logical unit per commit. Suggested prefixes (match existing git log style):


Task 1: Gitignore the dump directory

Purpose: The procedure writes prod database dumps (containing PII) to dump/. Must not be accidentally committed.

Files:

Run: grep -n dump .gitignore Expected: no match, or only references to unrelated entries.

Run: git status --short dump/ | head -3 Expected: lists untracked dump/ contents (already has misc files from prior work).

Open .gitignore and add the following line (group it near other tool/output exclusions, e.g., near the existing /tmp/ entry):

/dump/

Run: git ls-files dump/ | head Expected: no output. If output appears, stop and flag — existing tracked files shouldn't be silently ignored.

Run: git status --short | grep '^?? dump' Expected: no match (the dump/ line that was there at the start of the session is now hidden).

git add .gitignore
git commit -m "chore: gitignore dump directory"

Task 2: Add ORCHA_LOCAL_DB_NAME_OVERRIDE env override

Purpose: Allow the app and migrations to target orcha_prod_clone without modifying SSM parameters or config files. The override lives in parse-credentials so both the runtime pool (com.getorcha.db/::pool) and the migrations path (com.getorcha.db.migrations/->derive-db) pick it up from a single code point.

Files:

Open src/com/getorcha/db.clj at parse-credentials (~line 193). Confirm the body matches what Task 1's grep showed: it json-parses, then returns a map with :url, :user, :password, :host, :port, :dbname.

Run: test -f test/com/getorcha/db_test.clj && echo exists || echo missing

If missing, create it with this skeleton:

(ns com.getorcha.db-test
  (:require [clojure.test :refer [deftest testing is]]
            [com.getorcha.db :as db]))

If exists, open it and append the new test at the bottom.

Append to test/com/getorcha/db_test.clj:

(def ^:private sample-credentials-json
  (str "{\"username\":\"u\",\"password\":\"p\","
       "\"host\":\"h\",\"port\":5432,\"dbname\":\"orcha\"}"))


(deftest parse-credentials-honors-local-db-name-override
  (testing "without override, dbname matches the JSON"
    (let [parsed (db/parse-credentials sample-credentials-json)]
      (is (= "orcha" (:dbname parsed)))
      (is (= "jdbc:postgresql://h:5432/orcha" (:url parsed)))))

  (testing "with override env var set, dbname is replaced and URL reflects it"
    (with-redefs [db/env-db-name-override (constantly "orcha_prod_clone")]
      (let [parsed (db/parse-credentials sample-credentials-json)]
        (is (= "orcha_prod_clone" (:dbname parsed)))
        (is (= "jdbc:postgresql://h:5432/orcha_prod_clone" (:url parsed))))))

  (testing "blank override is ignored (treated as absent)"
    (with-redefs [db/env-db-name-override (constantly "")]
      (let [parsed (db/parse-credentials sample-credentials-json)]
        (is (= "orcha" (:dbname parsed)))))))

Run: clj -X:test:silent :nses '[com.getorcha.db-test]' 2>&1 | grep -E "FAIL|ERROR|Ran" Expected: ERROR (because db/env-db-name-override doesn't exist yet) or FAIL (because the override is not applied).

Edit src/com/getorcha/db.clj. Immediately above parse-credentials, add:

(defn ^:private env-db-name-override
  "Returns the value of ORCHA_LOCAL_DB_NAME_OVERRIDE or nil.

   Indirected through a defn (rather than called inline) so tests can
   stub it via with-redefs without touching process env. Intended for
   local development only — production task definitions MUST NOT set
   this env var."
  []
  (System/getenv "ORCHA_LOCAL_DB_NAME_OVERRIDE"))

Then modify the body of parse-credentials to apply the override. Replace:

  [credentials-json]
  (let [{:keys [username password host port dbname]}
        (json/parse-string credentials-json true)]
    {:url      (format "jdbc:postgresql://%s:%d/%s" host port dbname)
     :user     username
     :password (or password "")
     :host     host
     :port     port
     :dbname   dbname}))

with:

  [credentials-json]
  (let [{:keys [username password host port dbname]}
        (json/parse-string credentials-json true)
        override         (env-db-name-override)
        effective-dbname (if (and override (not (string/blank? override)))
                           override
                           dbname)]
    {:url      (format "jdbc:postgresql://%s:%d/%s" host port effective-dbname)
     :user     username
     :password (or password "")
     :host     host
     :port     port
     :dbname   effective-dbname}))

Required ns edit: verify the :require block of src/com/getorcha/db.clj includes [clojure.string :as string]. If absent (it currently is — check first with head -25 src/com/getorcha/db.clj), add it in alphabetical order. Without this require, string/blank? in the new code will fail to compile.

Run: clj -X:test:silent :nses '[com.getorcha.db-test]' 2>&1 | grep -E "FAIL|ERROR|Ran" Expected: Ran 1 tests containing 5 assertions. 0 failures, 0 errors. (or similar; numbers depend on pre-existing tests in the file).

Run: clj-kondo --lint src/com/getorcha/db.clj test/com/getorcha/db_test.clj Expected: no warnings or errors.

git add src/com/getorcha/db.clj test/com/getorcha/db_test.clj
git commit -m "feat(dev): env override for local DB name"

Task 3: Implement bb db:schema-diff

Purpose: Diff the schema of two local databases after running pg_dump -s, with canonicalization to eliminate non-semantic differences (comments, SET statements, sequence setvals, extension OIDs, statement order noise). Exit 0 on empty diff; non-zero with printed diff otherwise.

Files:

Create scripts/schema_diff.clj:

(ns schema-diff
  "Compare the schema of two local PostgreSQL databases.

   Runs pg_dump -s against each, canonicalizes the output (strips comments,
   SET statements, sequence setvals, and sorts top-level statements by
   object type + name), and diffs the result. Exits 0 on empty diff;
   exits 1 with the diff printed on any difference.

   Usage:
     bb db:schema-diff --a <db-a> --b <db-b> [--host HOST] [--port PORT] [--user USER]

   Example:
     bb db:schema-diff --a orcha_prod_clone --b orcha_fresh"
  (:require [babashka.process :as p]
            [clojure.java.io :as io]
            [clojure.string :as str])
  (:import [java.io File]))


(set! *warn-on-reflection* true)


(def ^:private default-host "localhost")
(def ^:private default-port "5432")
(def ^:private default-user "postgres")


(defn ^:private parse-args
  "Parse command-line args into an options map."
  [args]
  (loop [args args
         opts {:host default-host
               :port default-port
               :user default-user}]
    (if (empty? args)
      opts
      (let [[arg & rest-args] args]
        (case arg
          "--a"    (recur (rest rest-args) (assoc opts :db-a (first rest-args)))
          "--b"    (recur (rest rest-args) (assoc opts :db-b (first rest-args)))
          "--host" (recur (rest rest-args) (assoc opts :host (first rest-args)))
          "--port" (recur (rest rest-args) (assoc opts :port (first rest-args)))
          "--user" (recur (rest rest-args) (assoc opts :user (first rest-args)))
          "--help" (assoc opts :help true)
          (recur rest-args opts))))))


(defn canonicalize
  "Canonicalize pg_dump -s output to eliminate non-semantic differences.

   Rules:
   - Strip `--` line comments (including the section-header decorators
     pg_dump emits before each object).
   - Drop `SET` statements and `SELECT pg_catalog.set_config(...)` lines.
   - Drop sequence `setval` calls (carry no schema meaning).
   - Drop extension version lines (carry build-time OIDs).
   - Collapse runs of blank lines and trim trailing whitespace.

   Does NOT reorder statements: pg_dump emits objects in a deterministic
   topological order (namespace → kind → name). Two databases with the
   same final schema produce the same ordering. Reordering would require
   a $$-aware splitter, which we don't need."
  [sql-text]
  (let [filtered (->> (str/split-lines sql-text)
                      (remove #(str/starts-with? (str/triml %) "--"))
                      (remove #(re-find #"^\s*SET\s" %))
                      (remove #(re-find #"^\s*SELECT\s+pg_catalog\.set_config" %))
                      (remove #(re-find #"^\s*SELECT\s+pg_catalog\.setval" %))
                      (remove #(re-find #"COMMENT ON EXTENSION" %))
                      (map str/trimr))]
    (-> (str/join "\n" filtered)
        (str/replace #"\n{3,}" "\n\n")
        str/trim)))


(defn ^:private pg-dump-schema
  "Run pg_dump -s against `dbname`, return stdout string.
   Throws ex-info on non-zero exit."
  [{:keys [host port user] :as _opts} dbname]
  (let [result (p/shell {:out :string
                         :err :string
                         :continue true}
                        "pg_dump" "-s"
                        "--no-owner" "--no-privileges"
                        "--schema=public"
                        "-h" host "-p" port "-U" user
                        dbname)]
    (when-not (zero? (:exit result))
      (throw (ex-info (format "pg_dump failed for %s: %s" dbname (:err result))
                      {:dbname dbname :exit (:exit result)})))
    (:out result)))


(defn ^:private write-file!
  [^File file content]
  (.mkdirs (.getParentFile file))
  (spit file content))


(defn ^:private print-usage
  []
  (println "Usage: bb db:schema-diff --a <db-a> --b <db-b>")
  (println "                        [--host HOST] [--port PORT] [--user USER]")
  (println)
  (println "Compares the schemas of two local PostgreSQL databases.")
  (println "Exits 0 on empty diff; 1 with the diff printed on any difference."))


(defn -main
  [& args]
  (let [{:keys [db-a db-b help] :as opts} (parse-args args)]
    (cond
      help
      (print-usage)

      (or (nil? db-a) (nil? db-b))
      (do (println "Error: --a and --b are required")
          (println)
          (print-usage)
          (System/exit 2))

      :else
      (let [raw-a (pg-dump-schema opts db-a)
            raw-b (pg-dump-schema opts db-b)
            canon-a (canonicalize raw-a)
            canon-b (canonicalize raw-b)
            file-a (io/file (format "dump/schema-%s.sql" db-a))
            file-b (io/file (format "dump/schema-%s.sql" db-b))]
        (write-file! file-a canon-a)
        (write-file! file-b canon-b)
        (if (= canon-a canon-b)
          (do (println (format "OK: schemas match (%s vs %s)" db-a db-b))
              (System/exit 0))
          (let [diff (p/shell {:out :string :err :string :continue true}
                              "diff" "-u"
                              (.getPath file-a) (.getPath file-b))]
            (println (format "DIFF: schemas differ (%s vs %s)" db-a db-b))
            (println (:out diff))
            (System/exit 1)))))))

Create scripts/test_schema_diff.clj:

(ns test-schema-diff
  "Sanity tests for the schema-diff canonicalizer. Run via:
     bb scripts/test_schema_diff.clj"
  (:require [clojure.test :as t :refer [deftest is run-tests testing]]
            [schema-diff :as sd]))


(deftest canonicalize-strips-comments
  (let [raw "-- this is a comment\nCREATE TABLE foo (id int);\n-- another\n"]
    (is (not (re-find #"--" (sd/canonicalize raw))))
    (is (re-find #"CREATE TABLE foo" (sd/canonicalize raw)))))


(deftest canonicalize-drops-set-statements
  (let [raw "SET statement_timeout = 0;\nSET client_encoding = 'UTF8';\nCREATE TABLE foo (id int);\n"]
    (is (not (re-find #"(?i)SET\b" (sd/canonicalize raw))))
    (is (re-find #"CREATE TABLE foo" (sd/canonicalize raw)))))


(deftest canonicalize-drops-setval
  (let [raw "CREATE TABLE foo (id int);\nSELECT pg_catalog.setval('foo_id_seq', 42, true);\n"]
    (is (not (re-find #"setval" (sd/canonicalize raw))))
    (is (re-find #"CREATE TABLE foo" (sd/canonicalize raw)))))


(deftest canonicalize-preserves-function-bodies
  (let [raw (str "CREATE FUNCTION foo() RETURNS trigger AS $$\n"
                 "BEGIN\n"
                 "  PERFORM 1;\n"
                 "  RETURN NEW;\n"
                 "END;\n"
                 "$$ LANGUAGE plpgsql;\n")
        out (sd/canonicalize raw)]
    (is (re-find #"BEGIN" out))
    (is (re-find #"RETURN NEW" out))
    (is (re-find #"END" out))
    (is (re-find #"\$\$ LANGUAGE plpgsql" out))))


(deftest canonicalize-detects-real-differences
  (let [a "CREATE TABLE foo (id int);\n"
        b "CREATE TABLE foo (id bigint);\n"]
    (is (not= (sd/canonicalize a) (sd/canonicalize b)))))


(let [{:keys [fail error]} (run-tests 'test-schema-diff)]
  (System/exit (if (zero? (+ fail error)) 0 1)))

Run: bb scripts/test_schema_diff.clj Expected: Ran 5 tests containing 11 assertions. 0 failures, 0 errors. (exit 0; assertion count varies if you add tests).

If any fail, fix canonicalize until green.

Edit bb.edn. Locate the database migrations section (around migrate task). Add, after the existing db:logs task:

  db:schema-diff
  {:doc      "Diff the schema of two local databases: bb db:schema-diff --a DB-A --b DB-B"
   :requires ([schema-diff])
   :task     (apply schema-diff/-main *command-line-args*)}

Run:

docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE IF EXISTS orcha_sdtest_a;"
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE IF EXISTS orcha_sdtest_b;"
docker-compose exec -T postgres psql -U postgres -c "CREATE DATABASE orcha_sdtest_a;"
docker-compose exec -T postgres psql -U postgres -c "CREATE DATABASE orcha_sdtest_b;"
docker-compose exec -T postgres psql -U postgres -d orcha_sdtest_a -c "CREATE TABLE foo (id int, name text);"
docker-compose exec -T postgres psql -U postgres -d orcha_sdtest_b -c "CREATE TABLE foo (id int, name text);"
bb db:schema-diff --a orcha_sdtest_a --b orcha_sdtest_b

Expected: OK: schemas match (orcha_sdtest_a vs orcha_sdtest_b) with exit 0.

Run:

docker-compose exec -T postgres psql -U postgres -d orcha_sdtest_b -c "ALTER TABLE foo ALTER COLUMN id TYPE bigint;"
bb db:schema-diff --a orcha_sdtest_a --b orcha_sdtest_b
echo "Exit: $?"

Expected: DIFF: schemas differ (...) followed by a unified diff showing id int vs id bigint, exit code 1.

The two-table sanity test from Steps 5–6 uses trivial schemas. The real schema (functions with $$ bodies, triggers, ~80 tables) may surface canonicalizer gaps. Run a fresh-vs-fresh comparison:

This step depends on Task 4's bb db:fresh, so run it AFTER completing Task 4 (or skip this step now and come back). With bb db:fresh available:

bb db:fresh
docker-compose exec -T postgres psql -U postgres -c "ALTER DATABASE orcha_fresh RENAME TO orcha_fresh_a;"
bb db:fresh
docker-compose exec -T postgres psql -U postgres -c "ALTER DATABASE orcha_fresh RENAME TO orcha_fresh_b;"
bb db:schema-diff --a orcha_fresh_a --b orcha_fresh_b

Expected: OK: schemas match. Two independently-built copies of the same schema must canonicalize identically. If the diff is non-empty, the canonicalizer is missing a rule (sequence ownership, generated identity expressions, function body whitespace, etc.) — extend canonicalize with the offending pattern stripped.

Cleanup:

docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_fresh_a;"
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_fresh_b;"

Run:

docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_sdtest_a;"
docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_sdtest_b;"

Run: clj-kondo --lint scripts/schema_diff.clj scripts/test_schema_diff.clj Expected: no warnings or errors.

git add scripts/schema_diff.clj scripts/test_schema_diff.clj bb.edn
git commit -m "feat(dev): add schema diff script"

Task 4: Implement db clone helpers

Purpose: Four thin helpers that manage local DB state and surface RDS clones tagged by our script.

Files:

Create scripts/db_clone_helpers.clj:

(ns db-clone-helpers
  "Helper tasks for managing the prod-clone local database and throwaway
   AWS RDS instances used by `bb db:clone-prod`.

   Exposed as bb tasks:
     bb db:load-clone [dump-file]      Restore dump into orcha_prod_clone.
     bb db:fresh                       Recreate orcha_fresh at HEAD schema.
     bb db:drop-clone                  Drop orcha_prod_clone.
     bb db:list-clones                 List RDS throwaway clones still alive."
  (:require [babashka.process :as p]
            [clojure.java.io :as io]
            [clojure.string :as str])
  (:import [java.io File]))


(set! *warn-on-reflection* true)


(def ^:private clone-db-name "orcha_prod_clone")
(def ^:private fresh-db-name "orcha_fresh")
(def ^:private default-host "localhost")
(def ^:private default-port "5432")
(def ^:private default-user "postgres")
(def ^:private aws-profile "orcha-prod")
(def ^:private aws-region "eu-central-1")


(defn ^:private psql-admin!
  "Run a psql command against the postgres administrative database inside
   the docker-compose container. Throws ex-info on non-zero exit."
  [sql]
  (let [result (p/shell {:out :string
                         :err :string
                         :continue true}
                        "docker-compose" "exec" "-T" "postgres"
                        "psql" "-U" default-user "-d" "postgres" "-v" "ON_ERROR_STOP=1"
                        "-c" sql)]
    (when-not (zero? (:exit result))
      (throw (ex-info (format "psql failed: %s" (:err result))
                      {:sql sql :exit (:exit result)})))
    result))


(defn ^:private latest-dump-file
  "Return the newest .dump file in ./dump/ as a File, or nil if none exist."
  ^File []
  (let [dir (io/file "dump")]
    (when (.isDirectory dir)
      (->> (.listFiles dir)
           (filter #(and (.isFile ^File %)
                         (str/ends-with? (.getName ^File %) ".dump")))
           (sort-by #(.lastModified ^File %))
           last))))


(defn load-clone!
  "Drop and recreate orcha_prod_clone, then pg_restore from `dump-file`
   (defaults to the newest .dump in ./dump/).

   Expects `pg_restore` available on PATH locally (Postgres 18+ client tools)."
  [& [dump-file-arg]]
  (let [dump-file (if dump-file-arg
                    (io/file dump-file-arg)
                    (latest-dump-file))]
    (when-not (and dump-file (.exists ^File dump-file))
      (println "Error: no dump file found."
               (if dump-file-arg
                 (format "File %s does not exist." dump-file-arg)
                 "Place a .dump file under ./dump/ or pass a path."))
      (System/exit 1))
    (println (format "Dropping %s..." clone-db-name))
    (psql-admin! (format "DROP DATABASE IF EXISTS %s;" clone-db-name))
    (println (format "Creating %s..." clone-db-name))
    (psql-admin! (format "CREATE DATABASE %s;" clone-db-name))
    (println (format "Restoring from %s..." (.getPath ^File dump-file)))
    (let [result (p/shell {:out :string
                           :err :string
                           :continue true}
                          "pg_restore"
                          "--no-owner" "--no-privileges"
                          "-h" default-host "-p" default-port "-U" default-user
                          "-d" clone-db-name
                          (.getPath ^File dump-file))
          stderr (:err result)
          ;; pg_restore prints "WARNING:" lines for non-fatal issues (missing
          ;; roles, etc.) and "pg_restore: error:" for fatal ones. Exit 1 is
          ;; ambiguous; treat fatal-error lines as the real signal.
          fatal? (re-find #"(?m)^pg_restore: error:" (or stderr ""))]
      (when (or fatal? (> (:exit result) 1))
        (println stderr)
        (throw (ex-info "pg_restore failed" {:exit (:exit result)})))
      (when (seq stderr)
        (println "pg_restore warnings (non-fatal):")
        (println stderr)))
    (println (format "Done. %s is ready. App env: ORCHA_LOCAL_DB_NAME_OVERRIDE=%s"
                     clone-db-name clone-db-name))))


(defn drop-clone!
  "Drop orcha_prod_clone (no-op if missing)."
  [& _]
  (println (format "Dropping %s..." clone-db-name))
  (psql-admin! (format "DROP DATABASE IF EXISTS %s;" clone-db-name))
  (println "Done."))


(defn fresh!
  "Drop, recreate, and migrate-to-HEAD the orcha_fresh database.

   init.sql is applied implicitly by migratus when the database is empty;
   migratus picks up `resources/migrations/init.sql` and all timestamped
   migrations from the classpath."
  [& _]
  (println (format "Dropping %s..." fresh-db-name))
  (psql-admin! (format "DROP DATABASE IF EXISTS %s;" fresh-db-name))
  (println (format "Creating %s..." fresh-db-name))
  (psql-admin! (format "CREATE DATABASE %s;" fresh-db-name))
  (println "Running migrations (init.sql runs automatically via migratus :init-script)...")
  (let [result (p/shell {:extra-env {"ORCHA_LOCAL_DB_NAME_OVERRIDE" fresh-db-name}
                         :continue true}
                        "clojure" "-X:dev" "user/migratus"
                        ":command" "migrate"
                        ":args" "[]")]
    (when-not (zero? (:exit result))
      (throw (ex-info "migrate failed" {:exit (:exit result)}))))
  (println (format "Done. %s is at HEAD schema." fresh-db-name)))


(defn list-clones
  "List AWS RDS instances tagged CreatedBy=clone-prod-db. Prints identifier,
   status, and CreatedAt tag value."
  [& _]
  (let [result (p/shell {:out :string
                         :err :string
                         :continue true
                         :extra-env {"AWS_PROFILE" aws-profile}}
                        "aws" "rds" "describe-db-instances"
                        "--region" aws-region
                        "--query"
                        (str "DBInstances[?contains(TagList[?Key==`CreatedBy`].Value, `clone-prod-db`)]"
                             ".{ID:DBInstanceIdentifier,Status:DBInstanceStatus,Endpoint:Endpoint.Address}")
                        "--output" "table")]
    (if (zero? (:exit result))
      (do (println (:out result))
          (println "To delete a stray clone:")
          (println "  aws rds delete-db-instance --profile orcha-prod --region eu-central-1 \\")
          (println "    --db-instance-identifier <ID> --skip-final-snapshot --delete-automated-backups"))
      (do (println "Error:" (:err result))
          (System/exit 1)))))

Edit bb.edn, adding these four entries adjacent to the existing db:psql/db:logs/db:schema-diff tasks:

  db:load-clone
  {:doc      "Restore a prod dump into orcha_prod_clone: bb db:load-clone [dump-file]"
   :requires ([db-clone-helpers])
   :task     (apply db-clone-helpers/load-clone! *command-line-args*)}

  db:fresh
  {:doc      "Recreate orcha_fresh at HEAD schema (used as schema-diff baseline)"
   :requires ([db-clone-helpers])
   :task     (apply db-clone-helpers/fresh! *command-line-args*)}

  db:drop-clone
  {:doc      "Drop the orcha_prod_clone database"
   :requires ([db-clone-helpers])
   :task     (apply db-clone-helpers/drop-clone! *command-line-args*)}

  db:list-clones
  {:doc      "List AWS RDS throwaway clones tagged by clone-prod-db"
   :requires ([db-clone-helpers])
   :task     (apply db-clone-helpers/list-clones *command-line-args*)}

Run: bb db:fresh Expected: output ending in Done. orcha_fresh is at HEAD schema., no errors.

Verify the DB exists and has tables:

docker-compose exec -T postgres psql -U postgres -d orcha_fresh -c "\dt"

Expected: lists tables (ap_*, document, tenant, etc.).

Create a tiny synthetic dump to exercise the helper without needing prod data:

docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE IF EXISTS orcha_loadtest_src;"
docker-compose exec -T postgres psql -U postgres -c "CREATE DATABASE orcha_loadtest_src;"
docker-compose exec -T postgres psql -U postgres -d orcha_loadtest_src -c "CREATE TABLE foo (id int); INSERT INTO foo VALUES (1), (2), (3);"
mkdir -p dump
pg_dump -Fc -h localhost -p 5432 -U postgres orcha_loadtest_src -f dump/loadtest.dump
bb db:load-clone dump/loadtest.dump
docker-compose exec -T postgres psql -U postgres -d orcha_prod_clone -c "SELECT count(*) FROM foo;"

Expected: output shows count = 3.

Run: bb db:drop-clone Expected: Done.

Verify it's gone:

docker-compose exec -T postgres psql -U postgres -c "SELECT 1 FROM pg_database WHERE datname = 'orcha_prod_clone';" | head

Expected: (0 rows).

docker-compose exec -T postgres psql -U postgres -c "DROP DATABASE orcha_loadtest_src;"
rm dump/loadtest.dump

Run: bb db:list-clones Expected: either prints an empty table (no active clones — normal) and the delete hint, or lists any currently-alive throwaway clones. No error.

Run: clj-kondo --lint scripts/db_clone_helpers.clj Expected: no warnings or errors.

git add scripts/db_clone_helpers.clj bb.edn
git commit -m "feat(dev): add db clone helper tasks"

Task 5: Implement bb db:clone-prod

Purpose: The main cloning script. Finds the most recent RDS snapshot of v1-orcha-db (or creates a fresh one if stale), restores it to a throwaway db.t4g.small instance, port-forwards via SSM through v1-orcha-app, pg_dumps into dump/prod-<timestamp>.dump, and deletes the throwaway — with a shutdown hook so teardown happens even on crash/SIGINT.

Files:

Create scripts/clone_prod_db.clj:

(ns clone-prod-db
  "Clone the prod RDS database to a local dump file via a throwaway RDS
   instance restored from a recent snapshot.

   Flow:
     1. Find the most recent snapshot of v1-orcha-db (create one if the
        newest is older than the freshness threshold).
     2. Restore to v1-orcha-db-clone-<timestamp> (db.t4g.small, no backups,
        same VPC/SG so v1-orcha-app can reach it).
     3. Wait until available.
     4. Start SSM port-forward to the clone's endpoint via v1-orcha-app.
     5. pg_dump -Fc -Z1 into dump/prod-<timestamp>.dump.
     6. Delete the throwaway instance (also runs on abnormal exit).

   Usage:
     bb db:clone-prod [--fresh-snapshot]
                      [--freshness-hours N]
                      [--instance-class db.t4g.small]
                      [--skip-restore]"
  (:require [babashka.process :as p]
            [cheshire.core :as json]
            [clojure.java.io :as io]
            [clojure.string :as str])
  (:import [java.time Instant]
           [java.time.temporal ChronoUnit]))


(set! *warn-on-reflection* true)


(def ^:private aws-profile "orcha-prod")
(def ^:private aws-region "eu-central-1")
(def ^:private prod-db-identifier "v1-orcha-db")
(def ^:private app-instance-name "v1-orcha-app")
(def ^:private local-forward-port "25432")
(def ^:private default-instance-class "db.t4g.small")
(def ^:private default-freshness-hours 24)


;; ---------------------------------------------------------------------------
;; Pure helpers
;; ---------------------------------------------------------------------------

(defn parse-args
  "Parse command-line args into an options map with defaults."
  [args]
  (loop [args args
         opts {:fresh-snapshot?  false
               :freshness-hours  default-freshness-hours
               :instance-class   default-instance-class
               :skip-restore?    false}]
    (if (empty? args)
      opts
      (let [[arg & rest-args] args]
        (case arg
          "--fresh-snapshot"  (recur rest-args (assoc opts :fresh-snapshot? true))
          "--freshness-hours" (recur (rest rest-args)
                                     (assoc opts :freshness-hours
                                            (Long/parseLong (first rest-args))))
          "--instance-class"  (recur (rest rest-args)
                                     (assoc opts :instance-class (first rest-args)))
          "--skip-restore"    (recur rest-args (assoc opts :skip-restore? true))
          "--help"            (assoc opts :help true)
          (recur rest-args opts))))))


(defn clone-identifier
  "Deterministic throwaway identifier: v1-orcha-db-clone-<yyyyMMddHHmmss>."
  [^Instant now]
  (let [fmt (-> (java.time.format.DateTimeFormatter/ofPattern "yyyyMMddHHmmss")
                (.withZone (java.time.ZoneId/of "UTC")))]
    (str prod-db-identifier "-clone-" (.format fmt now))))


(defn snapshot-fresh-enough?
  "True if `snapshot-create-time` (ISO-8601 string) is within `max-age-hours`
   of `now`."
  [snapshot-create-time ^Instant now max-age-hours]
  (let [created (Instant/parse snapshot-create-time)
        age-hours (.between ChronoUnit/HOURS created now)]
    (<= age-hours max-age-hours)))


;; ---------------------------------------------------------------------------
;; AWS primitives (side-effectful)
;; ---------------------------------------------------------------------------

(defn ^:private aws-cmd
  "Run an aws CLI command with the prod profile and region.
   Returns {:out ... :err ... :exit N} without throwing."
  [& args]
  (apply p/shell
         {:out :string
          :err :string
          :continue true
          :extra-env {"AWS_PROFILE" aws-profile}}
         "aws"
         (concat args ["--region" aws-region])))


(defn ^:private find-latest-snapshot
  "Return {:identifier ... :create-time ...} for the newest available
   snapshot of `prod-db-identifier`, or nil if none."
  []
  (let [result (aws-cmd "rds" "describe-db-snapshots"
                        "--db-instance-identifier" prod-db-identifier
                        "--query" "reverse(sort_by(DBSnapshots[?Status==`available`], &SnapshotCreateTime))[0].{ID:DBSnapshotIdentifier,Time:SnapshotCreateTime,Type:SnapshotType}"
                        "--output" "json")]
    (when (zero? (:exit result))
      (let [parsed (json/parse-string (:out result) true)]
        (when (:ID parsed)
          {:identifier (:ID parsed)
           :create-time (:Time parsed)
           :type (:Type parsed)})))))


(defn ^:private create-manual-snapshot!
  "Create and wait for a fresh manual snapshot. Returns the identifier."
  [^Instant now]
  (let [id (str prod-db-identifier "-clone-src-"
                (.format (.withZone (java.time.format.DateTimeFormatter/ofPattern
                                     "yyyyMMddHHmmss")
                                    (java.time.ZoneId/of "UTC"))
                         now))]
    (println (format "Creating fresh manual snapshot: %s" id))
    (let [r (aws-cmd "rds" "create-db-snapshot"
                     "--db-instance-identifier" prod-db-identifier
                     "--db-snapshot-identifier" id)]
      (when-not (zero? (:exit r))
        (throw (ex-info (str "create-db-snapshot failed: " (:err r)) {}))))
    (println "Waiting for snapshot to become available (this can take 5-15 minutes)...")
    (let [r (aws-cmd "rds" "wait" "db-snapshot-available"
                     "--db-snapshot-identifier" id)]
      (when-not (zero? (:exit r))
        (throw (ex-info (str "snapshot wait failed: " (:err r)) {}))))
    id))


(defn ^:private subnet-group-name
  "Look up the subnet group name of the prod instance (needed to restore
   into the same subnets)."
  []
  (let [r (aws-cmd "rds" "describe-db-instances"
                   "--db-instance-identifier" prod-db-identifier
                   "--query" "DBInstances[0].DBSubnetGroup.DBSubnetGroupName"
                   "--output" "text")]
    (when-not (zero? (:exit r))
      (throw (ex-info "failed to look up subnet group" {:err (:err r)})))
    (str/trim (:out r))))


(defn ^:private security-group-ids
  "Look up the VPC SG IDs attached to the prod instance."
  []
  (let [r (aws-cmd "rds" "describe-db-instances"
                   "--db-instance-identifier" prod-db-identifier
                   "--query" "DBInstances[0].VpcSecurityGroups[].VpcSecurityGroupId"
                   "--output" "text")]
    (when-not (zero? (:exit r))
      (throw (ex-info "failed to look up security groups" {:err (:err r)})))
    (str/split (str/trim (:out r)) #"\s+")))


(defn ^:private restore-snapshot!
  "Restore `snapshot-id` to `clone-id`. Blocks until available."
  [snapshot-id clone-id instance-class timestamp-str]
  (let [sg-ids (security-group-ids)
        sg-args (mapcat (fn [sg] ["--vpc-security-group-ids" sg]) sg-ids)
        args (concat ["rds" "restore-db-instance-from-db-snapshot"
                      "--db-snapshot-identifier" snapshot-id
                      "--db-instance-identifier" clone-id
                      "--db-instance-class" instance-class
                      "--db-subnet-group-name" (subnet-group-name)
                      "--no-publicly-accessible"
                      "--no-multi-az"
                      "--no-auto-minor-version-upgrade"
                      "--storage-type" "gp3"
                      "--no-deletion-protection"
                      "--backup-retention-period" "0"
                      "--tags"
                      (format "Key=CloneOf,Value=%s" prod-db-identifier)
                      "Key=CreatedBy,Value=clone-prod-db"
                      (format "Key=CreatedAt,Value=%s" timestamp-str)]
                     sg-args)]
    (println (format "Restoring snapshot %s to %s (%s)..." snapshot-id clone-id instance-class))
    (let [r (apply aws-cmd args)]
      (when-not (zero? (:exit r))
        (throw (ex-info (str "restore failed: " (:err r)) {}))))
    (println "Waiting for clone to become available (~10-20 minutes)...")
    (let [r (aws-cmd "rds" "wait" "db-instance-available"
                     "--db-instance-identifier" clone-id)]
      (when-not (zero? (:exit r))
        (throw (ex-info (str "wait failed: " (:err r)) {}))))))


(defn ^:private clone-endpoint
  [clone-id]
  (let [r (aws-cmd "rds" "describe-db-instances"
                   "--db-instance-identifier" clone-id
                   "--query" "DBInstances[0].Endpoint.Address"
                   "--output" "text")]
    (when-not (zero? (:exit r))
      (throw (ex-info "endpoint lookup failed" {:err (:err r)})))
    (str/trim (:out r))))


(defn ^:private master-secret-json
  "Fetch the prod DB credentials JSON from SSM (same source config uses).

   Restored clones inherit the master user/password from the snapshot, so
   the prod credentials work against the clone."
  []
  (let [r (aws-cmd "ssm" "get-parameter"
                   "--name" "/v1-orcha/db-credentials"
                   "--with-decryption"
                   "--query" "Parameter.Value"
                   "--output" "text")]
    (when-not (zero? (:exit r))
      (throw (ex-info "credentials fetch failed" {:err (:err r)})))
    (str/trim (:out r))))


(defn ^:private ec2-instance-id
  "Look up the running v1-orcha-app EC2 instance ID."
  []
  (let [r (aws-cmd "ec2" "describe-instances"
                   "--filters"
                   (format "Name=tag:Name,Values=%s" app-instance-name)
                   "Name=instance-state-name,Values=running"
                   "--query" "Reservations[0].Instances[0].InstanceId"
                   "--output" "text")]
    (when-not (zero? (:exit r))
      (throw (ex-info "ec2 lookup failed" {:err (:err r)})))
    (let [id (str/trim (:out r))]
      (when (or (str/blank? id) (= id "None"))
        (throw (ex-info (format "No running %s instance found" app-instance-name) {})))
      id)))


(defn ^:private start-port-forward!
  "Start an SSM session that tunnels local-forward-port → rds-endpoint:5432."
  [ec2-id rds-endpoint]
  (println (format "Starting SSM port-forward: localhost:%s → %s:5432" local-forward-port rds-endpoint))
  (p/process {:extra-env {"AWS_PROFILE" aws-profile}
              :out :string
              :err :string}
             "aws" "ssm" "start-session"
             "--target" ec2-id
             "--document-name" "AWS-StartPortForwardingSessionToRemoteHost"
             "--parameters" (format "{\"host\":[\"%s\"],\"portNumber\":[\"5432\"],\"localPortNumber\":[\"%s\"]}"
                                    rds-endpoint local-forward-port)
             "--region" aws-region))


(defn ^:private wait-for-port
  [port timeout-ms]
  (let [start (System/currentTimeMillis)]
    (loop []
      (if (> (- (System/currentTimeMillis) start) timeout-ms)
        false
        (if (try
              (with-open [_ (java.net.Socket. "localhost" (Integer/parseInt port))]
                true)
              (catch java.net.ConnectException _ false))
          true
          (do (Thread/sleep 1000) (recur)))))))


(defn ^:private pg-dump!
  [creds-json dump-path]
  (let [{:keys [username password dbname]}
        (json/parse-string creds-json true)]
    (println (format "Running pg_dump to %s..." dump-path))
    (let [r (p/shell {:extra-env {"PGPASSWORD" password}
                      :continue true}
                     "pg_dump" "-Fc" "-Z1"
                     "--no-owner" "--no-privileges"
                     "-h" "localhost"
                     "-p" local-forward-port
                     "-U" username
                     "-f" dump-path
                     dbname)]
      (when-not (zero? (:exit r))
        (throw (ex-info "pg_dump failed" {:exit (:exit r)}))))))


(defn ^:private delete-clone!
  "Delete the throwaway RDS clone. Retries while the instance is still in
   `creating` state, which AWS rejects with InvalidDBInstanceState."
  [clone-id]
  (println (format "Deleting throwaway clone %s..." clone-id))
  (loop [attempt 1]
    (let [r (aws-cmd "rds" "delete-db-instance"
                     "--db-instance-identifier" clone-id
                     "--skip-final-snapshot"
                     "--delete-automated-backups")]
      (cond
        (zero? (:exit r))
        (println (format "Deletion initiated for %s." clone-id))

        (and (re-find #"InvalidDBInstanceState" (or (:err r) ""))
             (< attempt 30))
        (do (println (format "Clone still in transient state (attempt %d/30); waiting 30s..." attempt))
            (Thread/sleep 30000)
            (recur (inc attempt)))

        :else
        (do (println (format "WARNING: delete-db-instance failed for %s: %s"
                             clone-id (:err r)))
            (println (format "Manual cleanup required: bb db:list-clones && \\"))
            (println (format "  aws rds delete-db-instance --profile orcha-prod \\"))
            (println (format "    --region eu-central-1 --db-instance-identifier %s \\" clone-id))
            (println "    --skip-final-snapshot --delete-automated-backups"))))))


(defn ^:private ensure-cleanup-hook!
  "Register a JVM shutdown hook that attempts to delete the clone. Safe
   against double-invocation via an atom guard."
  [clone-id deleted?-atom]
  (.addShutdownHook (Runtime/getRuntime)
                    (Thread.
                     (fn []
                       (when (compare-and-set! deleted?-atom false true)
                         (println)
                         (println "Shutdown hook: cleaning up throwaway clone...")
                         (delete-clone! clone-id))))))


;; ---------------------------------------------------------------------------
;; Orchestration
;; ---------------------------------------------------------------------------

(defn print-usage
  []
  (println "Usage: bb db:clone-prod [options]")
  (println)
  (println "Options:")
  (println "  --fresh-snapshot       Always take a new manual snapshot (skip reuse check).")
  (println "  --freshness-hours N    Reuse newest snapshot if younger than N hours (default 24).")
  (println "  --instance-class C     Throwaway instance class (default db.t4g.small).")
  (println "  --skip-restore         Debug: print what would run, don't create anything.")
  (println "  --help"))


(defn -main
  [& args]
  (let [opts (parse-args args)]
    (when (:help opts) (print-usage) (System/exit 0))
    (let [now          (Instant/now)
          ts           (.format (.withZone (java.time.format.DateTimeFormatter/ofPattern
                                            "yyyyMMdd-HHmmss")
                                           (java.time.ZoneId/of "UTC"))
                                now)
          clone-id     (clone-identifier now)
          dump-path    (format "dump/prod-%s.dump" ts)]

      (println "Prod DB clone — session" ts)
      (println "Throwaway clone identifier:" clone-id)
      (println "Target dump file:         " dump-path)
      (println)

      (when (:skip-restore? opts)
        (println "[--skip-restore] exiting before any AWS side-effects.")
        (System/exit 0))

      ;; Step 1: pick or create a snapshot
      (let [existing (when-not (:fresh-snapshot? opts) (find-latest-snapshot))
            snapshot-id (if (and existing
                                 (snapshot-fresh-enough?
                                   (:create-time existing) now (:freshness-hours opts)))
                          (do (println (format "Reusing snapshot %s (%s, type=%s)"
                                               (:identifier existing)
                                               (:create-time existing)
                                               (:type existing)))
                              (:identifier existing))
                          (create-manual-snapshot! now))]

        ;; Step 2: restore + register cleanup
        ;; Hook is registered BEFORE the restore call so that a SIGINT during
        ;; the restore wait still triggers cleanup. delete-db-instance against
        ;; an instance that didn't quite materialize is a no-op + log line.
        (let [deleted? (atom false)]
          (ensure-cleanup-hook! clone-id deleted?)
          (try
            (restore-snapshot! snapshot-id clone-id (:instance-class opts) ts)

            ;; Step 3: port-forward
            (let [ec2-id     (ec2-instance-id)
                  endpoint   (clone-endpoint clone-id)
                  forward    (start-port-forward! ec2-id endpoint)]
              (try
                (if (wait-for-port local-forward-port 30000)
                  (println "Port forward established.")
                  (do
                    ;; Surface SSM stderr — a missing session-manager-plugin
                    ;; or expired SSO token is the usual cause and the
                    ;; default 30s timeout otherwise gives no signal.
                    (when-let [err (some-> @forward :err)]
                      (println "SSM stderr:")
                      (println err))
                    (throw (ex-info "port-forward did not come up in 30s"
                                    {:ec2-id ec2-id :endpoint endpoint}))))
                ;; Step 4: dump
                (let [creds (master-secret-json)]
                  (io/make-parents (io/file dump-path))
                  (pg-dump! creds dump-path)
                  (println (format "Dump complete: %s" dump-path)))
                (finally
                  (p/destroy forward))))

            (finally
              ;; Step 5: teardown (also covered by shutdown hook for abnormal exit)
              (when (compare-and-set! deleted? false true)
                (delete-clone! clone-id)))))))))

Create scripts/test_clone_prod_db.clj:

(ns test-clone-prod-db
  "Unit tests for pure helpers in clone-prod-db. Run via:
     bb scripts/test_clone_prod_db.clj"
  (:require [clojure.test :as t :refer [deftest is run-tests testing]]
            [clone-prod-db :as cpd])
  (:import [java.time Instant]))


(deftest parse-args-defaults
  (let [o (cpd/parse-args [])]
    (is (false? (:fresh-snapshot? o)))
    (is (= 24 (:freshness-hours o)))
    (is (= "db.t4g.small" (:instance-class o)))
    (is (false? (:skip-restore? o)))))


(deftest parse-args-all-flags
  (let [o (cpd/parse-args ["--fresh-snapshot"
                            "--freshness-hours" "6"
                            "--instance-class" "db.t4g.medium"
                            "--skip-restore"])]
    (is (true? (:fresh-snapshot? o)))
    (is (= 6 (:freshness-hours o)))
    (is (= "db.t4g.medium" (:instance-class o)))
    (is (true? (:skip-restore? o)))))


(deftest clone-identifier-is-deterministic-from-now
  (let [now (Instant/parse "2026-04-24T09:38:39Z")
        id (cpd/clone-identifier now)]
    (is (= "v1-orcha-db-clone-20260424093839" id))))


(deftest snapshot-fresh-enough?
  (let [now (Instant/parse "2026-04-24T12:00:00Z")]
    (testing "1h old is fresh (default threshold 24h)"
      (is (true? (cpd/snapshot-fresh-enough? "2026-04-24T11:00:00Z" now 24))))
    (testing "23h old is fresh"
      (is (true? (cpd/snapshot-fresh-enough? "2026-04-23T13:00:00Z" now 24))))
    (testing "25h old is stale"
      (is (false? (cpd/snapshot-fresh-enough? "2026-04-23T10:59:00Z" now 24))))
    (testing "custom threshold 6h — 5h fresh"
      (is (true? (cpd/snapshot-fresh-enough? "2026-04-24T07:00:00Z" now 6))))
    (testing "custom threshold 6h — 7h stale"
      (is (false? (cpd/snapshot-fresh-enough? "2026-04-24T05:00:00Z" now 6))))))


(let [{:keys [fail error]} (run-tests 'test-clone-prod-db)]
  (System/exit (if (zero? (+ fail error)) 0 1)))

Run: bb scripts/test_clone_prod_db.clj Expected: all tests pass, exit 0.

Register the bb task. Edit bb.edn:

  db:clone-prod
  {:doc      "Clone prod RDS to a local dump file via a throwaway RDS instance"
   :requires ([clone-prod-db])
   :task     (apply clone-prod-db/-main *command-line-args*)}

Run: bb db:clone-prod --help Expected: prints the usage block, exits 0.

Run: bb db:clone-prod --skip-restore Expected: prints the session header + throwaway identifier + dump path, then exits with [--skip-restore] exiting before any AWS side-effects.

Run: clj-kondo --lint scripts/clone_prod_db.clj scripts/test_clone_prod_db.clj Expected: no warnings or errors.

git add scripts/clone_prod_db.clj scripts/test_clone_prod_db.clj bb.edn
git commit -m "feat(dev): add prod DB clone script"

The real AWS execution — take snapshot, restore, dump, teardown — is validated in Task 7 against the rename migration. Don't run it here; it's slow and costs money.


Task 6: Write the runbook

Purpose: A single page a human or agent can follow top-to-bottom before a big migration deploy. References the scripts we just built.

Files:

Create docs/runbooks/prod-clone-refactor-testing.md:

# Prod-Clone Refactor Testing

Pre-deploy validation procedure for large, schema-touching refactors. Restores
a clone of prod Postgres locally, applies pending migrations, verifies the
resulting schema matches a fresh-from-init baseline, and exercises ingestion
and UI against the migrated clone.

**Design spec:** `docs/superpowers/specs/2026-04-24-prod-clone-refactor-testing-design.md`

## When to use

Run before deploying any migration that:

- Renames tables, columns, constraints, indexes, or types.
- Modifies trigger function bodies or JSONB path consumers.
- Makes `NOT NULL` a previously-nullable column.
- Changes FK topology.
- Anything where a successful `bb migrate migrate` against init.sql wouldn't
  catch data-shape issues.

**Not** needed for additive migrations (new columns/tables) that are covered
by unit/integration tests and don't touch live data layout.

## Prerequisites

- AWS SSO login: `aws sso login --profile orcha-prod`
- docker-compose stack running: `bb dev:up && bb dev:seed`
- Postgres 18+ client tools on PATH (`pg_dump`, `pg_restore`, `psql`).
  The host client major version MUST be ≥ the docker-compose Postgres
  server version (currently `pgvector/pgvector:pg18` per `docker-compose.yml`).
  Verify both:

pg_dump --version # → 18.x docker-compose exec -T postgres psql -U postgres -tAc "SHOW server_version;" # → 18.x

Mismatch will cause `pg_restore` to fail loading the prod-shaped dump.
- AWS Session Manager plugin installed (`session-manager-plugin --version`
must succeed). Without it, the SSM port-forward step silently hangs.
- Branch with the migration under test checked out

## Procedure

### Step 1 — Clone prod to a dump file

Unattended, ~30–45 min (longer if a fresh snapshot needs to be taken).

bb db:clone-prod


Produces `dump/prod-<timestamp>.dump`. Watch output for the throwaway clone
identifier — if the script dies unexpectedly, that's what you need to delete
manually (see Step 9).

**Flags:**
- `--fresh-snapshot` — skip snapshot reuse; always create a new one.
- `--freshness-hours N` — reuse snapshots up to N hours old (default 24).

### Step 2 — Load the dump into local Postgres

~2–5 min.

bb db:load-clone


Creates `orcha_prod_clone` in local docker-compose Postgres, restored from
the newest `dump/*.dump` file.

### Step 3 — Sanity check: current master boots against pre-migration clone

ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev


At the REPL: `(integrant.repl/go)` — must succeed. Confirms the clone is
usable and that current master is healthy against prod schema. Exit the REPL.

### Step 4 — Apply pending migrations on the clone

ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb migrate migrate


All pending migrations apply cleanly against prod data.

### Step 5 — Schema assertion (the gate)

bb db:fresh bb db:schema-diff --a orcha_prod_clone --b orcha_fresh


`bb db:fresh` (re)creates `orcha_fresh` at HEAD schema. `bb db:schema-diff`
exits 0 on empty diff. Any diff means the migration produces a schema that
diverges from what init.sql + migrations produce — i.e., a migration bug.
Iterate until exit 0.

### Step 6 — Ingestion smoke (programmatic)

Start the app pointed at the clone:

ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev


At the REPL: `(integrant.repl/go)`. In another terminal, list available
fixtures and ingest a couple. Don't reference prod document IDs — these
should be fresh test PDFs, since prod S3 objects aren't local.

ls test/fixtures/ # discover what's available bb ingest test/fixtures/.pdf bb ingest test/fixtures/.pdf


Assert each document reaches a terminal state (processed/failed) without
errors. Watch the REPL logs for:

- Queries referencing renamed columns (`legal_entity_id`, etc.)
- Malli decode errors
- Trigger-function payload key mismatches (`:legal-entity-id` vs `:tenant-id`)
- SES/notification consumer key mismatches
- FK violations

### Step 7 — UI smoke (manual)

With the app still running against the clone, walk through:

- Document list — loads and paginates
- Open a document → view renders without errors
- Matching screen — loads a cluster
- Tenants admin panel (`/tenants`) — loads, shows renamed entities
- Organizations admin panel (`/organizations`) — loads

Watch browser devtools and REPL logs for 500s.

### Step 8 — Iterate on failure

If any step fails:

1. Fix code or migration in the repo.
2. `bb db:load-clone` — resets `orcha_prod_clone` from the cached dump
   file (~3 min, no re-clone of prod needed).
3. Repeat from Step 4.

### Step 9 — Cleanup

bb db:drop-clone bb db:list-clones # must be empty rm dump/prod-.dump # manually; contains PII


`bb db:list-clones` surfaces any leaked throwaway RDS instances (should not
happen under normal conditions — the script has a shutdown hook — but run
this as a belt-and-braces check).

## What this does NOT cover

- Data-path bugs that require specific prod S3 objects. Ingestion uses
  fresh local test documents only.
- Load or performance testing — throwaway clone is cost-optimized.
- Prod deploy race conditions (concurrent writers during migration).
  Mitigation: deploy in low-traffic windows.
- Integration side effects (DATEV/SAP/Outlook) beyond what ingestion
  incidentally triggers.

## Troubleshooting

**Clone restore hangs past 25 minutes.** The AWS wait timeout may have been
exceeded. `bb db:list-clones` — if the instance exists and is `available`,
restart the script; if it's `creating`, keep waiting (larger snapshots take
longer).

**SSM port-forward fails with "target not connected".** Run
`aws sso login --profile orcha-prod` and retry.

**`pg_dump` fails with `password authentication failed`.** The script
fetches the master password from SSM (`/v1-orcha/db-credentials`), and
restored snapshots inherit the master password as it was AT SNAPSHOT TIME.
If the prod password rotated between the snapshot and the restore, those
won't match. Resolutions:
- Use `--fresh-snapshot` so the snapshot reflects the current password.
- Or pass `--master-user-password <known>` to
  `restore-db-instance-from-db-snapshot` and use that known password
  (script enhancement; not currently supported).

**`pg_restore` warnings about missing roles.** Expected; `--no-owner
--no-privileges` skips ownership. Exit code 1 with warnings is treated as
success by `bb db:load-clone`.

**Schema diff shows expected vs actual differences you think are fine.**
Add a canonicalizer rule in `scripts/schema_diff.clj`. Review the diff
carefully — "noise" is often a subtle bug.

**`bb db:list-clones` shows a stray instance you don't recognize.**
Delete it:

aws rds delete-db-instance --profile orcha-prod --region eu-central-1
--db-instance-identifier --skip-final-snapshot --delete-automated-backups

git add docs/runbooks/prod-clone-refactor-testing.md
git commit -m "docs(runbooks): add prod-clone refactor testing runbook"

Task 7: End-to-end validation against the rename migration

⚠️ USER-ONLY — DO NOT AUTO-EXECUTE.

This task is the first real use of the procedure and is the only task in this plan that performs AWS mutations against the orcha-prod account: rds create-db-snapshot, rds restore-db-instance-from-db-snapshot, ssm start-session to v1-orcha-app, and rds delete-db-instance --skip-final-snapshot --delete-automated-backups. It also pulls prod master DB credentials to the local machine via ssm get-parameter --with-decryption and writes a dump file containing production PII to dump/.

Agents executing this plan must STOP after Task 6 and hand control back to the user. The user runs Task 7 interactively, end-to-end, and confirms each step's outcome. Do not invoke bb db:clone-prod (without --skip-restore), bb db:load-clone, bb db:fresh, bb db:schema-diff, bb db:drop-clone, or any of the AWS CLI commands listed below on the user's behalf.

Purpose: Actually run the full procedure against the legal_entitytenant rename migration and confirm it catches real issues (or gives a clean pass). This is both the first real use and the acceptance test for the procedure itself.

Files: none modified — just running scripts and documenting findings.

This task is run interactively by the user and its results inform whether the procedure is ready. Not a clean TDD loop; more of a dress rehearsal.

aws sso login --profile orcha-prod
bb dev:up
bb dev:seed
docker-compose ps
pg_dump --version   # must be 18+

Expected: aws sso succeeds, containers running, pg_dump (PostgreSQL) 18.x.

git log --oneline -1

Expected: current HEAD is the tip of the rename work (the latest fix(tenant-rename): ... commit or later).

bb db:clone-prod

Runs unattended. When complete, note the dump path printed at the end.

bb db:list-clones

Expected: empty table (or only shows the one being deleted, status deleting).

bb db:load-clone

Expected: succeeds with "Done. orcha_prod_clone is ready."

This is the canary: the clone contains prod data pre-rename (i.e., still has legal_entity tables). But our current master expects the renamed tables. So this step is expected to FAIL if current HEAD is post-rename code.

If HEAD is post-rename code, skip this step. If you want this canary, check out a pre-rename commit and try it.

For this validation run, note that this step is N/A (current HEAD is post-rename).

ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone bb migrate migrate

Expected: the rename migration applies cleanly. Exit 0.

bb db:fresh
bb db:schema-diff --a orcha_prod_clone --b orcha_fresh

Expected: exit 0, "schemas match". If any diff appears, stop and investigate — it represents a real defect in either the migration or init.sql.

ORCHA_LOCAL_DB_NAME_OVERRIDE=orcha_prod_clone clj -M:dev

At the REPL:

(integrant.repl/go)

In a second terminal:

bb ingest test/fixtures/01-invoice.pdf

(Use whatever invoice fixtures actually exist under test/fixtures/ — list them first with ls test/fixtures/.)

Watch REPL logs for any legal-entity-id residual query, Malli decode error, or 500-level response.

Expected: ingestion completes; document moves to a terminal state; no errors referencing old-schema identifiers.

With the app running, open http://localhost:3000 (or whatever the app's local URL is — check bb.edn / README) and click through the checklist in docs/runbooks/prod-clone-refactor-testing.md Step 7.

bb db:drop-clone
bb db:list-clones       # empty
rm dump/prod-*.dump      # after cleanup confirmed

If anything surprising came up (schema diff issues, ingestion errors, UI 500s, throwaway-instance cleanup problems), file them as issues on the rename branch and add canonicalizer rules / plan steps as needed. If nothing surprising came up, the procedure has passed its first real test.

No commit required for this task unless script fixes were made.


Self-review checklist (post-plan)

Before handing this plan to execution, run through:

  1. Every task has a Purpose, Files, and at least one verification step.
  2. No TBD / TODO / handle edge cases placeholders. Grep the plan for those strings and fix any hits.
  3. Every code block shows runnable code, not sketches. Scroll through — any ...s?
  4. Function / task / flag names are consistent across tasks. The env var is always ORCHA_LOCAL_DB_NAME_OVERRIDE; the clone DB is always orcha_prod_clone; the fresh DB is always orcha_fresh; the task names always match between bb.edn and the doc.
  5. The procedure the plan builds matches the spec's stated flow. Cross-check Step-by-step in Task 6 runbook against the "Data flow" section of docs/superpowers/specs/2026-04-24-prod-clone-refactor-testing-design.md.