Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Search Integration Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

CRITICAL: Read orcha/CLAUDE.md at the start of every task. Follow all conventions strictly.

Goal: Replace pg_trgm booking history lookup with hybrid semantic search (BM25 + Google embeddings + RRF).

Architecture: New com.getorcha.search module with embed and search public functions. ERP calls embed after booking history import, Workers call search during post-processing. Curation stays in Workers.

Tech Stack: pgvector, Google Vertex AI (text-multilingual-embedding-002), core.async 1.9.x, PostgreSQL full-text search.


Task 1: Upgrade core.async and Add pgvector Dependency

Files:

Step 1: Update deps.edn

Add pgvector Java client and upgrade core.async:

;; In :deps map, update:
org.clojure/core.async {:mvn/version "1.9.829"}

;; Add:
com.pgvector/pgvector {:mvn/version "0.1.6"}

Step 2: Verify deps resolve

Run: clj -Stree | grep -E "(core.async|pgvector)"

Expected:

org.clojure/core.async 1.9.829
com.pgvector/pgvector 0.1.6

Step 3: Commit

git add deps.edn
git commit -m "deps: upgrade core.async to 1.9.x, add pgvector"

Task 2: Database Migration - pgvector Extension and Columns

Files:

Step 1: Create migration

Run: bb migrate create "add-search-columns"

Step 2: Write up migration

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Add searchable text (generated column for BM25)
ALTER TABLE booking_history_item
  ADD COLUMN searchable_text TEXT
    GENERATED ALWAYS AS (supplier_name || ' | ' || description) STORED;

-- Add embedding column (768 dimensions for Google's text-multilingual-embedding-002)
ALTER TABLE booking_history_item
  ADD COLUMN embedding vector(768);

-- HNSW index for approximate nearest neighbor search
CREATE INDEX booking_history_item_embedding_idx
  ON booking_history_item
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- GIN index for BM25 full-text search
CREATE INDEX booking_history_item_searchable_text_idx
  ON booking_history_item
  USING gin (to_tsvector('simple', searchable_text));

Step 3: Write down migration

DROP INDEX IF EXISTS booking_history_item_searchable_text_idx;
DROP INDEX IF EXISTS booking_history_item_embedding_idx;
ALTER TABLE booking_history_item DROP COLUMN IF EXISTS embedding;
ALTER TABLE booking_history_item DROP COLUMN IF EXISTS searchable_text;
-- Note: Don't drop the vector extension as other tables may use it

Step 4: Run migration locally

Run: clj -M:dev -m com.getorcha.db.migrate

Or start the system: clj -M:dev and migrations run automatically.

Step 5: Verify columns exist

Run: psql -h localhost -U postgres -d orcha -c "\d booking_history_item"

Expected: See searchable_text and embedding columns.

Step 6: Commit

git add resources/migrations/*-add-search-columns.*
git commit -m "feat(db): add searchable_text and embedding columns for semantic search"

Task 3: Create search.rrf Namespace

Files:

Step 1: Write the failing test

(ns com.getorcha.search.rrf-test
  (:require [clojure.test :refer [deftest is testing]]
            [com.getorcha.search.rrf :as rrf]))


(deftest fuse-test
  (testing "document in both lists gets boosted"
    (let [bm25-results [{:id 1 :rank 1} {:id 2 :rank 2}]
          semantic-results [{:id 2 :rank 1} {:id 3 :rank 2}]
          fused (rrf/fuse 60 bm25-results semantic-results)]
      ;; id=2 appears in both lists: 1/(60+2) + 1/(60+1) = higher score
      ;; id=1 appears only in bm25: 1/(60+1)
      ;; id=3 appears only in semantic: 1/(60+2)
      (is (= [2 1 3] (mapv :id fused)))))

  (testing "preserves row data from first occurrence"
    (let [bm25-results [{:id 1 :rank 1 :extra "bm25-data"}]
          semantic-results [{:id 1 :rank 1 :extra "semantic-data"}]
          fused (rrf/fuse 60 bm25-results semantic-results)]
      (is (= "bm25-data" (:extra (first fused))))))

  (testing "empty lists return empty"
    (is (empty? (rrf/fuse 60 [] [])))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.search.rrf-test]'

Expected: Fail - namespace not found.

Step 3: Write implementation

(ns com.getorcha.search.rrf
  "Reciprocal Rank Fusion for combining multiple ranked result lists.

   RRF score for document d = Σ 1/(k + rank_in_list)

   Reference: Cormack et al. 2009")


(defn fuse
  "Combines multiple ranked result lists using Reciprocal Rank Fusion.

   rrf-k        - constant from RRF paper, typically 60
   result-lists - varargs of result sequences, each with :id and :rank keys

   Returns results sorted by RRF score descending, with :rrf-score added."
  [rrf-k & result-lists]
  (let [id->entries (reduce
                      (fn [acc results]
                        (reduce
                          (fn [acc {:keys [id rank] :as row}]
                            (update acc id (fnil conj []) {:rank rank :row row}))
                          acc
                          results))
                      {}
                      result-lists)]
    (->> id->entries
         (map (fn [[id entries]]
                (let [rrf-score (reduce + (map #(/ 1.0 (+ rrf-k (:rank %))) entries))
                      row (:row (first entries))]
                  (assoc row :rrf-score rrf-score))))
         (sort-by :rrf-score >))))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.search.rrf-test]'

Expected: PASS

Step 5: Commit

git add src/com/getorcha/search/rrf.clj test/com/getorcha/search/rrf_test.clj
git commit -m "feat(search): add RRF (Reciprocal Rank Fusion) implementation"

Task 4: Create search.bm25 Namespace

Files:

Step 1: Write the failing test

(ns com.getorcha.search.bm25-test
  (:require [clojure.test :refer [deftest is testing use-fixtures]]
            [com.getorcha.search.bm25 :as bm25]
            [com.getorcha.test.fixtures :as fixtures]
            [honey.sql :as sql]))


(use-fixtures :once fixtures/with-system)
(use-fixtures :each fixtures/with-rollback)


(deftest search-test
  (testing "returns results ranked by text relevance"
    ;; Insert test data using HoneySQL
    (let [gen-uuid [:raw "gen_random_uuid()"]]
      (fixtures/execute!
        (sql/format
          {:insert-into :booking-history-item
           :columns     [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center]
           :values      [[gen-uuid gen-uuid gen-uuid "ACME Corp" "software license" "6300" "IT"]
                         [gen-uuid gen-uuid gen-uuid "ACME Corp" "hardware purchase" "6400" "IT"]
                         [gen-uuid gen-uuid gen-uuid "Other Inc" "consulting services" "6500" "HR"]]})))

    (let [search-spec {:table            :booking-history-item
                       :id-column        :id
                       :text-column      :searchable-text
                       :filters          {}}
          results (bm25/search fixtures/*db* search-spec "ACME software" 10)]
      (is (= 1 (count results)))
      (is (= "6300" (:booking-history-item/debit-account (first results))))
      (is (contains? (first results) :rank))
      (is (= 1 (:rank (first results)))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.search.bm25-test]'

Expected: Fail - namespace not found.

Step 3: Write implementation

(ns com.getorcha.search.bm25
  "PostgreSQL full-text search using ts_rank for BM25-style relevance."
  (:require [honey.sql :as sql]
            [honey.sql.pg-ops :as pg-ops]
            [next.jdbc :as jdbc]))


(defn ^:private build-where-clause
  "Builds WHERE clause for BM25 search with filters."
  [text-column query-text filters]
  (let [tsquery [:plainto_tsquery [:inline "simple"] query-text]
        tsvector [:to_tsvector [:inline "simple"] text-column]
        fts-match [pg-ops/atat tsvector tsquery]
        filter-clauses (mapv (fn [[col val]]
                               (if (nil? val)
                                 [:= col nil]  ; HoneySQL converts to IS NULL
                                 [:= col val]))
                             filters)]
    (if (seq filter-clauses)
      (into [:and fts-match] filter-clauses)
      fts-match)))


(defn search
  "Full-text search using PostgreSQL ts_rank.

   Returns results ranked by BM25-style relevance with :rank (1-indexed position)."
  [conn {:keys [table text-column filters] :as _search-spec} query-text k]
  (let [tsquery [:plainto_tsquery [:inline "simple"] query-text]
        tsvector [:to_tsvector [:inline "simple"] text-column]
        query {:select [:* [[[:ts_rank tsvector tsquery]] :bm25_score]]
               :from   [table]
               :where  (build-where-clause text-column query-text filters)
               :order-by [[:bm25_score :desc]]
               :limit  k}
        results (jdbc/execute! conn (sql/format query))]
    (map-indexed (fn [idx row]
                   (assoc row :rank (inc idx)))
                 results)))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.search.bm25-test]'

Expected: PASS

Step 5: Commit

git add src/com/getorcha/search/bm25.clj test/com/getorcha/search/bm25_test.clj
git commit -m "feat(search): add BM25 full-text search"

Task 5: Create search.vector Namespace

Files:

Step 1: Write the failing test

(ns com.getorcha.search.vector-test
  (:require [clojure.test :refer [deftest is testing use-fixtures]]
            [com.getorcha.search.vector :as vector]
            [com.getorcha.test.fixtures :as fixtures]
            [honey.sql :as sql]))


(use-fixtures :once fixtures/with-system)
(use-fixtures :each fixtures/with-rollback)


(defn- embedding->sql
  "Converts embedding vector to HoneySQL ::vector cast."
  [emb]
  [:raw (str "'" (str emb) "'::vector")])


(deftest search-test
  (testing "returns results ranked by cosine similarity"
    ;; Insert test data with embeddings (768-dim, using simple patterns)
    (let [embedding-a (vec (concat [1.0 0.0 0.0] (repeat 765 0.0)))
          embedding-b (vec (concat [0.9 0.1 0.0] (repeat 765 0.0)))
          embedding-c (vec (concat [0.0 1.0 0.0] (repeat 765 0.0)))
          gen-uuid [:raw "gen_random_uuid()"]]
      (fixtures/execute!
        (sql/format
          {:insert-into :booking-history-item
           :columns     [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center :embedding]
           :values      [[gen-uuid gen-uuid gen-uuid "A" "desc a" "6300" "IT" (embedding->sql embedding-a)]
                         [gen-uuid gen-uuid gen-uuid "B" "desc b" "6400" "IT" (embedding->sql embedding-b)]
                         [gen-uuid gen-uuid gen-uuid "C" "desc c" "6500" "HR" (embedding->sql embedding-c)]]}))

      (let [query-embedding embedding-a  ; Should match A best, then B, then C
            search-spec {:table            :booking-history-item
                         :id-column        :id
                         :embedding-column :embedding
                         :filters          {}}
            results (vector/search fixtures/*db* search-spec query-embedding 10)]
        (is (= 3 (count results)))
        (is (= "6300" (:booking-history-item/debit-account (first results))))
        (is (= "6400" (:booking-history-item/debit-account (second results))))
        (is (contains? (first results) :rank))
        (is (contains? (first results) :score))
        (is (> (:score (first results)) (:score (second results))))))))


(deftest search-filters-null-embeddings-test
  (testing "excludes rows with null embeddings"
    (let [gen-uuid [:raw "gen_random_uuid()"]]
      (fixtures/execute!
        (sql/format
          {:insert-into :booking-history-item
           :columns     [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center :embedding]
           :values      [[gen-uuid gen-uuid gen-uuid "A" "has embedding" "6300" "IT" (embedding->sql [1 0 0])]
                         [gen-uuid gen-uuid gen-uuid "B" "no embedding" "6400" "IT" nil]]})))

    (let [search-spec {:table            :booking-history-item
                       :id-column        :id
                       :embedding-column :embedding
                       :filters          {}}
          results (vector/search fixtures/*db* search-spec [1.0 0.0 0.0] 10)]
      (is (= 1 (count results)))
      (is (= "6300" (:booking-history-item/debit-account (first results)))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.search.vector-test]'

Expected: Fail - namespace not found.

Step 3: Write implementation

(ns com.getorcha.search.vector
  "Semantic similarity search using pgvector HNSW index."
  (:require [honey.sql :as sql]
            [next.jdbc :as jdbc]))


;; Register pgvector cosine distance operator
(sql/register-op! :<=>)


(defn ^:private build-where-clause
  "Builds WHERE clause for vector search with filters."
  [embedding-column filters]
  (let [not-null [:is-not embedding-column nil]
        filter-clauses (mapv (fn [[col val]]
                               (if (nil? val)
                                 [:= col nil]  ; HoneySQL converts to IS NULL
                                 [:= col val]))
                             filters)]
    (if (seq filter-clauses)
      (into [:and not-null] filter-clauses)
      not-null)))


(defn search
  "Semantic similarity search using pgvector.

   Sets ef_search=100 for good recall, returns results with :rank and :score."
  [conn {:keys [table embedding-column filters] :as _search-spec} query-embedding k]
  ;; Set HNSW search accuracy (raw SQL - HoneySQL :set is for UPDATE, not config)
  (jdbc/execute! conn ["SET hnsw.ef_search = 100"])

  (let [embedding-str (str (vec query-embedding))
        embedding-cast [:cast [:lift embedding-str] :vector]
        distance [:<=> embedding-column embedding-cast]
        query {:select [:* [[[:- [:inline 1] distance]] :score]]
               :from   [table]
               :where  (build-where-clause embedding-column filters)
               :order-by [[distance]]
               :limit  k}
        results (jdbc/execute! conn (sql/format query))]
    (map-indexed (fn [idx row]
                   (assoc row :rank (inc idx)))
                 results)))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.search.vector-test]'

Expected: PASS

Step 5: Commit

git add src/com/getorcha/search/vector.clj test/com/getorcha/search/vector_test.clj
git commit -m "feat(search): add pgvector similarity search"

Task 6: Create search.embedding Namespace

Files:

Step 1: Write the failing test

(ns com.getorcha.search.embedding-test
  (:require [clojure.core.async :refer [<!! close!]]
            [clojure.test :refer [deftest is testing]]
            [com.getorcha.search.embedding :as embedding]))


(deftest embed-query-test
  (testing "returns 768-dimension vector"
    ;; Note: This test requires GCP credentials and network access
    ;; In CI, mock with with-redefs
    (let [embedding (embedding/embed-query "test query")]
      (is (= 768 (count embedding)))
      (is (every? number? embedding)))))


(deftest embed-test
  (testing "returns channel with [index embedding] pairs"
    (let [texts ["text one" "text two" "text three"]
          ch (embedding/embed texts)
          results (loop [acc []]
                    (if-let [result (<!! ch)]
                      (recur (conj acc result))
                      acc))]
      (is (= 3 (count results)))
      (is (= #{0 1 2} (set (map first results))))
      (is (every? #(= 768 (count (second %))) results)))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.search.embedding-test]'

Expected: Fail - namespace not found.

Step 3: Write implementation

Reference the spike's src/embeddings/google_embed.py for API details.

(ns com.getorcha.search.embedding
  "Google Vertex AI embedding client using text-multilingual-embedding-002."
  (:require [clojure.core.async :refer [chan close! >!! io-thread]]
            [clojure.data.json :as json]
            [com.getorcha.config :as config])
  (:import [com.google.auth.oauth2 GoogleCredentials]
           [java.net URI]
           [java.net.http HttpClient HttpRequest HttpRequest$BodyPublishers HttpResponse$BodyHandlers]))


(def ^:private model "text-multilingual-embedding-002")
(def ^:private batch-size 100)
(def ^:private batch-delay-ms 200)
(def ^:private dimensions 768)


(defn ^:private get-access-token
  []
  (-> (GoogleCredentials/getApplicationDefault)
      (.createScoped ["https://www.googleapis.com/auth/cloud-platform"])
      (.refreshAccessToken)
      (.getTokenValue)))


(defn ^:private call-google-api
  "Calls Google embedding API with given texts and task type.
   task-type: 'RETRIEVAL_DOCUMENT' for indexing, 'RETRIEVAL_QUERY' for search"
  [texts task-type]
  (let [project-id  (config/gcp-project-id)
        location    "us-central1"
        endpoint    (format "https://%s-aiplatform.googleapis.com/v1/projects/%s/locations/%s/publishers/google/models/%s:predict"
                            location project-id location model)
        token       (get-access-token)
        body        {:instances (mapv (fn [text] {:content text :task_type task-type}) texts)}
        request     (-> (HttpRequest/newBuilder)
                        (.uri (URI/create endpoint))
                        (.header "Authorization" (str "Bearer " token))
                        (.header "Content-Type" "application/json")
                        (.POST (HttpRequest$BodyPublishers/ofString (json/write-str body)))
                        (.build))
        client      (HttpClient/newHttpClient)
        response    (.send client request (HttpResponse$BodyHandlers/ofString))
        parsed      (json/read-str (.body response) :key-fn keyword)]
    (if (= 200 (.statusCode response))
      (mapv #(get-in % [:embeddings :values]) (:predictions parsed))
      (throw (ex-info "Google embedding API error" {:status (.statusCode response)
                                                     :body   (.body response)})))))


(defn embed-query
  "Synchronous single-text embedding for search queries.
   Returns 768-dimension vector."
  [text]
  (first (call-google-api [text] "RETRIEVAL_QUERY")))


(defn embed
  "Batch embedding for indexing. Returns channel of [index embedding] pairs.

   Processes in batches of 100 with rate limiting. Caller consumes channel
   and persists embeddings to their domain table."
  [texts]
  (let [out (chan 100)
        texts-vec (vec texts)]
    (io-thread
      (try
        (doseq [[batch-idx batch] (map-indexed vector (partition-all batch-size texts-vec))]
          (let [start-idx (* batch-idx batch-size)
                embeddings (call-google-api (vec batch) "RETRIEVAL_DOCUMENT")]
            (doseq [[i emb] (map-indexed vector embeddings)]
              (>!! out [(+ start-idx i) emb])))
          (Thread/sleep batch-delay-ms))
        (finally
          (close! out))))
    out))

Step 4: Add config function for GCP project ID

In src/com/getorcha/config.clj, add if not present:

(defn gcp-project-id []
  (get-in (config) [:gcp :project-id]))

Step 5: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.search.embedding-test]'

Expected: PASS (requires GCP credentials)

Step 6: Commit

git add src/com/getorcha/search/embedding.clj test/com/getorcha/search/embedding_test.clj
git commit -m "feat(search): add Google embedding client"

Task 7: Create search.core Namespace (Public API)

Files:

Step 1: Write the failing test

(ns com.getorcha.search.core-test
  (:require [clojure.test :refer [deftest is testing use-fixtures]]
            [com.getorcha.search.core :as search]
            [com.getorcha.test.fixtures :as fixtures]
            [honey.sql :as sql]))


(use-fixtures :once fixtures/with-system)
(use-fixtures :each fixtures/with-rollback)


(defn- embedding->sql
  "Converts embedding vector to HoneySQL ::vector cast."
  [emb]
  [:raw (str "'" (str emb) "'::vector")])


(deftest search-test
  (testing "hybrid search combines BM25 and semantic results with RRF"
    ;; Insert test data with embeddings
    (let [embedding-a (vec (concat [1.0 0.0 0.0] (repeat 765 0.0)))
          embedding-b (vec (concat [0.8 0.2 0.0] (repeat 765 0.0)))
          gen-uuid [:raw "gen_random_uuid()"]]
      (fixtures/execute!
        (sql/format
          {:insert-into :booking-history-item
           :columns     [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center :embedding]
           :values      [[gen-uuid gen-uuid gen-uuid "ACME" "software license" "6300" "IT" (embedding->sql embedding-a)]
                         [gen-uuid gen-uuid gen-uuid "ACME" "hardware purchase" "6400" "IT" (embedding->sql embedding-b)]]}))

      ;; Mock embed-query to return a known embedding
      (with-redefs [com.getorcha.search.embedding/embed-query (constantly embedding-a)]
        (let [search-spec {:table            :booking-history-item
                           :id-column        :id
                           :embedding-column :embedding
                           :text-column      :searchable-text
                           :filters          {}}
              results (search/search fixtures/*db* search-spec "ACME software" {:k 10})]
          ;; "software license" should rank higher (matches BM25 and semantic)
          (is (pos? (count results)))
          (is (= "6300" (:booking-history-item/debit-account (first results))))
          (is (contains? (first results) :rrf-score)))))))

Step 2: Run test to verify it fails

Run: clj -X:test:silent :nses '[com.getorcha.search.core-test]'

Expected: Fail - namespace not found.

Step 3: Write implementation

(ns com.getorcha.search.core
  "Public API for hybrid semantic search.

   Two entrypoints:
   - `embed` - batch embedding for indexing (returns core.async channel)
   - `search` - hybrid search combining BM25 + semantic + RRF"
  (:require [com.getorcha.search.bm25 :as bm25]
            [com.getorcha.search.embedding :as embedding]
            [com.getorcha.search.rrf :as rrf]
            [com.getorcha.search.vector :as vector]))


(defn embed
  "Generates embeddings for a batch of texts via Google API.
   Returns a core.async channel of [index embedding] pairs.

   Caller is responsible for:
   - Filtering texts that already have embeddings (caching)
   - Persisting returned embeddings to their domain table"
  [texts]
  (embedding/embed texts))


(defn search
  "Hybrid search combining BM25 full-text and semantic similarity.
   Returns top-k results ranked by Reciprocal Rank Fusion.

   conn        - database connection
   search-spec - {:table keyword
                  :id-column keyword
                  :embedding-column keyword
                  :text-column keyword
                  :filters map}
   query-text  - search query string
   opts        - {:k 10, :semantic-k 20, :bm25-k 20, :rrf-k 60}"
  [conn search-spec query-text opts]
  (let [{:keys [k semantic-k bm25-k rrf-k]
         :or   {k 10 semantic-k 20 bm25-k 20 rrf-k 60}} opts

        query-embedding (embedding/embed-query query-text)

        bm25-results    (bm25/search conn
                                     (select-keys search-spec [:table :id-column :text-column :filters])
                                     query-text
                                     bm25-k)

        semantic-results (vector/search conn
                                        (select-keys search-spec [:table :id-column :embedding-column :filters])
                                        query-embedding
                                        semantic-k)]
    (->> (rrf/fuse rrf-k bm25-results semantic-results)
         (take k))))

Step 4: Run test to verify it passes

Run: clj -X:test:silent :nses '[com.getorcha.search.core-test]'

Expected: PASS

Step 5: Commit

git add src/com/getorcha/search/core.clj test/com/getorcha/search/core_test.clj
git commit -m "feat(search): add public API with hybrid search"

Task 8: Integrate Embedding in ERP Booking History Upload

Files:

Step 1: Read current implementation

Read the existing upload handler to understand where to hook embedding.

Step 2: Add embed-new-items! function

After the existing upload logic, add:

(defn ^:private embed-new-items!
  "Generates embeddings for booking history items that don't have them yet.
   Called after upload or DATEV import completes."
  [pool legal-entity-id]
  (let [select-query {:select [:id :searchable-text]
                      :from   [:booking-history-item]
                      :where  [:and
                               [:= :legal-entity-id legal-entity-id]
                               [:= :embedding nil]      ; HoneySQL converts to IS NULL
                               [:= :deleted-at nil]]}
        items (jdbc/execute! pool (sql/format select-query))
        texts (mapv :booking-history-item/searchable-text items)
        ids   (mapv :booking-history-item/id items)]
    (when (seq texts)
      (let [embed-ch (search/embed texts)]
        (loop []
          (when-let [[idx embedding] (<!! embed-ch)]
            (let [update-query {:update :booking-history-item
                                :set    {:embedding [:cast (str (vec embedding)) :vector]}
                                :where  [:= :id (nth ids idx)]}]
              (jdbc/execute! pool (sql/format update-query)))
            (recur)))))))

Step 3: Call embed-new-items! after upload

In the upload handler, after inserting rows, add:

;; After successful insert, trigger embedding in background
(future (embed-new-items! pool legal-entity-id))

Step 4: Add requires

(:require [clojure.core.async :refer [<!!]]
          [com.getorcha.search.core :as search]
          [honey.sql :as sql]
          ...)

Step 5: Test manually

  1. Start system: clj -M:dev
  2. Upload a booking history CSV via the UI
  3. Check embeddings: psql -c "SELECT id, embedding IS NOT NULL FROM booking_history_item LIMIT 5"

Step 6: Commit

git add src/com/getorcha/erp/http/settings/booking_history.clj
git commit -m "feat(erp): trigger embedding generation after booking history upload"

Task 9: Integrate Search in Workers Post-Processing

Files:

Step 1: Read current fetch-supplier-booking-history

Locate the existing function that uses pg_trgm.

Step 2: Create new fetch-booking-history-candidates

Replace or add alongside:

(defn ^:private fetch-booking-history-candidates
  "Fetches semantically similar booking history for invoice line items."
  [pool legal-entity-id supplier-name line-items opts]
  (let [{:keys [k] :or {k 10}} opts
        search-spec {:table            :booking_history_item
                     :id-column        :id
                     :embedding-column :embedding
                     :text-column      :searchable_text
                     :filters          {:legal_entity_id legal-entity-id
                                        :deleted_at      nil}}

        all-results (mapcat
                      (fn [line-item]
                        (let [query-text (str supplier-name " | " line-item)]
                          (search/search pool search-spec query-text {:k k})))
                      line-items)

        ;; Deduplicate by id, keeping highest score
        deduped     (vals
                      (reduce
                        (fn [acc {:booking-history-item/keys [id] :keys [rrf-score] :as row}]
                          (if (or (not (contains? acc id))
                                  (> rrf-score (:rrf-score (get acc id))))
                            (assoc acc id row)
                            acc))
                        {}
                        all-results))]
    deduped))

Step 3: Add curation function

(defn ^:private curate-candidates
  "Clusters candidates by (debit-account, cost-center) and deduplicates within clusters."
  [candidates {:keys [dedup-threshold] :or {dedup-threshold 0.9} :as _opts}]
  (let [clusters (group-by (juxt :booking-history-item/debit-account
                                  :booking-history-item/cost-center)
                           candidates)]
    (mapcat
      (fn [[[debit-account cost-center] items]]
        (let [sorted-items (sort-by :rrf-score > items)
              ;; Keep first item per cluster (highest score), skip near-duplicates
              ;; For now, simple approach: just take top item per cluster
              representative (first sorted-items)]
          [(assoc representative
             :cluster-key   [debit-account cost-center]
             :cluster-count (count items))]))
      clusters)))

Step 4: Update call site

Replace fetch-supplier-booking-history with new functions in the invoice processing flow.

Step 5: Add requires

(:require [com.getorcha.search.core :as search]
          ...)

Step 6: Test with existing invoice

  1. Start system
  2. Process an invoice that has booking history
  3. Verify the LLM receives curated candidates in the prompt

Step 7: Commit

git add src/com/getorcha/workers/ingestion/post_process.clj
git commit -m "feat(workers): use semantic search for booking history candidates"

Task 10: Remove pg_trgm Booking History Lookup (Cleanup)

Files:

Step 1: Identify old code

Find fetch-supplier-booking-history and any pg_trgm references.

Step 2: Remove old function

Delete the pg_trgm-based lookup function.

Step 3: Update any remaining references

Ensure all call sites use the new fetch-booking-history-candidates.

Step 4: Run tests

Run: clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Execution error|failed because|Ran .* tests)"

Expected: All tests pass.

Step 5: Commit

git add src/com/getorcha/workers/ingestion/post_process.clj
git commit -m "refactor(workers): remove pg_trgm booking history lookup"

Task 11: Integration Test - End to End

Files:

Step 1: Write integration test

(ns com.getorcha.search.integration-test
  (:require [clojure.core.async :refer [<!!]]
            [clojure.test :refer [deftest is testing use-fixtures]]
            [com.getorcha.search.core :as search]
            [com.getorcha.test.fixtures :as fixtures]
            [honey.sql :as sql]
            [next.jdbc :as jdbc]))


(use-fixtures :once fixtures/with-system)
(use-fixtures :each fixtures/with-rollback)


(deftest full-pipeline-test
  (testing "embed -> search -> retrieve works end-to-end"
    (let [legal-entity-id (random-uuid)
          upload-id       (random-uuid)
          id-a            (random-uuid)
          id-b            (random-uuid)
          id-c            (random-uuid)]

      ;; 1. Insert booking history items without embeddings
      (jdbc/execute! fixtures/*db*
        (sql/format
          {:insert-into :booking-history-item
           :columns     [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center]
           :values      [[id-a upload-id legal-entity-id "ACME Corp" "annual software license renewal" "6300" "IT-001"]
                         [id-b upload-id legal-entity-id "ACME Corp" "hardware maintenance contract" "6400" "IT-001"]
                         [id-c upload-id legal-entity-id "Other Inc" "consulting services" "6500" "HR-001"]]}))

      ;; 2. Generate embeddings
      (let [select-query {:select [:id :searchable-text]
                          :from   [:booking-history-item]
                          :where  [:and
                                   [:= :legal-entity-id legal-entity-id]
                                   [:= :embedding nil]]}
            items (jdbc/execute! fixtures/*db* (sql/format select-query))
            texts (mapv :booking-history-item/searchable-text items)
            ids   (mapv :booking-history-item/id items)
            ch    (search/embed texts)]
        (loop []
          (when-let [[idx embedding] (<!! ch)]
            (let [update-query {:update :booking-history-item
                                :set    {:embedding [:cast (str (vec embedding)) :vector]}
                                :where  [:= :id (nth ids idx)]}]
              (jdbc/execute! fixtures/*db* (sql/format update-query)))
            (recur))))

      ;; 3. Search for similar bookings
      (let [search-spec {:table            :booking-history-item
                         :id-column        :id
                         :embedding-column :embedding
                         :text-column      :searchable-text
                         :filters          {:legal-entity-id legal-entity-id}}
            results     (search/search fixtures/*db* search-spec
                                       "ACME Corp | software license" {:k 5})]

        ;; Should find ACME software license first
        (is (pos? (count results)))
        (is (= "6300" (:booking-history-item/debit-account (first results))))))))

Step 2: Run test

Run: clj -X:test:silent :nses '[com.getorcha.search.integration-test]'

Expected: PASS (requires GCP credentials)

Step 3: Commit

git add test/com/getorcha/search/integration_test.clj
git commit -m "test(search): add end-to-end integration test"

Summary

Task Description Key Files
1 Upgrade deps deps.edn
2 DB migration resources/migrations/*-add-search-columns.*
3 RRF module search/rrf.clj
4 BM25 module search/bm25.clj
5 Vector module search/vector.clj
6 Embedding module search/embedding.clj
7 Core API search/core.clj
8 ERP integration booking_history.clj
9 Workers integration post_process.clj
10 Cleanup pg_trgm post_process.clj
11 Integration test search/integration_test.clj