Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
CRITICAL: Read
orcha/CLAUDE.mdat the start of every task. Follow all conventions strictly.
Goal: Replace pg_trgm booking history lookup with hybrid semantic search (BM25 + Google embeddings + RRF).
Architecture: New com.getorcha.search module with embed and search public functions. ERP calls embed after booking history import, Workers call search during post-processing. Curation stays in Workers.
Tech Stack: pgvector, Google Vertex AI (text-multilingual-embedding-002), core.async 1.9.x, PostgreSQL full-text search.
Files:
deps.ednStep 1: Update deps.edn
Add pgvector Java client and upgrade core.async:
;; In :deps map, update:
org.clojure/core.async {:mvn/version "1.9.829"}
;; Add:
com.pgvector/pgvector {:mvn/version "0.1.6"}
Step 2: Verify deps resolve
Run: clj -Stree | grep -E "(core.async|pgvector)"
Expected:
org.clojure/core.async 1.9.829
com.pgvector/pgvector 0.1.6
Step 3: Commit
git add deps.edn
git commit -m "deps: upgrade core.async to 1.9.x, add pgvector"
Files:
resources/migrations/YYYYMMDDHHMMSS-add-search-columns.up.sqlresources/migrations/YYYYMMDDHHMMSS-add-search-columns.down.sqlStep 1: Create migration
Run: bb migrate create "add-search-columns"
Step 2: Write up migration
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Add searchable text (generated column for BM25)
ALTER TABLE booking_history_item
ADD COLUMN searchable_text TEXT
GENERATED ALWAYS AS (supplier_name || ' | ' || description) STORED;
-- Add embedding column (768 dimensions for Google's text-multilingual-embedding-002)
ALTER TABLE booking_history_item
ADD COLUMN embedding vector(768);
-- HNSW index for approximate nearest neighbor search
CREATE INDEX booking_history_item_embedding_idx
ON booking_history_item
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- GIN index for BM25 full-text search
CREATE INDEX booking_history_item_searchable_text_idx
ON booking_history_item
USING gin (to_tsvector('simple', searchable_text));
Step 3: Write down migration
DROP INDEX IF EXISTS booking_history_item_searchable_text_idx;
DROP INDEX IF EXISTS booking_history_item_embedding_idx;
ALTER TABLE booking_history_item DROP COLUMN IF EXISTS embedding;
ALTER TABLE booking_history_item DROP COLUMN IF EXISTS searchable_text;
-- Note: Don't drop the vector extension as other tables may use it
Step 4: Run migration locally
Run: clj -M:dev -m com.getorcha.db.migrate
Or start the system: clj -M:dev and migrations run automatically.
Step 5: Verify columns exist
Run: psql -h localhost -U postgres -d orcha -c "\d booking_history_item"
Expected: See searchable_text and embedding columns.
Step 6: Commit
git add resources/migrations/*-add-search-columns.*
git commit -m "feat(db): add searchable_text and embedding columns for semantic search"
Files:
src/com/getorcha/search/rrf.cljtest/com/getorcha/search/rrf_test.cljStep 1: Write the failing test
(ns com.getorcha.search.rrf-test
(:require [clojure.test :refer [deftest is testing]]
[com.getorcha.search.rrf :as rrf]))
(deftest fuse-test
(testing "document in both lists gets boosted"
(let [bm25-results [{:id 1 :rank 1} {:id 2 :rank 2}]
semantic-results [{:id 2 :rank 1} {:id 3 :rank 2}]
fused (rrf/fuse 60 bm25-results semantic-results)]
;; id=2 appears in both lists: 1/(60+2) + 1/(60+1) = higher score
;; id=1 appears only in bm25: 1/(60+1)
;; id=3 appears only in semantic: 1/(60+2)
(is (= [2 1 3] (mapv :id fused)))))
(testing "preserves row data from first occurrence"
(let [bm25-results [{:id 1 :rank 1 :extra "bm25-data"}]
semantic-results [{:id 1 :rank 1 :extra "semantic-data"}]
fused (rrf/fuse 60 bm25-results semantic-results)]
(is (= "bm25-data" (:extra (first fused))))))
(testing "empty lists return empty"
(is (empty? (rrf/fuse 60 [] [])))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.search.rrf-test]'
Expected: Fail - namespace not found.
Step 3: Write implementation
(ns com.getorcha.search.rrf
"Reciprocal Rank Fusion for combining multiple ranked result lists.
RRF score for document d = Σ 1/(k + rank_in_list)
Reference: Cormack et al. 2009")
(defn fuse
"Combines multiple ranked result lists using Reciprocal Rank Fusion.
rrf-k - constant from RRF paper, typically 60
result-lists - varargs of result sequences, each with :id and :rank keys
Returns results sorted by RRF score descending, with :rrf-score added."
[rrf-k & result-lists]
(let [id->entries (reduce
(fn [acc results]
(reduce
(fn [acc {:keys [id rank] :as row}]
(update acc id (fnil conj []) {:rank rank :row row}))
acc
results))
{}
result-lists)]
(->> id->entries
(map (fn [[id entries]]
(let [rrf-score (reduce + (map #(/ 1.0 (+ rrf-k (:rank %))) entries))
row (:row (first entries))]
(assoc row :rrf-score rrf-score))))
(sort-by :rrf-score >))))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.search.rrf-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/search/rrf.clj test/com/getorcha/search/rrf_test.clj
git commit -m "feat(search): add RRF (Reciprocal Rank Fusion) implementation"
Files:
src/com/getorcha/search/bm25.cljtest/com/getorcha/search/bm25_test.cljStep 1: Write the failing test
(ns com.getorcha.search.bm25-test
(:require [clojure.test :refer [deftest is testing use-fixtures]]
[com.getorcha.search.bm25 :as bm25]
[com.getorcha.test.fixtures :as fixtures]
[honey.sql :as sql]))
(use-fixtures :once fixtures/with-system)
(use-fixtures :each fixtures/with-rollback)
(deftest search-test
(testing "returns results ranked by text relevance"
;; Insert test data using HoneySQL
(let [gen-uuid [:raw "gen_random_uuid()"]]
(fixtures/execute!
(sql/format
{:insert-into :booking-history-item
:columns [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center]
:values [[gen-uuid gen-uuid gen-uuid "ACME Corp" "software license" "6300" "IT"]
[gen-uuid gen-uuid gen-uuid "ACME Corp" "hardware purchase" "6400" "IT"]
[gen-uuid gen-uuid gen-uuid "Other Inc" "consulting services" "6500" "HR"]]})))
(let [search-spec {:table :booking-history-item
:id-column :id
:text-column :searchable-text
:filters {}}
results (bm25/search fixtures/*db* search-spec "ACME software" 10)]
(is (= 1 (count results)))
(is (= "6300" (:booking-history-item/debit-account (first results))))
(is (contains? (first results) :rank))
(is (= 1 (:rank (first results)))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.search.bm25-test]'
Expected: Fail - namespace not found.
Step 3: Write implementation
(ns com.getorcha.search.bm25
"PostgreSQL full-text search using ts_rank for BM25-style relevance."
(:require [honey.sql :as sql]
[honey.sql.pg-ops :as pg-ops]
[next.jdbc :as jdbc]))
(defn ^:private build-where-clause
"Builds WHERE clause for BM25 search with filters."
[text-column query-text filters]
(let [tsquery [:plainto_tsquery [:inline "simple"] query-text]
tsvector [:to_tsvector [:inline "simple"] text-column]
fts-match [pg-ops/atat tsvector tsquery]
filter-clauses (mapv (fn [[col val]]
(if (nil? val)
[:= col nil] ; HoneySQL converts to IS NULL
[:= col val]))
filters)]
(if (seq filter-clauses)
(into [:and fts-match] filter-clauses)
fts-match)))
(defn search
"Full-text search using PostgreSQL ts_rank.
Returns results ranked by BM25-style relevance with :rank (1-indexed position)."
[conn {:keys [table text-column filters] :as _search-spec} query-text k]
(let [tsquery [:plainto_tsquery [:inline "simple"] query-text]
tsvector [:to_tsvector [:inline "simple"] text-column]
query {:select [:* [[[:ts_rank tsvector tsquery]] :bm25_score]]
:from [table]
:where (build-where-clause text-column query-text filters)
:order-by [[:bm25_score :desc]]
:limit k}
results (jdbc/execute! conn (sql/format query))]
(map-indexed (fn [idx row]
(assoc row :rank (inc idx)))
results)))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.search.bm25-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/search/bm25.clj test/com/getorcha/search/bm25_test.clj
git commit -m "feat(search): add BM25 full-text search"
Files:
src/com/getorcha/search/vector.cljtest/com/getorcha/search/vector_test.cljStep 1: Write the failing test
(ns com.getorcha.search.vector-test
(:require [clojure.test :refer [deftest is testing use-fixtures]]
[com.getorcha.search.vector :as vector]
[com.getorcha.test.fixtures :as fixtures]
[honey.sql :as sql]))
(use-fixtures :once fixtures/with-system)
(use-fixtures :each fixtures/with-rollback)
(defn- embedding->sql
"Converts embedding vector to HoneySQL ::vector cast."
[emb]
[:raw (str "'" (str emb) "'::vector")])
(deftest search-test
(testing "returns results ranked by cosine similarity"
;; Insert test data with embeddings (768-dim, using simple patterns)
(let [embedding-a (vec (concat [1.0 0.0 0.0] (repeat 765 0.0)))
embedding-b (vec (concat [0.9 0.1 0.0] (repeat 765 0.0)))
embedding-c (vec (concat [0.0 1.0 0.0] (repeat 765 0.0)))
gen-uuid [:raw "gen_random_uuid()"]]
(fixtures/execute!
(sql/format
{:insert-into :booking-history-item
:columns [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center :embedding]
:values [[gen-uuid gen-uuid gen-uuid "A" "desc a" "6300" "IT" (embedding->sql embedding-a)]
[gen-uuid gen-uuid gen-uuid "B" "desc b" "6400" "IT" (embedding->sql embedding-b)]
[gen-uuid gen-uuid gen-uuid "C" "desc c" "6500" "HR" (embedding->sql embedding-c)]]}))
(let [query-embedding embedding-a ; Should match A best, then B, then C
search-spec {:table :booking-history-item
:id-column :id
:embedding-column :embedding
:filters {}}
results (vector/search fixtures/*db* search-spec query-embedding 10)]
(is (= 3 (count results)))
(is (= "6300" (:booking-history-item/debit-account (first results))))
(is (= "6400" (:booking-history-item/debit-account (second results))))
(is (contains? (first results) :rank))
(is (contains? (first results) :score))
(is (> (:score (first results)) (:score (second results))))))))
(deftest search-filters-null-embeddings-test
(testing "excludes rows with null embeddings"
(let [gen-uuid [:raw "gen_random_uuid()"]]
(fixtures/execute!
(sql/format
{:insert-into :booking-history-item
:columns [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center :embedding]
:values [[gen-uuid gen-uuid gen-uuid "A" "has embedding" "6300" "IT" (embedding->sql [1 0 0])]
[gen-uuid gen-uuid gen-uuid "B" "no embedding" "6400" "IT" nil]]})))
(let [search-spec {:table :booking-history-item
:id-column :id
:embedding-column :embedding
:filters {}}
results (vector/search fixtures/*db* search-spec [1.0 0.0 0.0] 10)]
(is (= 1 (count results)))
(is (= "6300" (:booking-history-item/debit-account (first results)))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.search.vector-test]'
Expected: Fail - namespace not found.
Step 3: Write implementation
(ns com.getorcha.search.vector
"Semantic similarity search using pgvector HNSW index."
(:require [honey.sql :as sql]
[next.jdbc :as jdbc]))
;; Register pgvector cosine distance operator
(sql/register-op! :<=>)
(defn ^:private build-where-clause
"Builds WHERE clause for vector search with filters."
[embedding-column filters]
(let [not-null [:is-not embedding-column nil]
filter-clauses (mapv (fn [[col val]]
(if (nil? val)
[:= col nil] ; HoneySQL converts to IS NULL
[:= col val]))
filters)]
(if (seq filter-clauses)
(into [:and not-null] filter-clauses)
not-null)))
(defn search
"Semantic similarity search using pgvector.
Sets ef_search=100 for good recall, returns results with :rank and :score."
[conn {:keys [table embedding-column filters] :as _search-spec} query-embedding k]
;; Set HNSW search accuracy (raw SQL - HoneySQL :set is for UPDATE, not config)
(jdbc/execute! conn ["SET hnsw.ef_search = 100"])
(let [embedding-str (str (vec query-embedding))
embedding-cast [:cast [:lift embedding-str] :vector]
distance [:<=> embedding-column embedding-cast]
query {:select [:* [[[:- [:inline 1] distance]] :score]]
:from [table]
:where (build-where-clause embedding-column filters)
:order-by [[distance]]
:limit k}
results (jdbc/execute! conn (sql/format query))]
(map-indexed (fn [idx row]
(assoc row :rank (inc idx)))
results)))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.search.vector-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/search/vector.clj test/com/getorcha/search/vector_test.clj
git commit -m "feat(search): add pgvector similarity search"
Files:
src/com/getorcha/search/embedding.cljtest/com/getorcha/search/embedding_test.cljStep 1: Write the failing test
(ns com.getorcha.search.embedding-test
(:require [clojure.core.async :refer [<!! close!]]
[clojure.test :refer [deftest is testing]]
[com.getorcha.search.embedding :as embedding]))
(deftest embed-query-test
(testing "returns 768-dimension vector"
;; Note: This test requires GCP credentials and network access
;; In CI, mock with with-redefs
(let [embedding (embedding/embed-query "test query")]
(is (= 768 (count embedding)))
(is (every? number? embedding)))))
(deftest embed-test
(testing "returns channel with [index embedding] pairs"
(let [texts ["text one" "text two" "text three"]
ch (embedding/embed texts)
results (loop [acc []]
(if-let [result (<!! ch)]
(recur (conj acc result))
acc))]
(is (= 3 (count results)))
(is (= #{0 1 2} (set (map first results))))
(is (every? #(= 768 (count (second %))) results)))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.search.embedding-test]'
Expected: Fail - namespace not found.
Step 3: Write implementation
Reference the spike's src/embeddings/google_embed.py for API details.
(ns com.getorcha.search.embedding
"Google Vertex AI embedding client using text-multilingual-embedding-002."
(:require [clojure.core.async :refer [chan close! >!! io-thread]]
[clojure.data.json :as json]
[com.getorcha.config :as config])
(:import [com.google.auth.oauth2 GoogleCredentials]
[java.net URI]
[java.net.http HttpClient HttpRequest HttpRequest$BodyPublishers HttpResponse$BodyHandlers]))
(def ^:private model "text-multilingual-embedding-002")
(def ^:private batch-size 100)
(def ^:private batch-delay-ms 200)
(def ^:private dimensions 768)
(defn ^:private get-access-token
[]
(-> (GoogleCredentials/getApplicationDefault)
(.createScoped ["https://www.googleapis.com/auth/cloud-platform"])
(.refreshAccessToken)
(.getTokenValue)))
(defn ^:private call-google-api
"Calls Google embedding API with given texts and task type.
task-type: 'RETRIEVAL_DOCUMENT' for indexing, 'RETRIEVAL_QUERY' for search"
[texts task-type]
(let [project-id (config/gcp-project-id)
location "us-central1"
endpoint (format "https://%s-aiplatform.googleapis.com/v1/projects/%s/locations/%s/publishers/google/models/%s:predict"
location project-id location model)
token (get-access-token)
body {:instances (mapv (fn [text] {:content text :task_type task-type}) texts)}
request (-> (HttpRequest/newBuilder)
(.uri (URI/create endpoint))
(.header "Authorization" (str "Bearer " token))
(.header "Content-Type" "application/json")
(.POST (HttpRequest$BodyPublishers/ofString (json/write-str body)))
(.build))
client (HttpClient/newHttpClient)
response (.send client request (HttpResponse$BodyHandlers/ofString))
parsed (json/read-str (.body response) :key-fn keyword)]
(if (= 200 (.statusCode response))
(mapv #(get-in % [:embeddings :values]) (:predictions parsed))
(throw (ex-info "Google embedding API error" {:status (.statusCode response)
:body (.body response)})))))
(defn embed-query
"Synchronous single-text embedding for search queries.
Returns 768-dimension vector."
[text]
(first (call-google-api [text] "RETRIEVAL_QUERY")))
(defn embed
"Batch embedding for indexing. Returns channel of [index embedding] pairs.
Processes in batches of 100 with rate limiting. Caller consumes channel
and persists embeddings to their domain table."
[texts]
(let [out (chan 100)
texts-vec (vec texts)]
(io-thread
(try
(doseq [[batch-idx batch] (map-indexed vector (partition-all batch-size texts-vec))]
(let [start-idx (* batch-idx batch-size)
embeddings (call-google-api (vec batch) "RETRIEVAL_DOCUMENT")]
(doseq [[i emb] (map-indexed vector embeddings)]
(>!! out [(+ start-idx i) emb])))
(Thread/sleep batch-delay-ms))
(finally
(close! out))))
out))
Step 4: Add config function for GCP project ID
In src/com/getorcha/config.clj, add if not present:
(defn gcp-project-id []
(get-in (config) [:gcp :project-id]))
Step 5: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.search.embedding-test]'
Expected: PASS (requires GCP credentials)
Step 6: Commit
git add src/com/getorcha/search/embedding.clj test/com/getorcha/search/embedding_test.clj
git commit -m "feat(search): add Google embedding client"
Files:
src/com/getorcha/search/core.cljtest/com/getorcha/search/core_test.cljStep 1: Write the failing test
(ns com.getorcha.search.core-test
(:require [clojure.test :refer [deftest is testing use-fixtures]]
[com.getorcha.search.core :as search]
[com.getorcha.test.fixtures :as fixtures]
[honey.sql :as sql]))
(use-fixtures :once fixtures/with-system)
(use-fixtures :each fixtures/with-rollback)
(defn- embedding->sql
"Converts embedding vector to HoneySQL ::vector cast."
[emb]
[:raw (str "'" (str emb) "'::vector")])
(deftest search-test
(testing "hybrid search combines BM25 and semantic results with RRF"
;; Insert test data with embeddings
(let [embedding-a (vec (concat [1.0 0.0 0.0] (repeat 765 0.0)))
embedding-b (vec (concat [0.8 0.2 0.0] (repeat 765 0.0)))
gen-uuid [:raw "gen_random_uuid()"]]
(fixtures/execute!
(sql/format
{:insert-into :booking-history-item
:columns [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center :embedding]
:values [[gen-uuid gen-uuid gen-uuid "ACME" "software license" "6300" "IT" (embedding->sql embedding-a)]
[gen-uuid gen-uuid gen-uuid "ACME" "hardware purchase" "6400" "IT" (embedding->sql embedding-b)]]}))
;; Mock embed-query to return a known embedding
(with-redefs [com.getorcha.search.embedding/embed-query (constantly embedding-a)]
(let [search-spec {:table :booking-history-item
:id-column :id
:embedding-column :embedding
:text-column :searchable-text
:filters {}}
results (search/search fixtures/*db* search-spec "ACME software" {:k 10})]
;; "software license" should rank higher (matches BM25 and semantic)
(is (pos? (count results)))
(is (= "6300" (:booking-history-item/debit-account (first results))))
(is (contains? (first results) :rrf-score)))))))
Step 2: Run test to verify it fails
Run: clj -X:test:silent :nses '[com.getorcha.search.core-test]'
Expected: Fail - namespace not found.
Step 3: Write implementation
(ns com.getorcha.search.core
"Public API for hybrid semantic search.
Two entrypoints:
- `embed` - batch embedding for indexing (returns core.async channel)
- `search` - hybrid search combining BM25 + semantic + RRF"
(:require [com.getorcha.search.bm25 :as bm25]
[com.getorcha.search.embedding :as embedding]
[com.getorcha.search.rrf :as rrf]
[com.getorcha.search.vector :as vector]))
(defn embed
"Generates embeddings for a batch of texts via Google API.
Returns a core.async channel of [index embedding] pairs.
Caller is responsible for:
- Filtering texts that already have embeddings (caching)
- Persisting returned embeddings to their domain table"
[texts]
(embedding/embed texts))
(defn search
"Hybrid search combining BM25 full-text and semantic similarity.
Returns top-k results ranked by Reciprocal Rank Fusion.
conn - database connection
search-spec - {:table keyword
:id-column keyword
:embedding-column keyword
:text-column keyword
:filters map}
query-text - search query string
opts - {:k 10, :semantic-k 20, :bm25-k 20, :rrf-k 60}"
[conn search-spec query-text opts]
(let [{:keys [k semantic-k bm25-k rrf-k]
:or {k 10 semantic-k 20 bm25-k 20 rrf-k 60}} opts
query-embedding (embedding/embed-query query-text)
bm25-results (bm25/search conn
(select-keys search-spec [:table :id-column :text-column :filters])
query-text
bm25-k)
semantic-results (vector/search conn
(select-keys search-spec [:table :id-column :embedding-column :filters])
query-embedding
semantic-k)]
(->> (rrf/fuse rrf-k bm25-results semantic-results)
(take k))))
Step 4: Run test to verify it passes
Run: clj -X:test:silent :nses '[com.getorcha.search.core-test]'
Expected: PASS
Step 5: Commit
git add src/com/getorcha/search/core.clj test/com/getorcha/search/core_test.clj
git commit -m "feat(search): add public API with hybrid search"
Files:
src/com/getorcha/erp/http/settings/booking_history.cljStep 1: Read current implementation
Read the existing upload handler to understand where to hook embedding.
Step 2: Add embed-new-items! function
After the existing upload logic, add:
(defn ^:private embed-new-items!
"Generates embeddings for booking history items that don't have them yet.
Called after upload or DATEV import completes."
[pool legal-entity-id]
(let [select-query {:select [:id :searchable-text]
:from [:booking-history-item]
:where [:and
[:= :legal-entity-id legal-entity-id]
[:= :embedding nil] ; HoneySQL converts to IS NULL
[:= :deleted-at nil]]}
items (jdbc/execute! pool (sql/format select-query))
texts (mapv :booking-history-item/searchable-text items)
ids (mapv :booking-history-item/id items)]
(when (seq texts)
(let [embed-ch (search/embed texts)]
(loop []
(when-let [[idx embedding] (<!! embed-ch)]
(let [update-query {:update :booking-history-item
:set {:embedding [:cast (str (vec embedding)) :vector]}
:where [:= :id (nth ids idx)]}]
(jdbc/execute! pool (sql/format update-query)))
(recur)))))))
Step 3: Call embed-new-items! after upload
In the upload handler, after inserting rows, add:
;; After successful insert, trigger embedding in background
(future (embed-new-items! pool legal-entity-id))
Step 4: Add requires
(:require [clojure.core.async :refer [<!!]]
[com.getorcha.search.core :as search]
[honey.sql :as sql]
...)
Step 5: Test manually
clj -M:devpsql -c "SELECT id, embedding IS NOT NULL FROM booking_history_item LIMIT 5"Step 6: Commit
git add src/com/getorcha/erp/http/settings/booking_history.clj
git commit -m "feat(erp): trigger embedding generation after booking history upload"
Files:
src/com/getorcha/workers/ingestion/post_process.cljStep 1: Read current fetch-supplier-booking-history
Locate the existing function that uses pg_trgm.
Step 2: Create new fetch-booking-history-candidates
Replace or add alongside:
(defn ^:private fetch-booking-history-candidates
"Fetches semantically similar booking history for invoice line items."
[pool legal-entity-id supplier-name line-items opts]
(let [{:keys [k] :or {k 10}} opts
search-spec {:table :booking_history_item
:id-column :id
:embedding-column :embedding
:text-column :searchable_text
:filters {:legal_entity_id legal-entity-id
:deleted_at nil}}
all-results (mapcat
(fn [line-item]
(let [query-text (str supplier-name " | " line-item)]
(search/search pool search-spec query-text {:k k})))
line-items)
;; Deduplicate by id, keeping highest score
deduped (vals
(reduce
(fn [acc {:booking-history-item/keys [id] :keys [rrf-score] :as row}]
(if (or (not (contains? acc id))
(> rrf-score (:rrf-score (get acc id))))
(assoc acc id row)
acc))
{}
all-results))]
deduped))
Step 3: Add curation function
(defn ^:private curate-candidates
"Clusters candidates by (debit-account, cost-center) and deduplicates within clusters."
[candidates {:keys [dedup-threshold] :or {dedup-threshold 0.9} :as _opts}]
(let [clusters (group-by (juxt :booking-history-item/debit-account
:booking-history-item/cost-center)
candidates)]
(mapcat
(fn [[[debit-account cost-center] items]]
(let [sorted-items (sort-by :rrf-score > items)
;; Keep first item per cluster (highest score), skip near-duplicates
;; For now, simple approach: just take top item per cluster
representative (first sorted-items)]
[(assoc representative
:cluster-key [debit-account cost-center]
:cluster-count (count items))]))
clusters)))
Step 4: Update call site
Replace fetch-supplier-booking-history with new functions in the invoice processing flow.
Step 5: Add requires
(:require [com.getorcha.search.core :as search]
...)
Step 6: Test with existing invoice
Step 7: Commit
git add src/com/getorcha/workers/ingestion/post_process.clj
git commit -m "feat(workers): use semantic search for booking history candidates"
Files:
src/com/getorcha/workers/ingestion/post_process.cljStep 1: Identify old code
Find fetch-supplier-booking-history and any pg_trgm references.
Step 2: Remove old function
Delete the pg_trgm-based lookup function.
Step 3: Update any remaining references
Ensure all call sites use the new fetch-booking-history-candidates.
Step 4: Run tests
Run: clj -X:test:silent 2>&1 | grep -A 5 -E "(FAIL in|ERROR in|Execution error|failed because|Ran .* tests)"
Expected: All tests pass.
Step 5: Commit
git add src/com/getorcha/workers/ingestion/post_process.clj
git commit -m "refactor(workers): remove pg_trgm booking history lookup"
Files:
test/com/getorcha/search/integration_test.cljStep 1: Write integration test
(ns com.getorcha.search.integration-test
(:require [clojure.core.async :refer [<!!]]
[clojure.test :refer [deftest is testing use-fixtures]]
[com.getorcha.search.core :as search]
[com.getorcha.test.fixtures :as fixtures]
[honey.sql :as sql]
[next.jdbc :as jdbc]))
(use-fixtures :once fixtures/with-system)
(use-fixtures :each fixtures/with-rollback)
(deftest full-pipeline-test
(testing "embed -> search -> retrieve works end-to-end"
(let [legal-entity-id (random-uuid)
upload-id (random-uuid)
id-a (random-uuid)
id-b (random-uuid)
id-c (random-uuid)]
;; 1. Insert booking history items without embeddings
(jdbc/execute! fixtures/*db*
(sql/format
{:insert-into :booking-history-item
:columns [:id :upload-id :legal-entity-id :supplier-name :description :debit-account :cost-center]
:values [[id-a upload-id legal-entity-id "ACME Corp" "annual software license renewal" "6300" "IT-001"]
[id-b upload-id legal-entity-id "ACME Corp" "hardware maintenance contract" "6400" "IT-001"]
[id-c upload-id legal-entity-id "Other Inc" "consulting services" "6500" "HR-001"]]}))
;; 2. Generate embeddings
(let [select-query {:select [:id :searchable-text]
:from [:booking-history-item]
:where [:and
[:= :legal-entity-id legal-entity-id]
[:= :embedding nil]]}
items (jdbc/execute! fixtures/*db* (sql/format select-query))
texts (mapv :booking-history-item/searchable-text items)
ids (mapv :booking-history-item/id items)
ch (search/embed texts)]
(loop []
(when-let [[idx embedding] (<!! ch)]
(let [update-query {:update :booking-history-item
:set {:embedding [:cast (str (vec embedding)) :vector]}
:where [:= :id (nth ids idx)]}]
(jdbc/execute! fixtures/*db* (sql/format update-query)))
(recur))))
;; 3. Search for similar bookings
(let [search-spec {:table :booking-history-item
:id-column :id
:embedding-column :embedding
:text-column :searchable-text
:filters {:legal-entity-id legal-entity-id}}
results (search/search fixtures/*db* search-spec
"ACME Corp | software license" {:k 5})]
;; Should find ACME software license first
(is (pos? (count results)))
(is (= "6300" (:booking-history-item/debit-account (first results))))))))
Step 2: Run test
Run: clj -X:test:silent :nses '[com.getorcha.search.integration-test]'
Expected: PASS (requires GCP credentials)
Step 3: Commit
git add test/com/getorcha/search/integration_test.clj
git commit -m "test(search): add end-to-end integration test"
| Task | Description | Key Files |
|---|---|---|
| 1 | Upgrade deps | deps.edn |
| 2 | DB migration | resources/migrations/*-add-search-columns.* |
| 3 | RRF module | search/rrf.clj |
| 4 | BM25 module | search/bm25.clj |
| 5 | Vector module | search/vector.clj |
| 6 | Embedding module | search/embedding.clj |
| 7 | Core API | search/core.clj |
| 8 | ERP integration | booking_history.clj |
| 9 | Workers integration | post_process.clj |
| 10 | Cleanup pg_trgm | post_process.clj |
| 11 | Integration test | search/integration_test.clj |