Note (2026-04-24): After this document was written,
legal_entitywas renamed totenantand the oldtenantwas renamed toorganization. Read references to these terms with the pre-rename meaning.
Integrate semantic search from the spikes/semantic-search spike into Orcha, replacing the pg_trgm-based booking history lookup with hybrid search (BM25 + semantic + RRF).
AccountsMatcher and CostCenterMatcher currently use fetch-supplier-booking-history which matches on supplier name using pg_trgm similarity (threshold 0.7). This misses semantically similar bookings with different wording.
The spike validated that hybrid search (BM25 + Google embeddings + Reciprocal Rank Fusion) provides better retrieval quality for historical booking candidates.
com.getorcha.search/
├── core.clj ; Public API: embed, search
├── embedding.clj ; Google embedding client
├── bm25.clj ; PostgreSQL full-text search
├── vector.clj ; pgvector similarity search
└── rrf.clj ; Reciprocal Rank Fusion
Root-level module (com.getorcha.search) since it's used by multiple services (ERP for indexing, Workers for querying).
search/embedGenerates embeddings for a batch of texts via Google API. Returns a core.async channel of [index embedding] pairs.
(defn embed
"Generates embeddings for a batch of texts via Google API.
Returns a core.async channel of [index embedding] pairs.
Caller is responsible for:
- Filtering texts that already have embeddings (caching)
- Persisting returned embeddings to their domain table"
[texts]
...)
Uses io-thread for blocking HTTP calls, batches of 100, 200ms delay between batches for rate limiting.
search/searchHybrid search combining BM25 full-text and semantic similarity.
(defn search
"Hybrid search combining BM25 full-text and semantic similarity.
Returns top-k results ranked by Reciprocal Rank Fusion.
conn - database connection
search-spec - {:table keyword
:id-column keyword
:embedding-column keyword
:text-column keyword
:filters map}
query - query text (embedding generated internally)
opts - {:k 10, :semantic-k 20, :bm25-k 20, :rrf-k 60}"
[conn search-spec query opts]
...)
Migration for booking_history_item:
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Add searchable text (generated column for consistency)
ALTER TABLE booking_history_item
ADD COLUMN searchable_text TEXT
GENERATED ALWAYS AS (supplier_name || ' | ' || description) STORED;
-- Add embedding column (768 dimensions for Google's text-multilingual-embedding-002)
ALTER TABLE booking_history_item
ADD COLUMN embedding vector(768);
-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX booking_history_item_embedding_idx
ON booking_history_item
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- GIN index for BM25 full-text search
CREATE INDEX booking_history_item_searchable_text_idx
ON booking_history_item
USING gin (to_tsvector('simple', searchable_text));
| Parameter | Value | When Used | Purpose |
|---|---|---|---|
| m | 16 | Index build | Max connections per node. Higher = better recall, larger index. |
| ef_construction | 64 | Index build | Search width when building. Higher = better quality, slower build. |
| ef_search | 100 | Query time | Search width when querying. Higher = better recall, slower queries. |
Google API client using text-multilingual-embedding-002:
RETRIEVAL_DOCUMENT for indexing, RETRIEVAL_QUERY for searchStreaming results via core.async channel:
(defn embed [texts]
(let [out (chan 100)]
(io-thread
(doseq [[batch-idx batch] (map-indexed vector (partition-all 100 texts))]
(let [start-idx (* batch-idx batch-size)
embeddings (call-google-api batch "RETRIEVAL_DOCUMENT")]
(doseq [[i emb] (map-indexed vector embeddings)]
(>!! out [(+ start-idx i) emb])))
(Thread/sleep 200))
(close! out))
out))
PostgreSQL full-text search using ts_rank:
'simple' text search config (no stemming) - works for German + English mixed contentplainto_tsquery handles user input safely:rank (1-indexed position) for RRF fusionpgvector similarity search:
ef_search = 100 for good recall<=> operator for cosine distance (range [0, 2])1 - distance gives similarity in [0, 1]:rank for RRF fusionCombines BM25 and semantic rankings:
RRF score for document d = Σ 1/(k + rank_in_list)
In com.getorcha.erp.http.settings/booking_history.clj:
(defn ^:private embed-new-items!
"Generates embeddings for booking history items that don't have them yet.
Called as background job after upload or DATEV import completes."
[pool legal-entity-id]
(let [items (jdbc/execute! pool
["SELECT id, searchable_text
FROM booking_history_item
WHERE legal_entity_id = ?
AND embedding IS NULL
AND deleted_at IS NULL"
legal-entity-id])
texts (mapv :booking-history-item/searchable-text items)
ids (mapv :booking-history-item/id items)]
(when (seq texts)
(let [embed-ch (search/embed texts)]
(loop []
(when-let [[idx embedding] (<!! embed-ch)]
(jdbc/execute! pool
["UPDATE booking_history_item
SET embedding = ?::vector
WHERE id = ?"
(str embedding) (nth ids idx)])
(recur)))))))
Replace fetch-supplier-booking-history in post_process.clj:
(defn ^:private fetch-booking-history-candidates
"Fetches semantically similar booking history for invoice line items."
[pool legal-entity-id supplier-name line-items opts]
(let [{:keys [k] :or {k 10}} opts
search-spec {:table :booking_history_item
:id-column :id
:embedding-column :embedding
:text-column :searchable_text
:filters {:legal_entity_id legal-entity-id
:deleted_at nil}}
;; Search for each line item
all-results (mapcat
(fn [line-item]
(let [query-text (str supplier-name " | " line-item)]
(search/search pool search-spec query-text {:k k})))
line-items)
;; Deduplicate by id, keeping highest score
deduped (vals (reduce
(fn [acc {:keys [id rrf-score] :as row}]
(if (or (not (contains? acc id))
(> rrf-score (:rrf-score (get acc id))))
(assoc acc id row)
acc))
{}
all-results))]
deduped))
Stays in Workers (post_process.clj), not in the search module:
(defn ^:private curate-candidates
"Clusters candidates by (debit-account, cost-center) and deduplicates
similar descriptions within each cluster."
[candidates {:keys [dedup-threshold] :or {dedup-threshold 0.9}}]
...)
Attaches :cluster-count showing how many original matches had this booking pattern.
┌─────────────────────────────────────────────────────────────────────────────┐
│ INDEXING (ERP) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CSV/XLSX Upload ──┐ │
│ ├──► booking_history_item rows inserted │
│ DATEV Import ─────┘ (searchable_text auto-generated) │
│ │ │
│ ▼ │
│ Background job: embed-new-items! │
│ │ │
│ ▼ │
│ search/embed (Google API, batched) │
│ │ │
│ ▼ │
│ UPDATE embedding column │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ QUERYING (Workers) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Invoice arrives for post-processing │
│ │ │
│ ▼ │
│ For each line item: │
│ query = "{supplier_name} | {description}" │
│ │ │
│ ▼ │
│ search/search (hybrid: BM25 + semantic + RRF) │
│ ├── BM25: ts_rank on searchable_text │
│ ├── Semantic: embedding <=> query_embedding │
│ └── RRF: fuse rankings (k=60) │
│ │ │
│ ▼ │
│ Deduplicate across line items │
│ │ │
│ ▼ │
│ Curate: cluster by (debit_account, cost_center), dedupe within │
│ │ │
│ ▼ │
│ Format as CSV ──► AccountsMatcher / CostCenterMatcher prompts │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
io-threadspikes/semantic-search/text-multilingual-embedding-002 (768 dimensions)