Note (2026-04-24): After this document was written, legal_entity was renamed to tenant and the old tenant was renamed to organization. Read references to these terms with the pre-rename meaning.

Search Integration Design

Integrate semantic search from the spikes/semantic-search spike into Orcha, replacing the pg_trgm-based booking history lookup with hybrid search (BM25 + semantic + RRF).

Context

AccountsMatcher and CostCenterMatcher currently use fetch-supplier-booking-history which matches on supplier name using pg_trgm similarity (threshold 0.7). This misses semantically similar bookings with different wording.

The spike validated that hybrid search (BM25 + Google embeddings + Reciprocal Rank Fusion) provides better retrieval quality for historical booking candidates.

Goals

Replace pg_trgm lookup with hybrid semantic search
Focus on Google embeddings only (not MiniLM)
Abstract the search module for future reuse (invoice <-> PO <-> Contract matching)
Keep curation logic (clustering, deduplication) in Workers, not in the search module

Non-Goals

MiniLM embeddings (spike supported both, we only need Google)
Changing the LLM prompts for AccountsMatcher/CostCenterMatcher
Document-level semantic search (future work)

Module Structure

com.getorcha.search/
├── core.clj          ; Public API: embed, search
├── embedding.clj     ; Google embedding client
├── bm25.clj          ; PostgreSQL full-text search
├── vector.clj        ; pgvector similarity search
└── rrf.clj           ; Reciprocal Rank Fusion

Root-level module (com.getorcha.search) since it's used by multiple services (ERP for indexing, Workers for querying).

Public API

`search/embed`

Generates embeddings for a batch of texts via Google API. Returns a core.async channel of [index embedding] pairs.

(defn embed
  "Generates embeddings for a batch of texts via Google API.
   Returns a core.async channel of [index embedding] pairs.

   Caller is responsible for:
   - Filtering texts that already have embeddings (caching)
   - Persisting returned embeddings to their domain table"
  [texts]
  ...)

Uses io-thread for blocking HTTP calls, batches of 100, 200ms delay between batches for rate limiting.

`search/search`

Hybrid search combining BM25 full-text and semantic similarity.

(defn search
  "Hybrid search combining BM25 full-text and semantic similarity.
   Returns top-k results ranked by Reciprocal Rank Fusion.

   conn        - database connection
   search-spec - {:table keyword
                  :id-column keyword
                  :embedding-column keyword
                  :text-column keyword
                  :filters map}
   query       - query text (embedding generated internally)
   opts        - {:k 10, :semantic-k 20, :bm25-k 20, :rrf-k 60}"
  [conn search-spec query opts]
  ...)

Schema Changes

Migration for booking_history_item:

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Add searchable text (generated column for consistency)
ALTER TABLE booking_history_item
  ADD COLUMN searchable_text TEXT
    GENERATED ALWAYS AS (supplier_name || ' | ' || description) STORED;

-- Add embedding column (768 dimensions for Google's text-multilingual-embedding-002)
ALTER TABLE booking_history_item
  ADD COLUMN embedding vector(768);

-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX booking_history_item_embedding_idx
  ON booking_history_item
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- GIN index for BM25 full-text search
CREATE INDEX booking_history_item_searchable_text_idx
  ON booking_history_item
  USING gin (to_tsvector('simple', searchable_text));

HNSW Parameters

Parameter	Value	When Used	Purpose
m	16	Index build	Max connections per node. Higher = better recall, larger index.
ef_construction	64	Index build	Search width when building. Higher = better quality, slower build.
ef_search	100	Query time	Search width when querying. Higher = better recall, slower queries.

Embedding Implementation

Google API client using text-multilingual-embedding-002:

Task type distinction: RETRIEVAL_DOCUMENT for indexing, RETRIEVAL_QUERY for search
Batch size: 100 (API max is 250, conservative for rate limits)
Rate limiting: 200ms delay between batches
Dimensions: 768
Auth: Reuse existing GCP credentials (same as Document AI)

Streaming results via core.async channel:

(defn embed [texts]
  (let [out (chan 100)]
    (io-thread
      (doseq [[batch-idx batch] (map-indexed vector (partition-all 100 texts))]
        (let [start-idx (* batch-idx batch-size)
              embeddings (call-google-api batch "RETRIEVAL_DOCUMENT")]
          (doseq [[i emb] (map-indexed vector embeddings)]
            (>!! out [(+ start-idx i) emb])))
        (Thread/sleep 200))
      (close! out))
    out))

BM25 Search

PostgreSQL full-text search using ts_rank:

Uses 'simple' text search config (no stemming) - works for German + English mixed content
plainto_tsquery handles user input safely
Returns results with :rank (1-indexed position) for RRF fusion

Vector Search

pgvector similarity search:

Sets ef_search = 100 for good recall
<=> operator for cosine distance (range [0, 2])
Score = 1 - distance gives similarity in [0, 1]
Filters out rows where embedding is NULL
Returns results with :rank for RRF fusion

Reciprocal Rank Fusion

Combines BM25 and semantic rankings:

RRF score for document d = Σ 1/(k + rank_in_list)

k=60 from original RRF paper (Cormack et al. 2009)
Documents appearing in both lists get boosted
Higher k reduces impact of any single high ranking

Integration Points

ERP: Indexing

In com.getorcha.erp.http.settings/booking_history.clj:

(defn ^:private embed-new-items!
  "Generates embeddings for booking history items that don't have them yet.
   Called as background job after upload or DATEV import completes."
  [pool legal-entity-id]
  (let [items (jdbc/execute! pool
                ["SELECT id, searchable_text
                  FROM booking_history_item
                  WHERE legal_entity_id = ?
                    AND embedding IS NULL
                    AND deleted_at IS NULL"
                 legal-entity-id])
        texts (mapv :booking-history-item/searchable-text items)
        ids (mapv :booking-history-item/id items)]

    (when (seq texts)
      (let [embed-ch (search/embed texts)]
        (loop []
          (when-let [[idx embedding] (<!! embed-ch)]
            (jdbc/execute! pool
              ["UPDATE booking_history_item
                SET embedding = ?::vector
                WHERE id = ?"
               (str embedding) (nth ids idx)])
            (recur)))))))

Workers: Querying

Replace fetch-supplier-booking-history in post_process.clj:

(defn ^:private fetch-booking-history-candidates
  "Fetches semantically similar booking history for invoice line items."
  [pool legal-entity-id supplier-name line-items opts]
  (let [{:keys [k] :or {k 10}} opts
        search-spec {:table :booking_history_item
                     :id-column :id
                     :embedding-column :embedding
                     :text-column :searchable_text
                     :filters {:legal_entity_id legal-entity-id
                               :deleted_at nil}}

        ;; Search for each line item
        all-results (mapcat
                      (fn [line-item]
                        (let [query-text (str supplier-name " | " line-item)]
                          (search/search pool search-spec query-text {:k k})))
                      line-items)

        ;; Deduplicate by id, keeping highest score
        deduped (vals (reduce
                        (fn [acc {:keys [id rrf-score] :as row}]
                          (if (or (not (contains? acc id))
                                  (> rrf-score (:rrf-score (get acc id))))
                            (assoc acc id row)
                            acc))
                        {}
                        all-results))]

    deduped))

Curation

Stays in Workers (post_process.clj), not in the search module:

(defn ^:private curate-candidates
  "Clusters candidates by (debit-account, cost-center) and deduplicates
   similar descriptions within each cluster."
  [candidates {:keys [dedup-threshold] :or {dedup-threshold 0.9}}]
  ...)

Attaches :cluster-count showing how many original matches had this booking pattern.

Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                              INDEXING (ERP)                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  CSV/XLSX Upload ──┐                                                        │
│                    ├──► booking_history_item rows inserted                  │
│  DATEV Import ─────┘    (searchable_text auto-generated)                    │
│                                │                                            │
│                                ▼                                            │
│                    Background job: embed-new-items!                         │
│                                │                                            │
│                                ▼                                            │
│                    search/embed (Google API, batched)                       │
│                                │                                            │
│                                ▼                                            │
│                    UPDATE embedding column                                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                            QUERYING (Workers)                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Invoice arrives for post-processing                                       │
│         │                                                                   │
│         ▼                                                                   │
│  For each line item:                                                        │
│    query = "{supplier_name} | {description}"                                │
│         │                                                                   │
│         ▼                                                                   │
│  search/search (hybrid: BM25 + semantic + RRF)                              │
│    ├── BM25: ts_rank on searchable_text                                     │
│    ├── Semantic: embedding <=> query_embedding                              │
│    └── RRF: fuse rankings (k=60)                                            │
│         │                                                                   │
│         ▼                                                                   │
│  Deduplicate across line items                                              │
│         │                                                                   │
│         ▼                                                                   │
│  Curate: cluster by (debit_account, cost_center), dedupe within             │
│         │                                                                   │
│         ▼                                                                   │
│  Format as CSV ──► AccountsMatcher / CostCenterMatcher prompts              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Dependencies

New

pgvector extension - PostgreSQL vector similarity search
Google Vertex AI SDK (or HTTP client) - for embedding API

Upgrades

core.async 1.8.741 → 1.9.x - for virtual thread support in io-thread

Reference

Spike implementation: spikes/semantic-search/
Google embedding model: text-multilingual-embedding-002 (768 dimensions)
RRF paper: Cormack et al. 2009, "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"