Architecture Research

Domain: Semantic Search Comparison/Benchmark System Researched: 2026-02-20 Confidence: HIGH

System Overview

+===========================================================================+
|                           PRESENTATION LAYER                               |
|  +---------------------------+  +---------------------------+              |
|  | Interactive HTML Dashboard |  | Static Report Generator  |              |
|  | (Flask + HTMX)             |  | (HTML Export)            |              |
|  +-------------+-------------+  +-------------+-------------+              |
|                |                              |                            |
+================|==============================|============================+
                 |                              |
+================|==============================|============================+
|                |      SEARCH/QUERY LAYER      |                            |
|  +-------------v--------------+  +------------v------------+               |
|  |   pgvector Search Engine   |  | LLM Context Matcher     |               |
|  |   (3 embedding models)     |  | (Gemini 2.5 Flash)      |               |
|  +-------------+--------------+  +------------+------------+               |
|                |                              |                            |
+================|==============================|============================+
                 |                              |
+================|==============================|============================+
|                |      EVALUATION LAYER        |                            |
|  +-------------v--------------+  +------------v------------+               |
|  |   Metrics Calculator       |  | LLM-as-Judge Evaluator  |               |
|  |   (Exact, Top-K, MRR)      |  | (Quality Assessment)    |               |
|  +-------------+--------------+  +------------+------------+               |
|                |                              |                            |
+================|==============================|============================+
                 |                              |
+================|==============================|============================+
|                         DATA LAYER                                         |
|  +-------------------+  +-------------------+  +-------------------+        |
|  | Ground Truth      |  | Embeddings Store  |  | Results Store     |        |
|  | (Test Queries +   |  | (pgvector,        |  | (Metrics,         |        |
|  |  Expected Results)|  |  3 model columns) |  |  Comparisons)     |        |
|  +-------------------+  +-------------------+  +-------------------+        |
|                                                                            |
|  +----------------------------------------------------------------------+  |
|  |                 PostgreSQL 18 (Docker)                                |  |
|  +----------------------------------------------------------------------+  |
+============================================================================+
                 ^
                 |
+================|==========================================================+
|                |      INGESTION PIPELINE                                   |
|  +-------------+-------------+                                             |
|  | CSV Loader                |                                             |
|  | (historical.csv)         |                                             |
|  +-------------+-------------+                                             |
|                |                                                           |
|  +-------------v-------------+                                             |
|  | Embedding Generator       |   +--> Google text-multilingual-002         |
|  | (Batch Processing)        |   +--> Jina AI Embeddings                   |
|  +-------------+-------------+   +--> all-MiniLM-L6-v2 (local)             |
|                |                                                           |
|  +-------------v-------------+                                             |
|  | Synthetic Query Generator |                                             |
|  | (LLM-based variations)    |                                             |
|  +---------------------------+                                             |
+============================================================================+

Component Responsibilities

Component Responsibility Implementation
CSV Loader Parse historical.csv, normalize data, validate schema Python pandas, basic ETL
Embedding Generator Generate embeddings for all 3 models, handle rate limits, batch efficiently Python with model-specific SDKs
Synthetic Query Generator Create test variations of existing descriptions LLM (Gemini) for paraphrasing
pgvector Search Engine Execute similarity searches, return top-K results psycopg3 + pgvector extension
LLM Context Matcher Replicate current Orcha approach (CSV to LLM) Gemini 2.5 Flash API
Metrics Calculator Compute exact match, top-K accuracy, MRR, latency Pure Python computation
LLM-as-Judge Evaluator Assess quality when exact match fails Gemini/Claude API for scoring
HTML Dashboard Interactive search, side-by-side comparison Flask + HTMX + Tailwind
Report Generator Export static HTML comparison reports Jinja2 templates
src/
|-- ingestion/              # Data pipeline
|   |-- csv_loader.py       # Parse and validate historical.csv
|   |-- embeddings.py       # Generate embeddings (all 3 models)
|   |-- synthetic.py        # Generate test query variations
|   `-- db_writer.py        # Batch insert to pgvector
|
|-- search/                 # Search implementations
|   |-- pgvector_search.py  # Vector similarity search
|   |-- llm_matcher.py      # LLM context matching (Orcha approach)
|   `-- models.py           # Shared data structures
|
|-- evaluation/             # Benchmark framework
|   |-- ground_truth.py     # Test dataset management
|   |-- metrics.py          # Precision, recall, MRR, latency
|   |-- llm_judge.py        # LLM-as-judge evaluator
|   `-- runner.py           # Orchestrate evaluation runs
|
|-- dashboard/              # Web UI
|   |-- app.py              # Flask application
|   |-- routes.py           # API endpoints
|   `-- templates/          # Jinja2 + HTMX templates
|       |-- index.html      # Main search interface
|       |-- compare.html    # Side-by-side comparison
|       `-- report.html     # Export template
|
|-- config/                 # Configuration
|   |-- settings.py         # Environment-based config
|   `-- models.py           # Embedding model definitions
|
|-- db/                     # Database
|   |-- schema.sql          # pgvector schema
|   `-- connection.py       # Connection pooling
|
`-- scripts/                # CLI entry points
    |-- ingest.py           # Run ingestion pipeline
    |-- evaluate.py         # Run benchmark suite
    `-- serve.py            # Start dashboard

Structure Rationale

Architectural Patterns

Pattern 1: Pipeline Stages with Checkpoints

What: Each pipeline stage writes to database, enabling restart from any point. When to use: Data processing that may fail (API rate limits, network issues). Trade-offs: More database writes, but eliminates re-work on failure.

Example:

# Each stage is idempotent and resumable
class EmbeddingPipeline:
    def run(self, resume_from: str = None):
        stages = [
            ("load_csv", self.load_csv),
            ("embed_google", lambda: self.embed("google")),
            ("embed_jina", lambda: self.embed("jina")),
            ("embed_minilm", lambda: self.embed("minilm")),
            ("generate_synthetic", self.generate_synthetic),
        ]

        # Skip completed stages on resume
        start_idx = 0
        if resume_from:
            start_idx = [s[0] for s in stages].index(resume_from)

        for name, fn in stages[start_idx:]:
            logger.info(f"Running stage: {name}")
            fn()
            self.checkpoint(name)

Pattern 2: Strategy Pattern for Search Backends

What: Common interface for different search implementations. When to use: Comparing multiple approaches with same evaluation framework. Trade-offs: Slight abstraction overhead, but enables fair comparison.

Example:

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class SearchResult:
    item_id: str
    score: float
    gl_account: str
    cost_center: str

class SearchBackend(ABC):
    @abstractmethod
    def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
        pass

    @abstractmethod
    def name(self) -> str:
        pass

class PgvectorSearch(SearchBackend):
    def __init__(self, model: str):  # "google", "jina", "minilm"
        self.model = model

    def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
        embedding = self.embed_query(query)
        # Vector similarity search
        return self.db.query_similar(embedding, self.model, top_k)

    def name(self) -> str:
        return f"pgvector-{self.model}"

class LLMContextSearch(SearchBackend):
    def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
        # Get supplier context, send to LLM
        context_csv = self.get_supplier_history(query)
        return self.llm.match(query, context_csv, top_k)

    def name(self) -> str:
        return "llm-context"

Pattern 3: Ground Truth Dataset Structure

What: Test queries paired with expected results for automated evaluation. When to use: Repeatable benchmarking across approaches. Trade-offs: Requires upfront dataset creation, but enables objective comparison.

Example:

@dataclass
class TestCase:
    query: str                    # Modified description
    source_item_id: str           # Original item this derived from
    expected_gl_account: str      # Ground truth
    expected_cost_center: str     # Ground truth
    variation_type: str           # "synonym", "abbreviation", "typo", etc.

class GroundTruth:
    def __init__(self, test_cases: list[TestCase]):
        self.test_cases = test_cases

    def evaluate(self, backend: SearchBackend) -> EvaluationResult:
        results = []
        for tc in self.test_cases:
            start = time.perf_counter()
            search_results = backend.search(tc.query)
            latency = time.perf_counter() - start

            results.append(SingleResult(
                test_case=tc,
                search_results=search_results,
                latency_ms=latency * 1000
            ))
        return EvaluationResult(backend.name(), results)

Data Flow

Ingestion Flow

historical.csv
    |
    v
[CSV Loader] --> Validate schema, normalize text
    |
    v
[Embedding Generator]
    |
    +-- Google API --> embedding_google (768 dims)
    +-- Jina API   --> embedding_jina (768 dims)
    +-- Local ML   --> embedding_minilm (384 dims)
    |
    v
[DB Writer] --> COPY to line_items table
    |
    v
[Synthetic Generator] --> LLM creates query variations
    |
    v
[DB Writer] --> INSERT to test_queries table

Query Flow

User Query (from dashboard or test case)
    |
    +------------------------+------------------------+
    |                        |                        |
    v                        v                        v
[pgvector Google]    [pgvector Jina]    [pgvector MiniLM]
    |                        |                        |
    +------------------------+------------------------+
                             |
                             v
                    [LLM Context Matcher]
                             |
                             v
                    [Results Aggregator]
                             |
                             v
                    [Metrics Calculator]
                             |
                             v
                    [Dashboard Display]

Evaluation Flow

Test Dataset (synthetic queries + ground truth)
    |
    v
[Evaluation Runner]
    |
    +-- For each backend:
    |       |
    |       v
    |   [Search Backend] --> Results
    |       |
    |       v
    |   [Metrics Calculator]
    |       |
    |       +-- Exact match rate
    |       +-- Top-K accuracy (K=3,5,10)
    |       +-- MRR (Mean Reciprocal Rank)
    |       +-- Latency (p50, p95, p99)
    |       +-- Cost per query
    |
    +-- For mismatches:
            |
            v
        [LLM-as-Judge]
            |
            v
        Quality score (when exact match fails)
    |
    v
[Report Generator] --> HTML comparison report

Database Schema

-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Main data table
CREATE TABLE line_items (
    id SERIAL PRIMARY KEY,
    supplier_name TEXT NOT NULL,
    description TEXT NOT NULL,
    net_amount NUMERIC(12,2),
    debit_account TEXT,
    credit_account TEXT,
    cost_center TEXT,

    -- Embeddings (one column per model)
    embedding_google vector(768),
    embedding_jina vector(768),
    embedding_minilm vector(384),

    -- Metadata
    source_row INT,  -- Row number in CSV for traceability
    created_at TIMESTAMP DEFAULT NOW()
);

-- Indexes for each embedding type (HNSW for speed)
CREATE INDEX idx_embedding_google ON line_items
    USING hnsw (embedding_google vector_cosine_ops);
CREATE INDEX idx_embedding_jina ON line_items
    USING hnsw (embedding_jina vector_cosine_ops);
CREATE INDEX idx_embedding_minilm ON line_items
    USING hnsw (embedding_minilm vector_cosine_ops);

-- Index on supplier for LLM context approach
CREATE INDEX idx_supplier ON line_items (supplier_name);

-- Test queries with ground truth
CREATE TABLE test_queries (
    id SERIAL PRIMARY KEY,
    query_text TEXT NOT NULL,
    source_item_id INT REFERENCES line_items(id),
    expected_gl_account TEXT NOT NULL,
    expected_cost_center TEXT,
    variation_type TEXT,  -- "synonym", "abbreviation", "typo", "reorder"

    -- Pre-computed embeddings for test queries
    embedding_google vector(768),
    embedding_jina vector(768),
    embedding_minilm vector(384),

    created_at TIMESTAMP DEFAULT NOW()
);

-- Evaluation results
CREATE TABLE evaluation_runs (
    id SERIAL PRIMARY KEY,
    run_name TEXT,
    backend_name TEXT NOT NULL,
    run_at TIMESTAMP DEFAULT NOW(),

    -- Aggregate metrics
    exact_match_rate NUMERIC(5,4),
    top_3_accuracy NUMERIC(5,4),
    top_5_accuracy NUMERIC(5,4),
    top_10_accuracy NUMERIC(5,4),
    mrr NUMERIC(5,4),

    -- Latency stats (ms)
    latency_p50 NUMERIC(8,2),
    latency_p95 NUMERIC(8,2),
    latency_p99 NUMERIC(8,2),

    -- Cost tracking
    total_queries INT,
    total_cost_usd NUMERIC(10,6)
);

-- Individual query results for detailed analysis
CREATE TABLE query_results (
    id SERIAL PRIMARY KEY,
    run_id INT REFERENCES evaluation_runs(id),
    test_query_id INT REFERENCES test_queries(id),

    -- What was returned
    returned_items JSONB,  -- Array of {item_id, rank, score}

    -- Evaluation
    exact_match BOOLEAN,
    rank_of_correct INT,  -- NULL if not in results
    latency_ms NUMERIC(8,2),

    -- LLM judge score (when exact match fails)
    llm_judge_score NUMERIC(3,2),
    llm_judge_reasoning TEXT
);

Anti-Patterns to Avoid

Anti-Pattern 1: Embedding at Query Time

What people do: Generate embeddings for queries during search. Why it's wrong: Adds 100-500ms latency per query; unfair comparison with pre-indexed LLM approach. Do this instead: Pre-compute test query embeddings during dataset creation. Store in test_queries table.

Anti-Pattern 2: Mixing Embedding Models in Comparison

What people do: Compare Google embeddings vs LLM approach without controlling variables. Why it's wrong: Unclear if differences come from embedding quality or search approach. Do this instead: Compare each embedding model separately. Report results per model AND vs LLM baseline.

Anti-Pattern 3: Evaluating on Training Data

What people do: Test with exact strings from historical.csv. Why it's wrong: 100% accuracy guaranteed for vector search (finding itself). Not realistic. Do this instead: Use synthetic variations that test semantic understanding, not exact matching.

Anti-Pattern 4: Ignoring Edge Cases in Ground Truth

What people do: Only generate "easy" synthetic queries. Why it's wrong: Over-estimates real-world performance. Do this instead: Include multiple variation types: synonyms, abbreviations, typos, word reordering, multilingual (German descriptions).

Integration Points

External Services

Service Integration Pattern Notes
Google Embedding API REST via Python SDK Rate limit: 1500 req/min. Batch requests of 100.
Jina AI Embedding API REST API May need API key setup. Check free tier limits.
Gemini 2.5 Flash REST via Python SDK For LLM matching + synthetic generation + LLM judge
all-MiniLM-L6-v2 Local sentence-transformers No API, runs in-process. GPU optional.

Internal Boundaries

Boundary Communication Notes
ingestion <-> db Direct psycopg3 Batch COPY for performance
search <-> db Connection pool psycopg_pool for dashboard
evaluation <-> search Python function calls Strategy pattern interface
dashboard <-> evaluation In-process Flask calls evaluation.runner

Scaling Considerations

Concern This Spike (6K rows) If Scaled to 100K If Scaled to 1M
Embedding storage ~50MB total ~800MB ~8GB (consider quantization)
Index build time <1 min ~5 min ~30 min (build offline)
Query latency <50ms <100ms with HNSW <200ms (tune ef_search)
Ingestion time ~30 min (API rate limits) ~5 hours Batch across days

Scaling Priorities

  1. First bottleneck: Embedding API rate limits during ingestion. Mitigate with batching, retry logic, and local model (MiniLM) for testing.
  2. Second bottleneck: HNSW index memory. At 1M vectors, consider IVFFlat or vector quantization.

For this spike: None of these are concerns. 6K rows is well within single-machine, single-run territory.

Build Order (Dependencies)

Recommended implementation order based on component dependencies:

Phase 1: Foundation
   [db/schema.sql]           # Database schema first
          |
          v
   [config/settings.py]      # Configuration
          |
          v
   [db/connection.py]        # Connection management

Phase 2: Ingestion Pipeline
   [ingestion/csv_loader.py] # Parse CSV
          |
          v
   [ingestion/embeddings.py] # Generate embeddings (depends on config)
          |
          v
   [ingestion/db_writer.py]  # Store in pgvector (depends on schema)
          |
          v
   [ingestion/synthetic.py]  # Generate test queries (depends on data)

Phase 3: Search Layer
   [search/models.py]        # Shared data structures
          |
          +--> [search/pgvector_search.py]  # Vector search
          |
          +--> [search/llm_matcher.py]      # LLM approach

Phase 4: Evaluation Framework
   [evaluation/ground_truth.py]  # Test dataset management
          |
          v
   [evaluation/metrics.py]       # Metric calculations
          |
          v
   [evaluation/llm_judge.py]     # LLM-as-judge (optional quality)
          |
          v
   [evaluation/runner.py]        # Orchestration

Phase 5: Dashboard
   [dashboard/templates/*]    # HTML templates
          |
          v
   [dashboard/routes.py]      # API endpoints
          |
          v
   [dashboard/app.py]         # Flask application

Phase 6: Integration
   [scripts/ingest.py]        # CLI for ingestion
   [scripts/evaluate.py]      # CLI for benchmarking
   [scripts/serve.py]         # CLI for dashboard

Rationale for Build Order

  1. Database first: Everything depends on schema. Cannot store embeddings without table structure.
  2. Ingestion before search: Need data in database before querying it.
  3. Search before evaluation: Evaluation calls search backends.
  4. Dashboard last: Thin layer over search and evaluation; needs both complete.
  5. Scripts as integration: Entry points wire everything together.

Sources


Architecture research for: Semantic Search Comparison System Researched: 2026-02-20