Domain: Semantic Search Comparison/Benchmark System Researched: 2026-02-20 Confidence: HIGH
+===========================================================================+
| PRESENTATION LAYER |
| +---------------------------+ +---------------------------+ |
| | Interactive HTML Dashboard | | Static Report Generator | |
| | (Flask + HTMX) | | (HTML Export) | |
| +-------------+-------------+ +-------------+-------------+ |
| | | |
+================|==============================|============================+
| |
+================|==============================|============================+
| | SEARCH/QUERY LAYER | |
| +-------------v--------------+ +------------v------------+ |
| | pgvector Search Engine | | LLM Context Matcher | |
| | (3 embedding models) | | (Gemini 2.5 Flash) | |
| +-------------+--------------+ +------------+------------+ |
| | | |
+================|==============================|============================+
| |
+================|==============================|============================+
| | EVALUATION LAYER | |
| +-------------v--------------+ +------------v------------+ |
| | Metrics Calculator | | LLM-as-Judge Evaluator | |
| | (Exact, Top-K, MRR) | | (Quality Assessment) | |
| +-------------+--------------+ +------------+------------+ |
| | | |
+================|==============================|============================+
| |
+================|==============================|============================+
| DATA LAYER |
| +-------------------+ +-------------------+ +-------------------+ |
| | Ground Truth | | Embeddings Store | | Results Store | |
| | (Test Queries + | | (pgvector, | | (Metrics, | |
| | Expected Results)| | 3 model columns) | | Comparisons) | |
| +-------------------+ +-------------------+ +-------------------+ |
| |
| +----------------------------------------------------------------------+ |
| | PostgreSQL 18 (Docker) | |
| +----------------------------------------------------------------------+ |
+============================================================================+
^
|
+================|==========================================================+
| | INGESTION PIPELINE |
| +-------------+-------------+ |
| | CSV Loader | |
| | (historical.csv) | |
| +-------------+-------------+ |
| | |
| +-------------v-------------+ |
| | Embedding Generator | +--> Google text-multilingual-002 |
| | (Batch Processing) | +--> Jina AI Embeddings |
| +-------------+-------------+ +--> all-MiniLM-L6-v2 (local) |
| | |
| +-------------v-------------+ |
| | Synthetic Query Generator | |
| | (LLM-based variations) | |
| +---------------------------+ |
+============================================================================+
| Component | Responsibility | Implementation |
|---|---|---|
| CSV Loader | Parse historical.csv, normalize data, validate schema | Python pandas, basic ETL |
| Embedding Generator | Generate embeddings for all 3 models, handle rate limits, batch efficiently | Python with model-specific SDKs |
| Synthetic Query Generator | Create test variations of existing descriptions | LLM (Gemini) for paraphrasing |
| pgvector Search Engine | Execute similarity searches, return top-K results | psycopg3 + pgvector extension |
| LLM Context Matcher | Replicate current Orcha approach (CSV to LLM) | Gemini 2.5 Flash API |
| Metrics Calculator | Compute exact match, top-K accuracy, MRR, latency | Pure Python computation |
| LLM-as-Judge Evaluator | Assess quality when exact match fails | Gemini/Claude API for scoring |
| HTML Dashboard | Interactive search, side-by-side comparison | Flask + HTMX + Tailwind |
| Report Generator | Export static HTML comparison reports | Jinja2 templates |
src/
|-- ingestion/ # Data pipeline
| |-- csv_loader.py # Parse and validate historical.csv
| |-- embeddings.py # Generate embeddings (all 3 models)
| |-- synthetic.py # Generate test query variations
| `-- db_writer.py # Batch insert to pgvector
|
|-- search/ # Search implementations
| |-- pgvector_search.py # Vector similarity search
| |-- llm_matcher.py # LLM context matching (Orcha approach)
| `-- models.py # Shared data structures
|
|-- evaluation/ # Benchmark framework
| |-- ground_truth.py # Test dataset management
| |-- metrics.py # Precision, recall, MRR, latency
| |-- llm_judge.py # LLM-as-judge evaluator
| `-- runner.py # Orchestrate evaluation runs
|
|-- dashboard/ # Web UI
| |-- app.py # Flask application
| |-- routes.py # API endpoints
| `-- templates/ # Jinja2 + HTMX templates
| |-- index.html # Main search interface
| |-- compare.html # Side-by-side comparison
| `-- report.html # Export template
|
|-- config/ # Configuration
| |-- settings.py # Environment-based config
| `-- models.py # Embedding model definitions
|
|-- db/ # Database
| |-- schema.sql # pgvector schema
| `-- connection.py # Connection pooling
|
`-- scripts/ # CLI entry points
|-- ingest.py # Run ingestion pipeline
|-- evaluate.py # Run benchmark suite
`-- serve.py # Start dashboard
What: Each pipeline stage writes to database, enabling restart from any point. When to use: Data processing that may fail (API rate limits, network issues). Trade-offs: More database writes, but eliminates re-work on failure.
Example:
# Each stage is idempotent and resumable
class EmbeddingPipeline:
def run(self, resume_from: str = None):
stages = [
("load_csv", self.load_csv),
("embed_google", lambda: self.embed("google")),
("embed_jina", lambda: self.embed("jina")),
("embed_minilm", lambda: self.embed("minilm")),
("generate_synthetic", self.generate_synthetic),
]
# Skip completed stages on resume
start_idx = 0
if resume_from:
start_idx = [s[0] for s in stages].index(resume_from)
for name, fn in stages[start_idx:]:
logger.info(f"Running stage: {name}")
fn()
self.checkpoint(name)
What: Common interface for different search implementations. When to use: Comparing multiple approaches with same evaluation framework. Trade-offs: Slight abstraction overhead, but enables fair comparison.
Example:
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class SearchResult:
item_id: str
score: float
gl_account: str
cost_center: str
class SearchBackend(ABC):
@abstractmethod
def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
pass
@abstractmethod
def name(self) -> str:
pass
class PgvectorSearch(SearchBackend):
def __init__(self, model: str): # "google", "jina", "minilm"
self.model = model
def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
embedding = self.embed_query(query)
# Vector similarity search
return self.db.query_similar(embedding, self.model, top_k)
def name(self) -> str:
return f"pgvector-{self.model}"
class LLMContextSearch(SearchBackend):
def search(self, query: str, top_k: int = 10) -> list[SearchResult]:
# Get supplier context, send to LLM
context_csv = self.get_supplier_history(query)
return self.llm.match(query, context_csv, top_k)
def name(self) -> str:
return "llm-context"
What: Test queries paired with expected results for automated evaluation. When to use: Repeatable benchmarking across approaches. Trade-offs: Requires upfront dataset creation, but enables objective comparison.
Example:
@dataclass
class TestCase:
query: str # Modified description
source_item_id: str # Original item this derived from
expected_gl_account: str # Ground truth
expected_cost_center: str # Ground truth
variation_type: str # "synonym", "abbreviation", "typo", etc.
class GroundTruth:
def __init__(self, test_cases: list[TestCase]):
self.test_cases = test_cases
def evaluate(self, backend: SearchBackend) -> EvaluationResult:
results = []
for tc in self.test_cases:
start = time.perf_counter()
search_results = backend.search(tc.query)
latency = time.perf_counter() - start
results.append(SingleResult(
test_case=tc,
search_results=search_results,
latency_ms=latency * 1000
))
return EvaluationResult(backend.name(), results)
historical.csv
|
v
[CSV Loader] --> Validate schema, normalize text
|
v
[Embedding Generator]
|
+-- Google API --> embedding_google (768 dims)
+-- Jina API --> embedding_jina (768 dims)
+-- Local ML --> embedding_minilm (384 dims)
|
v
[DB Writer] --> COPY to line_items table
|
v
[Synthetic Generator] --> LLM creates query variations
|
v
[DB Writer] --> INSERT to test_queries table
User Query (from dashboard or test case)
|
+------------------------+------------------------+
| | |
v v v
[pgvector Google] [pgvector Jina] [pgvector MiniLM]
| | |
+------------------------+------------------------+
|
v
[LLM Context Matcher]
|
v
[Results Aggregator]
|
v
[Metrics Calculator]
|
v
[Dashboard Display]
Test Dataset (synthetic queries + ground truth)
|
v
[Evaluation Runner]
|
+-- For each backend:
| |
| v
| [Search Backend] --> Results
| |
| v
| [Metrics Calculator]
| |
| +-- Exact match rate
| +-- Top-K accuracy (K=3,5,10)
| +-- MRR (Mean Reciprocal Rank)
| +-- Latency (p50, p95, p99)
| +-- Cost per query
|
+-- For mismatches:
|
v
[LLM-as-Judge]
|
v
Quality score (when exact match fails)
|
v
[Report Generator] --> HTML comparison report
-- Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;
-- Main data table
CREATE TABLE line_items (
id SERIAL PRIMARY KEY,
supplier_name TEXT NOT NULL,
description TEXT NOT NULL,
net_amount NUMERIC(12,2),
debit_account TEXT,
credit_account TEXT,
cost_center TEXT,
-- Embeddings (one column per model)
embedding_google vector(768),
embedding_jina vector(768),
embedding_minilm vector(384),
-- Metadata
source_row INT, -- Row number in CSV for traceability
created_at TIMESTAMP DEFAULT NOW()
);
-- Indexes for each embedding type (HNSW for speed)
CREATE INDEX idx_embedding_google ON line_items
USING hnsw (embedding_google vector_cosine_ops);
CREATE INDEX idx_embedding_jina ON line_items
USING hnsw (embedding_jina vector_cosine_ops);
CREATE INDEX idx_embedding_minilm ON line_items
USING hnsw (embedding_minilm vector_cosine_ops);
-- Index on supplier for LLM context approach
CREATE INDEX idx_supplier ON line_items (supplier_name);
-- Test queries with ground truth
CREATE TABLE test_queries (
id SERIAL PRIMARY KEY,
query_text TEXT NOT NULL,
source_item_id INT REFERENCES line_items(id),
expected_gl_account TEXT NOT NULL,
expected_cost_center TEXT,
variation_type TEXT, -- "synonym", "abbreviation", "typo", "reorder"
-- Pre-computed embeddings for test queries
embedding_google vector(768),
embedding_jina vector(768),
embedding_minilm vector(384),
created_at TIMESTAMP DEFAULT NOW()
);
-- Evaluation results
CREATE TABLE evaluation_runs (
id SERIAL PRIMARY KEY,
run_name TEXT,
backend_name TEXT NOT NULL,
run_at TIMESTAMP DEFAULT NOW(),
-- Aggregate metrics
exact_match_rate NUMERIC(5,4),
top_3_accuracy NUMERIC(5,4),
top_5_accuracy NUMERIC(5,4),
top_10_accuracy NUMERIC(5,4),
mrr NUMERIC(5,4),
-- Latency stats (ms)
latency_p50 NUMERIC(8,2),
latency_p95 NUMERIC(8,2),
latency_p99 NUMERIC(8,2),
-- Cost tracking
total_queries INT,
total_cost_usd NUMERIC(10,6)
);
-- Individual query results for detailed analysis
CREATE TABLE query_results (
id SERIAL PRIMARY KEY,
run_id INT REFERENCES evaluation_runs(id),
test_query_id INT REFERENCES test_queries(id),
-- What was returned
returned_items JSONB, -- Array of {item_id, rank, score}
-- Evaluation
exact_match BOOLEAN,
rank_of_correct INT, -- NULL if not in results
latency_ms NUMERIC(8,2),
-- LLM judge score (when exact match fails)
llm_judge_score NUMERIC(3,2),
llm_judge_reasoning TEXT
);
What people do: Generate embeddings for queries during search. Why it's wrong: Adds 100-500ms latency per query; unfair comparison with pre-indexed LLM approach. Do this instead: Pre-compute test query embeddings during dataset creation. Store in test_queries table.
What people do: Compare Google embeddings vs LLM approach without controlling variables. Why it's wrong: Unclear if differences come from embedding quality or search approach. Do this instead: Compare each embedding model separately. Report results per model AND vs LLM baseline.
What people do: Test with exact strings from historical.csv. Why it's wrong: 100% accuracy guaranteed for vector search (finding itself). Not realistic. Do this instead: Use synthetic variations that test semantic understanding, not exact matching.
What people do: Only generate "easy" synthetic queries. Why it's wrong: Over-estimates real-world performance. Do this instead: Include multiple variation types: synonyms, abbreviations, typos, word reordering, multilingual (German descriptions).
| Service | Integration Pattern | Notes |
|---|---|---|
| Google Embedding API | REST via Python SDK | Rate limit: 1500 req/min. Batch requests of 100. |
| Jina AI Embedding API | REST API | May need API key setup. Check free tier limits. |
| Gemini 2.5 Flash | REST via Python SDK | For LLM matching + synthetic generation + LLM judge |
| all-MiniLM-L6-v2 | Local sentence-transformers | No API, runs in-process. GPU optional. |
| Boundary | Communication | Notes |
|---|---|---|
| ingestion <-> db | Direct psycopg3 | Batch COPY for performance |
| search <-> db | Connection pool | psycopg_pool for dashboard |
| evaluation <-> search | Python function calls | Strategy pattern interface |
| dashboard <-> evaluation | In-process | Flask calls evaluation.runner |
| Concern | This Spike (6K rows) | If Scaled to 100K | If Scaled to 1M |
|---|---|---|---|
| Embedding storage | ~50MB total | ~800MB | ~8GB (consider quantization) |
| Index build time | <1 min | ~5 min | ~30 min (build offline) |
| Query latency | <50ms | <100ms with HNSW | <200ms (tune ef_search) |
| Ingestion time | ~30 min (API rate limits) | ~5 hours | Batch across days |
For this spike: None of these are concerns. 6K rows is well within single-machine, single-run territory.
Recommended implementation order based on component dependencies:
Phase 1: Foundation
[db/schema.sql] # Database schema first
|
v
[config/settings.py] # Configuration
|
v
[db/connection.py] # Connection management
Phase 2: Ingestion Pipeline
[ingestion/csv_loader.py] # Parse CSV
|
v
[ingestion/embeddings.py] # Generate embeddings (depends on config)
|
v
[ingestion/db_writer.py] # Store in pgvector (depends on schema)
|
v
[ingestion/synthetic.py] # Generate test queries (depends on data)
Phase 3: Search Layer
[search/models.py] # Shared data structures
|
+--> [search/pgvector_search.py] # Vector search
|
+--> [search/llm_matcher.py] # LLM approach
Phase 4: Evaluation Framework
[evaluation/ground_truth.py] # Test dataset management
|
v
[evaluation/metrics.py] # Metric calculations
|
v
[evaluation/llm_judge.py] # LLM-as-judge (optional quality)
|
v
[evaluation/runner.py] # Orchestration
Phase 5: Dashboard
[dashboard/templates/*] # HTML templates
|
v
[dashboard/routes.py] # API endpoints
|
v
[dashboard/app.py] # Flask application
Phase 6: Integration
[scripts/ingest.py] # CLI for ingestion
[scripts/evaluate.py] # CLI for benchmarking
[scripts/serve.py] # CLI for dashboard
Architecture research for: Semantic Search Comparison System Researched: 2026-02-20