Researched: 2026-02-20 Domain: Vector similarity search, Flask web framework, LLM API integration Confidence: HIGH
Phase 3 implements the search interface for querying the semantic search system. Users submit a query via a minimal web interface and receive results from four sources in parallel: pgvector nearest neighbor search using three embedding models (Google, Jina, MiniLM), plus LLM context matching replicating Orcha's approach. The existing infrastructure from Phases 1-2 provides ~6K line items with pre-computed embeddings in HNSW-indexed columns.
The pgvector search component is straightforward: embed the query text using the same models, then use the <=> cosine distance operator to find top-K nearest neighbors. Flask provides a minimal web framework suitable for this spike's single-user evaluation use case. The LLM context matching requires replicating Orcha's approach: fetching similar historical bookings based on supplier name similarity (pg_trgm), formatting as CSV context, and prompting Gemini Flash to suggest GL account and cost center.
Running all four searches in parallel via concurrent.futures.ThreadPoolExecutor ensures responsive UI. Each search is I/O-bound (database query or API call), making threading ideal.
Primary recommendation: Use Flask with a single route for search, ThreadPoolExecutor for parallel execution of all 4 search methods, and replicate Orcha's LLM prompt structure exactly for accurate comparison.
<user_constraints>
None - discussion stayed within phase scope </user_constraints>
<phase_requirements>
| ID | Description | Research Support |
|---|---|---|
| SRCH-01 | Single query text input interface | Flask route with simple HTML form, single text input field plus K dropdown. See Flask Web Interface section. |
| SRCH-02 | Top-K retrieval with adjustable K (3, 5, 10) | pgvector ORDER BY embedding <=> query_vector LIMIT K with K from dropdown. See pgvector Search section. |
| SRCH-03 | Display similarity scores per result | pgvector returns distance; convert to similarity with 1 - distance for cosine. Include in result template. |
| SRCH-05 | LLM context matching (replicate Orcha approach with Gemini 2.5 Flash) | Use google-genai SDK, replicate Orcha's booking history lookup and prompt structure. See LLM Context Matching section. |
| SRCH-06 | Display results with supplier, description, GL accounts, cost center | Include all fields in search results, format in side-by-side columns per embedding model. |
| </phase_requirements> |
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| Flask | 3.1+ | Web framework | Minimal, well-documented, perfect for spikes. Official 2026 docs recommend simple routes for prototypes. |
| google-genai | 1.64+ | Gemini API access | Already in project. Official unified SDK for all Google generative AI. |
| psycopg | 3.3+ | PostgreSQL with pgvector | Already in project. Modern async-capable driver with native pgvector support. |
| concurrent.futures | stdlib | Parallel search execution | Built-in ThreadPoolExecutor for I/O-bound parallel tasks. No external dependency. |
| Library | Version | Purpose | When to Use |
|---|---|---|---|
| sentence-transformers | 5.2+ | MiniLM query embedding | Already in project. Embed query text locally before pgvector search. |
| pgvector | 0.4+ | Vector type handling | Already in project. register_vector() enables proper type coercion. |
| Jinja2 | (via Flask) | HTML templates | Flask's built-in templating for results display. |
| Instead of | Could Use | Tradeoff |
|---|---|---|
| Flask | FastAPI | FastAPI has native async, but Flask is simpler for this spike size. Async not needed with ThreadPoolExecutor. |
| ThreadPoolExecutor | asyncio/aiohttp | async is more efficient at scale, but ThreadPool is simpler and sufficient for 4 parallel calls. |
| Server-side rendering | React/Vue SPA | SPA adds complexity. Server-side HTML is fastest for this single-user evaluation tool. |
Installation:
uv add flask
# All other dependencies already present from Phases 1-2
src/
├── app.py # Flask application
├── search/
│ ├── __init__.py
│ ├── pgvector_search.py # Semantic search functions
│ └── llm_matching.py # Orcha-style LLM context matching
├── templates/
│ ├── base.html # Common layout
│ └── search.html # Search form + results
├── db.py # Existing
├── embeddings/ # Existing
│ ├── google_embed.py # Reuse for query embedding
│ ├── jina_embed.py # Reuse for query embedding
│ └── minilm_embed.py # Reuse for query embedding
└── normalize.py # Existing
What: Query embeddings using cosine distance operator When to use: All semantic search queries Example:
# Source: pgvector GitHub documentation
def search_pgvector(
conn,
query_embedding: list[float],
embedding_column: str,
k: int = 5
) -> list[dict]:
"""
Find K nearest neighbors using pgvector cosine distance.
Returns results with similarity score (1 - distance).
"""
with conn.cursor() as cur:
cur.execute(f"""
SELECT
id,
supplier_name,
description,
debit_account,
credit_account,
cost_center,
net_amount,
1 - ({embedding_column} <=> %s::vector) as similarity
FROM line_item
WHERE {embedding_column} IS NOT NULL
ORDER BY {embedding_column} <=> %s::vector
LIMIT %s
""", (query_embedding, query_embedding, k))
columns = [desc[0] for desc in cur.description]
return [dict(zip(columns, row)) for row in cur.fetchall()]
What: Run all 4 search methods concurrently using ThreadPoolExecutor When to use: Every search request Example:
# Source: Python concurrent.futures documentation
from concurrent.futures import ThreadPoolExecutor, as_completed
def search_all_models(
conn,
query_text: str,
k: int = 5
) -> dict:
"""Execute all searches in parallel."""
def search_google():
embedding = embed_google([query_text])[0]
return ('google', search_pgvector(conn, embedding, 'embedding_google', k))
def search_jina():
embedding = embed_jina([query_text])[0]
return ('jina', search_pgvector(conn, embedding, 'embedding_jina', k))
def search_minilm():
embedding = embed_minilm([query_text])[0]
return ('minilm', search_pgvector(conn, embedding, 'embedding_minilm', k))
def search_llm():
return ('llm', llm_context_match(conn, query_text))
results = {}
with ThreadPoolExecutor(max_workers=4) as executor:
futures = [
executor.submit(search_google),
executor.submit(search_jina),
executor.submit(search_minilm),
executor.submit(search_llm),
]
for future in as_completed(futures):
key, result = future.result()
results[key] = result
return results
What: Fetch similar historical bookings, format as prompt context, call Gemini
When to use: LLM prediction column
Based on: Orcha's post_process.clj fetch-supplier-booking-history and accounts-match prompt
Example:
# Source: Orcha post_process.clj analysis
from google import genai
def llm_context_match(conn, query_text: str) -> dict:
"""
Replicate Orcha's LLM context matching approach.
1. Parse supplier name from query (use first part before |)
2. Fetch similar historical bookings via pg_trgm
3. Format as CSV context
4. Prompt Gemini Flash for GL account + cost center
"""
# Extract supplier hint from query
# Query format: "supplier_name | description"
parts = query_text.split('|', 1)
supplier_hint = parts[0].strip() if parts else query_text
# Fetch similar bookings (Orcha uses pg_trgm similarity >= 0.7)
historical = fetch_booking_history(conn, supplier_hint)
# Format prompt matching Orcha structure
prompt = build_matching_prompt(query_text, historical)
# Call Gemini Flash
client = genai.Client()
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=prompt,
config={
'response_mime_type': 'application/json',
'temperature': 0,
}
)
# Parse JSON response
return parse_llm_response(response.text)
def fetch_booking_history(conn, supplier_name: str, limit: int = 50) -> list[dict]:
"""
Fetch historical bookings similar to supplier name.
Replicates Orcha's pg_trgm similarity search.
"""
from src.normalize import normalize_supplier_name
supplier_norm = normalize_supplier_name(supplier_name)
if not supplier_norm:
return []
with conn.cursor() as cur:
cur.execute("""
SELECT
supplier_name,
description,
net_amount,
debit_account,
cost_center
FROM line_item
WHERE supplier_name_normalized % %s
ORDER BY similarity(supplier_name_normalized, %s) DESC
LIMIT %s
""", (supplier_norm, supplier_norm, limit))
columns = [desc[0] for desc in cur.description]
return [dict(zip(columns, row)) for row in cur.fetchall()]
What: Single route handling both GET (form) and POST (search) When to use: Minimal web interface Example:
# Source: Flask documentation
from flask import Flask, render_template, request
app = Flask(__name__)
@app.route('/', methods=['GET', 'POST'])
def search():
results = None
query = ''
k = 5
if request.method == 'POST':
query = request.form.get('query', '')
k = int(request.form.get('k', 5))
if query.strip():
conn = get_connection()
try:
results = search_all_models(conn, query, k)
finally:
conn.close()
return render_template('search.html',
query=query,
k=k,
results=results)
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Web framework | Custom HTTP server | Flask | Routing, templating, form handling all built-in |
| Parallel execution | Manual threading | ThreadPoolExecutor | Proper cleanup, exception handling, result collection |
| Vector similarity | Manual distance calc | pgvector <=> operator |
Hardware-optimized, uses HNSW index |
| JSON response parsing | Regex extraction | json.loads() |
Handles edge cases, proper error messages |
| HTML escaping | String concatenation | Jinja2 templates | XSS prevention, cleaner code |
Key insight: This phase combines existing well-solved problems (web framework, parallel execution, vector search) with one domain-specific task (replicating Orcha's LLM prompt structure). Focus implementation effort on the LLM matching accuracy.
What goes wrong: Using wrong task_type for query embedding produces poor results
Why it happens: Embedding models have asymmetric retrieval modes (query vs document)
How to avoid: Use task='RETRIEVAL_QUERY' for Google, task='retrieval.query' for Jina when embedding the search query
Warning signs: Good embeddings but poor search relevance
What goes wrong: Database connections leak, pool exhausted Why it happens: Exception in search code before conn.close() How to avoid: Use try/finally or context manager for connection handling Warning signs: "too many connections" errors after multiple searches
What goes wrong: operator does not exist: text % text error
Why it happens: pg_trgm not installed for similarity search
How to avoid: Add CREATE EXTENSION IF NOT EXISTS pg_trgm; to init.sql
Warning signs: SQL error on first LLM context match attempt
What goes wrong: LLM returns markdown code blocks or malformed JSON
Why it happens: Gemini wraps JSON in ```json blocks or adds commentary
How to avoid: Use response_mime_type: 'application/json' config, implement strip_markdown_json fallback
Warning signs: JSONDecodeError on LLM responses
What goes wrong: One search fails but UI shows partial results without error Why it happens: Not calling future.result() or not handling exceptions How to avoid: Use as_completed() and call result() with try/except per future Warning signs: Missing columns in results, no error messages
What goes wrong: Search is slow (seconds instead of milliseconds)
Why it happens: Query plan doesn't use index due to missing probes setting
How to avoid: Set SET hnsw.ef_search = 40; for query sessions
Warning signs: EXPLAIN shows Seq Scan instead of Index Scan
# Source: google-genai documentation, adapted for query task
from google import genai
from google.genai.types import EmbedContentConfig
def embed_query_google(query_text: str) -> list[float]:
"""Embed query text for retrieval (not document storage)."""
client = genai.Client()
response = client.models.embed_content(
model='text-multilingual-embedding-002',
contents=[query_text],
config=EmbedContentConfig(
task_type='RETRIEVAL_QUERY', # Different from RETRIEVAL_DOCUMENT!
),
)
return response.embeddings[0].values
# Source: Jina API documentation
def embed_query_jina(query_text: str) -> list[float]:
"""Embed query text for retrieval."""
response = requests.post(
'https://api.jina.ai/v1/embeddings',
headers={
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINA_API_KEY}',
},
json={
'input': [query_text],
'model': 'jina-embeddings-v3',
'dimensions': 1024,
'task': 'retrieval.query', # Different from retrieval.passage!
},
)
response.raise_for_status()
return response.json()['data'][0]['embedding']
# Source: google-genai documentation, Orcha config.edn analysis
from google import genai
def call_gemini_flash(prompt: str) -> str:
"""Call Gemini Flash for JSON response."""
client = genai.Client() # Uses GOOGLE_API_KEY env var
response = client.models.generate_content(
model='gemini-2.5-flash',
contents=prompt,
config={
'response_mime_type': 'application/json',
'temperature': 0, # Deterministic output
}
)
return response.text
# Source: Analysis of Orcha post_process.clj accounts-match prompt
def build_matching_prompt(query_text: str, historical_bookings: list[dict]) -> str:
"""
Build prompt matching Orcha's accounts-match structure.
Orcha provides:
- Historical bookings as CSV
- Current invoice/query as JSON
- Instructions to match GL account and cost center
"""
# Format historical bookings as CSV (matches Orcha's dataset->csv)
booking_csv = format_as_csv(historical_bookings) if historical_bookings else ""
prompt = f"""You are matching a line item to GL accounts and cost centers.
Based on the query and historical bookings for similar suppliers, suggest:
1. The most likely debit GL account
2. The most likely cost center (if applicable)
--- HISTORICAL BOOKINGS FOR SIMILAR SUPPLIERS (CSV) ---
{booking_csv if booking_csv else "No historical data available."}
--- QUERY LINE ITEM ---
{query_text}
Respond with JSON:
{{"debit_account": "<account or null>", "cost_center": "<center or null>"}}
If you cannot determine with reasonable confidence, use null."""
return prompt
def format_as_csv(data: list[dict]) -> str:
"""Format list of dicts as CSV string (matches Orcha's dataset->csv)."""
if not data:
return ""
# Get keys that have non-null values
keys = list(data[0].keys())
header = ",".join(keys)
rows = []
for item in data:
row = ",".join(str(item.get(k, "")) for k in keys)
rows.append(row)
return header + "\n" + "\n".join(rows)
<!-- templates/search.html -->
<!DOCTYPE html>
<html>
<head>
<title>Semantic Search Comparison</title>
<style>
.results-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 20px; }
.column { border: 1px solid #ccc; padding: 10px; }
.result-item { border-bottom: 1px solid #eee; padding: 10px 0; }
.similarity { color: green; font-weight: bold; }
</style>
</head>
<body>
<h1>Semantic Search Comparison</h1>
<form method="post">
<input type="text" name="query" value="{{ query }}" size="80"
placeholder="supplier name | description">
<select name="k">
<option value="3" {{ 'selected' if k == 3 }}>Top 3</option>
<option value="5" {{ 'selected' if k == 5 }}>Top 5</option>
<option value="10" {{ 'selected' if k == 10 }}>Top 10</option>
</select>
<button type="submit">Search</button>
</form>
{% if results %}
<div class="results-grid">
<div class="column">
<h2>Google (768d)</h2>
{% for r in results.google %}
<div class="result-item">
<div><strong>{{ r.supplier_name }}</strong></div>
<div>{{ r.description }}</div>
<div>GL: {{ r.debit_account }} | CC: {{ r.cost_center or '-' }}</div>
<div class="similarity">{{ "%.3f"|format(r.similarity) }}</div>
</div>
{% endfor %}
</div>
<!-- Similar columns for Jina, MiniLM, LLM -->
</div>
{% endif %}
</body>
</html>
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| google-generativeai SDK | google-genai SDK | Nov 2025 | Unified API for Gemini, Vertex AI, embeddings |
| Flask sync-only | Flask 2.x async support | 2021 | Can use async routes, but ThreadPool still simpler for this case |
| IVFFlat indexes | HNSW indexes | pgvector 0.5+ (2024) | Better recall, no training phase, faster builds |
| Sequential API calls | Parallel with ThreadPool | Standard | 3-4x faster response time for multi-model search |
Deprecated/outdated:
google-generativeai package: Deprecated Nov 2025, use google-genai insteadpg_trgm Threshold for Historical Matching
similarity >= 0.7 thresholdQuery Text Format for LLM
Connection Pooling
post_process.clj - Actual LLM matching implementation (local file analysis)config.edn - LLM configuration (Gemini Flash for post-processing)Confidence breakdown:
Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable technologies)