Phase 3: Search Implementation - Research

Researched: 2026-02-20 Domain: Vector similarity search, Flask web framework, LLM API integration Confidence: HIGH

Summary

Phase 3 implements the search interface for querying the semantic search system. Users submit a query via a minimal web interface and receive results from four sources in parallel: pgvector nearest neighbor search using three embedding models (Google, Jina, MiniLM), plus LLM context matching replicating Orcha's approach. The existing infrastructure from Phases 1-2 provides ~6K line items with pre-computed embeddings in HNSW-indexed columns.

The pgvector search component is straightforward: embed the query text using the same models, then use the <=> cosine distance operator to find top-K nearest neighbors. Flask provides a minimal web framework suitable for this spike's single-user evaluation use case. The LLM context matching requires replicating Orcha's approach: fetching similar historical bookings based on supplier name similarity (pg_trgm), formatting as CSV context, and prompting Gemini Flash to suggest GL account and cost center.

Running all four searches in parallel via concurrent.futures.ThreadPoolExecutor ensures responsive UI. Each search is I/O-bound (database query or API call), making threading ideal.

Primary recommendation: Use Flask with a single route for search, ThreadPoolExecutor for parallel execution of all 4 search methods, and replicate Orcha's LLM prompt structure exactly for accurate comparison.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

Simple web interface for submitting queries
Single text input for query, dropdown for K (3, 5, 10)
Results always show all 3 embedding models + LLM in one view
No model selector needed - all models run on every query
Side-by-side columns: Google | Jina | MiniLM | LLM
Each pgvector column shows top-K results
LLM column shows single prediction (GL account + cost center)
Full details per result row: supplier name, description, GL account, cost center, amounts, similarity score, embedding model used
Replicate Orcha's actual approach - investigate config.edn, ingestion.clj, post_process.clj
Use Gemini Flash (same as Orcha)
Copy credentials from Orcha config
Match Orcha's prompt/context structure exactly
Return GL account + cost center only (no confidence score)
Run all 4 searches (3 embedding models + LLM) in parallel

Claude's Discretion

Results page UX (inline vs navigation) - pick simplest approach
LLM API error handling - appropriate error display
Flask app structure and routing

Deferred Ideas (OUT OF SCOPE)

None - discussion stayed within phase scope </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
SRCH-01	Single query text input interface	Flask route with simple HTML form, single text input field plus K dropdown. See Flask Web Interface section.
SRCH-02	Top-K retrieval with adjustable K (3, 5, 10)	pgvector `ORDER BY embedding <=> query_vector LIMIT K` with K from dropdown. See pgvector Search section.
SRCH-03	Display similarity scores per result	pgvector returns distance; convert to similarity with `1 - distance` for cosine. Include in result template.
SRCH-05	LLM context matching (replicate Orcha approach with Gemini 2.5 Flash)	Use google-genai SDK, replicate Orcha's booking history lookup and prompt structure. See LLM Context Matching section.
SRCH-06	Display results with supplier, description, GL accounts, cost center	Include all fields in search results, format in side-by-side columns per embedding model.
</phase_requirements>

Standard Stack

Core

Library	Version	Purpose	Why Standard
Flask	3.1+	Web framework	Minimal, well-documented, perfect for spikes. Official 2026 docs recommend simple routes for prototypes.
google-genai	1.64+	Gemini API access	Already in project. Official unified SDK for all Google generative AI.
psycopg	3.3+	PostgreSQL with pgvector	Already in project. Modern async-capable driver with native pgvector support.
concurrent.futures	stdlib	Parallel search execution	Built-in ThreadPoolExecutor for I/O-bound parallel tasks. No external dependency.

Supporting

Library	Version	Purpose	When to Use
sentence-transformers	5.2+	MiniLM query embedding	Already in project. Embed query text locally before pgvector search.
pgvector	0.4+	Vector type handling	Already in project. `register_vector()` enables proper type coercion.
Jinja2	(via Flask)	HTML templates	Flask's built-in templating for results display.

Alternatives Considered

Instead of	Could Use	Tradeoff
Flask	FastAPI	FastAPI has native async, but Flask is simpler for this spike size. Async not needed with ThreadPoolExecutor.
ThreadPoolExecutor	asyncio/aiohttp	async is more efficient at scale, but ThreadPool is simpler and sufficient for 4 parallel calls.
Server-side rendering	React/Vue SPA	SPA adds complexity. Server-side HTML is fastest for this single-user evaluation tool.

Installation:

uv add flask
# All other dependencies already present from Phases 1-2

Architecture Patterns

Recommended Project Structure

src/
├── app.py                  # Flask application
├── search/
│   ├── __init__.py
│   ├── pgvector_search.py  # Semantic search functions
│   └── llm_matching.py     # Orcha-style LLM context matching
├── templates/
│   ├── base.html           # Common layout
│   └── search.html         # Search form + results
├── db.py                   # Existing
├── embeddings/             # Existing
│   ├── google_embed.py     # Reuse for query embedding
│   ├── jina_embed.py       # Reuse for query embedding
│   └── minilm_embed.py     # Reuse for query embedding
└── normalize.py            # Existing

Pattern 1: pgvector Nearest Neighbor Search

What: Query embeddings using cosine distance operator When to use: All semantic search queries Example:

# Source: pgvector GitHub documentation
def search_pgvector(
    conn,
    query_embedding: list[float],
    embedding_column: str,
    k: int = 5
) -> list[dict]:
    """
    Find K nearest neighbors using pgvector cosine distance.

    Returns results with similarity score (1 - distance).
    """
    with conn.cursor() as cur:
        cur.execute(f"""
            SELECT
                id,
                supplier_name,
                description,
                debit_account,
                credit_account,
                cost_center,
                net_amount,
                1 - ({embedding_column} <=> %s::vector) as similarity
            FROM line_item
            WHERE {embedding_column} IS NOT NULL
            ORDER BY {embedding_column} <=> %s::vector
            LIMIT %s
        """, (query_embedding, query_embedding, k))

        columns = [desc[0] for desc in cur.description]
        return [dict(zip(columns, row)) for row in cur.fetchall()]

Pattern 2: Parallel Search Execution

What: Run all 4 search methods concurrently using ThreadPoolExecutor When to use: Every search request Example:

# Source: Python concurrent.futures documentation
from concurrent.futures import ThreadPoolExecutor, as_completed

def search_all_models(
    conn,
    query_text: str,
    k: int = 5
) -> dict:
    """Execute all searches in parallel."""

    def search_google():
        embedding = embed_google([query_text])[0]
        return ('google', search_pgvector(conn, embedding, 'embedding_google', k))

    def search_jina():
        embedding = embed_jina([query_text])[0]
        return ('jina', search_pgvector(conn, embedding, 'embedding_jina', k))

    def search_minilm():
        embedding = embed_minilm([query_text])[0]
        return ('minilm', search_pgvector(conn, embedding, 'embedding_minilm', k))

    def search_llm():
        return ('llm', llm_context_match(conn, query_text))

    results = {}
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = [
            executor.submit(search_google),
            executor.submit(search_jina),
            executor.submit(search_minilm),
            executor.submit(search_llm),
        ]
        for future in as_completed(futures):
            key, result = future.result()
            results[key] = result

    return results

Pattern 3: Orcha-Style LLM Context Matching

What: Fetch similar historical bookings, format as prompt context, call Gemini When to use: LLM prediction column Based on: Orcha's post_process.clj fetch-supplier-booking-history and accounts-match prompt Example:

# Source: Orcha post_process.clj analysis
from google import genai

def llm_context_match(conn, query_text: str) -> dict:
    """
    Replicate Orcha's LLM context matching approach.

    1. Parse supplier name from query (use first part before |)
    2. Fetch similar historical bookings via pg_trgm
    3. Format as CSV context
    4. Prompt Gemini Flash for GL account + cost center
    """
    # Extract supplier hint from query
    # Query format: "supplier_name | description"
    parts = query_text.split('|', 1)
    supplier_hint = parts[0].strip() if parts else query_text

    # Fetch similar bookings (Orcha uses pg_trgm similarity >= 0.7)
    historical = fetch_booking_history(conn, supplier_hint)

    # Format prompt matching Orcha structure
    prompt = build_matching_prompt(query_text, historical)

    # Call Gemini Flash
    client = genai.Client()
    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=prompt,
        config={
            'response_mime_type': 'application/json',
            'temperature': 0,
        }
    )

    # Parse JSON response
    return parse_llm_response(response.text)


def fetch_booking_history(conn, supplier_name: str, limit: int = 50) -> list[dict]:
    """
    Fetch historical bookings similar to supplier name.
    Replicates Orcha's pg_trgm similarity search.
    """
    from src.normalize import normalize_supplier_name

    supplier_norm = normalize_supplier_name(supplier_name)
    if not supplier_norm:
        return []

    with conn.cursor() as cur:
        cur.execute("""
            SELECT
                supplier_name,
                description,
                net_amount,
                debit_account,
                cost_center
            FROM line_item
            WHERE supplier_name_normalized % %s
            ORDER BY similarity(supplier_name_normalized, %s) DESC
            LIMIT %s
        """, (supplier_norm, supplier_norm, limit))

        columns = [desc[0] for desc in cur.description]
        return [dict(zip(columns, row)) for row in cur.fetchall()]

Pattern 4: Flask Simple Route

What: Single route handling both GET (form) and POST (search) When to use: Minimal web interface Example:

# Source: Flask documentation
from flask import Flask, render_template, request

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def search():
    results = None
    query = ''
    k = 5

    if request.method == 'POST':
        query = request.form.get('query', '')
        k = int(request.form.get('k', 5))

        if query.strip():
            conn = get_connection()
            try:
                results = search_all_models(conn, query, k)
            finally:
                conn.close()

    return render_template('search.html',
                          query=query,
                          k=k,
                          results=results)

Anti-Patterns to Avoid

Sequential search execution: Run all 4 searches in parallel, not sequentially
Creating new connections per search: Share connection or use connection pool
Embedding query multiple times: Cache query embeddings if same query hits multiple models
Complex Flask blueprints: Single route is sufficient for this spike
Client-side rendering: Server-side HTML is simpler and faster for this use case

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Web framework	Custom HTTP server	Flask	Routing, templating, form handling all built-in
Parallel execution	Manual threading	ThreadPoolExecutor	Proper cleanup, exception handling, result collection
Vector similarity	Manual distance calc	pgvector `<=>` operator	Hardware-optimized, uses HNSW index
JSON response parsing	Regex extraction	`json.loads()`	Handles edge cases, proper error messages
HTML escaping	String concatenation	Jinja2 templates	XSS prevention, cleaner code

Key insight: This phase combines existing well-solved problems (web framework, parallel execution, vector search) with one domain-specific task (replicating Orcha's LLM prompt structure). Focus implementation effort on the LLM matching accuracy.

Common Pitfalls

Pitfall 1: Query Embedding vs Document Embedding

What goes wrong: Using wrong task_type for query embedding produces poor results Why it happens: Embedding models have asymmetric retrieval modes (query vs document) How to avoid: Use task='RETRIEVAL_QUERY' for Google, task='retrieval.query' for Jina when embedding the search query Warning signs: Good embeddings but poor search relevance

Pitfall 2: Connection Not Closed on Error

What goes wrong: Database connections leak, pool exhausted Why it happens: Exception in search code before conn.close() How to avoid: Use try/finally or context manager for connection handling Warning signs: "too many connections" errors after multiple searches

Pitfall 3: pg_trgm Extension Not Enabled

What goes wrong: operator does not exist: text % text error Why it happens: pg_trgm not installed for similarity search How to avoid: Add CREATE EXTENSION IF NOT EXISTS pg_trgm; to init.sql Warning signs: SQL error on first LLM context match attempt

Pitfall 4: Gemini JSON Response Not Parseable

What goes wrong: LLM returns markdown code blocks or malformed JSON Why it happens: Gemini wraps JSON in ```json blocks or adds commentary How to avoid: Use response_mime_type: 'application/json' config, implement strip_markdown_json fallback Warning signs: JSONDecodeError on LLM responses

Pitfall 5: ThreadPoolExecutor Exceptions Silently Swallowed

What goes wrong: One search fails but UI shows partial results without error Why it happens: Not calling future.result() or not handling exceptions How to avoid: Use as_completed() and call result() with try/except per future Warning signs: Missing columns in results, no error messages

Pitfall 6: HNSW Index Not Used

What goes wrong: Search is slow (seconds instead of milliseconds) Why it happens: Query plan doesn't use index due to missing probes setting How to avoid: Set SET hnsw.ef_search = 40; for query sessions Warning signs: EXPLAIN shows Seq Scan instead of Index Scan

Code Examples

pgvector Query Embedding (Google)

# Source: google-genai documentation, adapted for query task
from google import genai
from google.genai.types import EmbedContentConfig

def embed_query_google(query_text: str) -> list[float]:
    """Embed query text for retrieval (not document storage)."""
    client = genai.Client()
    response = client.models.embed_content(
        model='text-multilingual-embedding-002',
        contents=[query_text],
        config=EmbedContentConfig(
            task_type='RETRIEVAL_QUERY',  # Different from RETRIEVAL_DOCUMENT!
        ),
    )
    return response.embeddings[0].values

pgvector Query Embedding (Jina)

# Source: Jina API documentation
def embed_query_jina(query_text: str) -> list[float]:
    """Embed query text for retrieval."""
    response = requests.post(
        'https://api.jina.ai/v1/embeddings',
        headers={
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {JINA_API_KEY}',
        },
        json={
            'input': [query_text],
            'model': 'jina-embeddings-v3',
            'dimensions': 1024,
            'task': 'retrieval.query',  # Different from retrieval.passage!
        },
    )
    response.raise_for_status()
    return response.json()['data'][0]['embedding']

Gemini Flash Text Generation

# Source: google-genai documentation, Orcha config.edn analysis
from google import genai

def call_gemini_flash(prompt: str) -> str:
    """Call Gemini Flash for JSON response."""
    client = genai.Client()  # Uses GOOGLE_API_KEY env var

    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=prompt,
        config={
            'response_mime_type': 'application/json',
            'temperature': 0,  # Deterministic output
        }
    )
    return response.text

Orcha-Style Prompt Structure

# Source: Analysis of Orcha post_process.clj accounts-match prompt
def build_matching_prompt(query_text: str, historical_bookings: list[dict]) -> str:
    """
    Build prompt matching Orcha's accounts-match structure.

    Orcha provides:
    - Historical bookings as CSV
    - Current invoice/query as JSON
    - Instructions to match GL account and cost center
    """
    # Format historical bookings as CSV (matches Orcha's dataset->csv)
    booking_csv = format_as_csv(historical_bookings) if historical_bookings else ""

    prompt = f"""You are matching a line item to GL accounts and cost centers.

Based on the query and historical bookings for similar suppliers, suggest:
1. The most likely debit GL account
2. The most likely cost center (if applicable)

--- HISTORICAL BOOKINGS FOR SIMILAR SUPPLIERS (CSV) ---
{booking_csv if booking_csv else "No historical data available."}

--- QUERY LINE ITEM ---
{query_text}

Respond with JSON:
{{"debit_account": "<account or null>", "cost_center": "<center or null>"}}

If you cannot determine with reasonable confidence, use null."""

    return prompt


def format_as_csv(data: list[dict]) -> str:
    """Format list of dicts as CSV string (matches Orcha's dataset->csv)."""
    if not data:
        return ""

    # Get keys that have non-null values
    keys = list(data[0].keys())
    header = ",".join(keys)
    rows = []
    for item in data:
        row = ",".join(str(item.get(k, "")) for k in keys)
        rows.append(row)

    return header + "\n" + "\n".join(rows)

Flask HTML Template

<!-- templates/search.html -->
<!DOCTYPE html>
<html>
<head>
    <title>Semantic Search Comparison</title>
    <style>
        .results-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 20px; }
        .column { border: 1px solid #ccc; padding: 10px; }
        .result-item { border-bottom: 1px solid #eee; padding: 10px 0; }
        .similarity { color: green; font-weight: bold; }
    </style>
</head>
<body>
    <h1>Semantic Search Comparison</h1>

    <form method="post">
        <input type="text" name="query" value="{{ query }}" size="80"
               placeholder="supplier name | description">
        <select name="k">
            <option value="3" {{ 'selected' if k == 3 }}>Top 3</option>
            <option value="5" {{ 'selected' if k == 5 }}>Top 5</option>
            <option value="10" {{ 'selected' if k == 10 }}>Top 10</option>
        </select>
        <button type="submit">Search</button>
    </form>

    {% if results %}
    <div class="results-grid">
        <div class="column">
            <h2>Google (768d)</h2>
            {% for r in results.google %}
            <div class="result-item">
                <div><strong>{{ r.supplier_name }}</strong></div>
                <div>{{ r.description }}</div>
                <div>GL: {{ r.debit_account }} | CC: {{ r.cost_center or '-' }}</div>
                <div class="similarity">{{ "%.3f"|format(r.similarity) }}</div>
            </div>
            {% endfor %}
        </div>
        <!-- Similar columns for Jina, MiniLM, LLM -->
    </div>
    {% endif %}
</body>
</html>

State of the Art

Old Approach	Current Approach	When Changed	Impact
google-generativeai SDK	google-genai SDK	Nov 2025	Unified API for Gemini, Vertex AI, embeddings
Flask sync-only	Flask 2.x async support	2021	Can use async routes, but ThreadPool still simpler for this case
IVFFlat indexes	HNSW indexes	pgvector 0.5+ (2024)	Better recall, no training phase, faster builds
Sequential API calls	Parallel with ThreadPool	Standard	3-4x faster response time for multi-model search

Deprecated/outdated:

google-generativeai package: Deprecated Nov 2025, use google-genai instead
Flask <2.0: No async support, still works but misses modern features
pgvector <0.5: IVFFlat only, HNSW provides better performance

Open Questions

pg_trgm Threshold for Historical Matching
- What we know: Orcha uses similarity >= 0.7 threshold
- What's unclear: Is 0.7 optimal for our dataset? Should we tune?
- Recommendation: Start with 0.7 (matches Orcha), can adjust in Phase 4 evaluation
Query Text Format for LLM
- What we know: Our embeddings use "supplier | description" format
- What's unclear: Should LLM receive same format or parsed components?
- Recommendation: Parse into components for clearer prompt structure
Connection Pooling
- What we know: Creating connection per request is simple but may be slow
- What's unclear: Will single-user usage cause issues?
- Recommendation: Start simple (connection per request), add pooling if latency is problematic

Sources

Primary (HIGH confidence)

pgvector GitHub - Cosine distance operators, HNSW index configuration
google-genai SDK - Gemini API, embedding task types
Flask Documentation - Async support, route patterns
Python concurrent.futures - ThreadPoolExecutor usage
Orcha post_process.clj - Actual LLM matching implementation (local file analysis)
Orcha config.edn - LLM configuration (Gemini Flash for post-processing)

Secondary (MEDIUM confidence)

pgvector 2026 Guide - Current best practices
Flask Best Practices 2026 - Project structure recommendations

Tertiary (LOW confidence)

None - all critical claims verified with official sources

Metadata

Confidence breakdown:

Standard stack: HIGH - All libraries already in project or well-documented
Architecture: HIGH - Patterns verified against official documentation
Pitfalls: HIGH - Based on pgvector GitHub issues and Flask documentation
LLM Matching: MEDIUM - Based on Orcha code analysis, may need adjustment

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable technologies)