Phase 3: Search Implementation - Research

Researched: 2026-02-20 Domain: Vector similarity search, Flask web framework, LLM API integration Confidence: HIGH

Summary

Phase 3 implements the search interface for querying the semantic search system. Users submit a query via a minimal web interface and receive results from four sources in parallel: pgvector nearest neighbor search using three embedding models (Google, Jina, MiniLM), plus LLM context matching replicating Orcha's approach. The existing infrastructure from Phases 1-2 provides ~6K line items with pre-computed embeddings in HNSW-indexed columns.

The pgvector search component is straightforward: embed the query text using the same models, then use the <=> cosine distance operator to find top-K nearest neighbors. Flask provides a minimal web framework suitable for this spike's single-user evaluation use case. The LLM context matching requires replicating Orcha's approach: fetching similar historical bookings based on supplier name similarity (pg_trgm), formatting as CSV context, and prompting Gemini Flash to suggest GL account and cost center.

Running all four searches in parallel via concurrent.futures.ThreadPoolExecutor ensures responsive UI. Each search is I/O-bound (database query or API call), making threading ideal.

Primary recommendation: Use Flask with a single route for search, ThreadPoolExecutor for parallel execution of all 4 search methods, and replicate Orcha's LLM prompt structure exactly for accurate comparison.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

Claude's Discretion

Deferred Ideas (OUT OF SCOPE)

None - discussion stayed within phase scope </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
SRCH-01 Single query text input interface Flask route with simple HTML form, single text input field plus K dropdown. See Flask Web Interface section.
SRCH-02 Top-K retrieval with adjustable K (3, 5, 10) pgvector ORDER BY embedding <=> query_vector LIMIT K with K from dropdown. See pgvector Search section.
SRCH-03 Display similarity scores per result pgvector returns distance; convert to similarity with 1 - distance for cosine. Include in result template.
SRCH-05 LLM context matching (replicate Orcha approach with Gemini 2.5 Flash) Use google-genai SDK, replicate Orcha's booking history lookup and prompt structure. See LLM Context Matching section.
SRCH-06 Display results with supplier, description, GL accounts, cost center Include all fields in search results, format in side-by-side columns per embedding model.
</phase_requirements>

Standard Stack

Core

Library Version Purpose Why Standard
Flask 3.1+ Web framework Minimal, well-documented, perfect for spikes. Official 2026 docs recommend simple routes for prototypes.
google-genai 1.64+ Gemini API access Already in project. Official unified SDK for all Google generative AI.
psycopg 3.3+ PostgreSQL with pgvector Already in project. Modern async-capable driver with native pgvector support.
concurrent.futures stdlib Parallel search execution Built-in ThreadPoolExecutor for I/O-bound parallel tasks. No external dependency.

Supporting

Library Version Purpose When to Use
sentence-transformers 5.2+ MiniLM query embedding Already in project. Embed query text locally before pgvector search.
pgvector 0.4+ Vector type handling Already in project. register_vector() enables proper type coercion.
Jinja2 (via Flask) HTML templates Flask's built-in templating for results display.

Alternatives Considered

Instead of Could Use Tradeoff
Flask FastAPI FastAPI has native async, but Flask is simpler for this spike size. Async not needed with ThreadPoolExecutor.
ThreadPoolExecutor asyncio/aiohttp async is more efficient at scale, but ThreadPool is simpler and sufficient for 4 parallel calls.
Server-side rendering React/Vue SPA SPA adds complexity. Server-side HTML is fastest for this single-user evaluation tool.

Installation:

uv add flask
# All other dependencies already present from Phases 1-2

Architecture Patterns

src/
├── app.py                  # Flask application
├── search/
│   ├── __init__.py
│   ├── pgvector_search.py  # Semantic search functions
│   └── llm_matching.py     # Orcha-style LLM context matching
├── templates/
│   ├── base.html           # Common layout
│   └── search.html         # Search form + results
├── db.py                   # Existing
├── embeddings/             # Existing
│   ├── google_embed.py     # Reuse for query embedding
│   ├── jina_embed.py       # Reuse for query embedding
│   └── minilm_embed.py     # Reuse for query embedding
└── normalize.py            # Existing

What: Query embeddings using cosine distance operator When to use: All semantic search queries Example:

# Source: pgvector GitHub documentation
def search_pgvector(
    conn,
    query_embedding: list[float],
    embedding_column: str,
    k: int = 5
) -> list[dict]:
    """
    Find K nearest neighbors using pgvector cosine distance.

    Returns results with similarity score (1 - distance).
    """
    with conn.cursor() as cur:
        cur.execute(f"""
            SELECT
                id,
                supplier_name,
                description,
                debit_account,
                credit_account,
                cost_center,
                net_amount,
                1 - ({embedding_column} <=> %s::vector) as similarity
            FROM line_item
            WHERE {embedding_column} IS NOT NULL
            ORDER BY {embedding_column} <=> %s::vector
            LIMIT %s
        """, (query_embedding, query_embedding, k))

        columns = [desc[0] for desc in cur.description]
        return [dict(zip(columns, row)) for row in cur.fetchall()]

Pattern 2: Parallel Search Execution

What: Run all 4 search methods concurrently using ThreadPoolExecutor When to use: Every search request Example:

# Source: Python concurrent.futures documentation
from concurrent.futures import ThreadPoolExecutor, as_completed

def search_all_models(
    conn,
    query_text: str,
    k: int = 5
) -> dict:
    """Execute all searches in parallel."""

    def search_google():
        embedding = embed_google([query_text])[0]
        return ('google', search_pgvector(conn, embedding, 'embedding_google', k))

    def search_jina():
        embedding = embed_jina([query_text])[0]
        return ('jina', search_pgvector(conn, embedding, 'embedding_jina', k))

    def search_minilm():
        embedding = embed_minilm([query_text])[0]
        return ('minilm', search_pgvector(conn, embedding, 'embedding_minilm', k))

    def search_llm():
        return ('llm', llm_context_match(conn, query_text))

    results = {}
    with ThreadPoolExecutor(max_workers=4) as executor:
        futures = [
            executor.submit(search_google),
            executor.submit(search_jina),
            executor.submit(search_minilm),
            executor.submit(search_llm),
        ]
        for future in as_completed(futures):
            key, result = future.result()
            results[key] = result

    return results

Pattern 3: Orcha-Style LLM Context Matching

What: Fetch similar historical bookings, format as prompt context, call Gemini When to use: LLM prediction column Based on: Orcha's post_process.clj fetch-supplier-booking-history and accounts-match prompt Example:

# Source: Orcha post_process.clj analysis
from google import genai

def llm_context_match(conn, query_text: str) -> dict:
    """
    Replicate Orcha's LLM context matching approach.

    1. Parse supplier name from query (use first part before |)
    2. Fetch similar historical bookings via pg_trgm
    3. Format as CSV context
    4. Prompt Gemini Flash for GL account + cost center
    """
    # Extract supplier hint from query
    # Query format: "supplier_name | description"
    parts = query_text.split('|', 1)
    supplier_hint = parts[0].strip() if parts else query_text

    # Fetch similar bookings (Orcha uses pg_trgm similarity >= 0.7)
    historical = fetch_booking_history(conn, supplier_hint)

    # Format prompt matching Orcha structure
    prompt = build_matching_prompt(query_text, historical)

    # Call Gemini Flash
    client = genai.Client()
    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=prompt,
        config={
            'response_mime_type': 'application/json',
            'temperature': 0,
        }
    )

    # Parse JSON response
    return parse_llm_response(response.text)


def fetch_booking_history(conn, supplier_name: str, limit: int = 50) -> list[dict]:
    """
    Fetch historical bookings similar to supplier name.
    Replicates Orcha's pg_trgm similarity search.
    """
    from src.normalize import normalize_supplier_name

    supplier_norm = normalize_supplier_name(supplier_name)
    if not supplier_norm:
        return []

    with conn.cursor() as cur:
        cur.execute("""
            SELECT
                supplier_name,
                description,
                net_amount,
                debit_account,
                cost_center
            FROM line_item
            WHERE supplier_name_normalized % %s
            ORDER BY similarity(supplier_name_normalized, %s) DESC
            LIMIT %s
        """, (supplier_norm, supplier_norm, limit))

        columns = [desc[0] for desc in cur.description]
        return [dict(zip(columns, row)) for row in cur.fetchall()]

Pattern 4: Flask Simple Route

What: Single route handling both GET (form) and POST (search) When to use: Minimal web interface Example:

# Source: Flask documentation
from flask import Flask, render_template, request

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def search():
    results = None
    query = ''
    k = 5

    if request.method == 'POST':
        query = request.form.get('query', '')
        k = int(request.form.get('k', 5))

        if query.strip():
            conn = get_connection()
            try:
                results = search_all_models(conn, query, k)
            finally:
                conn.close()

    return render_template('search.html',
                          query=query,
                          k=k,
                          results=results)

Anti-Patterns to Avoid

Don't Hand-Roll

Problem Don't Build Use Instead Why
Web framework Custom HTTP server Flask Routing, templating, form handling all built-in
Parallel execution Manual threading ThreadPoolExecutor Proper cleanup, exception handling, result collection
Vector similarity Manual distance calc pgvector <=> operator Hardware-optimized, uses HNSW index
JSON response parsing Regex extraction json.loads() Handles edge cases, proper error messages
HTML escaping String concatenation Jinja2 templates XSS prevention, cleaner code

Key insight: This phase combines existing well-solved problems (web framework, parallel execution, vector search) with one domain-specific task (replicating Orcha's LLM prompt structure). Focus implementation effort on the LLM matching accuracy.

Common Pitfalls

Pitfall 1: Query Embedding vs Document Embedding

What goes wrong: Using wrong task_type for query embedding produces poor results Why it happens: Embedding models have asymmetric retrieval modes (query vs document) How to avoid: Use task='RETRIEVAL_QUERY' for Google, task='retrieval.query' for Jina when embedding the search query Warning signs: Good embeddings but poor search relevance

Pitfall 2: Connection Not Closed on Error

What goes wrong: Database connections leak, pool exhausted Why it happens: Exception in search code before conn.close() How to avoid: Use try/finally or context manager for connection handling Warning signs: "too many connections" errors after multiple searches

Pitfall 3: pg_trgm Extension Not Enabled

What goes wrong: operator does not exist: text % text error Why it happens: pg_trgm not installed for similarity search How to avoid: Add CREATE EXTENSION IF NOT EXISTS pg_trgm; to init.sql Warning signs: SQL error on first LLM context match attempt

Pitfall 4: Gemini JSON Response Not Parseable

What goes wrong: LLM returns markdown code blocks or malformed JSON Why it happens: Gemini wraps JSON in ```json blocks or adds commentary How to avoid: Use response_mime_type: 'application/json' config, implement strip_markdown_json fallback Warning signs: JSONDecodeError on LLM responses

Pitfall 5: ThreadPoolExecutor Exceptions Silently Swallowed

What goes wrong: One search fails but UI shows partial results without error Why it happens: Not calling future.result() or not handling exceptions How to avoid: Use as_completed() and call result() with try/except per future Warning signs: Missing columns in results, no error messages

Pitfall 6: HNSW Index Not Used

What goes wrong: Search is slow (seconds instead of milliseconds) Why it happens: Query plan doesn't use index due to missing probes setting How to avoid: Set SET hnsw.ef_search = 40; for query sessions Warning signs: EXPLAIN shows Seq Scan instead of Index Scan

Code Examples

pgvector Query Embedding (Google)

# Source: google-genai documentation, adapted for query task
from google import genai
from google.genai.types import EmbedContentConfig

def embed_query_google(query_text: str) -> list[float]:
    """Embed query text for retrieval (not document storage)."""
    client = genai.Client()
    response = client.models.embed_content(
        model='text-multilingual-embedding-002',
        contents=[query_text],
        config=EmbedContentConfig(
            task_type='RETRIEVAL_QUERY',  # Different from RETRIEVAL_DOCUMENT!
        ),
    )
    return response.embeddings[0].values

pgvector Query Embedding (Jina)

# Source: Jina API documentation
def embed_query_jina(query_text: str) -> list[float]:
    """Embed query text for retrieval."""
    response = requests.post(
        'https://api.jina.ai/v1/embeddings',
        headers={
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {JINA_API_KEY}',
        },
        json={
            'input': [query_text],
            'model': 'jina-embeddings-v3',
            'dimensions': 1024,
            'task': 'retrieval.query',  # Different from retrieval.passage!
        },
    )
    response.raise_for_status()
    return response.json()['data'][0]['embedding']

Gemini Flash Text Generation

# Source: google-genai documentation, Orcha config.edn analysis
from google import genai

def call_gemini_flash(prompt: str) -> str:
    """Call Gemini Flash for JSON response."""
    client = genai.Client()  # Uses GOOGLE_API_KEY env var

    response = client.models.generate_content(
        model='gemini-2.5-flash',
        contents=prompt,
        config={
            'response_mime_type': 'application/json',
            'temperature': 0,  # Deterministic output
        }
    )
    return response.text

Orcha-Style Prompt Structure

# Source: Analysis of Orcha post_process.clj accounts-match prompt
def build_matching_prompt(query_text: str, historical_bookings: list[dict]) -> str:
    """
    Build prompt matching Orcha's accounts-match structure.

    Orcha provides:
    - Historical bookings as CSV
    - Current invoice/query as JSON
    - Instructions to match GL account and cost center
    """
    # Format historical bookings as CSV (matches Orcha's dataset->csv)
    booking_csv = format_as_csv(historical_bookings) if historical_bookings else ""

    prompt = f"""You are matching a line item to GL accounts and cost centers.

Based on the query and historical bookings for similar suppliers, suggest:
1. The most likely debit GL account
2. The most likely cost center (if applicable)

--- HISTORICAL BOOKINGS FOR SIMILAR SUPPLIERS (CSV) ---
{booking_csv if booking_csv else "No historical data available."}

--- QUERY LINE ITEM ---
{query_text}

Respond with JSON:
{{"debit_account": "<account or null>", "cost_center": "<center or null>"}}

If you cannot determine with reasonable confidence, use null."""

    return prompt


def format_as_csv(data: list[dict]) -> str:
    """Format list of dicts as CSV string (matches Orcha's dataset->csv)."""
    if not data:
        return ""

    # Get keys that have non-null values
    keys = list(data[0].keys())
    header = ",".join(keys)
    rows = []
    for item in data:
        row = ",".join(str(item.get(k, "")) for k in keys)
        rows.append(row)

    return header + "\n" + "\n".join(rows)

Flask HTML Template

<!-- templates/search.html -->
<!DOCTYPE html>
<html>
<head>
    <title>Semantic Search Comparison</title>
    <style>
        .results-grid { display: grid; grid-template-columns: repeat(4, 1fr); gap: 20px; }
        .column { border: 1px solid #ccc; padding: 10px; }
        .result-item { border-bottom: 1px solid #eee; padding: 10px 0; }
        .similarity { color: green; font-weight: bold; }
    </style>
</head>
<body>
    <h1>Semantic Search Comparison</h1>

    <form method="post">
        <input type="text" name="query" value="{{ query }}" size="80"
               placeholder="supplier name | description">
        <select name="k">
            <option value="3" {{ 'selected' if k == 3 }}>Top 3</option>
            <option value="5" {{ 'selected' if k == 5 }}>Top 5</option>
            <option value="10" {{ 'selected' if k == 10 }}>Top 10</option>
        </select>
        <button type="submit">Search</button>
    </form>

    {% if results %}
    <div class="results-grid">
        <div class="column">
            <h2>Google (768d)</h2>
            {% for r in results.google %}
            <div class="result-item">
                <div><strong>{{ r.supplier_name }}</strong></div>
                <div>{{ r.description }}</div>
                <div>GL: {{ r.debit_account }} | CC: {{ r.cost_center or '-' }}</div>
                <div class="similarity">{{ "%.3f"|format(r.similarity) }}</div>
            </div>
            {% endfor %}
        </div>
        <!-- Similar columns for Jina, MiniLM, LLM -->
    </div>
    {% endif %}
</body>
</html>

State of the Art

Old Approach Current Approach When Changed Impact
google-generativeai SDK google-genai SDK Nov 2025 Unified API for Gemini, Vertex AI, embeddings
Flask sync-only Flask 2.x async support 2021 Can use async routes, but ThreadPool still simpler for this case
IVFFlat indexes HNSW indexes pgvector 0.5+ (2024) Better recall, no training phase, faster builds
Sequential API calls Parallel with ThreadPool Standard 3-4x faster response time for multi-model search

Deprecated/outdated:

Open Questions

  1. pg_trgm Threshold for Historical Matching

  2. Query Text Format for LLM

  3. Connection Pooling

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

Metadata

Confidence breakdown:

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable technologies)