Phase 1: Foundation - Research

Researched: 2026-02-20 Domain: Docker/PostgreSQL infrastructure, Python environment setup, data import Confidence: HIGH

Summary

Phase 1 establishes the development foundation: a Docker Compose stack running Postgres 18 with pgvector 0.8.1, a Python environment managed by uv with required packages, and ~6K historical booking line items imported from CSV.

The stack is straightforward: pgvector provides an official Docker image (pgvector/pgvector:pg18) that bundles Postgres 18 with the vector extension pre-installed. The Python environment uses uv (the modern, fast package manager) with psycopg3 + pgvector-python for database access. Data import uses psycopg3's efficient COPY protocol.

Primary recommendation: Use the official pgvector Docker image, uv for Python package management, and psycopg3's copy protocol for bulk CSV import. Pre-create embedding columns with correct dimensions for each model to avoid schema migrations in Phase 2.

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

Claude's Discretion

Deferred Ideas (OUT OF SCOPE)

None - discussion stayed within phase scope </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
INFRA-01 Docker Compose setup with Postgres 18 and pgvector 0.8.1 Official pgvector/pgvector:pg18 image provides this out-of-box. Use init script to CREATE EXTENSION.
INFRA-02 Python environment with google-genai, sentence-transformers, ragas uv manages packages efficiently. All packages available on PyPI.
INFRA-03 Import ~6K line items from historical.csv into pgvector table psycopg3 COPY protocol for efficient bulk import. Port Orcha's normalize-supplier-name to Python.
</phase_requirements>

Standard Stack

Core

Library Version Purpose Why Standard
pgvector/pgvector:pg18 0.8.1-pg18-trixie Docker image with Postgres 18 + pgvector Official image, 10M+ downloads, maintained by pgvector team
psycopg 3.x PostgreSQL adapter for Python Modern async-capable driver, required for pgvector-python
pgvector-python 0.4.x pgvector types for Python Official pgvector Python bindings, supports psycopg3
uv latest Python package/project manager 10-100x faster than pip, replaces pip+venv+poetry, 2026 standard

Supporting

Library Version Purpose When to Use
google-genai latest Google AI embeddings API Phase 2: text-multilingual-embedding-002 embeddings
sentence-transformers 4.x Local embedding models Phase 2: all-MiniLM-L6-v2 and jina-embeddings-v3
ragas 0.4.x RAG evaluation framework Phase 5: LLM-as-judge evaluation
pandas 2.x Data manipulation CSV loading, data transformation

Alternatives Considered

Instead of Could Use Tradeoff
uv poetry Poetry is slower but more mature; uv is 10-100x faster
uv pip+venv pip+venv works but requires manual coordination; uv is simpler
psycopg3 psycopg2 psycopg2 is legacy; psycopg3 has better async support
pandas csv module csv module is lower-level; pandas handles normalization easily

Installation:

# Install uv (if not present)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Initialize project
uv init semantic-search-spike
cd semantic-search-spike
uv venv

# Add dependencies
uv add psycopg pgvector pandas
uv add google-genai sentence-transformers ragas  # For later phases

Architecture Patterns

semantic-search/
├── docker-compose.yml       # Postgres 18 + pgvector
├── init.sql                 # CREATE EXTENSION, CREATE TABLE
├── pyproject.toml           # uv project configuration
├── .venv/                   # Virtual environment (created by uv)
├── src/
│   ├── db.py               # Database connection utilities
│   ├── normalize.py        # Text normalization (port of Orcha logic)
│   └── import_csv.py       # CSV import script
├── data/
│   └── historical.csv      # Source data (symlink or copy)
└── .planning/              # GSD planning files

Pattern 1: Docker Compose with Init Script

What: Auto-initialize database with extension and schema on first startup When to use: Always for development environments Example:

# docker-compose.yml
services:
  db:
    image: pgvector/pgvector:pg18
    container_name: semantic-search-db
    ports:
      - "5433:5432"  # Non-standard port to avoid conflicts
    environment:
      POSTGRES_USER: dev
      POSTGRES_PASSWORD: dev
      POSTGRES_DB: semantic_search
    volumes:
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
      - pgvector_data:/var/lib/postgresql/18/docker  # Postgres 18 PGDATA path
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U dev -d semantic_search"]
      interval: 5s
      timeout: 5s
      retries: 5

volumes:
  pgvector_data:

Pattern 2: Schema with Pre-allocated Embedding Columns

What: Define vector columns with correct dimensions upfront When to use: When embedding dimensions are known ahead of time Example:

-- init.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE line_item (
    id BIGSERIAL PRIMARY KEY,
    -- Original values
    supplier_name TEXT NOT NULL,
    description TEXT NOT NULL,
    debit_account TEXT NOT NULL,
    credit_account TEXT,
    cost_center TEXT,
    net_amount NUMERIC,
    -- Normalized values (for matching)
    supplier_name_normalized TEXT NOT NULL,
    description_normalized TEXT NOT NULL,
    -- Embedding columns (populated in Phase 2)
    embedding_google vector(768),      -- text-multilingual-embedding-002
    embedding_jina vector(1024),       -- jina-embeddings-v3
    embedding_minilm vector(384),      -- all-MiniLM-L6-v2
    -- Metadata
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Indexes for embedding search (created in Phase 2 after data load)
-- CREATE INDEX idx_line_item_google_cosine ON line_item
--     USING hnsw (embedding_google vector_cosine_ops);

Pattern 3: Efficient Bulk Import with COPY

What: Use PostgreSQL COPY protocol for fast data loading When to use: Importing >1000 rows Example:

# Source: pgvector-python examples
from pgvector.psycopg import register_vector
import psycopg
from io import StringIO
import csv

def bulk_import(conn, rows):
    """Import rows using COPY protocol (10-100x faster than INSERT)."""
    register_vector(conn)

    # Prepare CSV buffer
    buffer = StringIO()
    writer = csv.writer(buffer)
    for row in rows:
        writer.writerow([
            row['supplier_name'],
            row['description'],
            row['debit_account'],
            row.get('credit_account'),
            row.get('cost_center'),
            row.get('net_amount'),
            row['supplier_name_normalized'],
            row['description_normalized'],
        ])
    buffer.seek(0)

    with conn.cursor() as cur:
        with cur.copy("""
            COPY line_item (
                supplier_name, description, debit_account,
                credit_account, cost_center, net_amount,
                supplier_name_normalized, description_normalized
            ) FROM STDIN WITH CSV
        """) as copy:
            while data := buffer.read(8192):
                copy.write(data)

Anti-Patterns to Avoid

Don't Hand-Roll

Problem Don't Build Use Instead Why
Text normalization Custom regex Port Orcha's normalize-supplier-name exactly Tested logic, handles German umlauts, company suffixes
CSV parsing Manual file reading pandas.read_csv Handles encoding, quotes, missing values
Bulk import INSERT loops psycopg3 COPY protocol 10-100x performance difference
Vector storage Custom binary format pgvector native vector type Optimized storage, native operators
Package management pip+venv scripts uv Automatic lockfiles, faster resolution

Key insight: The Python ecosystem has mature, well-tested tools for every aspect of this phase. Custom solutions add maintenance burden without benefits.

Common Pitfalls

Pitfall 1: Wrong Postgres Data Directory Path

What goes wrong: Volume mount fails or data disappears on restart Why it happens: Postgres 18 changed PGDATA to /var/lib/postgresql/18/docker (version-specific) How to avoid: Use the correct path in docker-compose volumes Warning signs: Empty database after container restart

Pitfall 2: Vector Dimension Mismatch

What goes wrong: ERROR: expected N dimensions, not M when inserting embeddings Why it happens: Schema defines one dimension, embedding model produces another How to avoid: Document and verify dimensions upfront:

Pitfall 3: Missing psycopg3 Registration

What goes wrong: can't adapt type 'numpy.ndarray' errors Why it happens: pgvector types not registered with connection How to avoid: Call register_vector(conn) after connecting Warning signs: Vector operations fail with type errors

Pitfall 4: Encoding Issues in CSV Import

What goes wrong: UnicodeDecodeError or garbled text Why it happens: CSV may have Windows encoding, BOM, or mixed encodings How to avoid: Explicitly specify encoding in pandas.read_csv: encoding='utf-8-sig' Warning signs: German characters (umlauts) appear corrupted

Pitfall 5: Creating Indexes Before Bulk Import

What goes wrong: Import takes hours instead of seconds Why it happens: Index maintenance on every INSERT/COPY row How to avoid: Import all data first, then CREATE INDEX Warning signs: Import progress slows dramatically as data grows

Code Examples

Text Normalization (Port of Orcha's Logic)

# Source: Ported from orcha/src/com/getorcha/util/text.clj
import re

UMLAUT_MAP = {
    'ä': 'ae', 'ö': 'oe', 'ü': 'ue', 'ß': 'ss',
    'Ä': 'Ae', 'Ö': 'Oe', 'Ü': 'Ue',
}

ACCENT_MAP = {
    'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'å': 'a',
    'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
    'ì': 'i', 'í': 'i', 'î': 'i', 'ï': 'i',
    'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ø': 'o',
    'ù': 'u', 'ú': 'u', 'û': 'u',
    'ñ': 'n', 'ç': 'c',
}

PUNCTUATION_PATTERN = re.compile(r'[.,;:!?"\'`\-_/\\@#$%^&*()=+\[\]{}|<>]')
COMPANY_SUFFIX_PATTERN = re.compile(
    r'(?i)\s*(gmbh|ag|kg|ohg|ug|se|gbr|co|inc|ltd|llc|plc|sa|srl|bv|nv|ek)\s*$'
)
CO_PARTNER_PATTERN = re.compile(r'(?i)\s*&\s*(co\.?|partner|söhne|sohn)\s*')
THE_PREFIX_PATTERN = re.compile(r'(?i)^the\s+')


def normalize_text(s: str | None) -> str | None:
    """Normalize text for fuzzy matching."""
    if s is None:
        return None
    result = s.lower()
    for char, replacement in UMLAUT_MAP.items():
        result = result.replace(char.lower(), replacement.lower())
    for char, replacement in ACCENT_MAP.items():
        result = result.replace(char, replacement)
    result = PUNCTUATION_PATTERN.sub('', result)
    result = re.sub(r'\s+', ' ', result)
    return result.strip()


def normalize_supplier_name(s: str | None) -> str | None:
    """Normalize supplier name with company suffix stripping."""
    if s is None:
        return None
    result = s.lower()
    # Strip & Co patterns before punctuation removal
    result = CO_PARTNER_PATTERN.sub(' ', result)
    # Apply standard normalization
    for char, replacement in UMLAUT_MAP.items():
        result = result.replace(char.lower(), replacement.lower())
    for char, replacement in ACCENT_MAP.items():
        result = result.replace(char, replacement)
    result = PUNCTUATION_PATTERN.sub('', result)
    # Strip company suffixes after punctuation removal
    result = COMPANY_SUFFIX_PATTERN.sub('', result)
    # Strip "the" prefix
    result = THE_PREFIX_PATTERN.sub('', result)
    # Collapse spaces and trim
    result = re.sub(r'\s+', ' ', result)
    return result.strip()

Database Connection with pgvector

# Source: pgvector-python documentation
import psycopg
from pgvector.psycopg import register_vector

def get_connection(autocommit: bool = True) -> psycopg.Connection:
    """Get database connection with pgvector support."""
    conn = psycopg.connect(
        host='localhost',
        port=5433,
        dbname='semantic_search',
        user='dev',
        password='dev',
        autocommit=autocommit,
    )
    register_vector(conn)
    return conn

CSV Import Script Structure

# Minimal import script structure
import pandas as pd
from normalize import normalize_supplier_name, normalize_text

def import_historical_csv(csv_path: str, conn) -> int:
    """Import historical.csv into line_item table."""
    df = pd.read_csv(csv_path, encoding='utf-8-sig')

    # Map CSV columns to schema
    df = df.rename(columns={
        'Supplier Name': 'supplier_name',
        'Line Item Description': 'description',
        'Debit Account': 'debit_account',
        'Credit Account': 'credit_account',
        'Cost Center': 'cost_center',
        'Net Amount': 'net_amount',
    })

    # Skip rows with missing required fields
    df = df.dropna(subset=['supplier_name', 'description', 'debit_account'])
    df = df[df['supplier_name'].str.strip() != '']
    df = df[df['description'].str.strip() != '']
    df = df[df['debit_account'].str.strip() != '']

    # Add normalized columns
    df['supplier_name_normalized'] = df['supplier_name'].apply(normalize_supplier_name)
    df['description_normalized'] = df['description'].apply(normalize_text)

    # Bulk import using COPY
    # ... (see Pattern 3 above)

    return len(df)

State of the Art

Old Approach Current Approach When Changed Impact
pip + venv + requirements.txt uv (single tool) 2024-2025 10-100x faster installs, automatic lockfiles
psycopg2 psycopg3 (psycopg) 2021+ Better async, typing, pgvector support
text-embedding-004 gemini-embedding-001 Jan 2026 text-embedding-004 deprecated, but text-multilingual-embedding-002 still supported for multilingual
Postgres 17 Postgres 18 2025 PGDATA path changed to /var/lib/postgresql/18/docker

Deprecated/outdated:

Open Questions

  1. CSV file location

  2. CSV column names

  3. Row count validation

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

Metadata

Confidence breakdown:

Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable technologies)