Researched: 2026-02-20 Domain: Docker/PostgreSQL infrastructure, Python environment setup, data import Confidence: HIGH
Phase 1 establishes the development foundation: a Docker Compose stack running Postgres 18 with pgvector 0.8.1, a Python environment managed by uv with required packages, and ~6K historical booking line items imported from CSV.
The stack is straightforward: pgvector provides an official Docker image (pgvector/pgvector:pg18) that bundles Postgres 18 with the vector extension pre-installed. The Python environment uses uv (the modern, fast package manager) with psycopg3 + pgvector-python for database access. Data import uses psycopg3's efficient COPY protocol.
Primary recommendation: Use the official pgvector Docker image, uv for Python package management, and psycopg3's copy protocol for bulk CSV import. Pre-create embedding columns with correct dimensions for each model to avoid schema migrations in Phase 2.
<user_constraints>
booking_history_item structure with supplier_name_normalized, description_normalized, supplier_name, description, debit_account, credit_account, cost_center, net_amount./orcha/dump/regnology/historical.csvutil.text/normalize-supplier-name logic to PythonNone - discussion stayed within phase scope </user_constraints>
<phase_requirements>
| ID | Description | Research Support |
|---|---|---|
| INFRA-01 | Docker Compose setup with Postgres 18 and pgvector 0.8.1 | Official pgvector/pgvector:pg18 image provides this out-of-box. Use init script to CREATE EXTENSION. |
| INFRA-02 | Python environment with google-genai, sentence-transformers, ragas | uv manages packages efficiently. All packages available on PyPI. |
| INFRA-03 | Import ~6K line items from historical.csv into pgvector table | psycopg3 COPY protocol for efficient bulk import. Port Orcha's normalize-supplier-name to Python. |
| </phase_requirements> |
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| pgvector/pgvector:pg18 | 0.8.1-pg18-trixie | Docker image with Postgres 18 + pgvector | Official image, 10M+ downloads, maintained by pgvector team |
| psycopg | 3.x | PostgreSQL adapter for Python | Modern async-capable driver, required for pgvector-python |
| pgvector-python | 0.4.x | pgvector types for Python | Official pgvector Python bindings, supports psycopg3 |
| uv | latest | Python package/project manager | 10-100x faster than pip, replaces pip+venv+poetry, 2026 standard |
| Library | Version | Purpose | When to Use |
|---|---|---|---|
| google-genai | latest | Google AI embeddings API | Phase 2: text-multilingual-embedding-002 embeddings |
| sentence-transformers | 4.x | Local embedding models | Phase 2: all-MiniLM-L6-v2 and jina-embeddings-v3 |
| ragas | 0.4.x | RAG evaluation framework | Phase 5: LLM-as-judge evaluation |
| pandas | 2.x | Data manipulation | CSV loading, data transformation |
| Instead of | Could Use | Tradeoff |
|---|---|---|
| uv | poetry | Poetry is slower but more mature; uv is 10-100x faster |
| uv | pip+venv | pip+venv works but requires manual coordination; uv is simpler |
| psycopg3 | psycopg2 | psycopg2 is legacy; psycopg3 has better async support |
| pandas | csv module | csv module is lower-level; pandas handles normalization easily |
Installation:
# Install uv (if not present)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Initialize project
uv init semantic-search-spike
cd semantic-search-spike
uv venv
# Add dependencies
uv add psycopg pgvector pandas
uv add google-genai sentence-transformers ragas # For later phases
semantic-search/
├── docker-compose.yml # Postgres 18 + pgvector
├── init.sql # CREATE EXTENSION, CREATE TABLE
├── pyproject.toml # uv project configuration
├── .venv/ # Virtual environment (created by uv)
├── src/
│ ├── db.py # Database connection utilities
│ ├── normalize.py # Text normalization (port of Orcha logic)
│ └── import_csv.py # CSV import script
├── data/
│ └── historical.csv # Source data (symlink or copy)
└── .planning/ # GSD planning files
What: Auto-initialize database with extension and schema on first startup When to use: Always for development environments Example:
# docker-compose.yml
services:
db:
image: pgvector/pgvector:pg18
container_name: semantic-search-db
ports:
- "5433:5432" # Non-standard port to avoid conflicts
environment:
POSTGRES_USER: dev
POSTGRES_PASSWORD: dev
POSTGRES_DB: semantic_search
volumes:
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
- pgvector_data:/var/lib/postgresql/18/docker # Postgres 18 PGDATA path
healthcheck:
test: ["CMD-SHELL", "pg_isready -U dev -d semantic_search"]
interval: 5s
timeout: 5s
retries: 5
volumes:
pgvector_data:
What: Define vector columns with correct dimensions upfront When to use: When embedding dimensions are known ahead of time Example:
-- init.sql
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE line_item (
id BIGSERIAL PRIMARY KEY,
-- Original values
supplier_name TEXT NOT NULL,
description TEXT NOT NULL,
debit_account TEXT NOT NULL,
credit_account TEXT,
cost_center TEXT,
net_amount NUMERIC,
-- Normalized values (for matching)
supplier_name_normalized TEXT NOT NULL,
description_normalized TEXT NOT NULL,
-- Embedding columns (populated in Phase 2)
embedding_google vector(768), -- text-multilingual-embedding-002
embedding_jina vector(1024), -- jina-embeddings-v3
embedding_minilm vector(384), -- all-MiniLM-L6-v2
-- Metadata
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Indexes for embedding search (created in Phase 2 after data load)
-- CREATE INDEX idx_line_item_google_cosine ON line_item
-- USING hnsw (embedding_google vector_cosine_ops);
What: Use PostgreSQL COPY protocol for fast data loading When to use: Importing >1000 rows Example:
# Source: pgvector-python examples
from pgvector.psycopg import register_vector
import psycopg
from io import StringIO
import csv
def bulk_import(conn, rows):
"""Import rows using COPY protocol (10-100x faster than INSERT)."""
register_vector(conn)
# Prepare CSV buffer
buffer = StringIO()
writer = csv.writer(buffer)
for row in rows:
writer.writerow([
row['supplier_name'],
row['description'],
row['debit_account'],
row.get('credit_account'),
row.get('cost_center'),
row.get('net_amount'),
row['supplier_name_normalized'],
row['description_normalized'],
])
buffer.seek(0)
with conn.cursor() as cur:
with cur.copy("""
COPY line_item (
supplier_name, description, debit_account,
credit_account, cost_center, net_amount,
supplier_name_normalized, description_normalized
) FROM STDIN WITH CSV
""") as copy:
while data := buffer.read(8192):
copy.write(data)
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Text normalization | Custom regex | Port Orcha's normalize-supplier-name exactly | Tested logic, handles German umlauts, company suffixes |
| CSV parsing | Manual file reading | pandas.read_csv | Handles encoding, quotes, missing values |
| Bulk import | INSERT loops | psycopg3 COPY protocol | 10-100x performance difference |
| Vector storage | Custom binary format | pgvector native vector type | Optimized storage, native operators |
| Package management | pip+venv scripts | uv | Automatic lockfiles, faster resolution |
Key insight: The Python ecosystem has mature, well-tested tools for every aspect of this phase. Custom solutions add maintenance burden without benefits.
What goes wrong: Volume mount fails or data disappears on restart
Why it happens: Postgres 18 changed PGDATA to /var/lib/postgresql/18/docker (version-specific)
How to avoid: Use the correct path in docker-compose volumes
Warning signs: Empty database after container restart
What goes wrong: ERROR: expected N dimensions, not M when inserting embeddings
Why it happens: Schema defines one dimension, embedding model produces another
How to avoid: Document and verify dimensions upfront:
vector without dimension specificationWhat goes wrong: can't adapt type 'numpy.ndarray' errors
Why it happens: pgvector types not registered with connection
How to avoid: Call register_vector(conn) after connecting
Warning signs: Vector operations fail with type errors
What goes wrong: UnicodeDecodeError or garbled text
Why it happens: CSV may have Windows encoding, BOM, or mixed encodings
How to avoid: Explicitly specify encoding in pandas.read_csv: encoding='utf-8-sig'
Warning signs: German characters (umlauts) appear corrupted
What goes wrong: Import takes hours instead of seconds Why it happens: Index maintenance on every INSERT/COPY row How to avoid: Import all data first, then CREATE INDEX Warning signs: Import progress slows dramatically as data grows
# Source: Ported from orcha/src/com/getorcha/util/text.clj
import re
UMLAUT_MAP = {
'ä': 'ae', 'ö': 'oe', 'ü': 'ue', 'ß': 'ss',
'Ä': 'Ae', 'Ö': 'Oe', 'Ü': 'Ue',
}
ACCENT_MAP = {
'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'å': 'a',
'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
'ì': 'i', 'í': 'i', 'î': 'i', 'ï': 'i',
'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ø': 'o',
'ù': 'u', 'ú': 'u', 'û': 'u',
'ñ': 'n', 'ç': 'c',
}
PUNCTUATION_PATTERN = re.compile(r'[.,;:!?"\'`\-_/\\@#$%^&*()=+\[\]{}|<>]')
COMPANY_SUFFIX_PATTERN = re.compile(
r'(?i)\s*(gmbh|ag|kg|ohg|ug|se|gbr|co|inc|ltd|llc|plc|sa|srl|bv|nv|ek)\s*$'
)
CO_PARTNER_PATTERN = re.compile(r'(?i)\s*&\s*(co\.?|partner|söhne|sohn)\s*')
THE_PREFIX_PATTERN = re.compile(r'(?i)^the\s+')
def normalize_text(s: str | None) -> str | None:
"""Normalize text for fuzzy matching."""
if s is None:
return None
result = s.lower()
for char, replacement in UMLAUT_MAP.items():
result = result.replace(char.lower(), replacement.lower())
for char, replacement in ACCENT_MAP.items():
result = result.replace(char, replacement)
result = PUNCTUATION_PATTERN.sub('', result)
result = re.sub(r'\s+', ' ', result)
return result.strip()
def normalize_supplier_name(s: str | None) -> str | None:
"""Normalize supplier name with company suffix stripping."""
if s is None:
return None
result = s.lower()
# Strip & Co patterns before punctuation removal
result = CO_PARTNER_PATTERN.sub(' ', result)
# Apply standard normalization
for char, replacement in UMLAUT_MAP.items():
result = result.replace(char.lower(), replacement.lower())
for char, replacement in ACCENT_MAP.items():
result = result.replace(char, replacement)
result = PUNCTUATION_PATTERN.sub('', result)
# Strip company suffixes after punctuation removal
result = COMPANY_SUFFIX_PATTERN.sub('', result)
# Strip "the" prefix
result = THE_PREFIX_PATTERN.sub('', result)
# Collapse spaces and trim
result = re.sub(r'\s+', ' ', result)
return result.strip()
# Source: pgvector-python documentation
import psycopg
from pgvector.psycopg import register_vector
def get_connection(autocommit: bool = True) -> psycopg.Connection:
"""Get database connection with pgvector support."""
conn = psycopg.connect(
host='localhost',
port=5433,
dbname='semantic_search',
user='dev',
password='dev',
autocommit=autocommit,
)
register_vector(conn)
return conn
# Minimal import script structure
import pandas as pd
from normalize import normalize_supplier_name, normalize_text
def import_historical_csv(csv_path: str, conn) -> int:
"""Import historical.csv into line_item table."""
df = pd.read_csv(csv_path, encoding='utf-8-sig')
# Map CSV columns to schema
df = df.rename(columns={
'Supplier Name': 'supplier_name',
'Line Item Description': 'description',
'Debit Account': 'debit_account',
'Credit Account': 'credit_account',
'Cost Center': 'cost_center',
'Net Amount': 'net_amount',
})
# Skip rows with missing required fields
df = df.dropna(subset=['supplier_name', 'description', 'debit_account'])
df = df[df['supplier_name'].str.strip() != '']
df = df[df['description'].str.strip() != '']
df = df[df['debit_account'].str.strip() != '']
# Add normalized columns
df['supplier_name_normalized'] = df['supplier_name'].apply(normalize_supplier_name)
df['description_normalized'] = df['description'].apply(normalize_text)
# Bulk import using COPY
# ... (see Pattern 3 above)
return len(df)
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| pip + venv + requirements.txt | uv (single tool) | 2024-2025 | 10-100x faster installs, automatic lockfiles |
| psycopg2 | psycopg3 (psycopg) | 2021+ | Better async, typing, pgvector support |
| text-embedding-004 | gemini-embedding-001 | Jan 2026 | text-embedding-004 deprecated, but text-multilingual-embedding-002 still supported for multilingual |
| Postgres 17 | Postgres 18 | 2025 | PGDATA path changed to /var/lib/postgresql/18/docker |
Deprecated/outdated:
CSV file location
./orcha/dump/regnology/historical.csvCSV column names
Row count validation
Confidence breakdown:
Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable technologies)