Phase 1: Foundation - Context

Gathered: 2026-02-20 Status: Ready for planning

## Phase Boundary

Working development environment with pgvector-enabled database and imported historical data. Docker Compose starts Postgres 18 + pgvector, Python environment has required packages, ~6K line items from historical.csv are queryable.

## Implementation Decisions

Database schema

Single flat table (no upload tracking — irrelevant for spike)
Replicate Orcha's booking_history_item structure:
- supplier_name_normalized, description_normalized (for matching)
- supplier_name, description (original values)
- debit_account, credit_account, cost_center
- net_amount
Include nullable embedding columns from the start (avoid migrations):
- embedding_google (vector)
- embedding_jina (vector)
- embedding_minilm (vector)
Embeddings populated in Phase 2, but columns exist in Phase 1

Data import

Source: ./orcha/dump/regnology/historical.csv
Import columns: Supplier Name, Line Item Description, Debit Account, Credit Account, Cost Center, Net Amount
Normalization: Port Orcha's util.text/normalize-supplier-name logic to Python
Skip rows with missing required fields (empty debit_account, supplier_name, or description)
Both original and normalized values stored

Claude's Discretion

Docker Compose structure and port choices
Python package manager (pip/poetry/uv)
Virtual environment approach
Import script error handling details

## Specific Ideas

Replicate Orcha's approach: the spike compares semantic embedding search (pgvector + 3 models) vs LLM context matching (current Orcha approach with historical bookings + Gemini)
Reference implementation: ./orcha/resources/migrations/20260217164646-add-booking-history-tables.up.sql for schema
Reference implementation: ./orcha/src/com/getorcha/workers/ingestion/post_process.clj for LLM matching approach

## Deferred Ideas

None — discussion stayed within phase scope

Phase: 01-foundation Context gathered: 2026-02-20