Phase 1: Foundation - Context
Gathered: 2026-02-20
Status: Ready for planning
## Phase Boundary
Working development environment with pgvector-enabled database and imported historical data. Docker Compose starts Postgres 18 + pgvector, Python environment has required packages, ~6K line items from historical.csv are queryable.
## Implementation Decisions
Database schema
- Single flat table (no upload tracking — irrelevant for spike)
- Replicate Orcha's
booking_history_item structure:
supplier_name_normalized, description_normalized (for matching)
supplier_name, description (original values)
debit_account, credit_account, cost_center
net_amount
- Include nullable embedding columns from the start (avoid migrations):
embedding_google (vector)
embedding_jina (vector)
embedding_minilm (vector)
- Embeddings populated in Phase 2, but columns exist in Phase 1
Data import
- Source:
./orcha/dump/regnology/historical.csv
- Import columns: Supplier Name, Line Item Description, Debit Account, Credit Account, Cost Center, Net Amount
- Normalization: Port Orcha's
util.text/normalize-supplier-name logic to Python
- Skip rows with missing required fields (empty debit_account, supplier_name, or description)
- Both original and normalized values stored
Claude's Discretion
- Docker Compose structure and port choices
- Python package manager (pip/poetry/uv)
- Virtual environment approach
- Import script error handling details
## Specific Ideas
- Replicate Orcha's approach: the spike compares semantic embedding search (pgvector + 3 models) vs LLM context matching (current Orcha approach with historical bookings + Gemini)
- Reference implementation:
./orcha/resources/migrations/20260217164646-add-booking-history-tables.up.sql for schema
- Reference implementation:
./orcha/src/com/getorcha/workers/ingestion/post_process.clj for LLM matching approach
## Deferred Ideas
None — discussion stayed within phase scope
Phase: 01-foundation
Context gathered: 2026-02-20