Phase 2: Embedding Generation - Context
Gathered: 2026-02-20
Status: Ready for planning
## Phase Boundary
Pre-compute vector embeddings from 3 models (Google text-multilingual-embedding-002, Jina embeddings-v3, MiniLM all-MiniLM-L6-v2) for all ~6K line items. Create clean train/test separation for evaluation. Search implementation and evaluation are separate phases.
## Implementation Decisions
Train/Test Split
- 20% test set as specified in requirements
- Split strategy: Claude's discretion (random or stratified)
Embedding Text Preparation
- What text to embed and how to combine fields: Claude's discretion
- Handle German text appropriately (umlauts already normalized in Phase 1)
Query Variation Generation
- Generate synthetic query variations for test set entries
- Approach: Claude's discretion — be smart about it
- Consider: synonyms, typos, word reordering as appropriate for German accounting context
- Keep it generic and simple
- Store at table/config level rather than per-embedding row
- Track: model name, dimensions, distance metric
Claude's Discretion
- Train/test split strategy (random vs stratified by supplier/account)
- Which fields to combine for embedding text
- Specific synthetic query generation approach
- Metadata schema design
## Specific Ideas
- "Be smart about synthetic queries" — use domain knowledge of German accounting terminology
- Keep metadata tracking simple and generic
## Deferred Ideas
None — discussion stayed within phase scope
Phase: 02-embedding-generation
Context gathered: 2026-02-20