Phase 2: Embedding Generation - Context

Gathered: 2026-02-20 Status: Ready for planning

## Phase Boundary

Pre-compute vector embeddings from 3 models (Google text-multilingual-embedding-002, Jina embeddings-v3, MiniLM all-MiniLM-L6-v2) for all ~6K line items. Create clean train/test separation for evaluation. Search implementation and evaluation are separate phases.

## Implementation Decisions

Train/Test Split

20% test set as specified in requirements
Split strategy: Claude's discretion (random or stratified)

Embedding Text Preparation

What text to embed and how to combine fields: Claude's discretion
Handle German text appropriately (umlauts already normalized in Phase 1)

Query Variation Generation

Generate synthetic query variations for test set entries
Approach: Claude's discretion — be smart about it
Consider: synonyms, typos, word reordering as appropriate for German accounting context

Model Metadata Tracking

Keep it generic and simple
Store at table/config level rather than per-embedding row
Track: model name, dimensions, distance metric

Claude's Discretion

Train/test split strategy (random vs stratified by supplier/account)
Which fields to combine for embedding text
Specific synthetic query generation approach
Metadata schema design

## Specific Ideas

"Be smart about synthetic queries" — use domain knowledge of German accounting terminology
Keep metadata tracking simple and generic

## Deferred Ideas

None — discussion stayed within phase scope

Phase: 02-embedding-generation Context gathered: 2026-02-20