Project State

Project Reference

See: .planning/PROJECT.md (updated 2026-02-20)

Core value: Determine whether pgvector semantic search can match or exceed LLM-based matching for GL account/cost center assignment Current focus: Phase 5 - Reporting & LLM Judge

Current Position

Phase: 5 of 5 (Reporting & LLM Judge) Plan: 2 of 2 in current phase Status: Complete Last activity: 2026-02-20 — Completed 05-02-PLAN.md

Progress: [##########] 100%

Performance Metrics

Velocity:

Total plans completed: 11
Average duration: 6.3 min
Total execution time: 69 min

By Phase:

Phase	Plans	Total	Avg/Plan
01-foundation	2	5 min	2.5 min
02-embedding-generation	3	34 min	11.3 min
03-search-implementation	2	10 min	5.0 min
04-evaluation-dashboard	2	12 min	6.0 min
05-reporting-llm-judge	2	8 min	4.0 min

Recent Trend:

Last 5 plans: 04-01, 04-02, 05-01, 05-02
Trend: Stabilizing at ~4-5 min avg

Updated after each plan completion

Accumulated Context

Decisions

Decisions are logged in PROJECT.md Key Decisions table. Recent decisions affecting current work:

Postgres 18 in Docker (isolated environment, easy pgvector setup)
Synthetic test queries (test robustness to description variations)
Three embedding models (compare local vs API, different architectures)
Port 5433 to avoid conflicts with local Postgres installations
Pre-created embedding columns with exact dimensions: Google (768), Jina (1024), MiniLM (384)
uv for Python environment management (faster than pip/poetry)
Port Orcha's normalize_supplier_name exactly (German umlauts, company suffix stripping)
Store both original and normalized text values
COPY protocol for 10-100x faster bulk import
Use sklearn train_test_split with stratify for proportional debit account representation
Handle sparse classes (1 member) with random assignment at test_size probability
Convert numpy int64 to native Python int for psycopg3 compatibility
Use getorcha-dev GCP project for Vertex AI (has billing enabled)
Consistent embedding text format: supplier | description
Conservative batch sizes for API rate limits
MiniLM embeddings normalized for cosine similarity at encode time
HNSW indexes created after data population for efficiency (m=16, ef_construction=64)
German QWERTZ keyboard layout for realistic typo generation
Query embeddings use RETRIEVAL_QUERY task type (Google, Jina) for optimal retrieval
HNSW ef_search=40 for balanced search performance
Similarity = 1 - distance for intuitive [0,1] scoring
pg_trgm similarity threshold 0.7 for historical booking lookup (matching Orcha exactly)
Escape % operator as %% for psycopg3 placeholder compatibility
Gemini Flash via API key (GOOGLE_API_KEY), not Vertex AI for LLM matching
Normalize account values to handle 6801 vs 6801.0 comparison
NULL == NULL counts as correct match for cost center
2 warmup queries excluded from timing to avoid cold start bias
Use consensus['values'] in Jinja2 to avoid dict method reference
Track per-model latency separately using search_single_model for accurate timing
skip_llm parameter in benchmark when GOOGLE_API_KEY not available
Few-shot examples in JUDGE_PROMPT for consistent YES/NO verdicts
Temperature=0 for deterministic LLM judge responses
Four showcase categories: best_cases, worst_cases, edge_cases, llm_saves
Top-15 GL accounts with 'Other' grouping for confusion matrix clarity
Base64 embedded images for self-contained HTML reports
Inline CSS styling for maximum browser compatibility

Pending Todos

None yet.

Blockers/Concerns

None yet.

Session Continuity

Last session: 2026-02-20 Stopped at: Completed 05-02-PLAN.md (Static HTML Report) - PROJECT COMPLETE Resume file: None