Project State
Project Reference
See: .planning/PROJECT.md (updated 2026-02-20)
Core value: Determine whether pgvector semantic search can match or exceed LLM-based matching for GL account/cost center assignment
Current focus: Phase 5 - Reporting & LLM Judge
Current Position
Phase: 5 of 5 (Reporting & LLM Judge)
Plan: 2 of 2 in current phase
Status: Complete
Last activity: 2026-02-20 — Completed 05-02-PLAN.md
Progress: [##########] 100%
Velocity:
- Total plans completed: 11
- Average duration: 6.3 min
- Total execution time: 69 min
By Phase:
| Phase |
Plans |
Total |
Avg/Plan |
| 01-foundation |
2 |
5 min |
2.5 min |
| 02-embedding-generation |
3 |
34 min |
11.3 min |
| 03-search-implementation |
2 |
10 min |
5.0 min |
| 04-evaluation-dashboard |
2 |
12 min |
6.0 min |
| 05-reporting-llm-judge |
2 |
8 min |
4.0 min |
Recent Trend:
- Last 5 plans: 04-01, 04-02, 05-01, 05-02
- Trend: Stabilizing at ~4-5 min avg
Updated after each plan completion
Accumulated Context
Decisions
Decisions are logged in PROJECT.md Key Decisions table.
Recent decisions affecting current work:
- Postgres 18 in Docker (isolated environment, easy pgvector setup)
- Synthetic test queries (test robustness to description variations)
- Three embedding models (compare local vs API, different architectures)
- Port 5433 to avoid conflicts with local Postgres installations
- Pre-created embedding columns with exact dimensions: Google (768), Jina (1024), MiniLM (384)
- uv for Python environment management (faster than pip/poetry)
- Port Orcha's normalize_supplier_name exactly (German umlauts, company suffix stripping)
- Store both original and normalized text values
- COPY protocol for 10-100x faster bulk import
- Use sklearn train_test_split with stratify for proportional debit account representation
- Handle sparse classes (1 member) with random assignment at test_size probability
- Convert numpy int64 to native Python int for psycopg3 compatibility
- Use getorcha-dev GCP project for Vertex AI (has billing enabled)
- Consistent embedding text format: supplier | description
- Conservative batch sizes for API rate limits
- MiniLM embeddings normalized for cosine similarity at encode time
- HNSW indexes created after data population for efficiency (m=16, ef_construction=64)
- German QWERTZ keyboard layout for realistic typo generation
- Query embeddings use RETRIEVAL_QUERY task type (Google, Jina) for optimal retrieval
- HNSW ef_search=40 for balanced search performance
- Similarity = 1 - distance for intuitive [0,1] scoring
- pg_trgm similarity threshold 0.7 for historical booking lookup (matching Orcha exactly)
- Escape % operator as %% for psycopg3 placeholder compatibility
- Gemini Flash via API key (GOOGLE_API_KEY), not Vertex AI for LLM matching
- Normalize account values to handle 6801 vs 6801.0 comparison
- NULL == NULL counts as correct match for cost center
- 2 warmup queries excluded from timing to avoid cold start bias
- Use consensus['values'] in Jinja2 to avoid dict method reference
- Track per-model latency separately using search_single_model for accurate timing
- skip_llm parameter in benchmark when GOOGLE_API_KEY not available
- Few-shot examples in JUDGE_PROMPT for consistent YES/NO verdicts
- Temperature=0 for deterministic LLM judge responses
- Four showcase categories: best_cases, worst_cases, edge_cases, llm_saves
- Top-15 GL accounts with 'Other' grouping for confusion matrix clarity
- Base64 embedded images for self-contained HTML reports
- Inline CSS styling for maximum browser compatibility
Pending Todos
None yet.
Blockers/Concerns
None yet.
Session Continuity
Last session: 2026-02-20
Stopped at: Completed 05-02-PLAN.md (Static HTML Report) - PROJECT COMPLETE
Resume file: None