Domain: Semantic Search Comparison (Embedding vs LLM-based Matching) Researched: 2026-02-20 Confidence: MEDIUM-HIGH (multiple authoritative sources, verified patterns)
What goes wrong: Vectors stored in pgvector have different dimensions than query vectors, causing silent failures or crashes. This happens when switching embedding models mid-project or using different models for indexing vs querying.
Why it happens:
Developers forget to track which model generated which vectors, or accidentally mix models in the same table.
How to avoid:
embedding_model column to vector tablesWarning signs:
Phase to address: Phase 1 (Data Import) - Define schema with model metadata; validate dimensions on insert
What goes wrong: Using cosine similarity with unnormalized vectors, or using dot product with normalized vectors expecting different results. The similarity metric used during search must match how the embedding model was trained.
Why it happens: all-MiniLM-L6-v2 was trained using cosine similarity. If you use L2 distance or inner product without understanding the implications, results degrade significantly. Some models output normalized vectors, others do not.
How to avoid:
<=> (cosine) operator for cosine-trained modelsWarning signs:
Phase to address: Phase 2 (Embedding Generation) - Research and document each model's training metric; normalize consistently
What goes wrong: Creating an IVFFlat index before loading data, or on a table with too few rows. IVFFlat clustering relies on data distribution - indexing empty or sparse tables produces poor clusters that never improve.
Why it happens: Developers create indexes during schema setup before data import. Unlike HNSW, IVFFlat cannot be updated incrementally - it's static once built.
How to avoid:
lists = rows / 1000 for IVFFlatWarning signs:
Phase to address: Phase 1 (Data Import) - Create indexes AFTER full data load; prefer HNSW for comparison spike
What goes wrong: Test queries are variations of training data that the embedding model has already "seen" during its own training, or the synthetic variations are too similar to source data, making the benchmark artificially easy.
Why it happens: When generating synthetic test queries from historical.csv descriptions, the variations may not be truly novel. The embedding model may have been trained on similar financial/accounting text, creating implicit leakage.
How to avoid:
Warning signs:
Phase to address: Phase 3 (Evaluation Design) - Define train/test split BEFORE any embedding; document leakage prevention
What goes wrong: Passing too many historical bookings to the LLM causes the model to lose important information in the middle ("lost in the middle" effect), or hit token limits silently truncating context.
Why it happens: Current Orcha approach passes up to 50 historical bookings as CSV. At ~100 tokens per booking, that's 5K tokens just for context. Add system prompt, query, and response space - can easily exceed model limits.
How to avoid:
Warning signs:
Phase to address: Phase 4 (LLM Implementation) - Implement token counting; optimize context placement
What goes wrong: Small changes to prompt wording cause 10-70% accuracy swings. The same semantic query gets different results with different prompt formatting. Model switches (even same family) require prompt re-tuning.
Why it happens: LLMs are "notoriously sensitive to subtle variations in prompt phrasing and structure." Performance diverges based on formatting, information order, and sentiment. A prompt that works well on one model may fail on another.
How to avoid:
Warning signs:
Phase to address: Phase 4 (LLM Implementation) - Define and freeze prompts early; measure sensitivity explicitly
What goes wrong: Using an LLM to evaluate match quality introduces systematic biases: position bias (favoring first/last options), verbosity bias (favoring longer explanations), self-preference bias (favoring outputs similar to its own style).
Why it happens: LLM judges assign higher scores to outputs with lower perplexity (more "familiar" to the model). Position in the comparison affects scores by 10%+. Research shows human-LLM agreement is only 64-68% for domain-specific tasks.
How to avoid:
Warning signs:
Phase to address: Phase 5 (Evaluation) - Implement multi-model judging; randomize order; human validation sample
What goes wrong: Comparing embedding models without controlling for confounding factors: different preprocessing, tokenization, batch sizes, or hyperparameters. "Winning" model may just have better defaults.
Why it happens: Each embedding model has its own recommended settings. Using one model's optimal settings for all creates unfair advantage. Generic benchmark scores don't translate to specific domains.
How to avoid:
Warning signs:
Phase to address: Phase 2 (Embedding Generation) - Document per-model configuration; standardize preprocessing
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|---|---|---|---|
| Single embedding model column | Simpler schema | Can't compare models side-by-side | Never for comparison spike |
| Hardcoded embedding dimensions | Less code | Crashes when adding new models | Never |
| No token counting | Faster implementation | Silent context truncation | Only for prototypes with small data |
| Single prompt template | Faster development | Unknown sensitivity, unreproducible | Never for benchmarking |
| No train/test split | Use all data | Meaningless accuracy metrics | Never for evaluation |
| Integration | Common Mistake | Correct Approach |
|---|---|---|
| Google Embedding API | Not batching requests | Batch up to 2048 inputs per request |
| Jina AI API | Ignoring rate limits | Implement exponential backoff; use batch endpoint |
| Gemini 2.5 Flash | Passing context in middle | Put critical info at start/end of context |
| pgvector | Creating index before data | Load all data, then CREATE INDEX |
| all-MiniLM-L6-v2 | Using L2 distance | Use cosine similarity (model training metric) |
| Trap | Symptoms | Prevention | When It Breaks |
|---|---|---|---|
| No pgvector index | Queries slow on 6K rows | Add HNSW index after data load | >1K rows |
| Embedding API per-row | Import takes hours | Batch requests (100-500 per call) | >100 rows |
| HNSW default parameters | Good recall, slow build | Tune m and ef_construction | >10K vectors |
| Full table scan with filter | Filter + vector search slow | Use partial indexes or pre-filter | >5K rows with filter |
embedding_model column populated| Pitfall | Recovery Cost | Recovery Steps |
|---|---|---|
| Dimension mismatch | LOW | Re-embed affected data with correct model; update schema |
| Wrong distance metric | MEDIUM | Re-run evaluation with correct metric; may need re-index |
| IVFFlat on empty table | LOW | Drop index, reload data, recreate index (or switch to HNSW) |
| Data leakage | HIGH | Redesign test set; re-run all evaluations |
| Context overflow | MEDIUM | Reduce context size; re-run LLM comparisons |
| Prompt sensitivity | MEDIUM | Document variants; report range not point estimate |
| LLM-as-judge bias | MEDIUM | Add randomization, multi-model judging; re-evaluate |
| Unfair comparison | HIGH | Standardize configs; may need re-embedding |
| Pitfall | Prevention Phase | Verification |
|---|---|---|
| Dimension mismatch | Phase 1: Data Import | Schema includes model metadata; dimension validated on insert |
| Normalization mismatch | Phase 2: Embedding Generation | Documentation of each model's metric; normalization verified |
| IVFFlat on empty | Phase 1: Data Import | Index creation after data load (use HNSW instead) |
| Data leakage | Phase 3: Evaluation Design | Train/test split documented; test data not embedded until eval |
| Context overflow | Phase 4: LLM Implementation | Token counting implemented; context limits documented |
| Prompt sensitivity | Phase 4: LLM Implementation | Multiple prompt variants tested; sensitivity reported |
| LLM-as-judge bias | Phase 5: Evaluation | Multi-model judging; order randomization; human sample |
| Unfair comparison | Phase 2: Embedding Generation | Per-model configs documented; preprocessing standardized |
Pitfalls research for: Semantic Search Comparison Spike Researched: 2026-02-20