Pitfalls Research

Domain: Semantic Search Comparison (Embedding vs LLM-based Matching) Researched: 2026-02-20 Confidence: MEDIUM-HIGH (multiple authoritative sources, verified patterns)

Critical Pitfalls

Pitfall 1: Embedding Dimension Mismatch

What goes wrong: Vectors stored in pgvector have different dimensions than query vectors, causing silent failures or crashes. This happens when switching embedding models mid-project or using different models for indexing vs querying.

Why it happens:

Developers forget to track which model generated which vectors, or accidentally mix models in the same table.

How to avoid:

Warning signs:

Phase to address: Phase 1 (Data Import) - Define schema with model metadata; validate dimensions on insert


Pitfall 2: Normalization and Distance Metric Mismatch

What goes wrong: Using cosine similarity with unnormalized vectors, or using dot product with normalized vectors expecting different results. The similarity metric used during search must match how the embedding model was trained.

Why it happens: all-MiniLM-L6-v2 was trained using cosine similarity. If you use L2 distance or inner product without understanding the implications, results degrade significantly. Some models output normalized vectors, others do not.

How to avoid:

Warning signs:

Phase to address: Phase 2 (Embedding Generation) - Research and document each model's training metric; normalize consistently


Pitfall 3: IVFFlat Index on Insufficient Data

What goes wrong: Creating an IVFFlat index before loading data, or on a table with too few rows. IVFFlat clustering relies on data distribution - indexing empty or sparse tables produces poor clusters that never improve.

Why it happens: Developers create indexes during schema setup before data import. Unlike HNSW, IVFFlat cannot be updated incrementally - it's static once built.

How to avoid:

Warning signs:

Phase to address: Phase 1 (Data Import) - Create indexes AFTER full data load; prefer HNSW for comparison spike


Pitfall 4: Data Leakage in Evaluation

What goes wrong: Test queries are variations of training data that the embedding model has already "seen" during its own training, or the synthetic variations are too similar to source data, making the benchmark artificially easy.

Why it happens: When generating synthetic test queries from historical.csv descriptions, the variations may not be truly novel. The embedding model may have been trained on similar financial/accounting text, creating implicit leakage.

How to avoid:

Warning signs:

Phase to address: Phase 3 (Evaluation Design) - Define train/test split BEFORE any embedding; document leakage prevention


Pitfall 5: LLM Context Window Overflow

What goes wrong: Passing too many historical bookings to the LLM causes the model to lose important information in the middle ("lost in the middle" effect), or hit token limits silently truncating context.

Why it happens: Current Orcha approach passes up to 50 historical bookings as CSV. At ~100 tokens per booking, that's 5K tokens just for context. Add system prompt, query, and response space - can easily exceed model limits.

How to avoid:

Warning signs:

Phase to address: Phase 4 (LLM Implementation) - Implement token counting; optimize context placement


Pitfall 6: Prompt Sensitivity in LLM-based Matching

What goes wrong: Small changes to prompt wording cause 10-70% accuracy swings. The same semantic query gets different results with different prompt formatting. Model switches (even same family) require prompt re-tuning.

Why it happens: LLMs are "notoriously sensitive to subtle variations in prompt phrasing and structure." Performance diverges based on formatting, information order, and sentiment. A prompt that works well on one model may fail on another.

How to avoid:

Warning signs:

Phase to address: Phase 4 (LLM Implementation) - Define and freeze prompts early; measure sensitivity explicitly


Pitfall 7: LLM-as-Judge Bias in Evaluation

What goes wrong: Using an LLM to evaluate match quality introduces systematic biases: position bias (favoring first/last options), verbosity bias (favoring longer explanations), self-preference bias (favoring outputs similar to its own style).

Why it happens: LLM judges assign higher scores to outputs with lower perplexity (more "familiar" to the model). Position in the comparison affects scores by 10%+. Research shows human-LLM agreement is only 64-68% for domain-specific tasks.

How to avoid:

Warning signs:

Phase to address: Phase 5 (Evaluation) - Implement multi-model judging; randomize order; human validation sample


Pitfall 8: Unfair Embedding Model Comparison

What goes wrong: Comparing embedding models without controlling for confounding factors: different preprocessing, tokenization, batch sizes, or hyperparameters. "Winning" model may just have better defaults.

Why it happens: Each embedding model has its own recommended settings. Using one model's optimal settings for all creates unfair advantage. Generic benchmark scores don't translate to specific domains.

How to avoid:

Warning signs:

Phase to address: Phase 2 (Embedding Generation) - Document per-model configuration; standardize preprocessing


Technical Debt Patterns

Shortcut Immediate Benefit Long-term Cost When Acceptable
Single embedding model column Simpler schema Can't compare models side-by-side Never for comparison spike
Hardcoded embedding dimensions Less code Crashes when adding new models Never
No token counting Faster implementation Silent context truncation Only for prototypes with small data
Single prompt template Faster development Unknown sensitivity, unreproducible Never for benchmarking
No train/test split Use all data Meaningless accuracy metrics Never for evaluation

Integration Gotchas

Integration Common Mistake Correct Approach
Google Embedding API Not batching requests Batch up to 2048 inputs per request
Jina AI API Ignoring rate limits Implement exponential backoff; use batch endpoint
Gemini 2.5 Flash Passing context in middle Put critical info at start/end of context
pgvector Creating index before data Load all data, then CREATE INDEX
all-MiniLM-L6-v2 Using L2 distance Use cosine similarity (model training metric)

Performance Traps

Trap Symptoms Prevention When It Breaks
No pgvector index Queries slow on 6K rows Add HNSW index after data load >1K rows
Embedding API per-row Import takes hours Batch requests (100-500 per call) >100 rows
HNSW default parameters Good recall, slow build Tune m and ef_construction >10K vectors
Full table scan with filter Filter + vector search slow Use partial indexes or pre-filter >5K rows with filter

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall Recovery Cost Recovery Steps
Dimension mismatch LOW Re-embed affected data with correct model; update schema
Wrong distance metric MEDIUM Re-run evaluation with correct metric; may need re-index
IVFFlat on empty table LOW Drop index, reload data, recreate index (or switch to HNSW)
Data leakage HIGH Redesign test set; re-run all evaluations
Context overflow MEDIUM Reduce context size; re-run LLM comparisons
Prompt sensitivity MEDIUM Document variants; report range not point estimate
LLM-as-judge bias MEDIUM Add randomization, multi-model judging; re-evaluate
Unfair comparison HIGH Standardize configs; may need re-embedding

Pitfall-to-Phase Mapping

Pitfall Prevention Phase Verification
Dimension mismatch Phase 1: Data Import Schema includes model metadata; dimension validated on insert
Normalization mismatch Phase 2: Embedding Generation Documentation of each model's metric; normalization verified
IVFFlat on empty Phase 1: Data Import Index creation after data load (use HNSW instead)
Data leakage Phase 3: Evaluation Design Train/test split documented; test data not embedded until eval
Context overflow Phase 4: LLM Implementation Token counting implemented; context limits documented
Prompt sensitivity Phase 4: LLM Implementation Multiple prompt variants tested; sensitivity reported
LLM-as-judge bias Phase 5: Evaluation Multi-model judging; order randomization; human sample
Unfair comparison Phase 2: Embedding Generation Per-model configs documented; preprocessing standardized

Sources

pgvector Indexing

Embedding and Normalization

LLM Context and RAG

Prompt Sensitivity

LLM-as-Judge Bias

Embedding Benchmarking

Data Leakage

Synthetic Data


Pitfalls research for: Semantic Search Comparison Spike Researched: 2026-02-20