Pitfalls Research

Domain: Semantic Search Comparison (Embedding vs LLM-based Matching) Researched: 2026-02-20 Confidence: MEDIUM-HIGH (multiple authoritative sources, verified patterns)

Critical Pitfalls

Pitfall 1: Embedding Dimension Mismatch

What goes wrong: Vectors stored in pgvector have different dimensions than query vectors, causing silent failures or crashes. This happens when switching embedding models mid-project or using different models for indexing vs querying.

Why it happens:

Google text-multilingual-embedding-002: 768 dimensions
Jina AI embeddings: varies by model (768-1024)
all-MiniLM-L6-v2: 384 dimensions

Developers forget to track which model generated which vectors, or accidentally mix models in the same table.

How to avoid:

Store embedding model name and version alongside vectors in the database
Add a embedding_model column to vector tables
Validate dimensions match before insert/query
Use separate tables or columns per embedding model during comparison

Warning signs:

pgvector errors about dimension mismatch
Unusually low similarity scores across the board
Queries returning no results when data exists

Phase to address: Phase 1 (Data Import) - Define schema with model metadata; validate dimensions on insert

Pitfall 2: Normalization and Distance Metric Mismatch

What goes wrong: Using cosine similarity with unnormalized vectors, or using dot product with normalized vectors expecting different results. The similarity metric used during search must match how the embedding model was trained.

Why it happens: all-MiniLM-L6-v2 was trained using cosine similarity. If you use L2 distance or inner product without understanding the implications, results degrade significantly. Some models output normalized vectors, others do not.

How to avoid:

Check embedding model documentation for recommended distance metric
Pre-normalize vectors at indexing time if using cosine similarity (then dot product = cosine)
Use pgvector's <=> (cosine) operator for cosine-trained models
Document the distance metric choice and rationale for each model

Warning signs:

High similarity scores for obviously unrelated items
Inconsistent ranking compared to expected results
Different models producing wildly different result orderings

Phase to address: Phase 2 (Embedding Generation) - Research and document each model's training metric; normalize consistently

Pitfall 3: IVFFlat Index on Insufficient Data

What goes wrong: Creating an IVFFlat index before loading data, or on a table with too few rows. IVFFlat clustering relies on data distribution - indexing empty or sparse tables produces poor clusters that never improve.

Why it happens: Developers create indexes during schema setup before data import. Unlike HNSW, IVFFlat cannot be updated incrementally - it's static once built.

How to avoid:

Use HNSW for this spike (works without data, better for dynamic datasets)
If using IVFFlat: load ALL data first, then create index
pgvector docs recommend: lists = rows / 1000 for IVFFlat
For ~6K rows: either skip index (sequential scan is fast enough) or use HNSW

Warning signs:

Query recall much lower than expected
Index created before data import
IVFFlat with very low list count on large dataset

Phase to address: Phase 1 (Data Import) - Create indexes AFTER full data load; prefer HNSW for comparison spike

Pitfall 4: Data Leakage in Evaluation

What goes wrong: Test queries are variations of training data that the embedding model has already "seen" during its own training, or the synthetic variations are too similar to source data, making the benchmark artificially easy.

Why it happens: When generating synthetic test queries from historical.csv descriptions, the variations may not be truly novel. The embedding model may have been trained on similar financial/accounting text, creating implicit leakage.

How to avoid:

Create test set with HELD-OUT data (not used for embedding/indexing)
Generate diverse variations: typos, abbreviations, synonyms, paraphrases
Include genuinely novel queries (related but not in training)
Split historical.csv: 80% for indexing, 20% for evaluation
Do NOT embed test data before evaluation

Warning signs:

Suspiciously high accuracy (>95%) on all models
Models performing identically despite different architectures
Top-1 accuracy equals Top-K accuracy (perfect retrieval)

Phase to address: Phase 3 (Evaluation Design) - Define train/test split BEFORE any embedding; document leakage prevention

Pitfall 5: LLM Context Window Overflow

What goes wrong: Passing too many historical bookings to the LLM causes the model to lose important information in the middle ("lost in the middle" effect), or hit token limits silently truncating context.

Why it happens: Current Orcha approach passes up to 50 historical bookings as CSV. At ~100 tokens per booking, that's 5K tokens just for context. Add system prompt, query, and response space - can easily exceed model limits.

How to avoid:

Calculate actual token count before API call
Limit to 20-30 most relevant bookings (not arbitrary 50)
Place query and critical context at START and END (not middle)
Use Gemini 2.5 Flash's longer context window strategically
Log token usage to detect near-limit situations

Warning signs:

LLM responses ignore relevant bookings
Inconsistent results with same input
API errors about context length
Quality degrades as historical data grows

Phase to address: Phase 4 (LLM Implementation) - Implement token counting; optimize context placement

Pitfall 6: Prompt Sensitivity in LLM-based Matching

What goes wrong: Small changes to prompt wording cause 10-70% accuracy swings. The same semantic query gets different results with different prompt formatting. Model switches (even same family) require prompt re-tuning.

Why it happens: LLMs are "notoriously sensitive to subtle variations in prompt phrasing and structure." Performance diverges based on formatting, information order, and sentiment. A prompt that works well on one model may fail on another.

How to avoid:

Test 3-5 prompt variations and report variance, not just best result
Document exact prompt templates used
Use structured output (JSON) to reduce interpretation variance
Include clear rubric in prompt for what constitutes a "match"
Report prompt sensitivity as a metric in evaluation

Warning signs:

Results vary significantly between runs
Minor prompt edits cause major accuracy changes
Different team members get different results

Phase to address: Phase 4 (LLM Implementation) - Define and freeze prompts early; measure sensitivity explicitly

Pitfall 7: LLM-as-Judge Bias in Evaluation

What goes wrong: Using an LLM to evaluate match quality introduces systematic biases: position bias (favoring first/last options), verbosity bias (favoring longer explanations), self-preference bias (favoring outputs similar to its own style).

Why it happens: LLM judges assign higher scores to outputs with lower perplexity (more "familiar" to the model). Position in the comparison affects scores by 10%+. Research shows human-LLM agreement is only 64-68% for domain-specific tasks.

How to avoid:

Use multiple LLMs as judges (GPT-4, Claude, Gemini) and report variance
Randomize presentation order in pairwise comparisons
Require explanation before score (improves alignment)
Supplement with human evaluation on sample set
Define explicit rubric in evaluation prompt

Warning signs:

One approach consistently wins in all LLM-as-judge comparisons
Results flip when presentation order changes
High disagreement between different LLM judges

Phase to address: Phase 5 (Evaluation) - Implement multi-model judging; randomize order; human validation sample

Pitfall 8: Unfair Embedding Model Comparison

What goes wrong: Comparing embedding models without controlling for confounding factors: different preprocessing, tokenization, batch sizes, or hyperparameters. "Winning" model may just have better defaults.

Why it happens: Each embedding model has its own recommended settings. Using one model's optimal settings for all creates unfair advantage. Generic benchmark scores don't translate to specific domains.

How to avoid:

Use same preprocessing pipeline for all models
Research and use each model's recommended settings
Report per-model configuration explicitly
Test on YOUR domain data (financial/accounting), not just generic benchmarks
Tune hyperparameters per model if resources allow

Warning signs:

Results contradict published benchmarks dramatically
One model performs much worse than expected
Configuration differs between models without justification

Phase to address: Phase 2 (Embedding Generation) - Document per-model configuration; standardize preprocessing

Technical Debt Patterns

Shortcut	Immediate Benefit	Long-term Cost	When Acceptable
Single embedding model column	Simpler schema	Can't compare models side-by-side	Never for comparison spike
Hardcoded embedding dimensions	Less code	Crashes when adding new models	Never
No token counting	Faster implementation	Silent context truncation	Only for prototypes with small data
Single prompt template	Faster development	Unknown sensitivity, unreproducible	Never for benchmarking
No train/test split	Use all data	Meaningless accuracy metrics	Never for evaluation

Integration Gotchas

Integration	Common Mistake	Correct Approach
Google Embedding API	Not batching requests	Batch up to 2048 inputs per request
Jina AI API	Ignoring rate limits	Implement exponential backoff; use batch endpoint
Gemini 2.5 Flash	Passing context in middle	Put critical info at start/end of context
pgvector	Creating index before data	Load all data, then CREATE INDEX
all-MiniLM-L6-v2	Using L2 distance	Use cosine similarity (model training metric)

Performance Traps

Trap	Symptoms	Prevention	When It Breaks
No pgvector index	Queries slow on 6K rows	Add HNSW index after data load	>1K rows
Embedding API per-row	Import takes hours	Batch requests (100-500 per call)	>100 rows
HNSW default parameters	Good recall, slow build	Tune m and ef_construction	>10K vectors
Full table scan with filter	Filter + vector search slow	Use partial indexes or pre-filter	>5K rows with filter

"Looks Done But Isn't" Checklist

Embedding import: Often missing model version metadata - verify embedding_model column populated
Normalization: Often assumed - verify vectors are actually normalized (check norm = 1.0)
Index creation: Often done too early - verify index created AFTER full data load
Test data: Often leaked - verify test queries not in training set
LLM evaluation: Often single run - verify multiple runs with variance reported
Prompt templates: Often undocumented - verify exact prompts saved with results
Token counting: Often skipped - verify context fits in model limit

Recovery Strategies

Pitfall	Recovery Cost	Recovery Steps
Dimension mismatch	LOW	Re-embed affected data with correct model; update schema
Wrong distance metric	MEDIUM	Re-run evaluation with correct metric; may need re-index
IVFFlat on empty table	LOW	Drop index, reload data, recreate index (or switch to HNSW)
Data leakage	HIGH	Redesign test set; re-run all evaluations
Context overflow	MEDIUM	Reduce context size; re-run LLM comparisons
Prompt sensitivity	MEDIUM	Document variants; report range not point estimate
LLM-as-judge bias	MEDIUM	Add randomization, multi-model judging; re-evaluate
Unfair comparison	HIGH	Standardize configs; may need re-embedding

Pitfall-to-Phase Mapping

Pitfall	Prevention Phase	Verification
Dimension mismatch	Phase 1: Data Import	Schema includes model metadata; dimension validated on insert
Normalization mismatch	Phase 2: Embedding Generation	Documentation of each model's metric; normalization verified
IVFFlat on empty	Phase 1: Data Import	Index creation after data load (use HNSW instead)
Data leakage	Phase 3: Evaluation Design	Train/test split documented; test data not embedded until eval
Context overflow	Phase 4: LLM Implementation	Token counting implemented; context limits documented
Prompt sensitivity	Phase 4: LLM Implementation	Multiple prompt variants tested; sensitivity reported
LLM-as-judge bias	Phase 5: Evaluation	Multi-model judging; order randomization; human sample
Unfair comparison	Phase 2: Embedding Generation	Per-model configs documented; preprocessing standardized

Sources

Pitfalls research for: Semantic Search Comparison Spike Researched: 2026-02-20

Pitfalls Research

Critical Pitfalls

Pitfall 1: Embedding Dimension Mismatch

Pitfall 2: Normalization and Distance Metric Mismatch

Pitfall 3: IVFFlat Index on Insufficient Data

Pitfall 4: Data Leakage in Evaluation

Pitfall 5: LLM Context Window Overflow

Pitfall 6: Prompt Sensitivity in LLM-based Matching

Pitfall 7: LLM-as-Judge Bias in Evaluation

Pitfall 8: Unfair Embedding Model Comparison

Technical Debt Patterns

Integration Gotchas

Performance Traps

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Sources

pgvector Indexing

Embedding and Normalization

LLM Context and RAG

Prompt Sensitivity

LLM-as-Judge Bias

Embedding Benchmarking

Data Leakage

Synthetic Data