Researched: 2026-02-20 Domain: Text embeddings, vector databases, ML evaluation methodology Confidence: HIGH
Phase 2 generates embeddings for ~6K line items using three models: Google text-multilingual-embedding-002 (768 dimensions, via Vertex AI SDK), Jina embeddings-v3 (1024 dimensions, via REST API), and all-MiniLM-L6-v2 (384 dimensions, local via sentence-transformers). The existing database schema already has pre-allocated vector columns with correct dimensions. Before embedding, the data must be split into 80% train and 20% test sets with a persistent is_test_set column to prevent data leakage.
Synthetic query variations for the test set require German-aware augmentation. Since the data is German accounting terminology (supplier names, line item descriptions), we need strategies that handle German compound words and accounting jargon. Options include: nlpaug with back-translation, German keyboard typo simulation, and word reordering for descriptions. Synonym replacement is challenging for German accounting terms, so LLM-based paraphrasing (using Gemini) is recommended for high-quality variations.
Primary recommendation: Add is_test_set boolean column before any embedding work, use stratified split by debit_account for balanced evaluation, store query variations in a separate test_query_variation table, and create HNSW indexes only after all embeddings are populated.
<user_constraints>
None - discussion stayed within phase scope </user_constraints>
<phase_requirements>
| ID | Description | Research Support |
|---|---|---|
| INFRA-04 | Pre-compute embeddings with Google text-multilingual-embedding-002 | Use google-genai SDK with Vertex AI. Model outputs 768 dimensions. Batch up to 250 texts per request. See Code Examples section. |
| INFRA-05 | Pre-compute embeddings with Jina embeddings | Use Jina REST API at api.jina.ai/v1/embeddings. embeddings-v3 outputs 1024 dimensions by default. No batch size limit but rate limited by tier. |
| INFRA-06 | Pre-compute embeddings with all-MiniLM-L6-v2 (local) | Use sentence-transformers SentenceTransformer class. Model outputs 384 dimensions. Batch encode is native and fast (~1000 texts/sec on CPU). |
| INFRA-07 | Store embedding model metadata alongside vectors | Create embedding_model_config table with model name, dimensions, distance metric. Store at config level per user decision. |
| EVAL-01 | Train/test split (80/20) with held-out test set | Add is_test_set boolean column to line_item. Use stratified split by debit_account for balanced evaluation. Split BEFORE embedding. |
| EVAL-02 | Generate synthetic query variations (synonyms, reordering, typos) | Create test_query_variation table. Use keyboard typos (nlpaug KeyboardAug), word reordering, and LLM paraphrasing for German accounting context. |
| </phase_requirements> |
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| google-genai | 1.64+ | Google/Vertex AI embeddings | Official unified SDK for Gemini/Vertex AI, replaces deprecated google-generativeai |
| sentence-transformers | 5.2+ | Local embedding models | Standard library for transformer embeddings, supports Jina v3 with trust_remote_code |
| sklearn | (via pandas) | Train/test split | train_test_split with stratify parameter for balanced splits |
| requests | (stdlib) | Jina API calls | Simple HTTP client for REST API, no additional dependency |
| Library | Version | Purpose | When to Use |
|---|---|---|---|
| numpy | 2.x | Vector operations | Embedding array manipulation before DB storage |
| tqdm | 4.x | Progress bars | Visual feedback during batch embedding (~6K items) |
| nlpaug | 1.1+ | Text augmentation | Keyboard typos, back-translation for German query variations |
| Instead of | Could Use | Tradeoff |
|---|---|---|
| Jina REST API | sentence-transformers local | REST is simpler setup but has rate limits; local requires downloading 570M model |
| nlpaug keyboard | random character substitution | nlpaug simulates realistic QWERTZ keyboard typos |
| Gemini paraphrasing | nlpaug back-translation | Gemini is higher quality but costs API calls; back-translation is free |
Installation:
# Already in pyproject.toml from Phase 1
uv add tqdm nlpaug # Add for Phase 2
src/
├── embeddings/
│ ├── __init__.py
│ ├── google_embed.py # Vertex AI embedding functions
│ ├── jina_embed.py # Jina API embedding functions
│ ├── minilm_embed.py # Local sentence-transformers
│ └── batch_processor.py # Common batching logic with progress
├── evaluation/
│ ├── __init__.py
│ ├── train_test_split.py # Stratified splitting
│ └── query_variations.py # Synthetic query generation
├── db.py # Existing - add update functions
├── normalize.py # Existing
└── import_csv.py # Existing
What: Split data ensuring test set has proportional representation of each debit_account When to use: When class distribution matters for evaluation (it does for GL account prediction) Example:
# Source: sklearn documentation
from sklearn.model_selection import train_test_split
import pandas as pd
def create_train_test_split(conn, test_size: float = 0.20, random_state: int = 42):
"""Add is_test_set column with stratified split by debit_account."""
# Fetch all IDs and their debit_accounts
df = pd.read_sql(
"SELECT id, debit_account FROM line_item",
conn
)
# Stratified split - ensures proportional debit_account distribution
train_ids, test_ids = train_test_split(
df['id'],
test_size=test_size,
random_state=random_state,
stratify=df['debit_account']
)
# Update database
with conn.cursor() as cur:
cur.execute("UPDATE line_item SET is_test_set = FALSE")
cur.execute(
"UPDATE line_item SET is_test_set = TRUE WHERE id = ANY(%s)",
(list(test_ids),)
)
conn.commit()
return len(train_ids), len(test_ids)
What: Process embeddings in batches with rate limiting and progress tracking When to use: Any embedding operation on 1000+ texts Example:
from tqdm import tqdm
import time
def batch_embed_with_progress(
texts: list[str],
embed_fn,
batch_size: int = 100,
delay_between_batches: float = 0.1
) -> list[list[float]]:
"""Generic batch embedding with progress bar."""
all_embeddings = []
for i in tqdm(range(0, len(texts), batch_size), desc="Embedding"):
batch = texts[i:i + batch_size]
embeddings = embed_fn(batch)
all_embeddings.extend(embeddings)
time.sleep(delay_between_batches) # Rate limit respect
return all_embeddings
What: Combine relevant fields into embedding-ready text When to use: Before calling any embedding model Recommendation: Combine supplier_name_normalized and description_normalized Example:
def prepare_embedding_text(supplier_name: str, description: str) -> str:
"""Combine fields for embedding. Use normalized versions."""
# Keep it simple: concatenate with separator
# The embedding models handle context internally
return f"{supplier_name} | {description}"
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Train/test split | Random selection | sklearn train_test_split with stratify | Handles edge cases, reproducible with random_state |
| Batch progress | Print statements | tqdm | Accurate ETA, handles terminals, minimal overhead |
| German keyboard typos | Character substitution | nlpaug KeyboardAug | QWERTZ layout aware, realistic typo patterns |
| Rate limiting | sleep() calls | Built-in retry with exponential backoff | Handles 429 responses properly |
| Vector serialization | Manual string formatting | pgvector register_vector | Type safety, proper encoding |
Key insight: Embedding generation involves API rate limits, batch processing, and progress tracking - all well-solved problems. Focus on the domain-specific query variation generation where custom logic adds value.
What goes wrong: Model sees test data during training, inflating accuracy metrics
Why it happens: Splitting after embedding, or using test set for model selection
How to avoid: Add is_test_set column FIRST, before any embedding. Filter on this column for all training operations.
Warning signs: Suspiciously high accuracy (>90% on semantic search)
What goes wrong: 429 errors from Google/Jina APIs, incomplete embedding runs Why it happens: Sending requests too fast, no backoff strategy How to avoid: Batch appropriately (250 for Google, no limit for Jina but respect RPM), add delays between batches, implement retry with exponential backoff Warning signs: Intermittent failures, partial data
What goes wrong: ERROR: expected N dimensions, not M when inserting
Why it happens: Model configuration differs from schema definition
How to avoid: Schema already correct from Phase 1 (768, 1024, 384). Verify model output dimensions match before batch insert.
Warning signs: First insert fails
What goes wrong: Different models get different input text, invalid comparison
Why it happens: Ad-hoc text preparation in each embedding script
How to avoid: Single prepare_embedding_text() function used by all models
Warning signs: Different embedding column NULL patterns
What goes wrong: Each embedding insert triggers index update, 100x slower Why it happens: Creating index before populating data How to avoid: Comment out index creation in init.sql (already done). Create indexes AFTER all embeddings populated. Warning signs: Embedding insertion takes hours instead of minutes
What goes wrong: Query variations don't represent real user typos/variations Why it happens: Using only programmatic augmentation without domain knowledge How to avoid: Mix approaches: keyboard typos for realism, word reordering for variation, LLM paraphrasing for semantic equivalents Warning signs: All synthetic queries look similar, no realistic misspellings
# Source: Google Cloud Vertex AI documentation
import os
from google import genai
from google.genai.types import EmbedContentConfig
# Set up for Vertex AI
os.environ['GOOGLE_GENAI_USE_VERTEXAI'] = 'True'
os.environ['GOOGLE_CLOUD_PROJECT'] = 'your-project-id'
os.environ['GOOGLE_CLOUD_LOCATION'] = 'us-central1' # or 'europe-west1'
client = genai.Client()
def embed_google(texts: list[str]) -> list[list[float]]:
"""Embed texts using text-multilingual-embedding-002 via Vertex AI."""
response = client.models.embed_content(
model='text-multilingual-embedding-002',
contents=texts,
config=EmbedContentConfig(
task_type='RETRIEVAL_DOCUMENT', # For stored documents
# output_dimensionality=768, # Default for this model
),
)
return [embedding.values for embedding in response.embeddings]
# Batch size: max 250 texts, 20,000 tokens per request
# Source: Jina AI Embeddings API documentation
import os
import requests
JINA_API_KEY = os.environ.get('JINA_API_KEY')
JINA_API_URL = 'https://api.jina.ai/v1/embeddings'
def embed_jina(texts: list[str]) -> list[list[float]]:
"""Embed texts using Jina embeddings-v3 API."""
response = requests.post(
JINA_API_URL,
headers={
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINA_API_KEY}',
},
json={
'input': texts,
'model': 'jina-embeddings-v3',
'dimensions': 1024, # Default, matches schema
'task': 'retrieval.passage', # For stored documents
},
)
response.raise_for_status()
return [d['embedding'] for d in response.json()['data']]
# Rate limits: Free tier 100 RPM, Paid 500 RPM
# No batch size limit - batches internally
# Source: sentence-transformers documentation
from sentence_transformers import SentenceTransformer
# Load model once, reuse for all embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
def embed_minilm(texts: list[str]) -> list[list[float]]:
"""Embed texts using local MiniLM model."""
# Returns numpy array, convert to list for DB storage
embeddings = model.encode(
texts,
batch_size=128, # Adjust based on GPU/CPU memory
show_progress_bar=False, # We use our own tqdm
normalize_embeddings=True, # For cosine similarity
)
return embeddings.tolist()
# Performance: ~1000 texts/sec on CPU, ~5000/sec on GPU
# Model size: 22MB, 384 dimensions
# Source: psycopg3 documentation + pgvector-python
import psycopg
from pgvector.psycopg import register_vector
import numpy as np
def update_embeddings_batch(
conn,
column_name: str,
id_embedding_pairs: list[tuple[int, list[float]]]
):
"""Batch update embedding column for multiple rows."""
register_vector(conn)
with conn.cursor() as cur:
# Use executemany for batch updates
cur.executemany(
f"UPDATE line_item SET {column_name} = %s WHERE id = %s",
[(np.array(emb), id_) for id_, emb in id_embedding_pairs]
)
conn.commit()
# Source: nlpaug documentation + custom
import nlpaug.augmenter.char as nac
import random
# German QWERTZ keyboard typo augmenter
typo_aug = nac.KeyboardAug(
aug_char_p=0.1, # 10% of characters
aug_word_p=0.3, # 30% of words
include_numeric=False,
lang='de', # German QWERTZ layout
)
def generate_typo_variation(text: str) -> str:
"""Generate realistic keyboard typo variation."""
return typo_aug.augment(text)[0]
def generate_word_reorder_variation(text: str) -> str:
"""Reorder words for description-like text."""
words = text.split()
if len(words) <= 2:
return text
# Shuffle middle words, keep first and last
middle = words[1:-1]
random.shuffle(middle)
return ' '.join([words[0]] + middle + [words[-1]])
def generate_llm_paraphrase(text: str, client) -> str:
"""Use Gemini to paraphrase German accounting text."""
response = client.models.generate_content(
model='gemini-2.0-flash',
contents=f"""Paraphrase this German invoice description in a different way
while keeping the same meaning. Only output the paraphrase, nothing else.
Original: {text}""",
)
return response.text.strip()
-- Add to init.sql or run as migration
CREATE TABLE embedding_model_config (
id SERIAL PRIMARY KEY,
model_name TEXT NOT NULL UNIQUE,
column_name TEXT NOT NULL, -- e.g., 'embedding_google'
dimensions INT NOT NULL,
distance_metric TEXT NOT NULL, -- 'cosine', 'l2', 'inner_product'
task_type TEXT, -- e.g., 'RETRIEVAL_DOCUMENT'
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Populate with our 3 models
INSERT INTO embedding_model_config (model_name, column_name, dimensions, distance_metric, task_type) VALUES
('text-multilingual-embedding-002', 'embedding_google', 768, 'cosine', 'RETRIEVAL_DOCUMENT'),
('jina-embeddings-v3', 'embedding_jina', 1024, 'cosine', 'retrieval.passage'),
('all-MiniLM-L6-v2', 'embedding_minilm', 384, 'cosine', NULL);
-- Stores synthetic query variations for test set evaluation
CREATE TABLE test_query_variation (
id BIGSERIAL PRIMARY KEY,
line_item_id BIGINT NOT NULL REFERENCES line_item(id),
variation_type TEXT NOT NULL, -- 'typo', 'reorder', 'paraphrase'
original_text TEXT NOT NULL, -- The embedding text that was varied
varied_text TEXT NOT NULL, -- The synthetic query
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_test_query_variation_line_item ON test_query_variation(line_item_id);
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| google-generativeai library | google-genai SDK | 2025 | Unified SDK for Gemini API and Vertex AI |
| text-embedding-004 | gemini-embedding-001 | Jan 2026 | text-embedding-004 deprecated, but text-multilingual-embedding-002 still supported |
| Fixed embedding dimensions | Matryoshka dimensions | 2024-2025 | Jina v3 and Gemini support flexible dimensions |
| IVFFlat indexes | HNSW indexes | 2023 | HNSW has better recall at same query speed |
Deprecated/outdated:
google-generativeai package: Use google-genai insteadtext-embedding-004: Deprecated Jan 2026, use gemini-embedding-001 or text-multilingual-embedding-002Vertex AI vs Gemini API for embeddings
GOOGLE_GENAI_USE_VERTEXAI=True. Requires GCP project and service account or ADC.Jina API tier and rate limits
Number of synthetic variations per test item
Confidence breakdown:
Research date: 2026-02-20 Valid until: 2026-03-20 (30 days - stable technologies, though Google may deprecate text-multilingual-embedding-002 soon)