Vector Search
Plain Explanation
Teams needed search that understands meaning, not just exact words. Keyword engines excel at IDs and exact phrases but stumble when a question is phrased differently, which weakens RAG pipelines that must fetch the right passages before a model answers. Vector search fixes this by retrieving content that is semantically close to the question, improving grounding quality without forcing users to guess the exact wording. Picture a library shelved by topic similarity instead of alphabetically. When someone asks a question, the librarian pulls the few shelves whose books “live closest” to that theme. Concretely, your pipeline splits documents into chunks, converts each chunk into a numeric vector (an embedding), and stores those vectors in a vector store. At query time you embed the query with the same model, compute similarity (cosine or dot product), retrieve Top‑K nearest chunks, then optionally re‑rank and pass the context to the generator. Some implementations distinguish document and query embedding modes, but the safer general rule is to keep indexing and query embeddings in the same compatible vector space. Reranking can improve relevance, but it adds another model call and therefore must fit the latency budget.
Examples & Analogies
- Policy helpdesk with embeddings: An internal help center indexes policy manuals by chunking and embedding with a document task, storing vectors in a collection configured for cosine similarity. At runtime it embeds each question with a query task and returns the top‑5 chunks to feed the answer model.
- Freshness‑aware newsroom search: An editorial team tags each chunk with date metadata and uses time‑based indexing. Vector search finds semantically related articles, while a date filter keeps results within the latest coverage so answers stay grounded in current material.
- Product catalog hybrid retrieval: An e‑commerce team combines BM25 keyword search with vector search to cover exact SKUs and IDs alongside paraphrased descriptions. When precision matters, they re‑rank the candidates with a cross‑encoder, trading extra latency for better top results.
At a Glance
| Vector search | Keyword search (BM25) | Hybrid search | |
|---|---|---|---|
| Matching signal | Embedding similarity (cosine/dot) | Exact-term scoring (BM25) | Combines both signals |
| Excels at | Synonyms, paraphrases, semantic intent | IDs, codes, exact phrases | Precision + semantic recall |
| Common misses | Proper nouns/IDs without filters | Rephrasings and synonyms | Fewer misses, more tuning knobs |
| Reranking | Cross-encoders often used (+50–200 ms) | Can be applied post-retrieval | Common after merging results |
| Retrieval flow | Embed query → similarity Top‑K → optional re‑rank | Term match → rank | BM25 + vectors → merge → re‑rank |
Pick vector search for meaning, keyword for exact tokens and IDs, and hybrid when you need both without sacrificing relevance.
Where and Why It Matters
- Core retrieval layer for RAG: better grounding and answer reliability when paired with metadata filters and re‑ranking.
- Practical impact of re‑ranking: higher precision at added latency; manage the trade‑off against SLA and prompt budget.
- Co‑optimization need: chunking/embedding/index/K/re‑ranking and system choices together determine latency/QPS and accuracy.
- Ongoing maintenance: prefer partial re‑index on changes (re‑chunk, re‑embed, re‑write) to control cost and downtime.
- Strategy diversity: hybrid retrieval, metadata gating (date/source/type), and hierarchical indexing are common in production.
Common Misconceptions
- ❌ Myth: Vector search makes keyword search obsolete. → ✅ Reality: Hybrid (BM25 + vectors) often improves coverage, especially for rare IDs and exact phrases.
- ❌ Myth: Any two good embedding models will work together. → ✅ Reality: Use the same model for indexing and querying; mixing models places vectors in different spaces.
- ❌ Myth: A well‑tuned vector DB alone guarantees great RAG answers. → ✅ Reality: Retrieval is one stage; co‑optimize end‑to‑end quality and performance across the stack.
How It Sounds in Conversation
- "Let's set Top‑K=5 with cosine similarity for the FAQ bot, then add a re‑ranker to tighten precision before the LLM."
- "Switch to document/query task types where supported; that setup improved alignment in our trials."
- "Search is missing product IDs — try hybrid search (BM25 + vectors) and add a date metadata filter on the index."
- "We should tune both retrieval and serving; latency/QPS depend on the whole pipeline, not just the vector DB."
- "If we bump K from 5 to 15, recall should rise but tokens and cost go up; the re‑rank adds ~120 ms, which seems acceptable for this SLA."
Related Reading
References
- Qdrant Documentation: Similarity search
Official explanation of nearest-vector search, Query API, filtering, and hybrid query options.
- Weaviate Concepts: Search
Official overview of keyword, vector, hybrid search, RAG, and reranking.
- Weaviate Documentation: Hybrid search
How vector and keyword signals are combined and weighted in hybrid search.
- pgvector
Official repository for PostgreSQL vector type, distance operators, and HNSW/IVFFlat indexing.