Vector Databases, Explained for Backend Engineers
What embeddings are, why approximate nearest-neighbor search needs special indexes, and when you actually need a vector database versus a pgvector column.
The Problem
You want to build semantic search or retrieval-augmented generation: given a user's
question, find the most relevant documents by meaning, not keywords. A LIKE
query won't cut it — "how do I reset my password" should match a doc titled
"account recovery steps." This is similarity search over embeddings, and it needs
different machinery.
Why It Matters
RAG and semantic search are now standard features. The retrieval layer underneath
them is a vector similarity problem, and doing it naively — comparing a query
against every stored vector — is O(n) per query. At millions of vectors, that's
too slow. Understanding the index is what separates a demo from production.
Core Concepts
An embedding is a fixed-length array of floats that represents the meaning of text, produced by a model. Similar meanings produce vectors that are close together, usually measured by cosine similarity.
Finding the closest vectors exactly is expensive, so vector databases use approximate nearest neighbor (ANN) search. The most common index is HNSW (Hierarchical Navigable Small World), a graph you can traverse to find close neighbors in roughly logarithmic time, trading a little recall for a lot of speed.
Implementation
You don't always need a dedicated database. Postgres with pgvector is often
enough:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text,
embedding vector(1536) -- dimension matches your embedding model
);
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops);
Querying is an ordering by distance:
SELECT id, content
FROM documents
ORDER BY embedding <=> $1 -- <=> is cosine distance
LIMIT 5;
The <=> operator finds the nearest neighbors; the HNSW index makes it fast.
Common Mistakes
- Mismatched dimensions. The column dimension must equal the embedding model's output. Switching models usually means re-embedding everything.
- Comparing across models. Embeddings from different models live in different spaces and aren't comparable. Pick one and stay consistent.
- Forgetting the index. Without an ANN index, every query is a full scan. It works on a thousand rows and falls over at a million.
Production Considerations
ANN is a recall/speed trade-off. Index parameters (m, ef_construction, and the
query-time ef_search) tune how thoroughly the graph is searched. Raise them for
better recall, lower them for lower latency. Measure recall against an exact
brute-force baseline on a sample so you know what you're trading away.
Security
Apply your normal authorization after retrieval, or filter within the query. Vector search will happily return a chunk the current user isn't allowed to see — relevance is not permission. Combine the similarity search with a metadata filter on tenant or document ownership.
Performance
Reach for a dedicated vector database (Qdrant, Weaviate, Milvus) when you're past
tens of millions of vectors, need high write throughput, or want native metadata
filtering at scale. Below that, pgvector keeps your vectors next to your relational
data and one fewer system to operate.
Summary
Vector search powers semantic retrieval by comparing embeddings with approximate
nearest-neighbor indexes like HNSW. Start with pgvector if you're already on
Postgres, match your dimensions to your model, always build the index, and enforce
authorization after retrieval. Graduate to a dedicated vector database only when
scale genuinely demands it.
The weekly engineering digest
Production-grade engineering writing in your inbox. No spam, unsubscribe anytime.