ANN algorithms, indexing strategies, and performance optimization
Vector databases have become critical infrastructure for AI applications, powering everything from semantic search to recommendation systems. At their core, these systems solve a seemingly simple problem: given a query vector, find the K most similar vectors from a database of billions.
This article dives deep into the algorithms and engineering decisions that make vector search tractable at scale. We'll explore the approximate nearest neighbor (ANN) algorithms that power modern systems, the distance metrics that define similarity, and the benchmarking methodologies that help you choose the right system for your use case.
A vector database stores collections of embeddings—dense vector representations of data (text, images, audio) that capture semantic meaning. Similar items cluster together in high-dimensional space.
The naive approach to search is linear: compute the distance from your query vector to every vector in the database, sort, and return the top K. This is called brute-force search and guarantees perfect recall but doesn't scale. Searching 1 billion vectors with 1536-dimensional embeddings requires roughly 6 terabytes of distance computations per query—impossible at interactive latencies.
Approximate Nearest Neighbor (ANN) algorithms sacrifice perfect accuracy for dramatic speedups. By preprocessing the index to understand the data distribution, ANN algorithms can prune vast regions of the search space while still returning highly relevant results.
How you measure similarity determines what "nearest" means. The choice of metric should align with your embedding model's training objective.
Measures the angle between two vectors, ignoring magnitude. Ranges from -1 (opposite) to 1 (identical). This is the most common metric for text embeddings:
cosine_similarity(a, b) = (a · b) / (||a|| × ||b||)
Cosine similarity is orientation-based—if your embeddings are L2-normalized (as most sentence transformers produce), cosine similarity equals negative L2 distance. Many systems pre-normalize embeddings to simplify the computation.
Measures straight-line distance in Euclidean space. More intuitive for applications where magnitude matters:
L2_distance(a, b) = √(Σ(a_i - b_i)²)
L2 is preferred for image embeddings and some recommendation systems where absolute differences in feature values carry meaning.
Measures projection of one vector onto another. Can be negative for vectors pointing in opposite directions:
dot_product(a, b) = Σ(a_i × b_i)
OpenAI's text-embedding-3-large uses dot product internally. For L2-normalized vectors, dot product is equivalent to cosine similarity shifted by a constant.
HNSW (Malkov and Yashunin, 2016) is the dominant ANN algorithm in production systems. It builds a multi-layer graph where each layer is a sparse graph connecting nearby points, with exponentially varying connection densities.
During indexing, each vector is inserted into multiple layers. Higher layers have sparser connections (longer "highways"), while lower layers have dense local connections. The probability of inserting into layer L is exponentially decaying:
P(layer ≥ L) = exp(-layer_index × (1/logn))
This creates a skip-list-like structure where searching starts at the top layer and greedily traverses the longest edges, then descends to progressively finer layers.
Search(query):
current_node = entry_point (top layer)
for each layer from top to bottom:
# Greedy navigation in current layer
while exists_unvisited_neighbor(current_node, layer):
next_node = closest_unvisited_neighbor(query, current_node, layer)
if distance(query, next_node) < distance(query, current_node):
current_node = next_node
else:
break
# Bottom layer: scan for k nearest neighbors
return k-closest vectors found
The search complexity is O(log n) because the long-range edges in upper layers let the algorithm "jump" across the graph, while the dense lower layers provide precise local search.
| Parameter | Typical Values | Effect |
|---|---|---|
| M (connections per node) | 16-64 | Higher = better recall, more memory, slower indexing |
| efConstruction | 100-400 | Higher = better recall, slower indexing |
| efSearch | 50-200 | Higher = better recall, slower search |
| max_elements | Set at init | Cannot be increased without rebuild |
For a million vectors with 95% recall target, typical settings are M=32, efConstruction=200, efSearch=100. This uses approximately 4 bytes per dimension for the index plus 4 bytes per connection per element.
IVF partitions the vector space into clusters using k-means. Each vector belongs to one cluster (its nearest centroid), and the inverted index maps centroids to their member vectors.
Indexing:
1. Run k-means to find N centroids (N typically 4096-65536)
2. For each vector, assign to nearest centroid
3. Store vectors in posting lists per centroid
Search:
1. Find K nearest centroids to query
2. Search posting lists of those centroids
3. Return k-nearest across searched lists
IVF alone is relatively slow for high recall, but it pairs well with other techniques. The nprobe parameter controls how many clusters to search—higher nprobe means better recall but slower search.
Product Quantization compresses vectors by splitting them into subvectors and quantizing each independently. This dramatically reduces memory usage at the cost of recall.
For a 128-dimensional vector with PQ split into 8 segments of 16 dimensions each:
Original vector: [d1, d2, d3, ..., d128] (128 floats = 512 bytes)
Split into 8 segments:
Segment 1: [d1, d2, ..., d16]
Segment 2: [d17, d18, ..., d32]
...
Segment 8: [d113, d114, ..., d128]
Each segment is quantized to a centroid ID (0-255)
Final representation: [c1, c2, c3, c4, c5, c6, c7, c8] (8 bytes!)
A 512-byte vector becomes 8 bytes—a 64x compression ratio. During search, query vectors are also split and quantized, and distances are computed between quantized segments using precomputed lookup tables.
PQ is almost always combined with IVF. The typical "IVF-PQ" approach:
FAISS implements a particularly effective combination: PQ with an IVFADC (Asymmetric Distance Computation) search that's optimized for the compressed representations. This enables billion-scale search on a single machine.
FAISS IndexIVFPQ configuration:
nlist: 4096 (number of clusters)
m: 16 (number of PQ subsegments)
nbits: 8 (2^8 = 256 centroids per subsegment)
Memory calculation for 1M vectors × 768 dimensions:
- Original: 1M × 768 × 4 bytes = 3 GB
- IndexIVFPQ: ~50 MB (with compression ratio ~60x)
Benchmarking requires standardized datasets and metrics. Key benchmarks:
| Database | Algorithm | 1M vectors recall@10 | QPS | P99 Latency |
|---|---|---|---|---|
| Qdrant | HNSW | 0.97 | 1500 | 42ms |
| Weaviate | HNSW | 0.95 | 800 | 65ms |
| Milvus | HNSW | 0.96 | 2000 | 35ms |
| Pinecone | HNSW (proprietary) | 0.97 | 1000 | 50ms |
These numbers are from Vector DB Bench on standardized hardware (8-core, 32GB RAM). Your mileage will vary significantly based on data characteristics, vector dimensionality, and hardware.
The choice of embedding model affects what the vector database must handle:
Higher dimensions capture more nuanced relationships but require more memory and computation. Modern models vary widely:
Most modern embedding models output L2-normalized vectors. For normalized embeddings, cosine similarity equals negative L2 distance—systems can exploit this optimization. Always check if your embedding model normalizes; using L2 distance on unnormalized vectors produces different results than cosine similarity.
For databases exceeding single-machine capacity, vectors can be sharded across multiple nodes. Sharding strategies:
Production vector databases typically run replicas for fault tolerance and read scaling. Synchronous replication ensures consistency; asynchronous replication improves write throughput at the cost of eventual consistency.
Most production applications require metadata filtering alongside vector search. This is surprisingly difficult—filtering before search limits the search space; filtering after wastes computation on filtered-out results.
Solutions:
Vector databases combine sophisticated ANN algorithms with engineering discipline to enable fast, scalable similarity search. HNSW remains the dominant algorithm for most use cases, with PQ and IVF providing additional optimization for specific scenarios.
The choice of vector database matters less than the choices around embedding models, indexing parameters, and retrieval strategy. A poorly tuned system on the "fastest" database will underperform a well-tuned system on any major option. Focus on benchmarking with your actual data and queries rather than published synthetic benchmarks.