Vector Database Detailed Guide

ANN algorithms, indexing strategies, and performance optimization

Published: January 2026 | Reading Time: 17 minutes | Category: AI & Machine Learning

Data visualization representing vector spaces and similarity

Vector databases have become critical infrastructure for AI applications, powering everything from semantic search to recommendation systems. At their core, these systems solve a seemingly simple problem: given a query vector, find the K most similar vectors from a database of billions.

This article dives deep into the algorithms and engineering decisions that make vector search tractable at scale. We'll explore the approximate nearest neighbor (ANN) algorithms that power modern systems, the distance metrics that define similarity, and the benchmarking methodologies that help you choose the right system for your use case.

The Vector Search Problem

A vector database stores collections of embeddings—dense vector representations of data (text, images, audio) that capture semantic meaning. Similar items cluster together in high-dimensional space.

The naive approach to search is linear: compute the distance from your query vector to every vector in the database, sort, and return the top K. This is called brute-force search and guarantees perfect recall but doesn't scale. Searching 1 billion vectors with 1536-dimensional embeddings requires roughly 6 terabytes of distance computations per query—impossible at interactive latencies.

Approximate Nearest Neighbor (ANN) algorithms sacrifice perfect accuracy for dramatic speedups. By preprocessing the index to understand the data distribution, ANN algorithms can prune vast regions of the search space while still returning highly relevant results.

Distance Metrics

How you measure similarity determines what "nearest" means. The choice of metric should align with your embedding model's training objective.

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude. Ranges from -1 (opposite) to 1 (identical). This is the most common metric for text embeddings:

cosine_similarity(a, b) = (a · b) / (||a|| × ||b||)

Cosine similarity is orientation-based—if your embeddings are L2-normalized (as most sentence transformers produce), cosine similarity equals negative L2 distance. Many systems pre-normalize embeddings to simplify the computation.

L2 (Euclidean) Distance

Measures straight-line distance in Euclidean space. More intuitive for applications where magnitude matters:

L2_distance(a, b) = √(Σ(a_i - b_i)²)

L2 is preferred for image embeddings and some recommendation systems where absolute differences in feature values carry meaning.

Dot Product (Inner Product)

Measures projection of one vector onto another. Can be negative for vectors pointing in opposite directions:

dot_product(a, b) = Σ(a_i × b_i)

OpenAI's text-embedding-3-large uses dot product internally. For L2-normalized vectors, dot product is equivalent to cosine similarity shifted by a constant.

Which Metric to Use?

        Rule of Thumb: Use whatever metric your embedding model was trained with. Most sentence transformer models use cosine similarity; OpenAI models use dot product (which is equivalent for normalized embeddings). Using the wrong metric can reduce recall by 20-30%.
    

HNSW: Hierarchical Navigable Small World

HNSW (Malkov and Yashunin, 2016) is the dominant ANN algorithm in production systems. It builds a multi-layer graph where each layer is a sparse graph connecting nearby points, with exponentially varying connection densities.

Index Construction

During indexing, each vector is inserted into multiple layers. Higher layers have sparser connections (longer "highways"), while lower layers have dense local connections. The probability of inserting into layer L is exponentially decaying:

P(layer ≥ L) = exp(-layer_index × (1/logn))

This creates a skip-list-like structure where searching starts at the top layer and greedily traverses the longest edges, then descends to progressively finer layers.

Search Process

Search(query):
  current_node = entry_point (top layer)
  
  for each layer from top to bottom:
    # Greedy navigation in current layer
    while exists_unvisited_neighbor(current_node, layer):
      next_node = closest_unvisited_neighbor(query, current_node, layer)
      if distance(query, next_node) < distance(query, current_node):
        current_node = next_node
      else:
        break
  
  # Bottom layer: scan for k nearest neighbors
  return k-closest vectors found

The search complexity is O(log n) because the long-range edges in upper layers let the algorithm "jump" across the graph, while the dense lower layers provide precise local search.

HNSW Parameters

Parameter	Typical Values	Effect
M (connections per node)	16-64	Higher = better recall, more memory, slower indexing
efConstruction	100-400	Higher = better recall, slower indexing
efSearch	50-200	Higher = better recall, slower search
max_elements	Set at init	Cannot be increased without rebuild

For a million vectors with 95% recall target, typical settings are M=32, efConstruction=200, efSearch=100. This uses approximately 4 bytes per dimension for the index plus 4 bytes per connection per element.

IVF: Inverted File Index

IVF partitions the vector space into clusters using k-means. Each vector belongs to one cluster (its nearest centroid), and the inverted index maps centroids to their member vectors.

Indexing:
  1. Run k-means to find N centroids (N typically 4096-65536)
  2. For each vector, assign to nearest centroid
  3. Store vectors in posting lists per centroid

Search:
  1. Find K nearest centroids to query
  2. Search posting lists of those centroids
  3. Return k-nearest across searched lists

IVF alone is relatively slow for high recall, but it pairs well with other techniques. The nprobe parameter controls how many clusters to search—higher nprobe means better recall but slower search.

Product Quantization (PQ)

Product Quantization compresses vectors by splitting them into subvectors and quantizing each independently. This dramatically reduces memory usage at the cost of recall.

How PQ Works

For a 128-dimensional vector with PQ split into 8 segments of 16 dimensions each:

Original vector: [d1, d2, d3, ..., d128]  (128 floats = 512 bytes)

Split into 8 segments:
  Segment 1: [d1, d2, ..., d16]
  Segment 2: [d17, d18, ..., d32]
  ...
  Segment 8: [d113, d114, ..., d128]

Each segment is quantized to a centroid ID (0-255)
Final representation: [c1, c2, c3, c4, c5, c6, c7, c8] (8 bytes!)

A 512-byte vector becomes 8 bytes—a 64x compression ratio. During search, query vectors are also split and quantized, and distances are computed between quantized segments using precomputed lookup tables.

PQ in Practice

PQ is almost always combined with IVF. The typical "IVF-PQ" approach:

Use IVF to narrow search to relevant clusters
Use PQ to compress vectors in memory
Re-rank candidate vectors with original (non-quantized) representations if needed

HNSW with PQ: Facebook AI Similarity Search (FAISS)

FAISS implements a particularly effective combination: PQ with an IVFADC (Asymmetric Distance Computation) search that's optimized for the compressed representations. This enables billion-scale search on a single machine.

FAISS IndexIVFPQ configuration:
  nlist: 4096 (number of clusters)
  m: 16 (number of PQ subsegments)
  nbits: 8 (2^8 = 256 centroids per subsegment)

Memory calculation for 1M vectors × 768 dimensions:
  - Original: 1M × 768 × 4 bytes = 3 GB
  - IndexIVFPQ: ~50 MB (with compression ratio ~60x)

Benchmarking Vector Databases

Benchmarking requires standardized datasets and metrics. Key benchmarks:

ANN-Benchmarks: Standardized benchmark with multiple datasets (SIFT, GIST, deep-image, etc.)
Vector DB Bench (2024): Compares Pinecone, Weaviate, Qdrant, Milvus on consistent hardware
MTEB: Massive Text Embedding Benchmark for retrieval tasks

Database	Algorithm	1M vectors recall@10	QPS	P99 Latency
Qdrant	HNSW	0.97	1500	42ms
Weaviate	HNSW	0.95	800	65ms
Milvus	HNSW	0.96	2000	35ms
Pinecone	HNSW (proprietary)	0.97	1000	50ms

These numbers are from Vector DB Bench on standardized hardware (8-core, 32GB RAM). Your mileage will vary significantly based on data characteristics, vector dimensionality, and hardware.

Embedding Model Considerations

The choice of embedding model affects what the vector database must handle:

Dimensionality

Higher dimensions capture more nuanced relationships but require more memory and computation. Modern models vary widely:

OpenAI text-embedding-3-small: 1536 dimensions
OpenAI text-embedding-3-large: 3072 dimensions (or reduced via Matryoshka)
sentence-transformers/all-MiniLM-L6-v2: 384 dimensions
Cohere embed-english-v3.0: 1024 dimensions

Embedding Normalization

Most modern embedding models output L2-normalized vectors. For normalized embeddings, cosine similarity equals negative L2 distance—systems can exploit this optimization. Always check if your embedding model normalizes; using L2 distance on unnormalized vectors produces different results than cosine similarity.

Scaling and Distributed Architectures

Sharding

For databases exceeding single-machine capacity, vectors can be sharded across multiple nodes. Sharding strategies:

Hash-based: Shard by hash of document ID. Simple but doesn't co-locate similar vectors.
Semantic: Shard by vector space region. Similar vectors land on same shard, enabling local search.
Recursive: Use space-filling curves (Hilbert, Z-order) to map high-dimensional space to 1D for consistent hashing.

Replicas for High Availability

Production vector databases typically run replicas for fault tolerance and read scaling. Synchronous replication ensures consistency; asynchronous replication improves write throughput at the cost of eventual consistency.

Filtering and Hybrid Search

Most production applications require metadata filtering alongside vector search. This is surprisingly difficult—filtering before search limits the search space; filtering after wastes computation on filtered-out results.

Solutions:

Pre-filtering: Build separate indices per metadata segment. Fast for selective filters, expensive for broad ones.
Post-filtering: Retrieve more candidates than needed, filter after. Simple but wastes quota on filtered results.
Approximate pre-filtering: Use metadata to bias search toward relevant clusters (Weaviate's approach).

Performance Optimization Tips

Batch indexing: Insert vectors in batches of 1000-10000 for 10-50x indexing speedups
Use appropriate quantization: INT8 quantization reduces memory 75% with ~2% recall loss
Optimize efSearch: P99 latency often dominated by outlier queries; tune for your recall/latency tradeoff
Consider disk-based indices: For billion+ scale, memory-mapped files can exceed RAM capacity
Profile your retrieval: Embedding generation is often the actual bottleneck, not vector search

Conclusion

Vector databases combine sophisticated ANN algorithms with engineering discipline to enable fast, scalable similarity search. HNSW remains the dominant algorithm for most use cases, with PQ and IVF providing additional optimization for specific scenarios.

The choice of vector database matters less than the choices around embedding models, indexing parameters, and retrieval strategy. A poorly tuned system on the "fastest" database will underperform a well-tuned system on any major option. Focus on benchmarking with your actual data and queries rather than published synthetic benchmarks.

RAG System Design Retrieval-Augmented Generation Deep Dive Transformer Architecture Explained