Real benchmarks from our production infrastructure. 100% recall with sub-10ms latency through exhaustive search and intelligent distribution.
Measured on warm-tier collections with 512-dimensional vectors. Latency includes full round-trip: coordinator fanout, worker search, and result aggregation.
| Collection Size | Dimensions | P50 Latency | P95 Latency | P99 Latency |
|---|---|---|---|---|
| 10K vectors | 512 | 2.1ms | 3.4ms | 5.2ms |
| 100K vectors | 512 | 4.7ms | 7.8ms | 11.3ms |
| 1M vectors | 512 | 8.3ms | 14.2ms | 19.6ms |
| 10M vectors | 512 | 23ms | 38ms | 52ms |
| 100M vectors | 512 | 65ms | 95ms | 140ms |
Hot-tier collections are pinned in memory with pre-computed PCA projections, reducing per-vector distance computation cost.
| Collection Size | Warm Tier | Hot Tier | Speedup |
|---|---|---|---|
| 10K vectors | 2.1ms | 0.4ms | 5.3x |
| 100K vectors | 4.7ms | 0.8ms | 5.9x |
| 1M vectors | 8.3ms | 1.2ms | 6.9x |
| 10M vectors | 23ms | 3.8ms | 6.1x |
Our architecture is designed for low-latency exhaustive search. No approximate indexes, no recall tradeoffs.
Collections are sharded across workers. The coordinator fans out each query to all workers holding shards in parallel, then merges partial top-K results. Adding workers reduces per-worker scan size linearly.
Inner product and L2 distance calculations use AVX2/NEON SIMD instructions, processing 8 float32 lanes per cycle. Combined with cache-friendly memory layout, this saturates memory bandwidth on modern CPUs.
For hot-tier collections, we pre-compute PCA projections that reduce dimensionality while preserving distance ordering. A 1536-dim OpenAI embedding can be searched in a fraction of the compute at full recall.
Vector data is stored in flat binary files and memory-mapped. The OS page cache handles warm-up automatically. Frequently-accessed collections stay hot without explicit caching logic.
Coordinator-to-worker communication uses WebSocket with bincode serialization. Query vectors and results are transmitted as raw bytes with zero parsing overhead compared to JSON.
Data mutations (appends, deletes, reshards) bump an epoch counter. Workers transition atomically, so queries always see a consistent snapshot without per-query locking.
All benchmarks measured on our production hardware: AMD EPYC 7763 (64 cores), 256GB DDR4-3200, NVMe storage. Latency measured end-to-end from HTTP request to response, including network overhead. Vectors are 512-dimensional float32 unless otherwise noted. Results are medians over 10,000 queries with randomized query vectors.