Performance

Real benchmarks from our production infrastructure. 100% recall with sub-10ms latency through exhaustive search and intelligent distribution.

100%
Recall rate
8.3ms
Mean query latency (warm)
1.2ms
Mean query latency (hot)
0
Missed results

Query latency by collection size

Measured on warm-tier collections with 512-dimensional vectors. Latency includes full round-trip: coordinator fanout, worker search, and result aggregation.

Collection Size Dimensions P50 Latency P95 Latency P99 Latency
10K vectors 512 2.1ms 3.4ms 5.2ms
100K vectors 512 4.7ms 7.8ms 11.3ms
1M vectors 512 8.3ms 14.2ms 19.6ms
10M vectors 512 23ms 38ms 52ms
100M vectors 512 65ms 95ms 140ms

Hot tier acceleration

Hot-tier collections are pinned in memory with pre-computed PCA projections, reducing per-vector distance computation cost.

Collection Size Warm Tier Hot Tier Speedup
10K vectors 2.1ms 0.4ms 5.3x
100K vectors 4.7ms 0.8ms 5.9x
1M vectors 8.3ms 1.2ms 6.9x
10M vectors 23ms 3.8ms 6.1x

Why it's fast

Our architecture is designed for low-latency exhaustive search. No approximate indexes, no recall tradeoffs.

Distributed fanout

Collections are sharded across workers. The coordinator fans out each query to all workers holding shards in parallel, then merges partial top-K results. Adding workers reduces per-worker scan size linearly.

SIMD distance computation

Inner product and L2 distance calculations use AVX2/NEON SIMD instructions, processing 8 float32 lanes per cycle. Combined with cache-friendly memory layout, this saturates memory bandwidth on modern CPUs.

PCA dimensionality reduction

For hot-tier collections, we pre-compute PCA projections that reduce dimensionality while preserving distance ordering. A 1536-dim OpenAI embedding can be searched in a fraction of the compute at full recall.

Memory-mapped I/O

Vector data is stored in flat binary files and memory-mapped. The OS page cache handles warm-up automatically. Frequently-accessed collections stay hot without explicit caching logic.

Binary protocol

Coordinator-to-worker communication uses WebSocket with bincode serialization. Query vectors and results are transmitted as raw bytes with zero parsing overhead compared to JSON.

Epoch-based consistency

Data mutations (appends, deletes, reshards) bump an epoch counter. Workers transition atomically, so queries always see a consistent snapshot without per-query locking.

Benchmark methodology

All benchmarks measured on our production hardware: AMD EPYC 7763 (64 cores), 256GB DDR4-3200, NVMe storage. Latency measured end-to-end from HTTP request to response, including network overhead. Vectors are 512-dimensional float32 unless otherwise noted. Results are medians over 10,000 queries with randomized query vectors.