Performance - Vector Panda

Query latency by collection size

Measured on warm-tier collections with 512-dimensional vectors. Latency includes full round-trip: coordinator fanout, worker search, and result aggregation.

Collection Size	Dimensions	P50 Latency	P95 Latency	P99 Latency
10K vectors	512	2.1ms	3.4ms	5.2ms
100K vectors	512	4.7ms	7.8ms	11.3ms
1M vectors	512	8.3ms	14.2ms	19.6ms
10M vectors	512	23ms	38ms	52ms
100M vectors	512	65ms	95ms	140ms

Hot tier acceleration

Hot-tier collections are pinned in memory with pre-computed PCA projections, reducing per-vector distance computation cost.

Collection Size	Warm Tier	Hot Tier	Speedup
10K vectors	2.1ms	0.4ms	5.3x
100K vectors	4.7ms	0.8ms	5.9x
1M vectors	8.3ms	1.2ms	6.9x
10M vectors	23ms	3.8ms	6.1x

Why it's fast

Our architecture is designed for low-latency exhaustive search. No approximate indexes, no recall tradeoffs.

Distributed fanout

Collections are sharded across workers. The coordinator fans out each query to all workers holding shards in parallel, then merges partial top-K results. Adding workers reduces per-worker scan size linearly.

SIMD distance computation

Inner product and L2 distance calculations use AVX2/NEON SIMD instructions, processing 8 float32 lanes per cycle. Combined with cache-friendly memory layout, this saturates memory bandwidth on modern CPUs.

PCA dimensionality reduction

For hot-tier collections, we pre-compute PCA projections that reduce dimensionality while preserving distance ordering. A 1536-dim OpenAI embedding can be searched in a fraction of the compute at full recall.

Memory-mapped I/O

Vector data is stored in flat binary files and memory-mapped. The OS page cache handles warm-up automatically. Frequently-accessed collections stay hot without explicit caching logic.

Binary protocol

Coordinator-to-worker communication uses WebSocket with bincode serialization. Query vectors and results are transmitted as raw bytes with zero parsing overhead compared to JSON.

Epoch-based consistency

Data mutations (appends, deletes, reshards) bump an epoch counter. Workers transition atomically, so queries always see a consistent snapshot without per-query locking.

Benchmark methodology

All benchmarks measured on our production hardware: AMD EPYC 7763 (64 cores), 256GB DDR4-3200, NVMe storage. Latency measured end-to-end from HTTP request to response, including network overhead. Vectors are 512-dimensional float32 unless otherwise noted. Results are medians over 10,000 queries with randomized query vectors.