
Vector Database Benchmarks Are Lying to You — Here's How to Actually Evaluate Them
The benchmark theater problem
Here's the uncomfortable truth about vector database evaluation in 2026: a significant portion of the benchmarks you'll find on GitHub were built by vendors, for vendors. Polished dashboards, carefully chosen hardware configurations, suspiciously clean recall numbers — and almost never a concurrent write in sight.
This isn't necessarily malicious. It's just that optimizing your benchmark to make your own product shine is rational behavior. The problem is that developers are making real infrastructure decisions based on this synthetic theater.
<> The question you should be asking isn't "which vector database is fastest?" — it's "fastest under which conditions, on whose hardware, doing what workload?"/>
Let's fix how you approach this.
Why synthetic benchmarks fail in production
The classic benchmark flaw stack looks like this:
- Frozen dataset, no concurrent writes — the index is built once, then only queried. Real production systems ingest continuously.
- Hardware asymmetry — one system gets a compute-optimized cloud instance, another runs on defaults.
- Suspiciously selective filters — tests use metadata filters that eliminate 99% of the data. This hides the actual cost of scanning semi-selective indexes.
- Legacy embedding dimensions — many benchmarks still use 128-dimensional vectors. Modern embeddings from OpenAI, Cohere, or Gemini are 1,536 to 3,072 dimensions. Performance characteristics change dramatically.
- Single-query latency focus — a database can look blazing fast on isolated queries while completely falling apart at 50 concurrent users.
The result? You pick a database that dominates the benchmark, build around it for three months, and then discover your p99 latency triples when your ingestion pipeline is running alongside your query traffic.
What a credible evaluation actually looks like
Before touching a single database, define your workload in concrete terms. This isn't optional ceremony — it's the entire foundation.
1# Workload definition template — fill this out before any evaluation
2workload = {
3 "data_size_now": "5M vectors",
4 "data_size_12mo": "50M vectors",
5 "embedding_dimensions": 1536, # match your actual model
6 "peak_query_concurrency": 100, # simultaneous requests
7 "ingestion_rate": "10k vectors/hour", # ongoing writes
8 "filter_selectivity": "medium", # ~20-40% of data passes filter
9 "metadata_fields": ["tenant_id", "doc_type", "created_at", "region"],
10 "hybrid_search_needed": True, # BM25 + vector?
11 "compliance": "GDPR", # affects deployment model
12 "existing_stack": "PostgreSQL 16"
13}Once you have this, your benchmark writes itself. You're not benchmarking the database in the abstract — you're benchmarking your database under your conditions.
The metrics that actually matter
Stop optimizing for mean latency on a single query. That's the least useful number you can track.
Here's the measurement checklist:
- Recall@10 — retrieval quality, not just speed. A fast database returning wrong results is useless.
- p95 / p99 latency — what happens at the tail? This is what your users experience during load.
- Concurrent throughput — run 50–100 simultaneous queries, not sequential ones.
- Write throughput + read-while-write — ingest vectors while querying. Watch recall degrade. Watch latency spike.
- Index rebuild time — what happens after a crash or major schema change?
- Storage footprint — at 50M × 1536-dim vectors, storage costs are real money.
- Monthly cost at projected volume — some managed services look cheap at demo scale and become existentially expensive at production scale.
1import concurrent.futures
2import time
3
4def benchmark_concurrent_queries(client, queries, n_concurrent=50):
5 latencies = []
6
7 def single_query(q):
8 start = time.perf_counter()Run this benchmark while a separate thread is continuously ingesting. That's your production number.
The architectural decision tree vendors don't want you to use
Here's the part that gets buried in benchmark drama: for most teams, the choice of vector database has more to do with constraints than raw performance.
Start with pgvector if:
- You already run PostgreSQL
- Your dataset is under ~5M vectors
- You need ACID guarantees alongside vector search
- Your team doesn't want to operate another distributed system
Move to a dedicated engine (Qdrant, Weaviate, Milvus) when:
- pgvector's recall or latency under load becomes measurably insufficient
- You need hybrid search with strong BM25 integration
- Your vector workload dominates and deserves dedicated resources
Be very careful with usage-based managed services when:
- Your query volume is unpredictable or spiky
- You're modeling costs for 12 months out — run the numbers concretely
<> Premature vector database specialization is one of the most common and expensive mistakes in RAG system architecture. Many teams that reached for Pinecone or Milvus on day one would have been better served by pgvector for their first 18 months./>
The hybrid search blind spot
One thing the benchmark ecosystem dramatically under-tests: hybrid search quality under realistic conditions.
Vector-only search fails systematically on:
- Exact product codes, error codes, legal identifiers
- Version numbers and precise terminology
- Names that don't have clean semantic neighbors
If your application involves any of these — and most production RAG applications do — then benchmarking vector latency alone is measuring the wrong thing entirely. You need to evaluate the combined BM25 + vector retrieval pipeline, including how the system merges and reranks results.
Why this matters
The stakes here aren't abstract. A poor vector database choice means:
- Retrieval quality problems that get misdiagnosed as "AI hallucination"
- Latency degradation that surfaces only in production, after you've built against the API
- Cost surprises that require emergency migrations at the worst possible time
- Operational burden that your team wasn't staffed to handle
The good news is that the methodology for avoiding all of this is straightforward — it just requires rejecting the benchmarks that were handed to you and building tests around your actual workload.
The databases themselves are genuinely good in 2026. The evaluation culture around them still needs work.
Actionable next steps:
1. Write down your workload spec before opening any benchmark repository
2. If you're on Postgres already, deploy pgvector this week and measure it against your real data
3. For any vendor benchmark you reference, check: same hardware? concurrent writes? realistic filter selectivity?
4. Build a 30-minute read-while-write test. If a candidate database can't survive it gracefully, you've learned something critical before it's expensive to learn.

