Vector Database Benchmarks Are Lying to You — Here's How to Actually Evaluate Them

HERALDAuthor

May 19, 2026|5 min read

The benchmark theater problem

Here's the uncomfortable truth about vector database evaluation in 2026: a significant portion of the benchmarks you'll find on GitHub were built by vendors, for vendors. Polished dashboards, carefully chosen hardware configurations, suspiciously clean recall numbers — and almost never a concurrent write in sight.

This isn't necessarily malicious. It's just that optimizing your benchmark to make your own product shine is rational behavior. The problem is that developers are making real infrastructure decisions based on this synthetic theater.

<
> The question you should be asking isn't "which vector database is fastest?" — it's "fastest under which conditions, on whose hardware, doing what workload?"
/>

Let's fix how you approach this.

Why synthetic benchmarks fail in production

The classic benchmark flaw stack looks like this:

Frozen dataset, no concurrent writes — the index is built once, then only queried. Real production systems ingest continuously.
Hardware asymmetry — one system gets a compute-optimized cloud instance, another runs on defaults.
Suspiciously selective filters — tests use metadata filters that eliminate 99% of the data. This hides the actual cost of scanning semi-selective indexes.
Legacy embedding dimensions — many benchmarks still use 128-dimensional vectors. Modern embeddings from OpenAI, Cohere, or Gemini are 1,536 to 3,072 dimensions. Performance characteristics change dramatically.
Single-query latency focus — a database can look blazing fast on isolated queries while completely falling apart at 50 concurrent users.

The result? You pick a database that dominates the benchmark, build around it for three months, and then discover your p99 latency triples when your ingestion pipeline is running alongside your query traffic.

What a credible evaluation actually looks like

Before touching a single database, define your workload in concrete terms. This isn't optional ceremony — it's the entire foundation.

python

1# Workload definition template — fill this out before any evaluation
2workload = {
3    "data_size_now": "5M vectors",
4    "data_size_12mo": "50M vectors",
5    "embedding_dimensions": 1536,  # match your actual model
6    "peak_query_concurrency": 100,  # simultaneous requests
7    "ingestion_rate": "10k vectors/hour",  # ongoing writes
8    "filter_selectivity": "medium",  # ~20-40% of data passes filter
9    "metadata_fields": ["tenant_id", "doc_type", "created_at", "region"],
10    "hybrid_search_needed": True,  # BM25 + vector?
11    "compliance": "GDPR",  # affects deployment model
12    "existing_stack": "PostgreSQL 16"
13}

Once you have this, your benchmark writes itself. You're not benchmarking the database in the abstract — you're benchmarking your database under your conditions.

The metrics that actually matter

Stop optimizing for mean latency on a single query. That's the least useful number you can track.

Here's the measurement checklist:

Recall@10 — retrieval quality, not just speed. A fast database returning wrong results is useless.
p95 / p99 latency — what happens at the tail? This is what your users experience during load.
Concurrent throughput — run 50–100 simultaneous queries, not sequential ones.
Write throughput + read-while-write — ingest vectors while querying. Watch recall degrade. Watch latency spike.
Index rebuild time — what happens after a crash or major schema change?
Storage footprint — at 50M × 1536-dim vectors, storage costs are real money.
Monthly cost at projected volume — some managed services look cheap at demo scale and become existentially expensive at production scale.

python(29 lines)

1import concurrent.futures
2import time
3
4def benchmark_concurrent_queries(client, queries, n_concurrent=50):
5    latencies = []
6    
7    def single_query(q):
8        start = time.perf_counter()

Run this benchmark while a separate thread is continuously ingesting. That's your production number.

The architectural decision tree vendors don't want you to use

Here's the part that gets buried in benchmark drama: for most teams, the choice of vector database has more to do with constraints than raw performance.

Start with pgvector if:

You already run PostgreSQL
Your dataset is under ~5M vectors
You need ACID guarantees alongside vector search
Your team doesn't want to operate another distributed system

Move to a dedicated engine (Qdrant, Weaviate, Milvus) when:

pgvector's recall or latency under load becomes measurably insufficient
You need hybrid search with strong BM25 integration
Your vector workload dominates and deserves dedicated resources

Be very careful with usage-based managed services when:

Your query volume is unpredictable or spiky
You're modeling costs for 12 months out — run the numbers concretely

<
> Premature vector database specialization is one of the most common and expensive mistakes in RAG system architecture. Many teams that reached for Pinecone or Milvus on day one would have been better served by pgvector for their first 18 months.
/>

The hybrid search blind spot

One thing the benchmark ecosystem dramatically under-tests: hybrid search quality under realistic conditions.

Vector-only search fails systematically on:

Exact product codes, error codes, legal identifiers
Version numbers and precise terminology
Names that don't have clean semantic neighbors

If your application involves any of these — and most production RAG applications do — then benchmarking vector latency alone is measuring the wrong thing entirely. You need to evaluate the combined BM25 + vector retrieval pipeline, including how the system merges and reranks results.

Why this matters

The stakes here aren't abstract. A poor vector database choice means:

Retrieval quality problems that get misdiagnosed as "AI hallucination"
Latency degradation that surfaces only in production, after you've built against the API
Cost surprises that require emergency migrations at the worst possible time
Operational burden that your team wasn't staffed to handle

The good news is that the methodology for avoiding all of this is straightforward — it just requires rejecting the benchmarks that were handed to you and building tests around your actual workload.

The databases themselves are genuinely good in 2026. The evaluation culture around them still needs work.

Actionable next steps:

1. Write down your workload spec before opening any benchmark repository

2. If you're on Postgres already, deploy pgvector this week and measure it against your real data

3. For any vendor benchmark you reference, check: same hardware? concurrent writes? realistic filter selectivity?

4. Build a 30-minute read-while-write test. If a candidate database can't survive it gracefully, you've learned something critical before it's expensive to learn.

Services

Tools

Pages

Ready to Start?

Have an idea?

Vector Database Benchmarks Are Lying to You — Here's How to Actually Evaluate Them

The benchmark theater problem

Why synthetic benchmarks fail in production

What a credible evaluation actually looks like

The metrics that actually matter

The architectural decision tree vendors don't want you to use

The hybrid search blind spot

Why this matters

AI Integration Services

About the Author

HERALD

Reading Other People's Code is the Missing Practice in Developer Growth