OpenAI Kills Its Own Benchmark After 60% Data Contamination
Last week I watched something I've never seen before: a company publicly executing its own benchmark.
OpenAI announced they're ditching SWE-bench Verified – the very evaluation they co-authored in August 2024. Why? It's completely contaminated with training data leaks and broken tests that make AI coding agents look way better than they actually are.
The numbers are brutal:
- 60.83% solution leakage – answers literally hidden in GitHub issue comments
- 47.93% of "successful" patches pass only weak tests
- Top agent scores collapsed from 51.7% to 25.9% after proper filtering
- Recent leaderboard claims like 87.1% by AIR + Gemini 2.5 are likely meaningless
The Benchmark That Fooled Everyone
SWE-bench Verified was supposed to be the good version. OpenAI handpicked 500 samples from the original ~2,000 GitHub issues, with human validation showing 77.8% of tasks should take expert engineers under an hour.
But here's the kicker – it was already broken at launch.
<> The SWE-Bench+ paper researchers manually reviewed top leaderboard models and found that filtering leakage and weak tests dropped resolution rates by over 50%./>
Think about that. Half of what we thought was AI progress was just sophisticated memorization.
How Bad Data Ruins Everything
I've been tracking these coding benchmarks obsessively, and the pattern is always the same:
1. New benchmark launches with reasonable scores (~20-30%)
2. Scores inflate rapidly as models get trained on leaked data
3. Someone notices the benchmark is measuring memorization, not reasoning
4. Real scores turn out to be half what everyone claimed
The Claude Sonnet 4.6 and Gemini 3.1 Pro models hitting ~80% on Verified in February 2026? Those numbers are basically fiction now.
The Real Damage
This isn't just academic benchmark drama. Developers are making real decisions based on these inflated scores. Companies are buying AI coding tools thinking they'll autonomously fix 80% of GitHub issues.
The reality? More like 26%.
I've tested some of these "SOTA" agents on my own repos. They confidently generate patches that compile but break edge cases the original tests never covered. Classic plausible but incorrect behavior.
What's Next: SWE-bench Pro
OpenAI is pushing everyone toward SWE-bench Pro as the new standard. Smart move – they get to reset the narrative while their GPT-5.3 Codex leads the fresh leaderboard.
But I'm skeptical this cycle won't repeat. The fundamental problem isn't the benchmark design – it's that any public benchmark will eventually get contaminated once it becomes important enough.
The AI companies have massive crawling operations hoovering up GitHub data. They're not going to stop just because we ask nicely.
The Uncomfortable Truth
Here's what nobody wants to admit: we might need private, rotating benchmarks that get refreshed faster than training cycles. Yeah, it's expensive. Yeah, it's annoying.
But the alternative is this endless cycle of hype and disappointment.
I respect OpenAI for killing their own benchmark instead of quietly ignoring the problems. That takes guts in a market where everyone's chasing the highest leaderboard numbers.
My Bet: SWE-bench Pro stays clean for 6-12 months max before the same contamination patterns emerge. The real winners will be whoever figures out evaluation methods that can't be gamed by training data inclusion.
