A 7-Layer Copy-Paste Trick Conquered AI's Biggest Leaderboard
The most embarrassing moment in AI benchmarking history happened on two gaming GPUs. A developer named dnhkng discovered that copy-pasting 7 specific middle layers in Alibaba's Qwen2-72B model—without modifying a single weight—could top the Hugging Face Open LLM Leaderboard across every benchmark.
No fine-tuning. No retraining. Just architectural surgery.
The results were immediate and devastating. Performance improved on MMLU, ARC, TruthfulQA, and GSM8K. The modified model claimed the #1 spot. As of early 2026, the top 4 models on that leaderboard are still descendants of this simple trick.
This isn't just a clever hack—it's a mirror reflecting our industry's deepest insecurities.
The Real Story
The Hugging Face Open LLM Leaderboard isn't some academic exercise. It's the de facto ranking system for open-source LLMs, attracting over 2 million unique visitors and serving as the primary decision-making tool for developers choosing models. When its v2 upgrade launched in October 2024, it introduced normalized scoring and emphasized harder benchmarks where models exceed chance performance.
<> "The v2 normalization creates fairer rankings by weighting harder benchmarks more heavily," explained Hugging Face maintainer Alina Lozovskaia./>
But fairer for whom? The normalization system that was supposed to prevent gaming actually made it easier to exploit marginal gains.
Here's what dnhkng discovered: Transformer models contain massive redundancy in their middle layers. Those attention-heavy sections provide outsized performance boosts when duplicated. Even single-layer duplication yielded measurable benefits.
The technical implications are staggering:
- Zero training costs: Skip the million-dollar GPU clusters
- Gaming GPU accessibility: Two consumer cards can run a leaderboard champion
- 14% compute overhead: Minimal price for maximum ranking gain
- Quantization compatibility: Works within existing leaderboard filters
This democratizes top-tier performance in ways that terrify the establishment. Startups can now compete with Google and OpenAI using modified open models instead of expensive API calls.
When Benchmarks Become Meaningless
The dirty secret? Leaderboard optimization has nothing to do with real-world performance.
Critics rightfully point out that pure benchmarks ignore latency, deployment costs, and compliance requirements. Success on gaming GPUs doesn't magically scale to production environments serving millions of users.
By 2026, top models like DeepSeek V3.2 with 685 billion parameters score 85.0 on MMLU-Pro through traditional scaling. Meanwhile, a surgically modified 72B model from 2024 continues dominating through architectural tricks.
The market has responded predictably. Procurement teams now favor cheap, modifiable open models over proprietary solutions. This forces companies like Alibaba to release more "tweakable" base models, accelerating the entire open-source ecosystem.
But it also exposes fundamental flaws in how we measure AI progress.
The Uncomfortable Truth
This hack reveals that our most trusted evaluation systems are trivially gameable. If duplicating layers without changing weights can achieve leaderboard dominance, what does that say about the benchmarks themselves?
The technique works because:
1. Layer redundancy: Base models contain inefficient architectures
2. Benchmark blind spots: Tests don't measure architectural authenticity
3. Score optimization: Normalized systems reward any marginal improvement
We're witnessing "benchmark pollution" in real-time. Communities chase optimization tricks instead of genuine innovation. The result is a leaderboard arms race that prioritizes gaming over capability.
The emperor has no clothes. And apparently, he doesn't need them to top the charts.
Proprietary AI companies are already responding by moving toward private evaluations like ScaleAI SEAL. They've learned what the open-source community is slowly discovering: public benchmarks create perverse incentives.
The real question isn't whether this hack works—it's whether we're measuring the right things at all.

