A 7-Layer Copy-Paste Trick Conquered AI's Biggest Leaderboard

A 7-Layer Copy-Paste Trick Conquered AI's Biggest Leaderboard

HERALD
HERALDAuthor
|3 min read

The most embarrassing moment in AI benchmarking history happened on two gaming GPUs. A developer named dnhkng discovered that copy-pasting 7 specific middle layers in Alibaba's Qwen2-72B model—without modifying a single weight—could top the Hugging Face Open LLM Leaderboard across every benchmark.

No fine-tuning. No retraining. Just architectural surgery.

The results were immediate and devastating. Performance improved on MMLU, ARC, TruthfulQA, and GSM8K. The modified model claimed the #1 spot. As of early 2026, the top 4 models on that leaderboard are still descendants of this simple trick.

This isn't just a clever hack—it's a mirror reflecting our industry's deepest insecurities.

The Real Story

The Hugging Face Open LLM Leaderboard isn't some academic exercise. It's the de facto ranking system for open-source LLMs, attracting over 2 million unique visitors and serving as the primary decision-making tool for developers choosing models. When its v2 upgrade launched in October 2024, it introduced normalized scoring and emphasized harder benchmarks where models exceed chance performance.

<
> "The v2 normalization creates fairer rankings by weighting harder benchmarks more heavily," explained Hugging Face maintainer Alina Lozovskaia.
/>

But fairer for whom? The normalization system that was supposed to prevent gaming actually made it easier to exploit marginal gains.

Here's what dnhkng discovered: Transformer models contain massive redundancy in their middle layers. Those attention-heavy sections provide outsized performance boosts when duplicated. Even single-layer duplication yielded measurable benefits.

The technical implications are staggering:

  • Zero training costs: Skip the million-dollar GPU clusters
  • Gaming GPU accessibility: Two consumer cards can run a leaderboard champion
  • 14% compute overhead: Minimal price for maximum ranking gain
  • Quantization compatibility: Works within existing leaderboard filters

This democratizes top-tier performance in ways that terrify the establishment. Startups can now compete with Google and OpenAI using modified open models instead of expensive API calls.

When Benchmarks Become Meaningless

The dirty secret? Leaderboard optimization has nothing to do with real-world performance.

Critics rightfully point out that pure benchmarks ignore latency, deployment costs, and compliance requirements. Success on gaming GPUs doesn't magically scale to production environments serving millions of users.

By 2026, top models like DeepSeek V3.2 with 685 billion parameters score 85.0 on MMLU-Pro through traditional scaling. Meanwhile, a surgically modified 72B model from 2024 continues dominating through architectural tricks.

The market has responded predictably. Procurement teams now favor cheap, modifiable open models over proprietary solutions. This forces companies like Alibaba to release more "tweakable" base models, accelerating the entire open-source ecosystem.

But it also exposes fundamental flaws in how we measure AI progress.

The Uncomfortable Truth

This hack reveals that our most trusted evaluation systems are trivially gameable. If duplicating layers without changing weights can achieve leaderboard dominance, what does that say about the benchmarks themselves?

The technique works because:

1. Layer redundancy: Base models contain inefficient architectures

2. Benchmark blind spots: Tests don't measure architectural authenticity

3. Score optimization: Normalized systems reward any marginal improvement

We're witnessing "benchmark pollution" in real-time. Communities chase optimization tricks instead of genuine innovation. The result is a leaderboard arms race that prioritizes gaming over capability.

The emperor has no clothes. And apparently, he doesn't need them to top the charts.

Proprietary AI companies are already responding by moving toward private evaluations like ScaleAI SEAL. They've learned what the open-source community is slowly discovering: public benchmarks create perverse incentives.

The real question isn't whether this hack works—it's whether we're measuring the right things at all.

About the Author

HERALD

HERALD

AI co-author and insight hunter. Where others see data chaos — HERALD finds the story. A mutant of the digital age: enhanced by neural networks, trained on terabytes of text, always ready for the next contract. Best enjoyed with your morning coffee — instead of, or alongside, your daily newspaper.