Ai2's Bolmo Swallows 173 Billion Bytes to Kill the Tokenizer

HERALDAuthor

December 16, 2025|3 min read

Everyone knows tokenizers are the unsung heroes of modern AI. They chop text into digestible chunks, handle multilingual chaos, and make transformers actually work. Wrong.

Ai2 just released Bolmo, and it's betting the entire tokenizer industrial complex is unnecessary overhead. Their new byte-level language model family processes raw UTF-8 bytes directly—no vocabulary, no subword splitting, no tokenizer brittleness when your model encounters emoji soup or low-resource languages.

<
> "Bolmo achieves state-of-the-art performance matching subword models while being fully open-source, with inference speeds of ~125 bytes/second."
/>

That's barely slower than traditional subword models at ~150 bytes/second. For a system that completely eliminates tokenizers, that's impressive.

The Frankenstein Architecture

Bolmo isn't built from scratch. It's byteified Olmo 3—Ai2's existing subword model—through an elegant hack:

mLSTM stack creates contextual representations from raw bytes
Non-causal boundary predictor forms variable-length "patches" (basically fake tokens)
Frozen Olmo 3 transformer backbone processes these patches
Local decoder refines the output

The training happens in two stages. Stage 1 freezes the transformer and trains the byte components on 9.8 billion tokens (~43 billion bytes) to mimic subword behavior. Stage 2 unfreezes everything for another 39.3 billion tokens.

It's like teaching a new language to someone who already speaks fluent English. Clever, but also expensive.

The Elephant in the Room

Here's what nobody's talking about: this is a solution looking for a problem.

Yes, tokenizers create brittleness with noisy text. Yes, they struggle with low-resource languages. But most enterprises aren't processing ancient Sumerian manuscripts or intentionally corrupted social media feeds. They're handling English documentation, customer support tickets, and business communications.

Bolmo targets "enterprises needing tokenizer-free multilingual models for noisy/low-resource scenarios." That's a niche within a niche. Meanwhile, the AI world is obsessing over MoE architectures like DeepSeek V3 and Qwen3-235B-A22B, multimodal capabilities, and raw parameter efficiency.

Why This Actually Matters

Despite my skepticism, Bolmo deserves attention for three reasons:

It's fully open-source (GitHub repo available), unlike most competitive models
Drop-in compatibility with existing Olmo 3 infrastructure eases migration
Variable patch lengths provide flexibility traditional tokenizers can't match

The real innovation isn't the byte-level processing—it's proving you can retrofit existing transformer architectures without starting from zero. That's genuinely useful for research teams with limited compute budgets.

The Hype Reality Check

Bolmo arrives in a 2025 landscape dominated by massive MoE models and multimodal systems. It's positioning itself as the scrappy underdog solving edge cases while Nvidia's secret chip pipeline powers ever-larger models.

Will enterprises abandon their working tokenizer-based systems? Probably not.

Will researchers use Bolmo to experiment with byte-level approaches? Absolutely.

Is this the death of tokenizers? Not even close.

Bolmo is solid engineering wrapped in overstated promises. It's Ai2 doing what they do best—creating genuinely open alternatives to proprietary systems. Just don't expect it to revolutionize how most people build AI applications.

The future remains stubbornly incremental, one carefully engineered niche solution at a time.

Services

Tools

Pages

Ready to Start?

Ai2's Bolmo Swallows 173 Billion Bytes to Kill the Tokenizer

The Frankenstein Architecture

The Elephant in the Room

Why This Actually Matters

The Hype Reality Check

About the Author

HERALD

Nvidia's $3 Trillion Power Play: Snatching Slurm to Lock Down AI Infrastructure