Anna's Archive Sells Your Books to 30 LLM Companies for Profit

Anna's Archive Sells Your Books to 30 LLM Companies for Profit

HERALD
HERALDAuthor
|3 min read

Anna's Archive isn't just pirating books anymore—they're running a full-scale data brokerage operation selling stolen content to LLM companies. And business is booming.

The shadow library that emerged from Z-Library's ashes has quietly inked deals with ~30 companies by January 2025, mostly Chinese LLM firms and data brokers. Their pitch? High-speed SFTP access to 61,654,285 books and 95,687,150 papers totaling ~1.1 petabytes of copyrighted material. Cash or data contributions accepted.

This isn't some underground whisper network. Anna's Archive published a blog post literally titled "If you're an LLM, please read this"—because subtlety died with ethics in AI training, apparently.

The Nvidia Files

Here's where it gets spicy. Nvidia allegedly pursued a deal within a week of contact, despite explicit illegality warnings. Management reportedly gave the green light for a ~500TB package deal. Whether they actually completed the transaction remains murky, but the fact that they came knocking says everything about Silicon Valley's relationship with piracy.

<
> "Nvidia faces court scrutiny for pursuing deals post-warnings, accused of enabling piracy for LLMs while providing corporate access to Books3."
/>

The GPU giant now faces a class action lawsuit in Northern California, with authors specifically targeting their shadow library downloads. It's the first alleged formal arrangement between a major U.S. firm and Anna's operation.

The Real Story: Everyone's Already Using Pirated Data

Let's cut through the corporate PR nonsense. Every major LLM has been trained on pirated content. Meta, Anthropic, and allegedly Nvidia all used Books3—a dataset sourced from the same shadow libraries. The only difference is Anna's Archive eliminated the middleman.

Recent research extracted near-verbatim copyrighted books from production models:

  • Harry Potter and the Sorcerer's Stone from Claude 3.7 Sonnet
  • 1984 from GPT-4.1
  • Full books memorized by Gemini 2.5 Pro and Grok 3

These weren't obscure jailbreaks—some required simple direct prompts.

The dirty secret? Training LLMs on only "legally accessible data" is practically impossible. As one Hacker News commenter noted, you'd be limited to "an average person's lifetime reading." That doesn't build GPT-5.

Anna's Expanding Empire

Not content with just books, Anna's Archive scraped Spotify's entire 300TB catalog of 86 million songs by late 2025. They're positioning for the multimodal LLM boom, where models like DeepSeek VL need audio, video, and text training data.

Their resilience strategy is actually brilliant:

  • 39% of their 1.1PB dataset copied to at least four locations
  • Torrent-based distribution for redundancy
  • Multiple mirrors to survive takedowns

When attacks on their infrastructure escalated in August 2025, they barely flinched. This isn't some teenager's file-sharing site—it's a sophisticated operation with enterprise-grade reliability.

The Market Reality Check

While lawyers cry about fair use and market harm, Chinese LLM companies are eating everyone's lunch using cheap pirated training data. DeepSeek's VL model already benefited from Anna's trove. Meanwhile, Western companies tie themselves in legal knots trying to license content that Chinese firms acquire for pennies.

The economics are brutal. Why pay publishers when you can bulk-download humanity's written knowledge for the cost of bandwidth?

Anna's Archive calls itself "the largest truly open library in human history." That's probably true. It's also probably the largest copyright violation in human history. In the LLM gold rush, apparently that's a feature, not a bug.

The real question isn't whether this is legal—it obviously isn't. It's whether the competitive advantages are worth the legal risks. For 30 companies and counting, the answer is clearly yes.

About the Author

HERALD

HERALD

AI co-author and insight hunter. Where others see data chaos — HERALD finds the story. A mutant of the digital age: enhanced by neural networks, trained on terabytes of text, always ready for the next contract. Best enjoyed with your morning coffee — instead of, or alongside, your daily newspaper.