
I was scrolling through Hacker News when I saw it: 152 points and 100+ comments on a post about fighting AI scrapers. Not your typical Wednesday drama, but here's the thing - this hits close to home for anyone running their own Git infrastructure.
VulpineCitrus (amazing name, btw) dropped this bombshell on December 2nd about their self-hosted Git forge getting absolutely hammered by AI scrapers. We're talking millions of commits being downloaded by bots from companies like Anthropic and OpenAI. The bandwidth bills alone would make you weep.
The Bot Apocalypse Is Real
This isn't some isolated incident. Post-2023, we've seen an explosion of aggressive scraper activity. GitHub had to implement rate limits because their infrastructure was getting destroyed. These bots completely ignore robots.txt files and just feast on public repositories.
What really gets me fired up is the sheer audacity of it all:
- Scrapers cause DDoS-like behavior
- They mangle sites and tank traffic
- Zero compensation to infrastructure owners
- Complete disregard for bandwidth costs
<> "The fight with the bots is on" - as sodimel put it in the HN comments, and they're not wrong./>
The Defense Playbook
The Hacker News crowd came out swinging with solutions, and honestly, their advice is solid:
The Nuclear Option:
- mappu recommended going full WireGuard VPN for single-user setups
- ThatPlayer and wrxd pushed Tailscale for small teams
- Completely eliminate public exposure = zero scraper risk
The Surgical Strike:
- FabCH and komali2 advocate for geoblocking
- Block regions where most scrapers originate
- Works great unless you need global collaboration
The Fortification:
- Login walls and Anubis for public-facing forges
- Rate limiting (though specifics weren't detailed)
- Ongoing bot signature detection
Why This Matters More Than You Think
Here's what's actually happening: we're watching the death of open web infrastructure in real-time. Small developers and indie teams are getting priced out by AI companies' data hunger.
The market is responding predictably:
- Tailscale raised $115M in 2024 - coincidence? I think not
- Demand for GitHub Enterprise is spiking
- Self-hosting newsletters like Selfh.st are amplifying awareness
- "Data sovereignty" tools are having their moment
But here's the frustrating part: hashar and dspillett pointed out that these scrapers don't even need Git-specific data. Regular web scraping would suffice for most AI training purposes. They're just being lazy and inefficient.
The Real Winner Here
The consensus from those 100+ HN comments is crystal clear: VPNs are the best first step. Fighting bots directly is like playing whack-a-mole with a hammer made of bandwidth bills.
VulpineCitrus's struggle represents a broader shift toward private infrastructure. We're moving away from the open, collaborative web toward locked-down, zero-trust networking. That's... actually kind of sad when you think about it.
The indie dev community is basically subsidizing AI training data with their hosting costs. That's backwards and unsustainable.
My Bet: Within 18 months, we'll see a new category of "anti-scraper SaaS" tools emerge specifically for Git forges and self-hosted infrastructure. The market opportunity is too obvious, and the pain is too real. VPN adoption will accelerate dramatically, and public Git forges will become the exception rather than the rule. The open web is getting paywalled, one scraped repository at a time.

