2025-07-19 The current setup defending my sites
===============================================

I wrote this post (and posted it on Emacs Wiki) because people have
been wondering about it all on Reddit. I'm no longer on Reddit.

If you see other people on the net wondering whether one of my sites
is down, feel free to repost this message or parts of it. Sadly, you
won't be able to link to it, because the people wondering are probably
banned by the firewall.

Why am I having visitors banned by the firewall? The web has been
under attack by AI scrapers since around 2022. That's when big
companies decided they needed to train AI and one of the sources of
training material was the web. (Another source was a huge collection
of pirated books, but that's a different story.) And if your task is
to scrape as much of the web as possible, you can't be picky. The
result is devastating. Let me quote Drew DeVault:

> If you think these crawlers respect robots.txt then you are several
> assumptions of good faith removed from reality. These bots crawl
> everything they can find, robots.txt be damned, including expensive
> endpoints like git blame, every page of every git log, and every
> commit in every repo, and they do so using random User-Agents that
> overlap with end-users and come from tens of thousands of IP
> addresses – mostly residential, in unrelated subnets, each one
> making no more than one HTTP request over any time period we tried
> to measure – actively and maliciously adapting and blending in with
> end-user traffic and avoiding attempts to characterize their
> behavior or block their traffic. -- Please stop externalizing your
> costs directly into my face, by Drew DeVault, for SourceHut

So people have been scrambling to defend their sites against the AI
scraper stampede. There are no good tools.

One of the first measures was to block self-identified scrapers and
bots. Any user agent containing the words "bot", "crawler", "spider",
"ggpht" or "gpt" are automatically redirected to a "No Bots" page with
an HTTP status of 410, which means the resource is gone and the user
agent should remove it from their database. And then I have another
list of user agents that keep hitting the site: bots to help search
engine optimisers (SEO), bots to "audit" the site, bots to check
uptime, get page previews, and on and on. Whenever I checked the top
hitters on my sites, I'd find another user agent or two to add to the
list.

But as you saw in Drew DeVault's blog post, AI scrapers have been
working around this by faking regular user agents, making them
indistinguishable from humans. The solution, therefore, is not to
listen to what they say but watch what they do.

One tool I stumbled upon pretty early was using fail2ban. The
traditional way of using it is to have it check a log file such as the
sshd log for failed login attempts. If an IP address was causing too
many failed login attempts, they would get banned for 10 minutes. A
nice trick is that you could also have it check its own log files and
if an IP address was getting banned multiple times, then they would
get banned for 1 week.

I started applying this to the web server log files. I figured a human
clicking a bunch of links might show a burst of activity, so I defined
a rate limit of 30 hits in 60 seconds. That is: the average rate must
not exceed one hit every 2 seconds but activity bursts of up to 30
hits are OK. I also exclude a lot of URLs matching images and other
resources.

The main limitation is that this rule is limited to single IP
addresses. And as you saw in Drew DeVault's blog post, AI scrapers
have been working around this by using services that distribute
requests over whole networks. The solution, therefore, is to defend
against entire organisations.

Multiple times per hour, I have jobs scheduled that go through the
last two hours of the web server access log, extracting all the IP
addresses and determining their autonomous system number (ASN). That
number identifies whole internet service providers (ISP) or similar
companies.

I know, using autonomous systems makes this a very broad ban hammer.
It catches innocent people that use an ISP that hires out computing
power and bandwidth to AI scrapers. But I don't know any other way to
fight back bots "using random User-Agents that overlap with end-users
and come from tens of thousands of IP addresses". So this is what it
is. On the positive side, the bans are temporary. They expire after a
while. If the AI scrapers are done ingesting the world-wide web, the
ban is over. If they're still at it, the ban is reinstated.

The first job bans "active" autonomous systems:

* If load exceeds 10, the number of hits in a 2 hour period may not
  exceed 300 per ASN.

* If load exceeds 5, the number of hits in a 2 hour period may not
  exceed 400 per ASN.

* Under regular load, the number of hits in a 2 hour period may not
  exceed 500 per ASN.

This includes everything showing up in the web server access log
including hits for embedded things such as CSS files and images.

The second job bans autonomous systems hitting expensive end-points:

* If load exceeds 10, the number of expensive hits in a 2 hour period
  may not exceed 10 per ASN.

* If load exceeds 5, the number of expensive hits in a 2 hour period
  may not exceed 20 per ASN.

* Under regular load, the number of expensive hits in a 2 hour period
  may not exceed 30 per ASN.

Expensive end-points are filtered RSS feed, Recent Changes and
full-text searches.

The third job bans autonomous systems hosting bots:

* If load exceeds 10, the number of bot hits in a 2 hour period may
  not exceed 10 per ASN.

* If load exceeds 5, the number of bot hits in a 2 hour period may not
  exceed 20 per ASN.

* Under regular load, the number of bot hits in a 2 hour period may
  not exceed 30 per ASN.

A bot hit is counted when the web server returned a HTTP status 410 as
mentioned above. In other words, these are all the user agents
containing the words "bot", "crawler", "spider", "ggpht" or "gpt".

The bans from the three jobs mentioned just now last for 1 hour.

If such a ban was made more than 5 times in a day, the ban is extended
to 1 week.

Banning an ASN means that all the networks it manages are banned.

If the system works, the AI scraper stampede starts, load starts to
climb up to 10, everything slows down to a crawl, the number of
threads goes up from 350 to 450, the number of TCP connection goes up
from 150 to 550, the number of wiki processes goes up from 1 or 2 to
20, and after a few minutes my jobs kick in and start banning IP
addresses left and right until things have calmed down.

I'm still learning. The programmers working on AI scrapers are still
learning. The arms race isn't over until their funding dries up. Until
we all decided that the costs of AI aren't worth it. So this post is
just a snapshot. I'll continue tweaking the setup.

I'm sorry if this ban hammer is hitting you. It's still better than
taking my sites offline. I've had to do that in the past because I did
not know what else to do.

The easy solution is to switch networks. You might still be able to
access the site from a mobile phone using mobile data, for example.
(Using a phone in a wifi network won't work.)

A harder solution is to use a VPN or to switch ISP.

An alternative for those of you with a static IP address within a
network that is often banned is to contact me and I can add your
specific IP address to an allow-list. Use Your IP address if you
don't know your IP number. In that case, however, I suspect that it is
not static.

I can't wait for the next AI winter. 🥶

#Butlerian_Jihad

2025-07-24. Sadly, the new setup is starting to hurt a lot of
innocent people. I've been getting a handful of emails, but there's
also talk on Reddit.

* Emacs Wiki

* Campaign Wiki

😥

2025-08-12. Putting the people I follow on fedi on an allow-list.
2025-08-03 GoToSocial and the Butlerian Jihad.

2025-09-14. For the technical aspects, see the Butlerian Jihad
pages.