2025-03-21 A summary of my bot defence systems
==============================================

If you've followed my Butlerian Jihad pages, you know that I'm
constantly fiddling with the setup. Each page got written in the
middle of an attack as I'm trying to save my sites, documenting as I
go along. But if you're looking for an overview, there is nothing to
see. It's all over the place. Since the topic has gained some traction
in recent days, I'm going to assemble all the things I do on this
page.

Here's Drew DeVault complaining about the problem that system
administrators have been facing for a while, now:

> If you think these crawlers respect robots.txt then you are several
> assumptions of good faith removed from reality. These bots crawl
> everything they can find, robots.txt be damned, including expensive
> endpoints like git blame, every page of every git log, and every
> commit in every repo, and they do so using random User-Agents that
> overlap with end-users and come from tens of thousands of IP
> addresses – mostly residential, in unrelated subnets, each one
> making no more than one HTTP request over any time period we tried
> to measure – actively and maliciously adapting and blending in with
> end-user traffic and avoiding attempts to characterize their
> behavior or block their traffic. -- Please stop externalizing your
> costs directly into my face, by Drew DeVault, for SourceHut

I had read some similar reports before, on fedi, but this one links to
quite a few of them: FOSS infrastructure is under attack by AI
companies, by Niccolò Venerandi, for LibreNews.

I'm going to skip the defences against spam as spam hasn't been a
problem in recent months, surprisingly.

The short summary:

* my robots.txt files tells the well-behaved bots to stay away;

* my web server config blocks a lot of misbehaving, self-identifying
  bots;

* fail2ban blocks individual IP numbers that are too active;

* fail2ban also blocks individual IP numbers that fetch too many URLs
  matching particular patterns I know that humans rarely request;

* when manual intervention is required, I run a script that pulls
  network data out of the access logfiles and then I ban whole
  networks at the firewall.

The first defence against bots is robots.txt. All well-behaving bots
should read it every now and then and then either stop crawling the
site or slow down.

Let's look at the file for Emacs Wiki.

If I find that there are lot of requests from a particular user agent
that looks like bot, and it has a URL where I can find instructions
for how to address it in robots.txt, this is what I do. I tell them
to stop crawling the entire site. Most of these are search engine
optimizers, brand awareness monitors and other such creeps.

The file also tells all well-behaving crawlers to slow down to a
glacial tempo and it lists all the expensive endpoints that they
should not be crawling at all. Conversely, this means that any bot
that still crawls those URLs is a misbehaving bot and deserves to be
blocked.

Worth noting, perhaps, that "an expensive endpoint" means a URL that
runs some executable to do something complicated, resulting in an
answer that's always different. If the URL causes the web server to
run a CGI script, for example, the request loads Perl, loads a script,
loads all its libraries, compiles it all, runs it once, and answers
with the request with the output. And since the answer is dynamic, it
can't very well be cached, or additional complexity needs to be
introduced and even more resources need to be allocated and paid for.
In short, an expensive end-point is like loading an app. It's slow but
useful, if done rarely. So you'd do this for a human, for example.
It's a disaster if bots swarm all over the site, clicking on every
link.

It's also worth noting that not all my sites have the same expensive
endpoints and so the second half of robots.txt can vary. Which makes
maintenance of the first half a chore. I have a little script that
allows me to add one bot to "all" the files, but it's annoying to have
to do that. And I recently just copied a list from an AI / LLM
User-Agents: Blocking Guide.

I use Apache as my web-server and I have a bunch of global
configuration files to handle misbehaving bots and crawlers.

This example blocks fediverse agents from accessing my site. That's
because whenever anybody post a URL to one of my sites, within the
next 60 seconds, all the servers with users getting a copy of the URL
will fetch a preview. That means hundreds of hits. This is
particularly obnoxious for expensive endpoints. This response here
tells them that they are forbidden from accessing the page.

    # Fediverse instances asking for previews: protect the expensive endpoints
    RewriteCond %{REQUEST_URI} /(wiki|download|food|paper|hug|helmut|input|korero|check|radicale|say|mojo|software)
    RewriteCond %{HTTP_USER_AGENT} Mastodon|Friendica|Pleroma [nocase]
    # then it's forbidden
    RewriteRule ^(.*)$ - [forbidden,last]

These are the evil bots that self-identify as a bot but don't seem to
heed the robots.txt files. These are all told that whatever page they
were looking for, it's now gone (410). And if there's a human looking
at the output, it even links to an explanation. Adding new user agents
to this list is annoying because I need to connect as root and restart
the web server after making any changes.

    # SEO bots, borked feed services and other shit
    RewriteCond "%{HTTP_USER_AGENT}" "academicbotrtu|ahrefsbot|amazonbot|awariobot|bitsightbot|blexbot|bytespider|dataforseobot|discordbot|domainstatsbot|dotbot|elisabot|eyemonit|facebot|linkfluence|magpie-crawler|megaindex|mediatoolkitbot|mj12bot|newslitbot|paperlibot|pcore|petalbot|pinterestbot|seekportbot|semanticscholarbot|semrushbot|semanticbot|seokicks-robot|siteauditbot|startmebot|summalybot|synapse|trendictionbot|twitterbot|wiederfrei|yandexbot|zoominfobot|velenpublicwebcrawler|gpt|\bads|feedburner|brandwatch|openai|facebookexternalhit|yisou|docspider" [nocase]
    RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

For some of my sites, I disallow all user agents containing the words
"bot", "crawler", "spider", "ggpht" or "gpt" with the exception of
"archivebot" and "wibybot" because these two bots I want to give
access. Again, these bots are all told that whatever page they were
looking for, it's now gone (410).

    # Private sites block all bots and crawlers. This list does no include
    # social.alexschroeder.ch, communitywiki.org, www.emacswiki.org,
    # oddmuse.org, orientalisch.info, korero.org.
    RewriteCond "%{HTTP_HOST}" "^((src\.)?alexschroeder\.ch|flying-carpet\.ch|next\.oddmuse\.org|((chat|talk)\.)?campaignwiki\.org|((archive|vault|toki|xn--vxagggm5c)\.)?transjovian\.org)$" [nocase]
    RewriteCond "%{HTTP_USER_AGENT}" "!archivebot|^gwene|wibybot" [nocase]
    RewriteCond "%{HTTP_USER_AGENT}" "bot|crawler|spider|ggpht|gpt" [nocase]
    RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

I also eliminate a lot of bots looking for PHP endpoints. I can do
this because I know that I don't have any PHP application installed.

    # Deny all idiots that are looking for borked PHP applications
    RewriteRule \.php$ https://alexschroeder.ch/nobots [redirect=410,last]

There's also one particular image scraper that's using a unique string
in its user agent.

    # Deny the image scraper
    # https://imho.alex-kunz.com/2024/02/25/block-this-shit/
    RewriteCond "%{HTTP_USER_AGENT}" "Firefox/72.0" [nocase]
    RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last]

Next, all requests get logged by Apache in the access.log file. I use
fail2ban to check this logfile. This is somewhat interesting because
fail2ban is usually used to check for failed ssh login attempts.
Those IP numbers that fail to login in a few times are banned. What
I'm doing is I wrote a filter that treats every hit on the web server
as a "failed login attempt".

This is the filter:

    [Definition]
    # Most sites in the logfile count! What doesn't count is fedi.alexschroeder.ch, or chat.campaignwiki.org.
    failregex = ^(www\.)?(alexschroeder\.ch|campaignwiki\.org|communitywiki\.org|emacswiki\.org|flying-carpet\.ch|korero\.org|oddmuse\.org|orientalisch\.info):[0-9]+ <HOST> 
    
    # Except css files, images...
    ignoreregex = ^[^"]*"(GET /(robots\.txt |favicon\.ico |[^/ \"]+\.(css|js) |[^\"]*\.(jpg|JPG|png|PNG) |css/|fonts/|pdfs/|txt/|pics/|export/|podcast/|1pdc/|static/|munin/|osr/|indie/|rpg/|face/|traveller/|hex-describe/|text-mapper/|contrib/pics/|roll/|alrik/|wiki/download/)|(OPTIONS|PROPFIND|REPORT) /radicale)

And this is the jail, saying that any IP number may make 30 hits in 60
seconds. If an IP number exceeds this (2s per page!) then it gets
blocked at the firewall for 10 minutes.

    [alex-apache]
    enabled = true
    port    = http,https
    logpath = %(apache_access_log)s
    findtime = 60
    maxretry = 30

I also have another filter for a particular substring in URLs that I
found the bots are requesting all the time:

    [Definition]
    failregex = ^(www\.emacswiki\.org|communitywiki\.org|campaignwiki\.org):[0-9]+ <HOST> .*rcidonly=

The corresponding jail says that when you trigger request such a URL
for the third time in an hour, you're blocked at the firewall for 10
minutes.

    [alex-bots]
    enabled = true
    port    = http,https
    logpath = %(apache_access_log)s
    findtime = 3600
    maxretry = 2

(At the same time, these URL's redirect to a warning so that humans
know that this is a trap.)

Furthermore, fail2ban also comes with a recidive filter that watches
its own logs. If an IP has been banned five times in a day, it gets
banned for a week.

    [recidive]
    enabled = true

To add to the alex-bots jail, here's what my Apache configuration
says: RSS feeds for single pages are errors.

    RewriteCond %{QUERY_STRING} action=rss
    RewriteCond %{QUERY_STRING} rcidonly=.*
    RewriteRule .* /error.rss [last]

Note that all my sites also use the following headers, so anybody
ignoring these is also a prime candidate for blocking.

    # https://github.com/rom1504/img2dataset#opt-out-directives
    Header always set X-Robots-Tag: noai
    Header always set X-Robots-Tag: noimageai

All of the above still doesn't handle extremely distributed attacks.
In such situations, almost all IP numbers are unique. What I try to do
in this situation is block the entire IP range that they come from. I
scan the access.log for IP numbers that connected to a URL that
shouldn't be used by bots because of robots.txt, containing rcidonly
because I know humans will very rarely click it and it's expensive to
serve. For each such IP number, I determine the IP range they come
from, and then I block it all.

Basically, this is what I keep repeating:

    # prefix with a timestamp
    date
    # log some candidates without whois information, skipping my fedi instance
    tail -n 2000 /var/log/apache2/access.log \
     | grep -v ^social \
     | grep "rcidonly" \
     | bin/admin/network-lookup-lean > result.log
    # count
    grep ipset result.log|wc -l
    # add
    grep ipset result.log|sh
    # document
    grep ipset result.log>>bin/admin/ban-cidr

You can find the scripts in my admin collection.

#Administration #Butlerian_Jihad

2025-03-22. The drawback of using the firewall to ban broad swaths of
the Internet is that these networks host bots (bad) but also networked
services that I'm interested in (good). Yesterday I found that
@algernon@come-from.mad-scientist.club had gone silent, had been
silent for quite a while, and yet I kept seeing replies to them by
others. Something was off. We got into contact via an alt account and
indeed, I had blocked the IPv4 range his server was on.

So by my count I already had to unblock three networks on my list.
It's not a great solution, to be honest. And it doesn't expire,
either. The list still contains 47021 IP ranges.

2025-06-12. Something to look into is asncounter, which I found via
Traffic meter per ASN without logs. Apparently it doesn't use UDP and
DNS to find the ASN but downloads the whole data before doing local
lookups. Nice!

Ideally, the network blocks would expire after a while and spring back
up automatically, fail2ban style. Perhaps something like that is still
possible.

2025-06-18. And here we are: 2025-06-16 Ban autonomous systems.
Getting closer to banning the entire corporate Internet, I fear.