2025-03-21 A summary of my bot defence systems ============================================== If you've followed my Butlerian Jihad pages, you know that I'm constantly fiddling with the setup. Each page got written in the middle of an attack as I'm trying to save my sites, documenting as I go along. But if you're looking for an overview, there is nothing to see. It's all over the place. Since the topic has gained some traction in recent days, I'm going to assemble all the things I do on this page. Here's Drew DeVault complaining about the problem that system administrators have been facing for a while, now: > If you think these crawlers respect robots.txt then you are several > assumptions of good faith removed from reality. These bots crawl > everything they can find, robots.txt be damned, including expensive > endpoints like git blame, every page of every git log, and every > commit in every repo, and they do so using random User-Agents that > overlap with end-users and come from tens of thousands of IP > addresses – mostly residential, in unrelated subnets, each one > making no more than one HTTP request over any time period we tried > to measure – actively and maliciously adapting and blending in with > end-user traffic and avoiding attempts to characterize their > behavior or block their traffic. -- Please stop externalizing your > costs directly into my face, by Drew DeVault, for SourceHut I had read some similar reports before, on fedi, but this one links to quite a few of them: FOSS infrastructure is under attack by AI companies, by Niccolò Venerandi, for LibreNews. I'm going to skip the defences against spam as spam hasn't been a problem in recent months, surprisingly. The short summary: * my robots.txt files tells the well-behaved bots to stay away; * my web server config blocks a lot of misbehaving, self-identifying bots; * fail2ban blocks individual IP numbers that are too active; * fail2ban also blocks individual IP numbers that fetch too many URLs matching particular patterns I know that humans rarely request; * when manual intervention is required, I run a script that pulls network data out of the access logfiles and then I ban whole networks at the firewall. The first defence against bots is robots.txt. All well-behaving bots should read it every now and then and then either stop crawling the site or slow down. Let's look at the file for Emacs Wiki. If I find that there are lot of requests from a particular user agent that looks like bot, and it has a URL where I can find instructions for how to address it in robots.txt, this is what I do. I tell them to stop crawling the entire site. Most of these are search engine optimizers, brand awareness monitors and other such creeps. The file also tells all well-behaving crawlers to slow down to a glacial tempo and it lists all the expensive endpoints that they should not be crawling at all. Conversely, this means that any bot that still crawls those URLs is a misbehaving bot and deserves to be blocked. Worth noting, perhaps, that "an expensive endpoint" means a URL that runs some executable to do something complicated, resulting in an answer that's always different. If the URL causes the web server to run a CGI script, for example, the request loads Perl, loads a script, loads all its libraries, compiles it all, runs it once, and answers with the request with the output. And since the answer is dynamic, it can't very well be cached, or additional complexity needs to be introduced and even more resources need to be allocated and paid for. In short, an expensive end-point is like loading an app. It's slow but useful, if done rarely. So you'd do this for a human, for example. It's a disaster if bots swarm all over the site, clicking on every link. It's also worth noting that not all my sites have the same expensive endpoints and so the second half of robots.txt can vary. Which makes maintenance of the first half a chore. I have a little script that allows me to add one bot to "all" the files, but it's annoying to have to do that. And I recently just copied a list from an AI / LLM User-Agents: Blocking Guide. I use Apache as my web-server and I have a bunch of global configuration files to handle misbehaving bots and crawlers. This example blocks fediverse agents from accessing my site. That's because whenever anybody post a URL to one of my sites, within the next 60 seconds, all the servers with users getting a copy of the URL will fetch a preview. That means hundreds of hits. This is particularly obnoxious for expensive endpoints. This response here tells them that they are forbidden from accessing the page. # Fediverse instances asking for previews: protect the expensive endpoints RewriteCond %{REQUEST_URI} /(wiki|download|food|paper|hug|helmut|input|korero|check|radicale|say|mojo|software) RewriteCond %{HTTP_USER_AGENT} Mastodon|Friendica|Pleroma [nocase] # then it's forbidden RewriteRule ^(.*)$ - [forbidden,last] These are the evil bots that self-identify as a bot but don't seem to heed the robots.txt files. These are all told that whatever page they were looking for, it's now gone (410). And if there's a human looking at the output, it even links to an explanation. Adding new user agents to this list is annoying because I need to connect as root and restart the web server after making any changes. # SEO bots, borked feed services and other shit RewriteCond "%{HTTP_USER_AGENT}" "academicbotrtu|ahrefsbot|amazonbot|awariobot|bitsightbot|blexbot|bytespider|dataforseobot|discordbot|domainstatsbot|dotbot|elisabot|eyemonit|facebot|linkfluence|magpie-crawler|megaindex|mediatoolkitbot|mj12bot|newslitbot|paperlibot|pcore|petalbot|pinterestbot|seekportbot|semanticscholarbot|semrushbot|semanticbot|seokicks-robot|siteauditbot|startmebot|summalybot|synapse|trendictionbot|twitterbot|wiederfrei|yandexbot|zoominfobot|velenpublicwebcrawler|gpt|\bads|feedburner|brandwatch|openai|facebookexternalhit|yisou|docspider" [nocase] RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last] For some of my sites, I disallow all user agents containing the words "bot", "crawler", "spider", "ggpht" or "gpt" with the exception of "archivebot" and "wibybot" because these two bots I want to give access. Again, these bots are all told that whatever page they were looking for, it's now gone (410). # Private sites block all bots and crawlers. This list does no include # social.alexschroeder.ch, communitywiki.org, www.emacswiki.org, # oddmuse.org, orientalisch.info, korero.org. RewriteCond "%{HTTP_HOST}" "^((src\.)?alexschroeder\.ch|flying-carpet\.ch|next\.oddmuse\.org|((chat|talk)\.)?campaignwiki\.org|((archive|vault|toki|xn--vxagggm5c)\.)?transjovian\.org)$" [nocase] RewriteCond "%{HTTP_USER_AGENT}" "!archivebot|^gwene|wibybot" [nocase] RewriteCond "%{HTTP_USER_AGENT}" "bot|crawler|spider|ggpht|gpt" [nocase] RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last] I also eliminate a lot of bots looking for PHP endpoints. I can do this because I know that I don't have any PHP application installed. # Deny all idiots that are looking for borked PHP applications RewriteRule \.php$ https://alexschroeder.ch/nobots [redirect=410,last] There's also one particular image scraper that's using a unique string in its user agent. # Deny the image scraper # https://imho.alex-kunz.com/2024/02/25/block-this-shit/ RewriteCond "%{HTTP_USER_AGENT}" "Firefox/72.0" [nocase] RewriteRule ^ https://alexschroeder.ch/nobots [redirect=410,last] Next, all requests get logged by Apache in the access.log file. I use fail2ban to check this logfile. This is somewhat interesting because fail2ban is usually used to check for failed ssh login attempts. Those IP numbers that fail to login in a few times are banned. What I'm doing is I wrote a filter that treats every hit on the web server as a "failed login attempt". This is the filter: [Definition] # Most sites in the logfile count! What doesn't count is fedi.alexschroeder.ch, or chat.campaignwiki.org. failregex = ^(www\.)?(alexschroeder\.ch|campaignwiki\.org|communitywiki\.org|emacswiki\.org|flying-carpet\.ch|korero\.org|oddmuse\.org|orientalisch\.info):[0-9]+ # Except css files, images... ignoreregex = ^[^"]*"(GET /(robots\.txt |favicon\.ico |[^/ \"]+\.(css|js) |[^\"]*\.(jpg|JPG|png|PNG) |css/|fonts/|pdfs/|txt/|pics/|export/|podcast/|1pdc/|static/|munin/|osr/|indie/|rpg/|face/|traveller/|hex-describe/|text-mapper/|contrib/pics/|roll/|alrik/|wiki/download/)|(OPTIONS|PROPFIND|REPORT) /radicale) And this is the jail, saying that any IP number may make 30 hits in 60 seconds. If an IP number exceeds this (2s per page!) then it gets blocked at the firewall for 10 minutes. [alex-apache] enabled = true port = http,https logpath = %(apache_access_log)s findtime = 60 maxretry = 30 I also have another filter for a particular substring in URLs that I found the bots are requesting all the time: [Definition] failregex = ^(www\.emacswiki\.org|communitywiki\.org|campaignwiki\.org):[0-9]+ .*rcidonly= The corresponding jail says that when you trigger request such a URL for the third time in an hour, you're blocked at the firewall for 10 minutes. [alex-bots] enabled = true port = http,https logpath = %(apache_access_log)s findtime = 3600 maxretry = 2 (At the same time, these URL's redirect to a warning so that humans know that this is a trap.) Furthermore, fail2ban also comes with a recidive filter that watches its own logs. If an IP has been banned five times in a day, it gets banned for a week. [recidive] enabled = true To add to the alex-bots jail, here's what my Apache configuration says: RSS feeds for single pages are errors. RewriteCond %{QUERY_STRING} action=rss RewriteCond %{QUERY_STRING} rcidonly=.* RewriteRule .* /error.rss [last] Note that all my sites also use the following headers, so anybody ignoring these is also a prime candidate for blocking. # https://github.com/rom1504/img2dataset#opt-out-directives Header always set X-Robots-Tag: noai Header always set X-Robots-Tag: noimageai All of the above still doesn't handle extremely distributed attacks. In such situations, almost all IP numbers are unique. What I try to do in this situation is block the entire IP range that they come from. I scan the access.log for IP numbers that connected to a URL that shouldn't be used by bots because of robots.txt, containing rcidonly because I know humans will very rarely click it and it's expensive to serve. For each such IP number, I determine the IP range they come from, and then I block it all. Basically, this is what I keep repeating: # prefix with a timestamp date # log some candidates without whois information, skipping my fedi instance tail -n 2000 /var/log/apache2/access.log \ | grep -v ^social \ | grep "rcidonly" \ | bin/admin/network-lookup-lean > result.log # count grep ipset result.log|wc -l # add grep ipset result.log|sh # document grep ipset result.log>>bin/admin/ban-cidr You can find the scripts in my admin collection. #Administration #Butlerian_Jihad 2025-03-22. The drawback of using the firewall to ban broad swaths of the Internet is that these networks host bots (bad) but also networked services that I'm interested in (good). Yesterday I found that @algernon@come-from.mad-scientist.club had gone silent, had been silent for quite a while, and yet I kept seeing replies to them by others. Something was off. We got into contact via an alt account and indeed, I had blocked the IPv4 range his server was on. So by my count I already had to unblock three networks on my list. It's not a great solution, to be honest. And it doesn't expire, either. The list still contains 47021 IP ranges. 2025-06-12. Something to look into is asncounter, which I found via Traffic meter per ASN without logs. Apparently it doesn't use UDP and DNS to find the ASN but downloads the whole data before doing local lookups. Nice! Ideally, the network blocks would expire after a while and spring back up automatically, fail2ban style. Perhaps something like that is still possible. 2025-06-18. And here we are: 2025-06-16 Ban autonomous systems. Getting closer to banning the entire corporate Internet, I fear.