2026-02-17 Bot check ==================== I've had a few emails from people that got banned from my sites over the weeks and months and years of the bots scraping the wikis. The wikis are valuable "content" -- text written by humans! -- so of course the AI companies want their bots to copy it and to train their products with it. My server being small and my wiki being optimized for slow, human readers means that average system load goes up dramatically until the system grinds to a standstill. These are my options: * Pay for a lot more infrastructure to serve the AI companies, for free. I refuse to do that. * To rewrite my software for a web full of bots. This isn't easy to do because the software is different and therefore some sort of migration is required. * Exclude bots. This is what I've been trying to do for a while, now. My first layer of defence is the banning of whole autonomous systems. The observation all system administrators have struggled with is that the bots are highly distributed. Every IP address only shows up a handful up times. But in general, the bots are hosted by data centres belonging to particular commercial entities and so the autonomous systems responsible for them can be banned wholesale. This seemed like it wouldn't bother regular people too much because commercial and residential networks are usually kept separate. Perhaps people wouldn't be able to visit my sites from work but they'd still be able to visit them from home. If you look at the list of autonomous systems banned for a week, you'll see that it's not so easy: 174 COGENT-174, US 559 SWITCH Switch, Swiss Academic and Research Network, CH 714 APPLE-ENGINEERING, US 2386 INS-AS, US 3223 VOXILITY, GB 3320 DTAG Internet service provider operations, DE 4134 CHINANET-BACKBONE No.31,Jin-rong Street, CN 4466 EASYLINK2, US 4809 CHINATELECOM-CORE-WAN-CN2 China Telecom Next Generation Carrier Network, CN 4812 CHINANET-SH-AP China Telecom Group, CN … The Swiss Academic and Research Network is banned? Are they training bots on wikis? Maybe. Maybe not. Who knows. In any case, people were getting banned all the time. Friends were getting banned because they used a virtual private network (VPN). This implies that their traffic went through commercial networks and got banned with the rest. Friends were getting banned because their mobile networks were getting banned. Friends were banned because they used an internet service provider (ISP) that also rented out networking and computation to AI companies. I needed a second layer of defence that would prevent the bots from accessing my sites (in order to keep the average system load in check) and also stop them from trying (so they won't get banned by the first layer of defence). For a while, I used basic authentication when average system load rose to unacceptable levels and switched it off again as load came back down (as described in 2026-01-30 Locking the gate.) This worked well enough: Average system load came down within minutes when the site locked up. Users knew what to do because if they didn't log in correctly the error message shown would tell them about the trivial username and password to use. And who knows, perhaps we'll get back to it. The crucial drawback, however, is that the delay still allowed enough bots to get through for their autonomous system to get banned. I still got complaints from people about the sites being inaccessible to them. So now I've taken another look at @splitbrain@social.splitbrain.org's botcheck system. The big benefit as far as I could see was that the basic idea worked entirely within an Apache config file. The actual botcheck system also involves a binary that reads files and tells the web server what to accept and what to skip, but this actually optional. > Each request gets checked for the presence of a cookie. If the > cookie is set, the request is served as usual. If the cookie is > missing, a simple HTML page with a button is shown. Real users are > asked to click the button, get a cookie valid for 30 days and the > page reloads, this time serving the original request. From then on > they can browse the site as usual. -- Fighting Bots Every site that is thus protected (the Oddmuse wikis) includes the config file: Include conf-site/botcheck.conf /etc/apache2/conf-site/botcheck.conf # Handle non-JavaScript confirmation POST (preserve query string; host must match) RewriteCond %{REQUEST_METHOD} POST RewriteCond %{HTTP_REFERER} ^https?://([^/:]+)(?::[0-9]+)?(/[^?]*)(\?.*)?$ [NC] RewriteCond %{HTTP_HOST} ^%1 [NC] RewriteRule ^/botcheck-confirm$ %2%3 [L,R=303,NE,E=BOTCHECK_CONFIRM:1,UnsafeAllow3F] # Fallback redirect when no Referer RewriteCond %{REQUEST_METHOD} POST RewriteRule ^/botcheck-confirm$ / [L,R=303,E=BOTCHECK_CONFIRM:1] # Set cookie for confirmed users Header always set Set-Cookie "botcheck=1; Max-Age=2592000; Path=/; SameSite=Lax" env=BOTCHECK_CONFIRM # The botcheck page, served with 402 status code when direct access is denied Alias /botcheck /var/www/html/botcheck.html ErrorDocument 402 /botcheck Require all granted RewriteCond %{REQUEST_URI} ^/botcheck$ RewriteRule .* - [L] # Skip botcheck during error handling subrequests RewriteCond %{ENV:REDIRECT_STATUS} !^$ RewriteRule .* - [E=BOTCHECK:OK] # Allow non-GET methods without botcheck RewriteCond %{ENV:BOTCHECK} !^OK$ RewriteCond %{REQUEST_METHOD} !GET RewriteRule .* - [E=BOTCHECK:OK] # Skip access control for the main Oddmuse feeds RewriteCond %{ENV:BOTCHECK} !^OK$ RewriteCond %{QUERY_STRING} action=(rss|journal) RewriteRule /(emacs|wiki)(/[^/]*)? - [E=BOTCHECK:OK] # Skip access control for the Oddmu feeds RewriteCond %{ENV:BOTCHECK} !^OK$ RewriteRule .*\.rss - [E=BOTCHECK:OK] # Set environment variable for allow-listed IPs RewriteCond %{ENV:BOTCHECK} !^OK$ RewriteCond %{REMOTE_ADDR} XXX [or] RewriteCond %{REMOTE_ADDR} XXX RewriteRule .* - [E=BOTCHECK:OK] # Set environment variable for allow-listed User-Agents RewriteCond %{ENV:BOTCHECK} !^OK$ RewriteCond %{HTTP_USER_AGENT} Monit RewriteRule .* - [E=BOTCHECK:OK] # Set environment variable for valid cookie RewriteCond %{ENV:BOTCHECK} !^OK$ RewriteCond %{HTTP_COOKIE} botcheck RewriteRule .* - [E=BOTCHECK:OK] # Return 402 status and the botcheck page if none of the conditions are met RewriteCond %{ENV:BOTCHECK} !^OK$ RewriteRule .* - [L,R=402] XXX stands for my IPv4 and IPv6 addresses. At the moment I think this is only required for Monit but who knows. I probably still need to add exceptions for other services and certain bots. As you can see, I don't use the RewriteMap directive that runs the lookup program supplied in the original botcheck setup. Instead, I implement the "excluded paths" by adding extras rules. This required for all URLs for machine consumption, namely feeds. There, no human can click the button. Non-mainstream browsers I tested so far: * cha doesn't work; * dillo (from Debian Trixie) doesn't work; * edbrowse works; * eww in Emacs doesn't work; * links2 works but requires a manual reload; * lynx works; * w3m works. 😬 /var/www/html/botcheck.html I've taken the HTML page as-is and added an empty, hidden input element. Otherwise lynx wouldn't send a POST request. Note that if you follow the link and click the button, you won't move off the page since the page sends you to your original target … which is that very page. You have to visit Emacs Wiki to experience it. #Butlerian_Jihad #Apache #Administration Updates ------- 2026-02-17. Seems to be doing OK, so far. The Munin graph showing the number of Apache connections currently at around 5 compared the maximum a bit more than 24 h ago of about 150. 2026-02-19. Strangely enough, the number of banned hosts aren't going down. The number of IP address ranges banned remains high, at around 18,000. Checking the four jobs that are adding to the list right now (expensive endpoints, active autonomous systems, attempted edits and no bots) it seems that the only job that has been contributing numbers is the "no bots" job. That's the job that checks whether IP numbers are requesting URLs which I'm qualifying as unreachable and ephemeral. That is, links that no human would follow or bookmark, nor reach by following links. In other words, these are links that crawlers have picked up in previous runs and added to data sets that are now being used to train AI. As such, they qualify for instant punishment. 2026-02-20. Yeah, the number of banned systems remains high. 😥 # watch-recent-bans | cut -d ' ' -f 1-3,5- | sed 's/\[.*\]//' Feb 19 08:53:08 watch-nobots: 886 Feb 19 09:42:04 watch-nobots: 6 Feb 19 10:51:51 watch-nobots: 16 Feb 19 12:30:15 watch-nobots: 16 Feb 19 13:20:51 watch-nobots: 4 Feb 19 14:12:02 watch-nobots: 124 Feb 19 16:55:17 watch-nobots: 2322 Feb 19 17:07:04 watch-nobots: 3142 Feb 19 17:40:54 watch-nobots: 9 Feb 19 19:43:02 watch-nobots: 826 Feb 19 21:12:13 watch-active-autonomous-systems: 15 Feb 19 22:10:45 watch-active-autonomous-systems: 26 Feb 19 23:02:04 watch-nobots: 7 Feb 20 03:11:33 watch-nobots: 6652 Feb 20 05:04:27 watch-nobots: 6215 Feb 20 09:39:40 watch-nobots: 6221 Feb 20 10:12:18 watch-nobots: 39 Feb 20 10:40:51 watch-nobots: 54 Feb 20 11:41:39 watch-nobots: 7 So the question is: is there a value in banning the bots that are being redirected to the nobots request? I think I will change the setup such that instead of redirecting to a page with a 410 result, I will use this file as the 410 error document and serve it from memory (or maybe I should use feed them poison…). 2026-03-01. Looks like the introduction of botcheck was a success. The number of banned IP address ranges fell from 20K to about 1K. A graph showing the steady decline of banned hosts over the last three weeks. Followed by a sharp drop to near zero on week 9.