2026-02-17 Bot check
====================

I've had a few emails from people that got banned from my sites over
the weeks and months and years of the bots scraping the wikis. The
wikis are valuable "content" -- text written by humans! -- so of
course the AI companies want their bots to copy it and to train their
products with it. My server being small and my wiki being optimized
for slow, human readers means that average system load goes up
dramatically until the system grinds to a standstill.

These are my options:

* Pay for a lot more infrastructure to serve the AI companies, for
  free. I refuse to do that.

* To rewrite my software for a web full of bots. This isn't easy to do
  because the software is different and therefore some sort of
  migration is required.

* Exclude bots. This is what I've been trying to do for a while, now.

My first layer of defence is the banning of whole autonomous systems.
The observation all system administrators have struggled with is that
the bots are highly distributed. Every IP address only shows up a
handful up times. But in general, the bots are hosted by data centres
belonging to particular commercial entities and so the autonomous
systems responsible for them can be banned wholesale. This seemed like
it wouldn't bother regular people too much because commercial and
residential networks are usually kept separate. Perhaps people
wouldn't be able to visit my sites from work but they'd still be able
to visit them from home.

If you look at the list of autonomous systems banned for a week,
you'll see that it's not so easy:

    174	COGENT-174, US
    559	SWITCH Switch, Swiss Academic and Research Network, CH
    714	APPLE-ENGINEERING, US
    2386	INS-AS, US
    3223	VOXILITY, GB
    3320	DTAG Internet service provider operations, DE
    4134	CHINANET-BACKBONE No.31,Jin-rong Street, CN
    4466	EASYLINK2, US
    4809	CHINATELECOM-CORE-WAN-CN2 China Telecom Next Generation Carrier Network, CN
    4812	CHINANET-SH-AP China Telecom Group, CN
    …

The Swiss Academic and Research Network is banned? Are they training
bots on wikis? Maybe. Maybe not. Who knows.

In any case, people were getting banned all the time. Friends were
getting banned because they used a virtual private network (VPN). This
implies that their traffic went through commercial networks and got
banned with the rest. Friends were getting banned because their mobile
networks were getting banned. Friends were banned because they used an
internet service provider (ISP) that also rented out networking and
computation to AI companies.

I needed a second layer of defence that would prevent the bots from
accessing my sites (in order to keep the average system load in check)
and also stop them from trying (so they won't get banned by the first
layer of defence).

For a while, I used basic authentication when average system load rose
to unacceptable levels and switched it off again as load came back
down (as described in 2026-01-30 Locking the gate.) This worked well
enough: Average system load came down within minutes when the site
locked up. Users knew what to do because if they didn't log in
correctly the error message shown would tell them about the trivial
username and password to use. And who knows, perhaps we'll get back to
it. The crucial drawback, however, is that the delay still allowed
enough bots to get through for their autonomous system to get banned.
I still got complaints from people about the sites being inaccessible
to them.

So now I've taken another look at @splitbrain@social.splitbrain.org's
botcheck system. The big benefit as far as I could see was that the
basic idea worked entirely within an Apache config file. The actual
botcheck system also involves a binary that reads files and tells the
web server what to accept and what to skip, but this actually
optional.

> Each request gets checked for the presence of a cookie. If the
> cookie is set, the request is served as usual. If the cookie is
> missing, a simple HTML page with a button is shown. Real users are
> asked to click the button, get a cookie valid for 30 days and the
> page reloads, this time serving the original request. From then on
> they can browse the site as usual. -- Fighting Bots

Every site that is thus protected (the Oddmuse wikis) includes the
config file:

    Include conf-site/botcheck.conf

/etc/apache2/conf-site/botcheck.conf


    # Handle non-JavaScript confirmation POST (preserve query string; host must match)
    RewriteCond %{REQUEST_METHOD} POST
    RewriteCond %{HTTP_REFERER} ^https?://([^/:]+)(?::[0-9]+)?(/[^?]*)(\?.*)?$ [NC]
    RewriteCond %{HTTP_HOST} ^%1 [NC]
    RewriteRule ^/botcheck-confirm$ %2%3 [L,R=303,NE,E=BOTCHECK_CONFIRM:1,UnsafeAllow3F]
    
    # Fallback redirect when no Referer
    RewriteCond %{REQUEST_METHOD} POST
    RewriteRule ^/botcheck-confirm$ / [L,R=303,E=BOTCHECK_CONFIRM:1]
    
    # Set cookie for confirmed users
    Header always set Set-Cookie "botcheck=1; Max-Age=2592000; Path=/; SameSite=Lax" env=BOTCHECK_CONFIRM
    
    # The botcheck page, served with 402 status code when direct access is denied
    Alias /botcheck /var/www/html/botcheck.html
    ErrorDocument 402 /botcheck
    <Location /botcheck>
        Require all granted
    </Location>
    RewriteCond %{REQUEST_URI} ^/botcheck$
    RewriteRule .* - [L]
    
    # Skip botcheck during error handling subrequests
    RewriteCond %{ENV:REDIRECT_STATUS} !^$
    RewriteRule .* - [E=BOTCHECK:OK]
    
    # Allow non-GET methods without botcheck
    RewriteCond %{ENV:BOTCHECK} !^OK$
    RewriteCond %{REQUEST_METHOD} !GET
    RewriteRule .* - [E=BOTCHECK:OK]
    
    # Skip access control for the main Oddmuse feeds
    RewriteCond %{ENV:BOTCHECK} !^OK$
    RewriteCond %{QUERY_STRING} action=(rss|journal)
    RewriteRule /(emacs|wiki)(/[^/]*)? - [E=BOTCHECK:OK]
    
    # Skip access control for the Oddmu feeds
    RewriteCond %{ENV:BOTCHECK} !^OK$
    RewriteRule .*\.rss - [E=BOTCHECK:OK]
    
    # Set environment variable for allow-listed IPs
    RewriteCond %{ENV:BOTCHECK} !^OK$
    RewriteCond %{REMOTE_ADDR} XXX [or]
    RewriteCond %{REMOTE_ADDR} XXX
    RewriteRule .* - [E=BOTCHECK:OK]
    
    # Set environment variable for allow-listed User-Agents
    RewriteCond %{ENV:BOTCHECK} !^OK$
    RewriteCond %{HTTP_USER_AGENT} Monit
    RewriteRule .* - [E=BOTCHECK:OK]
    
    # Set environment variable for valid cookie
    RewriteCond %{ENV:BOTCHECK} !^OK$
    RewriteCond %{HTTP_COOKIE} botcheck
    RewriteRule .* - [E=BOTCHECK:OK]
    
    # Return 402 status and the botcheck page if none of the conditions are met
    RewriteCond %{ENV:BOTCHECK} !^OK$
    RewriteRule .* - [L,R=402]

XXX stands for my IPv4 and IPv6 addresses. At the moment I think this
is only required for Monit but who knows. I probably still need to add
exceptions for other services and certain bots.

As you can see, I don't use the RewriteMap directive that runs the
lookup program supplied in the original botcheck setup. Instead, I
implement the "excluded paths" by adding extras rules. This required
for all URLs for machine consumption, namely feeds. There, no human
can click the button.

Non-mainstream browsers I tested so far:

* cha doesn't work;

* dillo (from Debian Trixie) doesn't work;

* edbrowse works;

* eww in Emacs doesn't work;

* links2 works but requires a manual reload;

* lynx works;

* w3m works.

😬

/var/www/html/botcheck.html


I've taken the HTML page as-is and added an empty, hidden input
element. Otherwise lynx wouldn't send a POST request. Note that if
you follow the link and click the button, you won't move off the page
since the page sends you to your original target … which is that very
page. You have to visit Emacs Wiki to experience it.

#Butlerian_Jihad #Apache #Administration

Updates
-------

2026-02-17. Seems to be doing OK, so far.

The Munin graph showing the number of Apache connections currently at
around 5 compared the maximum a bit more than 24 h ago of about 150.

2026-02-19. Strangely enough, the number of banned hosts aren't going
down.

The number of IP address ranges banned remains high, at around 18,000.

Checking the four jobs that are adding to the list right now
(expensive endpoints, active autonomous systems, attempted edits and
no bots) it seems that the only job that has been contributing numbers
is the "no bots" job. That's the job that checks whether IP numbers
are requesting URLs which I'm qualifying as unreachable and ephemeral.
That is, links that no human would follow or bookmark, nor reach by
following links. In other words, these are links that crawlers have
picked up in previous runs and added to data sets that are now being
used to train AI. As such, they qualify for instant punishment.

2026-02-20. Yeah, the number of banned systems remains high. 😥

    # watch-recent-bans | cut -d ' ' -f 1-3,5- | sed 's/\[.*\]//'
    Feb 19 08:53:08 watch-nobots: 886
    Feb 19 09:42:04 watch-nobots: 6
    Feb 19 10:51:51 watch-nobots: 16
    Feb 19 12:30:15 watch-nobots: 16
    Feb 19 13:20:51 watch-nobots: 4
    Feb 19 14:12:02 watch-nobots: 124
    Feb 19 16:55:17 watch-nobots: 2322
    Feb 19 17:07:04 watch-nobots: 3142
    Feb 19 17:40:54 watch-nobots: 9
    Feb 19 19:43:02 watch-nobots: 826
    Feb 19 21:12:13 watch-active-autonomous-systems: 15
    Feb 19 22:10:45 watch-active-autonomous-systems: 26
    Feb 19 23:02:04 watch-nobots: 7
    Feb 20 03:11:33 watch-nobots: 6652
    Feb 20 05:04:27 watch-nobots: 6215
    Feb 20 09:39:40 watch-nobots: 6221
    Feb 20 10:12:18 watch-nobots: 39
    Feb 20 10:40:51 watch-nobots: 54
    Feb 20 11:41:39 watch-nobots: 7

So the question is: is there a value in banning the bots that are
being redirected to the nobots request?

I think I will change the setup such that instead of redirecting to a
page with a 410 result, I will use this file as the 410 error document
and serve it from memory (or maybe I should use feed them poison…).

2026-03-01. Looks like the introduction of botcheck was a success.
The number of banned IP address ranges fell from 20K to about 1K.

A graph showing the steady decline of banned hosts over the last three
weeks. Followed by a sharp drop to near zero on week 9.