2025-03-20 Something about the bot defence is working ===================================================== At midnight, there was a surge in activity. CPU usage went up. Load went up, too. But it stayed within reasonable bounds -- less than 4 instead of the more than 80 I have seen in the past. And the number of IP addresses blocked by fail2ban went from 40 to 50. I'm usually sceptical of this because the big attacks are from a far wider variety of IP numbers. In this case, however, maybe there was some probing that resulted in blocks? I don't know. Lucky, I guess? In any case, the site is still up. Yay for small wins. Also, I cannot overstate how good it feel to have some Munin graphs available. alex-bots is a setup I desribed in 2025-02-19 Bots again, cursed. Basically a request to one of my Oddmuse wikis containing the parameter rcidonly is an expensive endpoint: "all changes for this single page" or "a feed for this single page". This is something a human would rarely access and yet it somehow the URLs landed in some dataset for AI training, I suspect. So what I do is I’m redirecting any request containing “rcidonly” in the query string to /nobots, warning humans not to click on these links. In addition to that, the filter /etc/fail2ban/filter.d/alex-bots.conf contains this: [Definition] failregex = ^(www\.emacswiki\.org|communitywiki\.org|campaignwiki\.org):[0-9]+ .*rcidonly= And I added a section using this filter to my jail /etc/fail2ban/jail.d/alex.conf: [alex-bots] enabled = true port = http,https logpath = %(apache_access_log)s findtime = 3600 maxretry = 2 So if an IP number visits three URLs containing "rcidonly" in an hour, they get banned for ten minutes. The recidive filter (a standard filter you just need to activate) then makes sure that any IP number that got blocked three times gets blocked for a week. #Administration #Butlerian_Jihad 2025-03-20. Ever since Drew DeVault published his blog post, more people seem to notice what's going on: AI ingestion is killing web sites and web services. > If you think these crawlers respect robots.txt then you are several > assumptions of good faith removed from reality. These bots crawl > everything they can find, robots.txt be damned, including expensive > endpoints like git blame, every page of every git log, and every > commit in every repo, and they do so using random User-Agents that > overlap with end-users and come from tens of thousands of IP > addresses – mostly residential, in unrelated subnets, each one > making no more than one HTTP request over any time period we tried > to measure – actively and maliciously adapting and blending in with > end-user traffic and avoiding attempts to characterize their > behavior or block their traffic. -- Please stop externalizing your > costs directly into my face, by Drew DeVault, for SourceHut > > Then, yesterday morning, KDE GitLab infrastructure was overwhelmed > by another AI crawler, with IPs from an Alibaba range; this caused > GitLab to be temporarily inaccessible by KDE developers. I then > discovered that, one week ago, an Anime girl started appearing on > the GNOME GitLab instance, as the page was loaded. It turns out that > it's the default loading page for Anubis, a proof-of-work challenger > that blocks AI scrapers that are causing outages. -- FOSS > infrastructure is under attack by AI companies, by Niccolò > Venerandi, for LibreNews > > What do SourceHut, GNOME’s GitLab, and KDE’s GitLab have in common, > other than all three of them being forges? Well, it turns out all > three of them have been dealing with immense amounts of traffic from > “AI” scrapers, who are effectively performing DDoS attacks with such > ferocity it’s bringing down the infrastructures of these major open > source projects. Being open source, and thus publicly accessible, > means these scrapers have unlimited access, unlike with proprietary > projects. … Everything about this “AI” bubble is gross, and I can’t > wait for this bubble to pop so a semblance of sanity can return to > the technology world. Until the next hype train rolls into the > station, of course. -- FOSS infrastructure is under attack by AI > companies, by Thom Holwerda, for OSnews 2025-03-22. Ordinary sysadmins get hit as well. Here's Sean Conner of the The Boston Diaries: He reports on Friday, March 21, 2025 that his logs show a total of 468439 requests for February 2025. The top hitter was 4.231.104.62 with 43242 requests (9%). This was from MICROSOFT-CORP-MSN-AS-BLOCK, US. But the ASN has more networks, of course. Adding them all up give 78889 (17%). He links to the IP to ASN Mapping Service by Team Cymru. I started switching my network-lookup script to using it because it also supports IPv6. Something that I haven't done is find the ASN and then block all the blocks belonging to the ASN. That's where I want to be, actually. 2025-03-26. More media are picking it up, but with a strange focus on "open source". > As it currently stands, both the rapid growth of AI-generated > content overwhelming online spaces and aggressive web-crawling > practices by AI firms threaten the sustainability of essential > online resources. The current approach taken by some large AI > companies—extracting vast amounts of data from open-source > projects without clear consent or compensation—risks severely > damaging the very digital ecosystem on which these AI models depend. > -- Open Source devs say AI crawlers dominate traffic, forcing blocks > on entire countries, by Benj Edwards, for Ars Technica @bagder@mastodon.social recently had some numbers: > The AI bots that desperately need OSS for code training, are now > slowly killing OSS by overloading every site. The curl website is > now at 77TB/month, or 8GB every five minutes. @gluejar@tilde.zone writes: > There's a war going on on the Internet. AI companies with billions > to burn are hard at work destroying the websites of libraries, > archives, non-profit organizations, and scholarly publishers, anyone > who is working to make quality information universally available on > the internet. -- AI bots are destroying Open Access, by Eric > Hellman 2025-04-02. The bots keep eating everything of value. > Since January 2024, we have seen the bandwidth used for downloading > multimedia content grow by 50%. This increase is not coming from > human readers, but largely from automated programs that scrape the > Wikimedia Commons image catalog of openly licensed images to feed > images to AI models. Our infrastructure is built to sustain sudden > traffic spikes from humans during high-interest events, but the > amount of traffic generated by scraper bots is unprecedented and > presents growing risks and costs. -- How crawlers impact the > operations of the Wikimedia projects, Birgit Mueller, Chris Danis > and Giuseppe Lavagetto, all for the Wikimedia Foundation 2025-06-18. See 2025-03-21 A summary of my bot defence systems for the next instalment.