2026-01-02 Xobaque does RFC 5005 ================================ Very few blogs (this one?) support RFC 5005 "Feed Paging and Archiving". Xobaque is a search engine that doesn't crawl the web. It gets fed using feeds. That makes it the ideal companion for web rings, blog planets and the like. And I just added blog crawling using RFC 5005 to it. As long as Xobaque finds a "next" link, it'll continue following that chain. In order to not overwhelm sites, there's a 5s pause between these requests. I'm still not sure how good this idea is. A site like my own with a bit more than 6000 blog pages takes over 50 hours to crawl. And a new crawl is started every 24 hours. In my particular case, the next run that reaches a year-page, however, would notice that it hasn't changed and stop. So I hope it's still OK? I think the next thing I'd like to do is support sitemaps. I don't have one but all the Blogspot blogs have one. There are 277 Blogspot blogs on the RPG Planet. The problem with sitemaps, as far as I am concerned, is that you then need to request each and every page. It's not quite random crawling, but it's getting closer. 😭 #Xobaque #Feeds 2026-01-03. There are four blogs on Planet RPG which don't allow Xobaque using robots.txt. I wonder if Planet Jupiter -- the software I use to aggregate all the feeeds -- takes robots.txt into account? I wrote it so long ago I don't remember. As for the timeouts I've been getting for Blogspot blogs and others, I think the problem is that their ASN is blocked as per my Butlerian Jihad. The fail2ban firewall chain needs an exception. List the handles: nft -a list chain inet f2b-table f2b-chain Insert the new rule between accept list and fail2ban list: nft insert rule inet f2b-table f2b-chain handle 110 \ ct state established,related accept \ comment "accept responses to outgoing traffic" Result: table inet f2b-table { chain f2b-chain { # handle 1 type filter hook input priority filter - 1; policy accept; tcp dport 0-65535 ip6 saddr @gotosocial6 accept # handle 12 tcp dport 0-65535 ip saddr @gotosocial accept # handle 11 tcp dport 0-65535 ip6 saddr @allowlist6 accept # handle 10 tcp dport 0-65535 ip saddr @allowlist accept # handle 9 ct state established,related accept comment "accept responses to outgoing traffic" # handle 161 tcp dport 0-65535 ip saddr @addr-set-butlerian-jihad reject with icmp port-unreachable # handle 110 tcp dport 0-65535 ip6 saddr @addr6-set-butlerian-jihad reject with icmpv6 port-unreachable # handle 116 tcp dport { 80, 443 } ip6 saddr @addr6-set-alex-apache reject with icmpv6 port-unreachable # handle 123 tcp dport { 80, 443 } ip saddr @addr-set-alex-apache reject with icmp port-unreachable # handle 130 tcp dport 0-65535 ip saddr @addr-set-butlerian-jihad-week reject with icmp port-unreachable # handle 136 tcp dport 0-65535 ip6 saddr @addr6-set-butlerian-jihad-week reject with icmpv6 port-unreachable # handle 142 meta l4proto tcp ip6 saddr @addr6-set-recidive reject with icmpv6 port-unreachable # handle 148 meta l4proto tcp ip saddr @addr-set-recidive reject with icmp port-unreachable # handle 154 } } 2026-01-04. I implemented the import of sitemaps and sitemap indexes. Since those requests will all hit the same host, however, I'm using a hard coded delay of 5s between requests. What I need to consider: * Get the crawl delay from robots.txt. Sadly, I just discovered that this was an extension that isn't supported by Google, for example. * How to schedule the full crawl of an OPML such that the individual feed pages are fetched with the least amount of waiting while still upholding the crawl delay per host. 2026-01-07. I've been working on better parallelizing things. The current code assumes that there won't be all that many locations on the command line (!) and therefore starts a go routine for every OPML file to assemble a list of feeds and in a second step it then starts ten worker go routines that keep working on this task list. If full feeds are being fetched, finding a next page doesn't delay that go routine anymore. Instead, it starts a new go routine that delays and then adds the new URL to the task list, thus assuring that the delay is kept. And now, as I think about it, I guess that won't work because at the beginning, when the number of URLs is known, the task channel is stuffed with all the URLs and then closed. Gaaah. 2026-01-08. I think I fixed it. You don't close the task channel and the workers use select to get tasks off the channel and if there are no more tasks, they quit. The important part is that the workers must find the "next" link before doing the work of parsing and storing the page so that when they're done, the next task is already waiting. In previous iterations I was doing that at the end, at which point control returns to the worker, who pulls from the task list … and apparently often enough the task hasn't arrived, yet. Changing the order of things fixed this issue. Oh, and just in case the workers all get to go before the job that assembles the tasks, add a tiny millisecond sleep before they get started. Parallel stuff is hard.