2026-01-02 Xobaque does RFC 5005
================================

Very few blogs (this one?) support RFC 5005 "Feed Paging and
Archiving". Xobaque is a search engine that doesn't crawl the web. It
gets fed using feeds. That makes it the ideal companion for web rings,
blog planets and the like. And I just added blog crawling using RFC
5005 to it.

As long as Xobaque finds a "next" link, it'll continue following that
chain. In order to not overwhelm sites, there's a 5s pause between
these requests. I'm still not sure how good this idea is. A site like
my own with a bit more than 6000 blog pages takes over 50 hours to
crawl. And a new crawl is started every 24 hours.

In my particular case, the next run that reaches a year-page, however,
would notice that it hasn't changed and stop. So I hope it's still OK?

I think the next thing I'd like to do is support sitemaps. I don't
have one but all the Blogspot blogs have one. There are 277 Blogspot
blogs on the RPG Planet.

The problem with sitemaps, as far as I am concerned, is that you then
need to request each and every page. It's not quite random crawling,
but it's getting closer. 😭

#Xobaque #Feeds

2026-01-03. There are four blogs on Planet RPG which don't allow
Xobaque using robots.txt. I wonder if Planet Jupiter -- the software
I use to aggregate all the feeeds -- takes robots.txt into account? I
wrote it so long ago I don't remember.

As for the timeouts I've been getting for Blogspot blogs and others, I
think the problem is that their ASN is blocked as per my Butlerian
Jihad. The fail2ban firewall chain needs an exception.

List the handles:

    nft -a list chain inet f2b-table f2b-chain

Insert the new rule between accept list and fail2ban list:

    nft insert rule inet f2b-table f2b-chain handle 110 \
      ct state established,related accept \
      comment "accept responses to outgoing traffic"

Result:

    table inet f2b-table {
    	chain f2b-chain { # handle 1
    		type filter hook input priority filter - 1; policy accept;
    		tcp dport 0-65535 ip6 saddr @gotosocial6 accept # handle 12
    		tcp dport 0-65535 ip saddr @gotosocial accept # handle 11
    		tcp dport 0-65535 ip6 saddr @allowlist6 accept # handle 10
    		tcp dport 0-65535 ip saddr @allowlist accept # handle 9
    		ct state established,related accept comment "accept responses to outgoing traffic" # handle 161
    		tcp dport 0-65535 ip saddr @addr-set-butlerian-jihad reject with icmp port-unreachable # handle 110
    		tcp dport 0-65535 ip6 saddr @addr6-set-butlerian-jihad reject with icmpv6 port-unreachable # handle 116
    		tcp dport { 80, 443 } ip6 saddr @addr6-set-alex-apache reject with icmpv6 port-unreachable # handle 123
    		tcp dport { 80, 443 } ip saddr @addr-set-alex-apache reject with icmp port-unreachable # handle 130
    		tcp dport 0-65535 ip saddr @addr-set-butlerian-jihad-week reject with icmp port-unreachable # handle 136
    		tcp dport 0-65535 ip6 saddr @addr6-set-butlerian-jihad-week reject with icmpv6 port-unreachable # handle 142
    		meta l4proto tcp ip6 saddr @addr6-set-recidive reject with icmpv6 port-unreachable # handle 148
    		meta l4proto tcp ip saddr @addr-set-recidive reject with icmp port-unreachable # handle 154
    	}
    }

2026-01-04. I implemented the import of sitemaps and sitemap indexes.
Since those requests will all hit the same host, however, I'm using a
hard coded delay of 5s between requests.

What I need to consider:

* Get the crawl delay from robots.txt. Sadly, I just discovered that
  this was an extension that isn't supported by Google, for example.

* How to schedule the full crawl of an OPML such that the individual
  feed pages are fetched with the least amount of waiting while still
  upholding the crawl delay per host.

2026-01-07. I've been working on better parallelizing things. The
current code assumes that there won't be all that many locations on
the command line (!) and therefore starts a go routine for every OPML
file to assemble a list of feeds and in a second step it then starts
ten worker go routines that keep working on this task list. If full
feeds are being fetched, finding a next page doesn't delay that go
routine anymore. Instead, it starts a new go routine that delays and
then adds the new URL to the task list, thus assuring that the delay
is kept. And now, as I think about it, I guess that won't work because
at the beginning, when the number of URLs is known, the task channel
is stuffed with all the URLs and then closed. Gaaah.

2026-01-08. I think I fixed it. You don't close the task channel and
the workers use select to get tasks off the channel and if there are
no more tasks, they quit. The important part is that the workers must
find the "next" link before doing the work of parsing and storing the
page so that when they're done, the next task is already waiting. In
previous iterations I was doing that at the end, at which point
control returns to the worker, who pulls from the task list … and
apparently often enough the task hasn't arrived, yet. Changing the
order of things fixed this issue.

Oh, and just in case the workers all get to go before the job that
assembles the tasks, add a tiny millisecond sleep before they get
started. Parallel stuff is hard.