2026-01-11 Xobaque imports Sitemaps
===================================

If you have a small number of sites you want to cover with your own
search engine, I might have you covered. Xobaque is a search engine
based on Sqlite. The most useful part of it is that it doesn't crawl
the web. So how does it index the pages? You feed it with a list of
feeds (RSS, Atom, JSON), or with a list of OPML files (containing
feeds), or a list of Sitemaps (linking to every page directly).

I have been running three instances for a while, now. The Emacs
database is the smallest one with just 27 MB. The indieblog.page
database is 709 MB. The largest one is the RPG database with 769 MB.

So, how small is 'a small number of sites'? The indieblog.page search
index covers 6429 domains and 273,721 pages; the RPG search index
covers 436 domains and 147'981 pages. 😬

Xobaque knows how to handle paginated feeds as specified in RFC 5005.
All the blogs hosted by Google via Blogspot and Blogger support this.
That means the search engine can go through the feed following the
links to the next feed page until it has ingested the whole site -- as
long as the whole site shows up in the paginated feed at some point.

Xobaque knows how to handle sitemaps as specified on the Sitemaps site
. This is document that links to all the pages on a site. All the
blogs hosted by Google via Blogspot and Blogger support these
Sitemaps. The drawback from my point of view is that the search engine
then has to request every single one of these pages individually. A
feed or a paginated feed is better since you're getting ten pages or
more per request.

Xobaque follows the directives in a /robots.txt page as specified in
RFC 9309 follows the non-standard Crawl-Delay directive. Yay!

#Xobaque #Search