2025-09-10 Large feeds
======================

I've been working on Xobaque because I'm using it to back Search
indieblog.page.

As I'm importing nearly 5000 feeds, I find that a single SQLite writer
is in fact a bottle-neck; that my simple "select foo" if found "update
foo" else "insert foo" isn't very fast. Sadly, I cannot use UPSERT
because full text search is implemented by a virtual table which
doesn't allow constraints or indexes and UPSERT requires a constraint
to work.

I also discovered that some blogs come with a gazillion pages going
back to the beginning of the century. There's one blog with over 12000
pages in its feed. Short pages, for sure, but still. There's one blog
with over 4000 pages; one with over 2000 pages; six with over 1000
pages. What the hell!

I understand having a lot of pages on the blog. I don't understand
having a lot of pages in a feed. The feed is for updates. Do people
expect readers to read the entire back catalogue? I mean, I have
listened through entire podcasts, which are powered for a feed, so I
understand the feed for a blog and the feed for a podcast to serve
different needs. But still. 12000 pages in a feed feels like
downloading the whole site, every single time somebody requests and
update. These feeds are not "news" but "archives" and there's a place
for both of them. Upload an archive for the initial-load, then keep
uploading news to keep the index up to date.

This is what RFC 5005 is for: feed paging and archiving.

I guess I'm just confused because it all illustrates that I walked
into this search engine stuff without understanding the huge variety
out there.

The current architecture is this: 10 go routines fetch feeds from the
OPML; this uses If-Modified-Since and If-None-Match headers and
skips feeds that return a 304. This is fast.

Before a feed is written to disk a lock is acquired so we only have
one SQLite writer at a time. This is slow and cannot be changed unless
moving away from SQLite.

On fedi, @jonny@neuromatch.social had some interesting suggestions for
my use of SQLite: Split select, insert and updates in to chunks. And I
did, and it helped!

Each feed is broken into groups of 1000 items each (only very few of
them are); up to 1000 pages with matching URL are read from disk and
their title and page is compared to the data in the feed; the result
is a number of updates, a number of inserts, a number of duplicate
URLs that get ignored, and a number of unchanged pages. The updates
are executed with a single prepared statement; then the inserts are
executed with a single prepared statement. This already reduces the
running time of one pass to about 12h. 😅 This is better than before!

I'm not sure what to do next to improve performance. The most obvious
step would be to use three global variables for prepared select,
insert and update to reduce SQL parsing. Not sure how much that will
help. I have a log with 45 min of output. In this time window, 1782
feeds resulted in some SQL based processing and 2179 feeds were
skipped due to HTTP caching.

If I use the timestamp of the Last-Modified header that I have on disk
and filter the pages in the feed for published or updated dates
greater than that, I might speed things up by excluding them from the
select statements. If such a date is missing, no big deal. For many
feeds, however, this will result in less database activity. Looking at
the numbers of the piece of log I have, it shows 93119 pages
processed, of which 160 were duplicates, 9534 resulted in inserts, 362
resulted in updates and 83063 were unchanged. Clearly, skipping the
unchanged ones would save database activity!

Actually, I'll store the feed's published/updated date in addition to
the web server's last-modified date, if available, and use that
instead.

#Xobaque #Feeds #Blogs