2025-09-10 Large feeds ====================== I've been working on Xobaque because I'm using it to back Search indieblog.page. As I'm importing nearly 5000 feeds, I find that a single SQLite writer is in fact a bottle-neck; that my simple "select foo" if found "update foo" else "insert foo" isn't very fast. Sadly, I cannot use UPSERT because full text search is implemented by a virtual table which doesn't allow constraints or indexes and UPSERT requires a constraint to work. I also discovered that some blogs come with a gazillion pages going back to the beginning of the century. There's one blog with over 12000 pages in its feed. Short pages, for sure, but still. There's one blog with over 4000 pages; one with over 2000 pages; six with over 1000 pages. What the hell! I understand having a lot of pages on the blog. I don't understand having a lot of pages in a feed. The feed is for updates. Do people expect readers to read the entire back catalogue? I mean, I have listened through entire podcasts, which are powered for a feed, so I understand the feed for a blog and the feed for a podcast to serve different needs. But still. 12000 pages in a feed feels like downloading the whole site, every single time somebody requests and update. These feeds are not "news" but "archives" and there's a place for both of them. Upload an archive for the initial-load, then keep uploading news to keep the index up to date. This is what RFC 5005 is for: feed paging and archiving. I guess I'm just confused because it all illustrates that I walked into this search engine stuff without understanding the huge variety out there. The current architecture is this: 10 go routines fetch feeds from the OPML; this uses If-Modified-Since and If-None-Match headers and skips feeds that return a 304. This is fast. Before a feed is written to disk a lock is acquired so we only have one SQLite writer at a time. This is slow and cannot be changed unless moving away from SQLite. On fedi, @jonny@neuromatch.social had some interesting suggestions for my use of SQLite: Split select, insert and updates in to chunks. And I did, and it helped! Each feed is broken into groups of 1000 items each (only very few of them are); up to 1000 pages with matching URL are read from disk and their title and page is compared to the data in the feed; the result is a number of updates, a number of inserts, a number of duplicate URLs that get ignored, and a number of unchanged pages. The updates are executed with a single prepared statement; then the inserts are executed with a single prepared statement. This already reduces the running time of one pass to about 12h. 😅 This is better than before! I'm not sure what to do next to improve performance. The most obvious step would be to use three global variables for prepared select, insert and update to reduce SQL parsing. Not sure how much that will help. I have a log with 45 min of output. In this time window, 1782 feeds resulted in some SQL based processing and 2179 feeds were skipped due to HTTP caching. If I use the timestamp of the Last-Modified header that I have on disk and filter the pages in the feed for published or updated dates greater than that, I might speed things up by excluding them from the select statements. If such a date is missing, no big deal. For many feeds, however, this will result in less database activity. Looking at the numbers of the piece of log I have, it shows 93119 pages processed, of which 160 were duplicates, 9534 resulted in inserts, 362 resulted in updates and 83063 were unchanged. Clearly, skipping the unchanged ones would save database activity! Actually, I'll store the feed's published/updated date in addition to the web server's last-modified date, if available, and use that instead. #Xobaque #Feeds #Blogs