2025-09-15 Searching the blogs on the RPG Planet ================================================ I'm maintaining the RPG Planet. Having recently developed Xobaque, a search engine that indexes feeds, I wanted to give it a try. And here we are: Search RPG Planet! 🥳 Let's see how it goes. As for the RPG Planet: If you have an RPG blog that isn't listed, let me know and I'll add it. I'm also interested in backfilling entries that are no longer part of the current feeds. This is tricky, unfortunately, because almost nobody implements RFC 5005: Feed Paging and Archiving. Wayback Machine --------------- I tried an approach that asks the Wayback Machine for all the copies of feed. It's generic but it's sloooooow! 😴 I'm going avoid using this mechanism, if I can! This uses waybackpack, htmlq and xpath, … for feed in (xpath -e //outline/attribute::xmlUrl example-opml.xml \ 2>/dev/null | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p') for url in (waybackpack --list $feed) echo $url set body (curl --silent $url) if echo $body | head -n 1 | grep --silent "" set url (htmlq -u $url -a src "#playback") echo ... $url set body (curl --silent $url) end echo $body | ./xobaque import feed file - end end I started looking around and I think I have answers for the big platforms. Blogspot -------- Blogspot is important because there are 265 blogs with "blogspot" in their name that I'm listing. There may be more. I remembered that Blogspot supports the start-index query parameter and returns 25 results for every call. Here's blogspot-import: #!/usr/bin/fish # Go through previous pages of Blogspot feeds and import them all. echo (count $argv) OPML files to consider if test -z "$argv" exit end set feeds (string match --entire blogspot \ (xpath -e //outline/attribute::xmlUrl $argv \ 2>/dev/null | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p')) echo (count $feeds) feeds found for feed in $feeds echo $feed for i in (seq 400) set start (math "($i-1)*25+1") set url "$feed?start-index=$start" echo " $url" set body (curl --silent $url) set cnt (string match --all --regex "" "$body" | count) if test $cnt = 0 echo " done." break else echo $body | xobaque import feed file - end end end Wordpress --------- Wordpress is important because there are 128 blogs ending in /feed or /feed/ that I'm listing. There may be more. I remembered that Wordpress supports the paged query parameter and returns 10 results for every call. This is going to be slow. 😴 Here's wordpress-import: #!/usr/bin/fish # Go through previous pages of Wordpress feeds and import them all. echo (count $argv) OPML files to consider if test -z "$argv" exit end set xpath "string(/rss/channel/item[position()=1]/link/text() | /feed/entry[position()=1]/link/@href)" set feeds (xpath -e //outline/attribute::xmlUrl $argv \ 2>/dev/null | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p' \ | string match --entire --regex '/feed/?$' ) echo (count $feeds) feeds found for feed in $feeds echo $feed for i in (seq 1000) if test $i = 1 set url $feed else set url "$feed?paged=$i" end echo " $url" set body (curl --location --silent $url) if test $i = 1 set first (echo $body | xpath -e $xpath 2>/dev/null) if test -z $first echo " empty feed" break end else set this (echo $body | xpath -e $xpath 2>/dev/null) if test -z $this echo " no more items" break else if test $this = $first echo " pagination failed" break end end echo $body | xobaque import feed file - end end In fact, this is too slow! I'm not sure what the cause the slowness is. Perhaps it's the xobaque startup time? I started using a setup with multiple steps. 1. get a list of the feeds 2. download the paginated feeds for 40 of them 3. start xobaque and import them all in one go This is wordpress-list: #!/usr/bin/fish # Go through previous pages of Wordpress feeds and list them. if test -z "$argv" exit end set feeds (xpath -e //outline/attribute::xmlUrl $argv \ 2>/dev/null | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p' \ | string match --entire --regex '/feed/?$' \ | tail -n +2) for feed in $feeds echo $feed end Use it to create a todo list. ./wordpress-list > wordpress-todo.txt Then pipe a bunch of these into the next script. The first forty: head -n 40 wordpress-todo.txt | xargs ./wordpress-feed \+ The next forty: tail -n +41 wordpress-todo.txt | head -n 40 | xargs ./wordpress-feed \+ And so on. The wordpress-feed script downloads the paginated feeds: #!/usr/bin/fish # Go through the feeds on the command line and get all the pages for the feed. if test -z "$argv" exit end set xpath "string(/rss/channel/item[position()=1]/link/text() | /feed/entry[position()=1]/link/@href)" set dir feeds-(date '+%Y-%m-%d %H:%M') mkdir --parent $dir set f 0 for feed in $argv set f (math $f+1) set previous_first_link "" for i in (seq 1000) if test $i = 1 set url $feed else set url "$feed?paged=$i" end echo $url set body (curl --location --silent $url) set first_link (echo $body | xpath -e $xpath 2>/dev/null) if test -z $first_link echo " empty feed" break else if test $first_link = $previous_first_link echo " pagination failed" break end set previous_first_link $first_link echo $body > "$dir/$f-$i.xml" end end This generates a bunch of XML files in a directory called feeds-2025-09-16 00:29 (depending on the date and time). Process them all: xobaque import feed file feeds-*/*.xml This seems much quicker. Even better! ------------ Extracts all the feed URLs from a bunch of OPML files using xpath and sed: xpath -e //outline/attribute::xmlUrl \ /home/alex/planet/indie.opml \ /home/alex/planet/osr.opml \ /home/alex/planet/other.opml \ 2>/dev/null \ | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p' Save this list of URLs in a file and take 50 items at a time and process them with get-feed-archive which tries to detect Blogspot or Wordpress pagination: #!/usr/bin/fish # Go through the feeds on the command line and get all the pages for the feed. if test -z "$argv" exit end set xpath "string(/rss/channel/item[position()=1]/link/text() | /feed/entry[position()=1]/link/@href)" set dir feeds-(random) mkdir $dir set f 0 for feed in $argv set f (math $f+1) set previous_first_link "" set type "" for i in (seq 1000) if test $i = 1 set url $feed else if test $type = "wordpress" if string match --quiet --entire "?" $feed set url "$feed&paged=$i" else set url "$feed?paged=$i" end else if test $type = "blogspot" set start (math "($i-1)*25+1") if string match --quiet --entire "?" $feed set url "$feed&start-index=$start" else set url "$feed?start-index=$start" end else echo " unknown pagination type" break end echo $url set body (curl --location --silent $url) set first_link (echo $body | xpath -e $xpath 2>/dev/null) if test -z $first_link echo " empty feed" break else if test $first_link = $previous_first_link echo " pagination failed" break end set previous_first_link $first_link echo $body > "$dir/$f-$i.xml" if test $i = 1 if string match --entire blogspot $feed set type blogspot else if string match --entire wordpress $feed set type wordpress else if echo $body | xpath -e //generator 2>/dev/null | grep --silent wordpress set type wordpress else if echo $body | xpath -e /feed/id 2>/dev/null | grep --silent blogger set type blogspot else if set link (echo $body | xpath -e '/rss/channel/link/text()' 2>/dev/null | grep blogspot) set type blogspot set feed $link"feeds/posts/default" else echo " unknown pagination type" break end echo " pagination type $type" end end end echo $dir Then process the directory using xobaque import feed file feed-*/*.xml or similar. Then delete the temporary directories again. Now I just need to put this in some sort of yearly job, maybe? What about RFC 5005 ------------------- Did you know that there is an actual spec for paginating feeds? RFC 5005 is the best! Except not a single feed on the RPG Planet has a match for rel="next" or rel="previous". Not even my own blog. 😭 Others ------ Current status, accounting for multiple feeds (people with feeds in both indie.opml and osr.opml): * 478 unique blogs in total * 265 Blogspot blogs * 128 Wordpress blogs * 85 other blogs What about these 86 other blogs, I wonder. I've removed a number of dead ones, but I still have questions. Top posters ----------- The imports are still on-going, of course. I'll update this table. +---------------------------------+-------+ | DOMAIN | COUNT | +---------------------------------+-------+ | theotherside.timsbrannan.com | 6524 | | grognardia.blogspot.com | 4744 | | jrients.blogspot.com | 3502 | | dysonlogos.blog | 3228 | | githyankidiaspora.com | 3090 | | www.crossplanes.com | 3028 | | towerofthearchmage.blogspot.com | 2894 | | bxblackrazor.blogspot.com | 2565 | | vulpinoid.blogspot.com | 2555 | | stargazersworld.com | 2333 | +---------------------------------+-------+ #Xobaque