2025-09-15 Searching the blogs on the RPG Planet
================================================

I'm maintaining the RPG Planet. Having recently developed Xobaque, a
search engine that indexes feeds, I wanted to give it a try. And here
we are: Search RPG Planet! 🥳 Let's see how it goes.

As for the RPG Planet: If you have an RPG blog that isn't listed, let
me know and I'll add it.

I'm also interested in backfilling entries that are no longer part of
the current feeds. This is tricky, unfortunately, because almost
nobody implements RFC 5005: Feed Paging and Archiving.

Wayback Machine
---------------

I tried an approach that asks the Wayback Machine for all the copies
of feed. It's generic but it's sloooooow! 😴 I'm going avoid using this
mechanism, if I can!

This uses waybackpack, htmlq and xpath, …

    for feed in (xpath -e //outline/attribute::xmlUrl example-opml.xml \
                 2>/dev/null | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p')
      for url in (waybackpack --list $feed)
        echo $url
        set body (curl --silent $url)
        if echo $body | head -n 1 | grep --silent "<!DOCTYPE html>"
          set url (htmlq -u $url -a src "#playback")
          echo ... $url
          set body (curl --silent $url)
        end
        echo $body | ./xobaque import feed file -
      end
    end

I started looking around and I think I have answers for the big
platforms.

Blogspot
--------

Blogspot is important because there are 265 blogs with "blogspot" in
their name that I'm listing. There may be more.

I remembered that Blogspot supports the start-index query parameter
and returns 25 results for every call. Here's blogspot-import:

    #!/usr/bin/fish
    
    # Go through previous pages of Blogspot feeds and import them all.
    
    echo (count $argv) OPML files to consider
    
    if test -z "$argv"
        exit
    end
    
    set feeds (string match --entire blogspot \
        (xpath -e //outline/attribute::xmlUrl $argv \
            2>/dev/null | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p'))
    echo (count $feeds) feeds found
    for feed in $feeds
        echo $feed
        for i in (seq 400)
            set start (math "($i-1)*25+1")
            set url "$feed?start-index=$start"
            echo "  $url"
            set body (curl --silent $url)
            set cnt (string match --all --regex "<entry>" "$body" | count)
            if test $cnt = 0
                echo "  done."
                break
            else
                echo $body | xobaque import feed file -
            end
        end
    end

Wordpress
---------

Wordpress is important because there are 128 blogs ending in /feed or
/feed/ that I'm listing. There may be more.

I remembered that Wordpress supports the paged query parameter and
returns 10 results for every call. This is going to be slow. 😴 Here's
wordpress-import:

    #!/usr/bin/fish
    
    # Go through previous pages of Wordpress feeds and import them all.
    
    echo (count $argv) OPML files to consider
    
    if test -z "$argv"
        exit
    end
    
    set xpath "string(/rss/channel/item[position()=1]/link/text() | /feed/entry[position()=1]/link/@href)"
    
    set feeds (xpath -e //outline/attribute::xmlUrl $argv \
                   2>/dev/null | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p' \
    	     | string match --entire --regex '/feed/?$' )
    echo (count $feeds) feeds found
    for feed in $feeds
        echo $feed
        for i in (seq 1000)
            if test $i = 1
                set url $feed
            else
                set url "$feed?paged=$i"
            end
            echo "  $url"
            set body (curl --location --silent $url)
    	if test $i = 1
    	    set first (echo $body | xpath -e $xpath 2>/dev/null)
    	    if test -z $first
    	        echo "  empty feed"
    		break
    	    end
    	else
    	    set this (echo $body | xpath -e $xpath 2>/dev/null)
    	    if test -z $this
    	        echo "  no more items"
    		break
    	    else if test $this = $first
    	        echo "  pagination failed"
    		break
    	    end
    	end
    	echo $body | xobaque import feed file -
        end
    end

In fact, this is too slow! I'm not sure what the cause the slowness
is. Perhaps it's the xobaque startup time?

I started using a setup with multiple steps.

1. get a list of the feeds

2. download the paginated feeds for 40 of them

3. start xobaque and import them all in one go

This is wordpress-list:

    #!/usr/bin/fish
    
    # Go through previous pages of Wordpress feeds and list them.
    
    if test -z "$argv"
        exit
    end
    
    set feeds (xpath -e //outline/attribute::xmlUrl $argv \
                   2>/dev/null | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p' \
    	    | string match --entire --regex '/feed/?$' \
    	    | tail -n +2)
    for feed in $feeds
        echo $feed
    end

Use it to create a todo list.

    ./wordpress-list > wordpress-todo.txt

Then pipe a bunch of these into the next script.

The first forty:

    head -n 40 wordpress-todo.txt | xargs ./wordpress-feed \+

The next forty:

    tail -n +41 wordpress-todo.txt | head -n 40 | xargs ./wordpress-feed \+

And so on.

The wordpress-feed script downloads the paginated feeds:

    #!/usr/bin/fish
    
    # Go through the feeds on the command line and get all the pages for the feed.
    
    if test -z "$argv"
        exit
    end
    
    set xpath "string(/rss/channel/item[position()=1]/link/text() | /feed/entry[position()=1]/link/@href)"
    
    set dir feeds-(date '+%Y-%m-%d %H:%M')
    mkdir --parent $dir
    set f 0
    for feed in $argv
        set f (math $f+1)
        set previous_first_link ""
        for i in (seq 1000)
            if test $i = 1
                set url $feed
            else
                set url "$feed?paged=$i"
            end
            echo $url
            set body (curl --location --silent $url)
    	set first_link (echo $body | xpath -e $xpath 2>/dev/null)
    	if test -z $first_link
    	    echo "  empty feed"
    	    break
    	else if test $first_link = $previous_first_link
    	    echo "  pagination failed"
    	    break
    	end
    	set previous_first_link $first_link
    	echo $body > "$dir/$f-$i.xml"
        end
    end

This generates a bunch of XML files in a directory called
feeds-2025-09-16 00:29 (depending on the date and time).

Process them all:

    xobaque import feed file feeds-*/*.xml

This seems much quicker.

Even better!
------------

Extracts all the feed URLs from a bunch of OPML files using xpath and
sed:

    xpath -e //outline/attribute::xmlUrl \
        /home/alex/planet/indie.opml \
        /home/alex/planet/osr.opml \
        /home/alex/planet/other.opml \
        2>/dev/null \
      | sed -n -e 's/.*xmlUrl="\(.*\)"/\1/p'

Save this list of URLs in a file and take 50 items at a time and
process them with get-feed-archive which tries to detect Blogspot or
Wordpress pagination:

    #!/usr/bin/fish
    
    # Go through the feeds on the command line and get all the pages for the feed.
    
    if test -z "$argv"
        exit
    end
    
    set xpath "string(/rss/channel/item[position()=1]/link/text() | /feed/entry[position()=1]/link/@href)"
    
    set dir feeds-(random)
    mkdir $dir
    set f 0
    for feed in $argv
        set f (math $f+1)
        set previous_first_link ""
        set type ""
        for i in (seq 1000)
            if test $i = 1
                set url $feed
            else if test $type = "wordpress"
    	    if string match --quiet --entire "?" $feed
                    set url "$feed&paged=$i"
    	    else
                    set url "$feed?paged=$i"
    	    end
    	else if test $type = "blogspot"
    	    set start (math "($i-1)*25+1")
    	    if string match --quiet --entire "?" $feed
                    set url "$feed&start-index=$start"
    	    else
                    set url "$feed?start-index=$start"
    	    end
    	else
    	    echo "  unknown pagination type"
    	    break
            end
            echo $url
            set body (curl --location --silent $url)
    	set first_link (echo $body | xpath -e $xpath 2>/dev/null)
    	if test -z $first_link
    	    echo "  empty feed"
    	    break
    	else if test $first_link = $previous_first_link
    	    echo "  pagination failed"
    	    break
    	end
    	set previous_first_link $first_link
    	echo $body > "$dir/$f-$i.xml"
            if test $i = 1
    	    if string match --entire blogspot $feed
    	        set type blogspot
    	    else if string match --entire wordpress $feed
    	        set type wordpress
    	    else if echo $body | xpath -e //generator 2>/dev/null | grep --silent wordpress
    	        set type wordpress
    	    else if echo $body | xpath -e /feed/id 2>/dev/null | grep --silent blogger
    	        set type blogspot
    	    else if set link (echo $body | xpath -e '/rss/channel/link/text()' 2>/dev/null | grep blogspot)
    	        set type blogspot
    		set feed $link"feeds/posts/default"
    	    else
    		echo "  unknown pagination type"
    		break
    	    end
    	    echo "  pagination type $type"
    	end	
        end
    end
    echo $dir

Then process the directory using xobaque import feed file feed-*/*.xml
 or similar.

Then delete the temporary directories again.

Now I just need to put this in some sort of yearly job, maybe?

What about RFC 5005
-------------------

Did you know that there is an actual spec for paginating feeds? RFC
5005 is the best!

Except not a single feed on the RPG Planet has a match for rel="next"
or rel="previous".

Not even my own blog. 😭

Others
------

Current status, accounting for multiple feeds (people with feeds in
both indie.opml and osr.opml):

* 478 unique blogs in total

* 265 Blogspot blogs

* 128 Wordpress blogs

* 85 other blogs

What about these 86 other blogs, I wonder. I've removed a number of
dead ones, but I still have questions.

Top posters
-----------

The imports are still on-going, of course. I'll update this table.

+---------------------------------+-------+
|             DOMAIN              | COUNT |
+---------------------------------+-------+
| theotherside.timsbrannan.com    |  6524 |
| grognardia.blogspot.com         |  4744 |
| jrients.blogspot.com            |  3502 |
| dysonlogos.blog                 |  3228 |
| githyankidiaspora.com           |  3090 |
| www.crossplanes.com             |  3028 |
| towerofthearchmage.blogspot.com |  2894 |
| bxblackrazor.blogspot.com       |  2565 |
| vulpinoid.blogspot.com          |  2555 |
| stargazersworld.com             |  2333 |
+---------------------------------+-------+

#Xobaque