## Meet Eomyidae Meet Eomyidae, the flying Gopher. by Christoph Lohmann <20h@r-36.net> ## What is eomyidae? Eomyidae is a family of extinct rodents from North America and Eurasia related to modern day pocket gophers and kangaroo rats. They are known from the Middle Eocene to the Late Miocene in North America and from the Late Eocene to the Pleistocene in Eurasia. Eomyids were generally small, but occasionally large, and tended to be squirrel-like in form and habits. The family includes the earliest known gliding rodent, Eomys quercyi. Flying gophers! ## See it. . . /`````\, ; O O o o;:c; `:d kkk OO dod odd \ kdldddllcllldk0OO oodlcocodcoxk0KKOkkk0kxkOK0 ddoocck00KXXXK0kk0OOkddx dO0KKKXKKOkO0KKOkdodoc llodkOO00kkkOKXXKKXXXKx ooolcclllodxdoodkkkkOk00d cxkxoollolllddc ;xo;: :00xll ::l; .d00d ':: cd 'c' c ;: c; ;o, .ol'. .:lc,.. .',;,,,. .',;,';. ..',,;. ..'' ## How did it evolve so far? * Over 20 iterations of different ways to crawl the gopherspace. * There were always new problems that arose. * When things did not scale, the algorithm had to be redone. ## How does it crawl now? 1. Use some initial URI and add it to the queue. 2. Load the old state of the queue, if there is any. 3. Sort the URIs by hostname. 4. Add URIs to the jobs based on how many selectors are there and how known the the host is. 5. Crawl the jobs. 6. Add the jobs and filter out, if the new selectors have been crawled already. ## What has been implemented so far? * robots.txt support * Some details are missing. * User-Agent: eomyidae * block list for manual intervention * Effective caching of all states, so if something goes wrong, just restart. * An information page at gopherproject.org. ## What is crawled now? * Eomyidae only crawls menus. ## How do you know Eomyidae crawled you? * If the robots.txt is requested, eomyidae is friendly and gives you a hint, how to control it. You should see in your logs: This is eomyidae, your friendly crawler. See \ gopher://gopherproject.org/1/eomyidae for more info. \ Have a nice day! ## How can you block Eomyidae from ever coming back? * Put into some robots.txt, which is reachable by domain/0/robots.txt: User-agent: eomyidae Disallow: / * Or, if you do not like any crawlers: User-agent: * Disallow: / ## Tips for robots.txt writers. * Do not use menu item types in the pattern. Disallow: /1/something Disallow: /something Just /something is enough. ## Where do we start in gopherspace to crawl? Eomyidae simply uses the gopher lawn: gopher://bitreich.org/1/lawn If you are in the lawn, you will be indexed. This gets people to auto-sort or auto-curate their links. ## Statistics so far. * Eomyidae had seen 924 gopher servers. * There were 4523668 unique selectors crawled. ## What is the future plan? * Be has helpful as possible. * Do not be annoying. * Publish data for reuse. ## Be helpful as possible. * There is now a gopher-validator on bitreich and it is planned, to be run on the menus and find a simple way how to tell or report to server owners, what is wrong. * First it can be used for the lawn, to check if links are still active. * Allow some way to easily access wayback machines. ## Do not be annoying. * Respect crawl-delay parameter in robots.txt. * Only crawl the front of the gopherhole over and over again, to check for updates. * If someone stops crawling, stop it. That is where there is the info page at gopherproject.org, to be able to contact me. ## Publish data for reuse. * So far data is not published. * There are known formats, like WebArc, but those are inefficient. * See http://commoncrawl.org/ for details. * My idea is to have the raw menus in a nice sorted way. ## Crawl other file formats? * There is the big GDPR problem with downloading all files. * The wayback machines do it. * I will try to see, how many useful things we get from just menus, if authors used goot descriptions to easily find texts or if some other parsing is needed. * Eomyidae should be simple in the processing too. * Unless we buy amazon. ## What is currently in development? * Separation of crawling arbiters, so they are not hammering some servers in some situations, as it is done now. * Recrawling based on timestamps. * A simple publishing method for the data on gopherproject.org. * A basic search. * It all depends on my as hobbyist having the idea and implementing it. * Of course you can help too. ## Eomyidae has been published today. git://bitreich.org/eomyidae gopher://bitreich.org/1/scm/eomyidae * So far implemented in python, as a prototype. * That is what python is for. * Please wait for the first release tag, for big changes. ## Goopher ,oGGGGGo. GG ,GG"' '"Go GG GG' ,oGGGGo. ,oGGGGo. GGGGGGo. GGuGGGGo. ,oGGGGo. GGuGGGGG GG GGGGG oG"' '"Go oG"' `"Go GG `"Go GG"' `GG oG"' '"Go GG"' GG. ,GG GG GG GG GG GG GG GG GG GG.oGGG""' GG 'GGo._ _,oGG" "Go. ,oG" "Go. ,oG" GG ,oG" GG GG "Go._ ,oo GG '"GGGGG"' '"GGGG"' '"GGGG"' GGGGGG"' GG GG '"GGGG"' GG GG GG [ ________________________________________ ] [ search ] Thanks josuah, for creating the proposal. ;) ## Questions for the future. * Ranking? * See the discussion on gopher during the day. * Graphs and connections? * See the discussion on gopher during the day. * Regular publishing of data with time stamp. * For wayback machines. ## Questions? Do you have any questions? ## Thanks Thank you very much for listening. Christoph Lohmann <20h@r-36.net> Or __20h__ at #bitreich-en on Freenode.