2018-04-09 ___D_e_a_l_i_n_g__w_i_t_h__r_o_g_u_e__c_r_a_w_l_e_r_s________________ Today I got hit by a crawler that thinks indexing all of my stagit repo pages is a good idea. Now I am unsure about the usefulness of a robots.txt file, if someone wants to access a selector or all of them, fine by me. Unless my data volume limit is not hit by it let them access my selectors. But I have seen a lot of spiders creating selectors that aren't valid. And I think one needs to deal with this properly. I am implementing the following steps: * Add a pf(1) table for greylisting protential spammers * Add some tarpit selectors that will trigger another check check in the table whether the calling IP is in the greylist. * If the calling IP is in the greylist, and is hitting a bogus selector again, move it to the blacklist * Blacklisted IPs will get blocked from the system entirely for X hours * The tarpit daemon will slowly respon to each request with a huge potentially never ending text file stating some explanation and then hang up * A cron job will clean up the blacklist after a while. So how to do this with pf(1)? Turns out to be quite easy: '''pf.conf table persist block in on egress proto tcp from port 70 ''' The entries can be filled with pfctl(1), I am using a simple script called update-pf: '''shell # pfctl -t spammers-black -T replace / ''' And deleted with the '-T expire ' command. The former will be done within the trap cgi and the latter in a cronjob. Note that this script is for geomyidae, other servers do not provide REMOTE_ADDR. Check the documentation (or better source!) of your gopher server. The CGI: '''shell #!/bin/ksh grep "$REMOTE_ADDR" /var/gopher/greylist > /dev/null if [ "$?" -ne "0" ]; then echo "$REMOTE_ADDR" >> /var/gopher/greylist else sed -i.bak "s,$REMOTE_ADDR,,g" /var/gopher/greylist echo "$REMOTE_ADDR" >> /var/gopher/blacklist fi doas /sbin/update-pf 2>/dev/null gopher-tarpit ''' Gopher tarpit is just a dump slowly sending program, you can use anything really. Adjust the server settings to your need please. It sends some selectors pointing to the cgi again: '''c #include #include #include char message[] = "i\tHi this is a tarpit...\tInfo\tvernunftzentrum.de\t70\r\n" "iFollow any of the links below or this selector again, and you will be banned\tInfo\tserver\tport\r\n" "1Some uninteresting content (do not follow!)\t/pit/\tvernunftzentrum.de\t70\r\n" "1More uninteresting content (do not follow!)\t/pit/\tvernunftzentrum.de\t70\r\n" ".\r\n"; int main (int argc, char **argv) { size_t l = strlen(message); for (int i=0; i