2025-08-21 Looking at web server logs ===================================== @bookandswordblog@scholar.social recently wondered how to look at web server logs. Ever since I'm fighting the bots, I've had to look at the web server stats and my admin folder is full of small scripts. But if you're new to it all, how do you start? Where are the files? -------------------- If you use Apache on Debian, the current log file is in /var/log/apache2/access.log so look for something similar. If you use nginx instead of Apache, it might be a folder like /var/log/nginx. The log file itself might be called differently, too. Log format ---------- The log file format can vary, too. If you use Apache on Debian, the log formats are set in /etc/apache2/apache2.conf using the LogFormat directive, and each virtual host can pick one of them using the CustomLog directive. See Log Files. In my case, I use a config file that sets the following: LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" %>D" vhost_combined CustomLog ${APACHE_LOG_DIR}/access.log vhost_combined This adds %>D at the end of the format and is otherwise identical to the default vhost_combined format. The format codes are documented on the mod_log_config page. And example line from my logs: social.alexschroeder.ch:443 37.27.248.47 - - [21/Aug/2025:00:59:49 +0200] "POST /users/alex/inbox HTTP/1.1" 202 4207 "-" "Mastodon/4.4.2 (http.rb/5.3.1; +https://beige.party/)" 33180 This means: * a request to host social.alexschroeder.ch on port 443 * from IP address 37.27.248.47 * with no remote log name * and no remote user * on 2025-08-21 00:59:49 CEST * posted information to /users/alex/inbox (to read information the request would have used GET instead of POST) * the response was a status 202 (see HTTP response status codes for a list) * and contained 4207 bytes * with no referrer provided * from a Mastodon user agent self-identifying as from beige.party * taking 33180 microseconds, or about 33 milliseconds (this is the information I added) Processing the log using awk ------------------------- I'm going to use awk to process the log files. It's a very old programming language. Every line is split into words based on spaces. Each word is numbered and available as a variable starting with the dollar sign and its number. Therefore, when processing the access.log file using the format shown above, the number variables contain the following: $1 social.alexschroeder.ch:443 $2 37.27.248.47 $3 - $4 - $5 [21/Aug/2025:00:59:49 $6 +0200] $7 "POST $8 /users/alex/inbox $9 HTTP/1.1" $10 202 $11 4207 You'll note that those square brackets and double quotes are sometimes very annoying. 🤨 A block is enclosed in braces { like this } and is executed for every line. You don't need to declare variables before assigning to them. An element of a an associative array (a hash map in Perl or Java, a Dictionary in Python) is accessed using square brackets like[this]. A block after END is executed just once, at the end of the input. Counting the hits per IP number ------------------------------- We use an associative array where the key is the IP address of the visitor and the value is the number of hits. At the end, we print the number hits, a tab, and the IP address, for every address. The ++ operator increments the operand by one. $2 is the IP address, and hits is the associative array. We then use sort --sort-numerically --reverse to sort these lines by their hits, putting more hits at the top. head --lines=20 then prints the first 20 lines. awk '{hits[$2]++} END {for (ip in hits) printf "%d\t%s\n", hits[ip], ip}' /var/log/apache2/access.log \ | sort --numeric-sort --reverse \ | head --lines=20 Counting the bytes per IP number (the bandwidth used) ----------------------------------------------------- In order to sum the bandwidth used, we need to sum the bytes send per IP number. The awk script here does that and prints the number of MB (1000000 bytes) per IP number. We then use sort --sort-numerically --reverse to sort these lines by their hits, putting more volume at the top. head --lines=20 then prints the first 20 lines. awk '{vol[$2] += $11} END {for (ip in vol) printf "%dM\t%s\n", vol[ip]/1E6, ip}' /var/log/apache2/access.log \ | sort --numeric-sort --reverse \ | head --lines=20 Ranking the requests by popularity ---------------------------------- In order to rank the requests we need to print the requests and count them, then sort them. The awk script just prints the request itself; sort sorts the lines and uniq -c counts the unique lines (it requires a sorted input) We then use sort --sort-numerically --reverse to sort these lines, putting more hits at the top. head --lines=20 then prints the first 20 lines. awk '{print $8}' /var/log/apache2/access.log \ | sort \ | uniq --count \ | sort --numeric-sort --reverse \ | head --lines=20 Sometimes it can be interesting to note what sort requests are not OK (Status 200). Instead of starting with the whole access log, we only print requests if the status is not 200. This uses a new feature of awk that we haven't used before. You can have conditions before the block. awk '$10 != 200 {print $8}' /var/log/apache2/access.log \ | sort \ | uniq --count \ | sort --numeric-sort --reverse \ | head --lines=20 And if you're interested in the status itself: awk '$10 != 200 {print $10, $8}' /var/log/apache2/access.log \ | sort \ | uniq --count \ | sort --numeric-sort --reverse \ | head --lines=20 Or if you're interested only in errors for one particular virtual host: awk '$1 == "alexschroeder.ch:443" && $10 >= 400 {print $10, $8}' /var/log/apache2/access.log \ | sort \ | uniq --count \ | sort --numeric-sort --reverse \ | head --lines=20 #Administration #Butlerian_Jihad