2025-08-21 Looking at web server logs
=====================================

@bookandswordblog@scholar.social recently wondered how to look at web
server logs. Ever since I'm fighting the bots, I've had to look at the
web server stats and my admin folder is full of small scripts. But if
you're new to it all, how do you start?

Where are the files?
--------------------

If you use Apache on Debian, the current log file is in
/var/log/apache2/access.log so look for something similar. If you use
nginx instead of Apache, it might be a folder like /var/log/nginx.
The log file itself might be called differently, too.

Log format
----------

The log file format can vary, too. If you use Apache on Debian, the
log formats are set in /etc/apache2/apache2.conf using the LogFormat
directive, and each virtual host can pick one of them using the
CustomLog directive. See Log Files.

In my case, I use a config file that sets the following:

    LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" %>D" vhost_combined
    CustomLog ${APACHE_LOG_DIR}/access.log vhost_combined

This adds %>D at the end of the format and is otherwise identical to
the default vhost_combined format. The format codes are documented on
the mod_log_config page.

And example line from my logs:

social.alexschroeder.ch:443 37.27.248.47 - - [21/Aug/2025:00:59:49
+0200] "POST /users/alex/inbox HTTP/1.1" 202 4207 "-" "Mastodon/4.4.2
(http.rb/5.3.1; +https://beige.party/)" 33180

This means:

* a request to host social.alexschroeder.ch on port 443

* from IP address 37.27.248.47

* with no remote log name

* and no remote user

* on 2025-08-21 00:59:49 CEST

* posted information to /users/alex/inbox (to read information the
  request would have used GET instead of POST)

* the response was a status 202 (see HTTP response status codes for a
  list)

* and contained 4207 bytes

* with no referrer provided

* from a Mastodon user agent self-identifying as from beige.party

* taking 33180 microseconds, or about 33 milliseconds (this is the
  information I added)

Processing the log using awk
-------------------------

I'm going to use awk to process the log files. It's a very old
programming language.

Every line is split into words based on spaces. Each word is numbered
and available as a variable starting with the dollar sign and its
number. Therefore, when processing the access.log file using the
format shown above, the number variables contain the following:

     $1  social.alexschroeder.ch:443
     $2  37.27.248.47
     $3  -
     $4  -
     $5  [21/Aug/2025:00:59:49
     $6  +0200]
     $7  "POST
     $8  /users/alex/inbox
     $9  HTTP/1.1"
    $10  202
    $11  4207

You'll note that those square brackets and double quotes are sometimes
very annoying. 🤨

A block is enclosed in braces { like this } and is executed for every
line. You don't need to declare variables before assigning to them. An
element of a an associative array (a hash map in Perl or Java, a
Dictionary in Python) is accessed using square brackets like[this].

A block after END is executed just once, at the end of the input.

Counting the hits per IP number
-------------------------------

We use an associative array where the key is the IP address of the
visitor and the value is the number of hits. At the end, we print the
number hits, a tab, and the IP address, for every address. The ++
operator increments the operand by one. $2 is the IP address, and
hits is the associative array.

We then use sort --sort-numerically --reverse to sort these lines by
their hits, putting more hits at the top. head --lines=20 then prints
the first 20 lines.

    awk '{hits[$2]++} END {for (ip in hits) printf "%d\t%s\n", hits[ip], ip}' /var/log/apache2/access.log \
    | sort --numeric-sort --reverse \
    | head --lines=20

Counting the bytes per IP number (the bandwidth used)
-----------------------------------------------------

In order to sum the bandwidth used, we need to sum the bytes send per
IP number. The awk script here does that and prints the number of MB
(1000000 bytes) per IP number.

We then use sort --sort-numerically --reverse to sort these lines by
their hits, putting more volume at the top. head --lines=20 then
prints the first 20 lines.

    awk '{vol[$2] += $11} END {for (ip in vol) printf "%dM\t%s\n", vol[ip]/1E6, ip}' /var/log/apache2/access.log \
    | sort --numeric-sort --reverse \
    | head --lines=20

Ranking the requests by popularity
----------------------------------

In order to rank the requests we need to print the requests and count
them, then sort them. The awk script just prints the request itself;
sort sorts the lines and uniq -c counts the unique lines (it
requires a sorted input)

We then use sort --sort-numerically --reverse to sort these lines,
putting more hits at the top. head --lines=20 then prints the first
20 lines.

    awk '{print $8}' /var/log/apache2/access.log \
    | sort \
    | uniq --count \
    | sort --numeric-sort --reverse \
    | head --lines=20

Sometimes it can be interesting to note what sort requests are not OK
(Status 200). Instead of starting with the whole access log, we only
print requests if the status is not 200. This uses a new feature of
awk that we haven't used before. You can have conditions before the
block.

    awk '$10 != 200 {print $8}' /var/log/apache2/access.log \
    | sort \
    | uniq --count \
    | sort --numeric-sort --reverse \
    | head --lines=20

And if you're interested in the status itself:

    awk '$10 != 200 {print $10, $8}' /var/log/apache2/access.log \
    | sort \
    | uniq --count \
    | sort --numeric-sort --reverse \
    | head --lines=20

Or if you're interested only in errors for one particular virtual
host:

    awk '$1 == "alexschroeder.ch:443" && $10 >= 400 {print $10, $8}' /var/log/apache2/access.log \
    | sort \
    | uniq --count \
    | sort --numeric-sort --reverse \
    | head --lines=20

#Administration #Butlerian_Jihad