The Apache logo, licensed under the Apache License, Version 2.0

Analyze Apache access logs


5 - 6 minutes read, 1281 words
Categories: web
Keywords: apache shell web

Last month I volunteered to help to analyze the traffic on a website. Previously, some sort of online dashboard permitted to monitor which URL was accessed most, which were the referrers, and so on(I guess similarly to Google Analytics).

Since this dashboard is not available anymore, the owner stopped looking for this statistical data, until I proposed he analyzes the access logs, which are still available.

Format log

The format of the apache access logs looks as following

ww.xxx.yy.zz - - [03/Feb/2020:06:19:24 +0300] "GET /robots.txt HTTP/1.1" 202 197 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Every line contains an entry, and some entries are quoted. Entries are space-separated, and the time entry, which contains a space, is surrounded by square brackets.

The first entry is (in our case) a scrambled IP-address (I guess for compliance with the GDPR, I did not investigate further). Between the square brackets, there is the access time, the date format is %d/%b/%Y, while the time format is '%H:%M:%S', plus the offset.

After the date, there is the request, in this case through GET the page robots.txt has been requested over HTTP/1.1. After that, it is possible to see the HTTP response code, in this case, 202, and after it, the size of the object returned, in this case, 197 bytes. The next entry is the referrer, but in this case, there is none. After that comes the last entry, which is the user agent, in this case, it is possible to acknowledge that the site has been requested by a bot from google.

A file full of those lines, for someone normally using graphical interfaces, looks daunting, and the log file needed to be downloaded manually from the server, so I decided to help to automate some boring tasks.

Automatically downloading the data

The first step was to automate is the download of the access logs. It lowers the entry barrier for continuously checking the status.

In our case, as there was no official API for downloading the data, we resorted to curl. Firefox gives the possibility to generate curl commands, so after a couple of trials, we managed to automate the login, save the relevant cookie, and download the data.

Saving data with git

As blindly downloading the data from the server might overwrite already downloaded data (if for some reasons the log are manipulated on the server-side, or if during the download there is an error and the local data get corrupted and afterward it is not possible, for whatever reason, to download them again, …​.) I decided to version control them with git.

This is automated with

DIR='/directory/where/curl/saves/the/downloaded/data';
git -C "$DIR" add .;
git -C "$DIR" diff-index --cached --quiet HEAD || git -C "$DIR" commit --message='import';

Filtering bots

For many tasks, it makes sense to not consider bots. The easiest solution is to grep the relevant entries out and analyze the remaining logs. I did not remove them from the logs, as it is still useful to see where bots are crawling, for example, if they are trying to access sites that do not exist.

This is a non-complete list (sorted alphabetically) of bots we grepped out

  • '360Spider'

  • 'Adsbot'

  • 'Aggregator'

  • 'AhrefsBot'

  • 'AspiegelBot'

  • 'Baiduspider'

  • 'CheckMarkNetwork'

  • 'Cincraw'

  • 'Dataprovider'

  • 'DingTalkBot'

  • 'Discordbot'

  • 'DotBot'

  • 'Feedly'

  • 'GarlikCrawler'

  • 'Go-http-client'

  • 'Google Favicon'

  • 'Googlebot'

  • 'K7MLWCBot'

  • 'LightspeedSystemsCrawler'

  • 'LinkLint-spider'

  • 'LinkWalker'

  • 'Linkbot'

  • 'LinkedInBot'

  • 'MJ12bot'

  • 'MSOffice'

  • 'MTRobot'

  • 'MauiBot'

  • 'MegaIndex'

  • 'MojeekBot'

  • 'NewsBlur Feed Fetcher'

  • 'NextCloud-News'

  • 'PagePeeker'

  • 'PingdomPageSpeed'

  • 'Python-urllib'

  • 'SEMrushBot'

  • 'SafeDNSBot'

  • 'Seekport Crawler'

  • 'SemrushBot'

  • 'SemrushBot'

  • 'SeobilityBot'

  • 'SerendeputyBot'

  • 'SeznamBot'

  • 'SiteCheckerBotCrawler'

  • 'Slackbot'

  • 'SynologyChatBot'

  • 'TelegramBot'

  • 'TestBot'

  • 'TombaPublicWebCrawler'

  • 'Turnitin'

  • 'TweetmemeBot'

  • 'TwitterBot'

  • 'Twitterbot'

  • 'W3C-checklink'

  • 'YandexBot'

  • 'YandexMobileBot'

  • 'aiohttp'

  • 'appengine'

  • 'applebot'

  • 'bingbot'

  • 'cliqzbot'

  • 'cloudsystemnetworks'

  • 'coccocbot-image'

  • 'coccocbot-web'

  • 'commoncrawl'

  • 'facebookexternalhit'

  • 'linkfluence'

  • 'msnbot'

  • 'petalbot'

  • 'proximic'

  • 'python-requests'

  • 'qwant'

  • 'securityheaders'

  • 'semalt'

  • 'serpstatbot'

  • 'smtbot'

  • 'zoominfobot'

There are surely many more "official" bots, but as user-agent are free text that can be freely set by the end-user, there will be always some values we might need to add afterward.

When filtering out the bots I’ve noticed that some of those tried to access pages like .git/HEAD, .env, wp-admin, or login. As the site is not made with WordPress, there is no git repository, and also no login page, those are probably malicious bots trying to see if there is some sensitive data to steal or login-page to brute-force.

Sorting entries

The entries were, strangely, not sorted by date. This is unpractical when looking at the log files, but fortunately, with sort this can be fixed

sort by date
LC_ALL=C sort -s -b --field-separator' ' --key=4.9,4.12n --key=4.5,4.7M --key=4.2,4.3n --key=4.14,4.15n --key=4.17,4.18n --key=4.20,4.21n "logfile"

Filter by date

Analyzing the whole logs at once, especially for gathering statistical data, does not make much sense. For example, we might want to look only at the traffic of the current month or year.

With grep and date it is possible to automate this task easily

# filter current month and current year
grep "\[[0-9][0-9]/$(date "+%b")/$(date '+%Y')\:";

# filter only current year
grep "\[[0-9][0-9]/.*/$(date '+%Y')\:";

Gathering some statistical data

While we were able to filter with ease the entries we are interested in, it is still not that practical to look at them all one by one.

With awk it is possible to easily query the data and gather some statistical information

For example, it is possible to collect and count the HTTP status

print http status
awk '{a[$9]++} END {for(k in a) print k,"-",a[k]}'

The output, filtered to the beginning of the month, looked like

200 - 11506
206 - 2
301 - 2662
304 - 1044
400 - 26
401 - 4
403 - 24
404 - 2938
500 - 127

And similarly, it is possible to query which resources are the most requested

sort resources by requests
awk '{a[$7]++} END {for(k in a) print a[k],k}' | sort --numeric --reverse;

the output looked like

2064 /favicon.ico
1510 /style.css
1475 /
1396 /index.xml
970 /robots.txt
601 /some-other-resource.html
477 /android_logo.svg
...

goaccess

While I was implementing and packing those snippets in a shell script, it occurred to me that there might be already some existing tools that I could install locally.

A quick search revealed goaccess, which seems to be the tool that I was looking for. It can analyze these (and other) log files and collect statistical data in a TUI like I was doing by hand.

In the beginning, I thought that except for the download functionality, I could throw everything away. But then I noticed that goaccess would always analyze the whole file.

Thus I was not able (or at least I did not find such functionalities), to ignore the bots, which make the majority of requests. Or analyze only what happened in a specific time frame.

This while I do not need anymore most functions for gathering data, but the ones for filtering the log entries (by time, or by being a bot or not) is essential for creating useful statistics.

It also reduces load times of goacces, has it has to analyze less data

filter data before passing it to goaccess
cat data | filter | goaccess --no-ip-validation --log-format='%h %^[%d:%t %^] "%r" %s %b "%R" "%u"' --date-format='%d/%b/%Y' --time-format='%H:%M:%S';

# or use
cat data | filter | goaccess --no-ip-validation --log-format='COMBINED';

where

  • %h is the placeholder for ip (notice I also needed to use --no-ip-validation)

  • %^ %^ ignores the two -, those should be identd and clientid, which are never set.

  • [%d:%t %^]` time`

    • %d date field

    • %t time field

    • %^ ignore time offset

  • "%r" the request

  • %s http status code

  • %b size returned to the client

  • "%R" the referrer

  • "%u" user-agent