Analyze Apache access logs
Last month I volunteered to help analyze the traffic on a website. Previously, some sort of online dashboard permitted to monitor which URL was accessed most, which were the referrers, and so on (I guess similarly to Google Analytics).
Since this dashboard is not available anymore, the owner stopped looking for this statistical data, until I proposed he analyze the access logs, which are still available.
Format log
The format of the Apache access logs looks as follows
ww.xxx.yy.zz - - [03/Feb/2020:06:19:24 +0300] "GET /robots.txt HTTP/1.1" 202 197 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" Every line contains an entry, and some entries are quoted. Entries are space-separated, and the time entry, which contains a space, is surrounded by square brackets.
The first entry is (in our case) a scrambled IP address (I guess for compliance with the GDPR, I did not investigate further). Between the square brackets, there is the access time, the date format is %d/%b/%Y, while the time format is '%H:%M:%S', plus the offset.
After the date, there is the request, in this case through GET the page robots.txt has been requested over HTTP/1.1. After that, it is possible to see the HTTP response code, in this case, 202, and after it, the size of the object returned, in this case, 197 bytes. The next entry is the referrer, but in this case, there is none. After that comes the last entry, which is the user agent, in this case, it is possible to acknowledge that the site has been requested by a bot from Google 🗄️.
A file full of those lines, for someone normally using graphical interfaces, looks daunting, and the log file needed to be downloaded manually from the server, so I decided to help automate some boring tasks.
Automatically downloading the data
The first step was to automate the download of the access logs. It lowers the entry barrier for continuously checking the status.
In our case, as there was no official API for downloading the data, we resorted to curl. Firefox gives the possibility to generate curl commands, so after a couple of trials, we managed to automate the login, save the relevant cookie, and download the data.
Saving data with git
As blindly downloading the data from the server might overwrite already downloaded data (if for some reason the log is manipulated on the server-side, or if during the download there is an error and the local data gets corrupted and afterward it is not possible, for whatever reason, to download them again, ….) I decided to version control them with git.
This is automated with
DIR='/directory/where/curl/saves/the/downloaded/data';
git -C "$DIR" add .;
git -C "$DIR" diff-index --cached --quiet HEAD || git -C "$DIR" commit --message='import' Filtering bots
For many tasks, it makes sense to not consider bots. The easiest solution is to grep the relevant entries out and analyze the remaining logs. I did not remove them from the logs, as it is still useful to see where bots are crawling, for example, if they are trying to access sites that do not exist.
This is a non-complete list (sorted alphabetically) of bots we grepped out
-
'360Spider'
-
'Adsbot'
-
'Aggregator'
-
'AhrefsBot'
-
'AspiegelBot'
-
'Baiduspider'
-
'CheckMarkNetwork'
-
'Cincraw'
-
'Dataprovider'
-
'DingTalkBot'
-
'Discordbot'
-
'DotBot'
-
'Feedly'
-
'GarlikCrawler'
-
'Go-http-client'
-
'Google Favicon'
-
'Googlebot'
-
'K7MLWCBot'
-
'LightspeedSystemsCrawler'
-
'LinkLint-spider'
-
'LinkWalker'
-
'Linkbot'
-
'LinkedInBot'
-
'MJ12bot'
-
'MSOffice'
-
'MTRobot'
-
'MauiBot'
-
'MegaIndex'
-
'MojeekBot'
-
'NewsBlur Feed Fetcher'
-
'NextCloud-News'
-
'PagePeeker'
-
'PingdomPageSpeed'
-
'Python-urllib'
-
'SEMrushBot'
-
'SafeDNSBot'
-
'Seekport Crawler'
-
'SemrushBot'
-
'SemrushBot'
-
'SeobilityBot'
-
'SerendeputyBot'
-
'SeznamBot'
-
'SiteCheckerBotCrawler'
-
'Slackbot'
-
'SynologyChatBot'
-
'TelegramBot'
-
'TestBot'
-
'TombaPublicWebCrawler'
-
'Turnitin'
-
'TweetmemeBot'
-
'TwitterBot'
-
'W3C-checklink'
-
'YandexBot'
-
'YandexMobileBot'
-
'aiohttp'
-
'appengine'
-
'applebot'
-
'bingbot'
-
'cliqzbot'
-
'cloudsystemnetworks'
-
'coccocbot-image'
-
'coccocbot-web'
-
'commoncrawl'
-
'facebookexternalhit'
-
'linkfluence'
-
'msnbot'
-
'petalbot'
-
'proximic'
-
'python-requests'
-
'qwant'
-
'securityheaders'
-
'semalt'
-
'serpstatbot'
-
'smtbot'
-
'zoominfobot'
There are surely many more "official" bots, but as user agents are free text that can be freely set by the end user, there will be always some values we might need to add afterward.
When filtering out the bots I’ve noticed that some of those tried to access pages like .git/HEAD, .env, wp-admin, or login. As the site is not made with WordPress, there is no git repository, and also no login page, those are probably malicious bots trying to see if there is some sensitive data to steal or a login page to brute-force.
Sorting entries
The entries were, strangely, not sorted by date. This is unpractical when looking at the log files, but fortunately, with sort this can be fixed
LC_ALL=C sort -s -b --field-separator ' ' --key=4.9,4.12n --key=4.5,4.7M --key=4.2,4.3n --key=4.14,4.15n --key=4.17,4.18n --key=4.20,4.21n "logfile" Filter by date
Analyzing the whole log at once, especially for gathering statistical data, does not make much sense. For example, we might want to look only at the traffic of the current month or year.
With grep and date it is possible to automate this task easily
# filter current month and current year
grep "\[[0-9][0-9]/$(date "+%b")/$(date '+%Y')\:";
# filter only current year
grep "\[[0-9][0-9]/.*/$(date '+%Y')\:"; Gathering some statistical data
While we were able to filter with ease the entries we are interested in, it is still not that practical to look at them all one by one.
With awk it is possible to easily query the data and gather some statistical information
For example, it is possible to collect and count the HTTP status
awk '{dict[$9]++} END {for(k in dict) print k,"-",dict[k]}' | sort The output, filtered to the beginning of the month, looked like
200 - 11506
206 - 2
301 - 2662
304 - 1044
400 - 26
401 - 4
403 - 24
404 - 2938
500 - 127 Similarly, it is possible to query which resources are the most requested
awk '{dict[$7]++} END {for(k in dict) print dict[k],k}' | sort --numeric --reverse; the output looked like
2064 /favicon.ico
1510 /style.css
1475 /
1396 /index.xml
970 /robots.txt
601 /some-other-resource.html
477 /android_logo.svg
... awk -F'[[/]' '{date=sprintf("%.4s/%02d/%s", $4,(index("JanFebMarAprMayJunJulAugSepOctNovDec",$3)+2)/3,$2); dict[date]++} END {for(k in dict) print k,dict[k]}' | sort | less the output looked like
2019/12/31 116
2020/01/01 189
2020/01/02 187
2020/01/03 203
2020/01/04 147
2020/01/05 150
2020/01/06 254
2020/01/07 347
2020/01/08 309
2020/01/09 200 goaccess
While I was implementing and packing those snippets in a shell script, it occurred to me that there might be already some existing tools that I could install locally.
A quick search revealed goaccess, which seems to be the tool that I was looking for. It can analyze these (and other) log files and collect statistical data in a TUI like I was doing by hand.
In the beginning, I thought that except for the download functionality, I could throw everything away. But then I noticed that goaccess would always analyze the whole file.
Thus I was not able (or at least I did not find such functionalities), to ignore specific bots, which make the majority of requests. Or analyze only what happened in a specific time frame.
Thus while I do not need anymore most functions for gathering data, the ones for filtering the log entries (by time, or by being a bot or not) are essential for creating useful statistics.
It also reduces load times of goacces, as it has to analyze fewer data
goaccesscat data | filter | goaccess --no-ip-validation --log-format='%h %^[%d:%t %^] "%r" %s %b "%R" "%u"' --date-format='%d/%b/%Y' --time-format='%H:%M:%S';
# or use
cat data | filter | goaccess --no-ip-validation --log-format='COMBINED'; where
-
%his the placeholder for IP (notice I also needed to use--no-ip-validation) -
%^ %^ignores the two-, those should be identd and clientid 🗄️, which are never set -
[%d:%t %^]time-
%ddate field -
%ttime field -
%^ignore time offset
-
-
"%r"the request -
%shttp status code -
%bsize returned to the client -
"%R"the referrer -
"%u"user-agent
If you have questions, comments, or found typos, the notes are not clear, or there are some errors; then just contact me.