Wednesday 12 October 2011

Top Command Line Tips: Apache Access Log

I said I should do some top command line tips. So I thought I'd start with some useful Apache access log monitoring and analysis commands.
These can come in useful if odd things are happening on your websever and maybe you think there are dodgy spiders or your need to only see requests from your IP, that kind of thing.

Assumptions
I'm assuming the path to your access log is /var/log/apache/access.log. So adjust if it's different.
Also assuming your access file's LogFormat is 'combined' and that you actually have permission to view the logs.

Commands Overview
Commands and options we'll be using are:

  • tail - view the end of file/input, last 10 lines by default.
    • -f - append new lines as the file grows
  • head - view the start of a file/input
  • grep - search a file/input
    • -E - extended regular expressions
    • -f - get search patterns from file
  • sort - err...
    • -r - sort descending
    • -g - numerical sorting
  • cut - split the input by a character and show a particular field
    • -d - specify the delimiting character, TAB by default.
    • -f - list of fields t output.
  • zcat - display a zipped file
  • awk - scripting language. My awk is pretty basic so I'm only using simple stuff. In short the input is split by whitespace into $1, $2, $3, etc. If you prefer you could replace awk with perl -ae input is split into @F.
  • Uniq - takes sorted data and returns unique lines
    • -c - outputs the count of occurrences at the start of the line.
1- Most Common 404s (Page Not Found)
cut -d'"' -f2,3 /var/log/apache/access.log | awk '$4=404{print $4" "$2}' | sort | uniq -c | sort -rg

2 - Count requests by HTTP code

cut -d'"' -f3 /var/log/apache/access.log | cut -d' ' -f2 | sort | uniq -c | sort -rg

3 - Largest Images
cut -d'"' -f2,3 /var/log/apache/access.log | grep -E '\.jpg|\.png|\.gif' | awk '{print $5" "$2}' | sort | uniq | sort -rg

4 - Filter Your IPs Requests
tail -f /var/log/apache/access.log | grep <your IP>

5 - Top Referring URLS
cut -d'"' -f4 /var/log/apache/access.log | grep -v '^-$' | grep -v '^http://www.yoursite.com' | sort | uniq -c | sort -rg

6 - Watch Crawlers Live
For this we need an extra file which we'll call bots.txt. Here's the contents:

Bot
Crawl
ai_archiver
libwww-perl
spider
Mediapartners-Google
slurp
wget
httrack

This just helps is to filter out common user agents used by crawlers.
Here's the command:
tail -f /var/log/apache/access.log | grep -f bots.txt

7 - Top Crawlers
This command will show you all the spiders that crawled your site with a count of the number of requests.
cut -d'"' -f6 /var/log/apache/access.log | grep -f bots.txt  | sort | uniq -c | sort -rg


How To Get A Top Ten
You can easily turn the commands above that aggregate (the ones using uniq) into a top ten by adding this to the end:
| head

That is pipe the output to the head command.
Simple as that.

Zipped Log Files
If you want to run the above commands on a logrotated file, you can adjust easily by starting with a zcat on the file then piping to the first command (the one with the filename).

So this:
cut -d'"' -f3 /var/log/apache/access.log | cut -d' ' -f2 | sort | uniq -c | sort -rg
Would become this:
zcat /var/log/apache/access.log.1.gz | cut -d'"' -f3 | cut -d' ' -f2 | sort | uniq -c | sort -rg

If there another report you'd like to now just ask in the comments.


See Also: Top 10 command line shortcutsTerminal tweaks and My favourite CLI posts (by others)