Maintain an Archive of AWStats Monthly Website Traffic Reports

I use AWStats to generate reports from my Apache webserver logs. Now, I prefer to use AWStats to generate statically linked reports, so instead of having the “Update now” option on the AWStats report page, I have the following script that runs every hour or so as a cron job to update the reports instead.

#!/bin/bash
perl /path/to/awstats_buildstaticpages.pl -config=MYSITE -update -staticlinks -awstatsprog=/path/to/awstats.pl -dir=/var/www/MYSITE/statistics/ > /dev/null
chown MYUSERNAME:MYGROUPNAME /var/www/MYSITE/statistics/awstats.MYSITE.*

The result of that little script is that AWStats prepares a detailed report of webserver activity for the current month. Now, that’s great but how can you maintain an archive of detailed reports for previous months? Answer; write another bash script…

Now, this script is somewhat more complicated than the two-line effort above but it will prepare a set of detailed reports for the previous month, store them in a date-named subfolder and generate a web page listing all sets of monthly reports it’s created so far. Just run it as a cron job on the first of every month and make sure it doesn’t conflict with the running of the regular update script above, especially if you’ve got AWStats’ file-lock option turned on.

NOTE: This script won’t go back through your whole history of apache logs, generating AWStats reports for each month of activity it can find. That’s a job for another script…

Here’s a directory tree by way of explanation…

# Website root
/var/www/MYSITE/index.html

# AWStats detailed report for current month, generated by the two-line script above
/var/www/MYSITE/statistics/awstats.MYSITE.html

# AWStats detailed report for October, created by script below when it ran on 1st November 
/var/www/MYSITE/statistics/2011_10/awstats.MYSITE.html

# AWStats detailed report for September, created by script below when it ran on 1st October
/var/www/MYSITE/statistics/2011_09/awstats.MYSITE.html

# Web page created by script below with links to...
#   - Current month's detailed report
#   - Detailed report for October
#   - Detailed report for September
/var/www/MYSITE/statistics/index.html

Below is the script itself. It won’t work out of the box – there are a few variables to set first, mostly paths to AWStats and your webserver. You can copy and paste it from this page or download the script here. As with the other scripts I’ve made available to download, change the extension and permissions before running it. There’s not much more to say about the script here as I’ve made extensive notes in the file. I hope you find it useful!

#!/bin/bash
## ---------------------------------------------------------------------------
# awstats_monthly_reports.sh
## ---------------------------------------------------------------------------
# Script to generate a full set of statically linked awstats reports
# for the previous calendar month for an individual website.
#
# Author: Richard Leszczynski
# Email: contact NOSPAM @makerdyne.com
# Created: 30/10/2011
# Modified: 26/11/2011
## ---------------------------------------------------------------------------
#
# USAGE:
# 	To be run once a month at the beginning of the month.
# User running the script must have execute permission for the awstats# perl
#scripts and write permissions for the directory in which the# statistics
# report is to be created. (so 'root', most likely)
#
# BEHAVIOUR:
# 	1. Creates (if necessary) a 'statistics' subfolder for the website to
# store the monthly statistics reports.
# 	2. Creates a subfolder within 'statistics' for the monthly report that
# is to be created. Folder name will be a numeric# representation of the year
# and month (e.g. 2011_10)
#	3. Creates a full set of statically-linked awstats reports# for the
# previous month using awstats_buildstaticpages.pl
#	4. Changes the owner & group of the folder/reports created; ideally to
# those of a regular user, not root.
# 	5. Updates statistics/index.html with a link to the newly# created
# report
#
#
# REQUIREMENTS:
#	awstats already configured for the website that the monthly# report is
# to be generated for. For instructions. see
# http://awstats.sourceforge.net/docs/awstats_setup.html
#
# RECOMMENDATIONS:
# 	That an awstats.pl command such as
# perl /PATH/TO/awstats.pl -config=MYSITE -update -output -staticlinks > /PATH/TO/mysite/statistics/awstats.MYSITE.html
# is already running on a regular basis through a cron job in order# to give
# you something to look at between the generation of each monthly report.
#
#	If you change the path of the report that the command above outputs
# to the statistics directory, it will be linked to within the generated
#statistics/index.html page
#
# It is also recommended that an awstats.pl command such as
# perl /PATH/TO/awstats.pl -config=MYSITE -update
# is added as a prerotate action within /etc/logrotate.d/apache2. See the
# awstats documentation for more information on this.
## ---------------------------------------------------------------------------
# Paths & Variables  Required: Change to suit your installation...
## ---------------------------------------------------------------------------
# Full path of your website's DocumentRoot - NO TRAILING SLASH!
# e.g. /var/www/MYSITE/
WEBSITE_DIR="/var/www/MYSITE"
# Full path of awstats.pl script
AWSTATS_PATH="/path/to/awstats.pl"
# Full path of awstats_buildstaticpages.sh
AWSTATS_BSP_PATH="/path/to/awstats_buildstaticpages.pl"
# awstats website configuration name. The text you'd enter in place
# of MYSITE in the command 'perl awstats.pl -config=MYSITE -update'
MYSITE="mywebsitename"
# User and group to own the created monthly report
# e.g. USER=myusername , GROUP=users , or
# e.g. USER=www-data , GROUP=www-data
MYUSER=myusername
MYGROUP=mygroupname
## ---------------------------------------------------------------------------

# Check for pre-existing statistics directory
WEBSITE_DIR=${WEBSITE_DIR%/}
if [ ! -d "$WEBSITE_DIR" ]; then
	echo "$WEBSITE_DIR does not exist! Check path for WEBSITE_DIR has been set correctly. Exiting..."
	exit -1
fi
STATS_DIR=$WEBSITE_DIR/statistics
if [ ! -d "$STATS_DIR" ]; then
	mkdir "$STATS_DIR"
fi
chown $MYUSER:$MYGROUP "$STATS_DIR"

# New Subdirectory to hold the new set of monthly reports
LMONTH=$(date --date="last month" +%m)
LYEAR=$(date --date="last month" +%Y)

REPORT_DIR=("$STATS_DIR"/"$LYEAR"_"$LMONTH")
if [ ! -d "$REPORT_DIR" ]; then
	mkdir "$REPORT_DIR"
fi
chown $MYUSER:$MYGROUP "$REPORT_DIR"

# Run awstats_buildstaticpages.pl to generate a complete set of
# statically linked reports for the last month
perl "$AWSTATS_BSP_PATH" -config=$MYSITE -update -month=$LMONTH -year=$LYEAR -staticlinks -awstatsprog="$AWSTATS_PATH" -dir="$REPORT_DIR" >/dev/null
chown $MYUSER:$MYGROUP "$REPORT_DIR/awstats.$MYSITE".*

# Create an index.html file to link to both the current/rolling report
# and a history of all the monthly reports created to date
INDEX_FILE=$STATS_DIR/index.html
echo "<!--File generated automatically by awstats_monthly_reports.sh-->" > "$INDEX_FILE"
echo "<html><body><h1>AWSTATS Website Statistics for $MYSITE</h1>" >> "$INDEX_FILE"
echo "<h2>Rolling report for this month</h2>" >> "$INDEX_FILE"
echo "<a href="awstats.$MYSITE.html">1st $(date --date="today" +%B\ %Y) to today</a>" >> "$INDEX_FILE"

echo "<h2>Archive of monthly website activity reports</h2>" >> "$INDEX_FILE"
# obtain a list of directories within the STATS_DIR that match the date format which the
# monthly repots are stored in
cd "$STATS_DIR"
MONTHLY_DIRS=$(ls -rd */ | grep -E ^[12][9012][0-9]{2}_[01][0-9]/)
for dr in $MONTHLY_DIRS
do
	TEMP_YEAR=${dr:0:4}
	TEMP_MONTH=${dr:5:2}
	TEMP_MONTH=$(echo $TEMP_MONTH | sed 's/^0*//')
	(( MONTH_OFFSET = $(date --date="today" +%-m) - "$TEMP_MONTH" ))
	echo "<a href="/statistics/"$dr"awstats.$MYSITE.html">$(date --date="$MONTH_OFFSET months ago" +%B) $TEMP_YEAR</a><p>" >> "$INDEX_FILE"
done
echo "</body></html>" >> "$INDEX_FILE"

Automatically Update MaxMind’s GeoIP files

I recently installed the AWStats log file analyser for viewing my apache web server logs and activated the plugins that enable its reports to show the geographic location of IP addresses that have visited your server. In the AWStats configuration file you can choose which of two different geolocation database sources to use to identify IP addresses by country. There’s the GeoIPfree plugin which uses a database that’s reportedly out of date and the GeoIP plugin that uses a database that is actively maintained by a company called MaxMind. MaxMind also go one step further and provide a database and plugin, GeoIP_City_Maxmind, to locate IP addresses by city. I chose to activate the MaxMind plugins.

Now, although Debian provide the MaxMind country database as the geoip-database package in their repositories I couldn’t work out if it was the latest version or not, but Debian being Debian, I suspect not. There was also no sign of the city database being present in Debian’s repositories so I decided to write a modest bash script for downloading the latest databases direct from MaxMind’s website. Their free databases are updated during the first week of every month, so I have the bash script running as a cron job on the 10th of each month. You can download a copy of the script here.

Make sure that you change the file extension and permissions after you’ve downloaded it. I’ve changed the file extension as WordPress doesn’t allow the uploading of .sh files by default and simply changing the extension is quicker than trying to modify the media uploader’s behaviour.

mv /path/to/GeoIP_update.txt /home/myuser/bin/GeoIP_update.sh
chown myuser:mygroup /home/myuser/bin/GeoIP_update.sh
chmod 744 /home/myuser/bin/GeoIP_update.sh

Alternatively, copy-and-paste from below, but whichever method you use there are several variables that must be set before using it for the first time.

#!/bin/bash
# Script to update Maxmind's GeoIP files
# Author: Richard Leszczynski
# Contact: contact NOSPAM @makerdyne.com
# Created: 18 November 2011
# Last Modified: 21 November 2011

# --------------------------------------------------------------------
# Purpose:
# --------------------------------------------------------------------
# To download the latest versions of the files provided by
# www.MaxMind.com for geolocating IP addresses.

# --------------------------------------------------------------------
# Recommended Usage:
# --------------------------------------------------------------------
# Run as a cron job once a month, about a 8-10 days into the month
# If run as a user, not root, the user must have write permissions
# on both the DOWNLOADS_DIR and GEOIP_DIR directories (see below).
#
# Script should be silent unless you've made any configuration errors.
# If nothing downloads, remove the -q option from wget and run the
# script from the command line. wget is too complicated for me to
# grab errors from it but see the "Variables set by user" section
# below for why it might occasionally go wrong.

# --------------------------------------------------------------------
# Variables to be set by user:
# --------------------------------------------------------------------
#
# Path to downloads folder. This folder will keep a copy of the zipped
# GeoIP files in a subfolder called MaxMind_GeoIP .
# - The script requires that DOWNLOADS_DIR already exists. Create it
# before running the script or you will get an error.
# - A permanent copy of the zipped files is required to allow wget to
# inspect the existing, local files and only initiate a download if
# it discovers newer versions on MaxMind's servers. Use of the wget -N
# option which enables this behaviour is requested by MaxMind to limit
# the load on their servers.
# - Also note that MaxMind are pretty hot on monitoring the number of
# downloads people make from their servers and appear to block IPs for
# 24-48 hours if you exceed a certain (and small! 6?) number of
# download attempts within a short space of time.
# -If you're having trouble downloading the files with this script,
# remove the -q (quiet) option from the wget line and run the script
# from the command line. If wget reports "Connection timed out", you
# have probably hit MaxMind's download attempt limit. Wait 24-48 hours
# and then try again. Don't forget to put -q back on the wget line
# before you run the script as a cron job again.
DOWNLOADS_DIR="/home/myuser/downloads"

# Path to in-use location of GeoIP files. The path that programs which
# will make use of the GeoIP files expect to find them.
# - The script requires that GEOIP_DIR already exists. Create it
# before running the script or you will get an error.
GEOIP_DIR="/usr/share/GeoIP"

# Maxmind provide files for geolocation of IP address to countries
# and also to individual cities. Do you also want to download the
# files for cities? [yes/no]
CITIES_ENABLED="yes"

# MaxMind have developed a database of IPv6 addresses. Choose whether
# you want to download the files for geolocating IPv6 addresses in
# addition to the default files for IPv4. [yes/no]
IPv6_ENABLED="no"

# Enter the user/group who you want to own the downloaded files.
MYUSER="myusername"
MYGROUP="mygroupname"

# There are 4 GeoIP files to download, the countries file, the cities file
# and their two equivalents for IPv6. The binary versions - not csv -
# are what you will need in most circumstances.
# Update the URLs here should MaxMind change their locations.
COUNTRIES="http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz"
CITIES="http://geolite.maxmind.com/download/geoip/database/GeoLiteCity.dat.gz"
COUNTRIES_IP6="http://geolite.maxmind.com/download/geoip/database/GeoIPv6.dat.gz"
CITIES_IP6="http://geolite.maxmind.com/download/geoip/database/GeoLiteCityv6-beta/GeoLiteCityv6.dat.gz"

# The script can send results by email when it finishes.
# Set to enable emailing of results [yes/no]
ENABLE_EMAIL="yes"
# Specify a valid email address
EMAIL_ADDRESS="me@mydomain.com"

# --------------------------------------------------------------------

DOWNLOADS_DIR="${DOWNLOADS_DIR%/}"
if [ ! -d "$DOWNLOADS_DIR" ]; then
	echo "The specified DOWNLOADS_DIR "$DOWNLOADS_DIR" does not exist!"
	echo "Check you have specified the correct path, or create it manually."
	exit -1
fi

GEOIP_DIR="${GEOIP_DIR%/}"
if [ ! -d "$GEOIP_DIR" ]; then
	echo "The specified GEOIP_DIR "$GEOIP_DIR" does not exist!"
	echo "Check you have specified the correct path, or create it manually."
	exit -1
fi

# Note: The files to be downloaded are compressed in .gz format and
# gunzip removes the original archive file after uncompressing its
# contents, so to allow wget to use it's -N option, a copy of the
# archive (as downloaded) is kept in a subfolder of the DOWNLOADS_DIR
# for use by subsequent executions of the script.
MAXMIND_DIR="$DOWNLOADS_DIR"/MaxMind_GeoIP
if [ ! -d "$MAXMIND_DIR" ]; then
	mkdir "$MAXMIND_DIR"
	chown $MYUSER:$MYGROUP "$MAXMIND_DIR"
fi

ARRAY="$COUNTRIES"
CITIES_ENABLED=${CITIES_ENABLED,,}
if [ "$CITIES_ENABLED" == "yes" ]; then
	ARRAY[$((${#ARRAY[@]} + 1))]="$CITIES"
fi
IPv6_ENABLED=${IPv6_ENABLED,,}
if [ "$IPv6_ENABLED" == "yes" ]; then
	ARRAY[$((${#ARRAY[@]} + 1))]="$COUNTRIES_IP6"
	if [ "$CITIES_ENABLED" == "yes" ]; then
		ARRAY[$((${#ARRAY[@]} + 1))]="$CITIES_IP6"
	fi
fi

DOWNLOADED=""
for url in ${ARRAY[@]}
do
	cd "$MAXMIND_DIR"
	FILENAME=$(basename "$url")
	FILE_OLD_CHECKSUM=0
	TIME_STAMP=""
	if [ -f "$FILENAME" ]; then
		FILE_OLD_CHECKSUM=$(md5sum "$FILENAME" | cut -d ' ' -f 1)
		TIME_STAMP="-N"
		echo "File checksum is "$FILE_OLD_CHECKSUM""
	fi
	wget $TIME_STAMP -q --tries=10 --wait=120 --limit-rate=50k $url
	# Basic check for first-run wget problems below - it won't catch
	# errors if a local copy of the file already exists.
	if [ ! -f "$FILENAME" ]; then
		echo "Unable to download "$FILENAME" from the location:"
		echo "$url"
		echo "and no local copy exits. Check URL and read comments within script file for notes on wget."
		exit -1
	# Copy downloaded file to where it will be used
	else
		FILE_NEW_CHECKSUM=$(md5sum "$FILENAME" | cut -d ' ' -f 1)
		# only copy file from DOWNLOADS_DIR to GEOIP_DIR if it has changed
		if [ $FILE_OLD_CHECKSUM != $FILE_NEW_CHECKSUM ]; then
			DOWNLOADED="$DOWNLOADED $FILENAME" 
			cp -fp "$FILENAME" "$DOWNLOADS_DIR"
			cd "$DOWNLOADS_DIR"
			gunzip -fq "$FILENAME"
			mv -u "${FILENAME%.gz}" "$GEOIP_DIR"
		fi
	fi
done

EMAIL_ENABLED=${EMAIL_ENABLED,,}
if [ "$EMAIL_ENABLED" == "yes" ]; then
	echo "GeoIP_update.sh downloaded the following files: $DOWNLOADED" | mail -E -s "GeoIP_update.sh Results" $EMAIL_ADDRESS
fi