Traffic analysis: the mystery of the missing visitors

We’ve already looked at using analytics to find out about the visitors to your website, and had a closer look at the current most popular webstats service, Google Analytics.

Now, if you’re anything like me then you’ve got GA installed on your site and have compared the data it produces with the data thrown up by your web server’s stats (something called AWStats, in my case). More often than not, the figures don’t seem to agree – in fact, generally I’d say that GA reports substantially fewer visits than the server stats show.

Why should that be? The answer is that GA relies on a technique called page tagging, whereas logfile analysis depends on the data generated by your server as it records each individual file it serves. Let’s find out what that means in practice.

Page tagging

To implement GA on your website, you have to add a snippet of JavaScript code (the Google Analytics Tracking Code, or GATC) onto every page of your website that you want to be included in the stats gathering process. Every time the page is opened, the JavaScript sends data about the browser activity to Google’s servers, where the information’s recorded and held ready for you to view. This has some advantages.

For one thing, it means that views of cached pages are recorded. If a page is cached by the user’s own browser or by the user’s ISP for future use, a subsequent visit to that page won’t be recorded in the website server’s logfiles – because the files have been retrieved from the cache and not from the server. So a substantial number of revisits to a site may be missed in stats generated from logfiles.

If the same page contains the GATC, however, the browser executes the snippet of JavaScript regardless of where it retrieves the page from – so even visits to cached pages get recorded. As a result, GA may actually give you a more accurate record of repeat visits to your website (or visits from the same ISP, in some cases) than your server’s own logfiles do!

On the other hand, there are some users and visits that GA misses.

Which way does the cookie crumble?

GA relies on the use of cookies to track visitors’ movements through a site that uses GA tracking. However, some users regard this as an invasion of their privacy and consequently choose not to allow GA’s cookies. The issuing of the EU Cookie Directive in May 2011, and its compliance deadline of 26 May 2012, has highlighted this issue very publicly at least in the UK, and it’s now commonplace on major websites to see warnings about the use of cookies.

One month on, it’s unclear how far users of websites have taken to blocking GA cookies as a result of all this publicity. It could be an insignificant number, or it could be tens of thousands. But inevitably the effect will be to depress GA visit figures – the only question is, by how much?

“Block ad” tackle

A longer-standing problem is the existence of browser plug-ins that remove advertising from web pages. Some of these plug-ins may block the GATC, whether automatically or as a configurable option.

Much the same applies to users who’ve chosen to disable JavaScript. But this was a relatively small problem even five years ago (estimated at about three per cent of users in the US, and less than half that percentage in the EU). And with so many popular websites these days relying on JavaScript for basic functions, the number of users disabling it is likely to be vanishingly small by now.

Logfile analysis

OK, so we’ve had a look at the human visitors and how GA records them (or doesn’t, in some cases). But what about the logfiles? Are there any peculiarities that affect their figures?

Aye, robot

Where logfile analysis wins hands-down is on non-human visitors – the search engine spiders, or “bots”.

Bots typically don’t execute JavaScript on a page. As a result, webstats analysis packages that rely on page tags are likely to miss them (as we’ve seen, this can affect visits by humans too). Logfile analysis packages, on the other hand, can easily identify the bots – and often report their visits separately, so as not to skew the data and suggest that your website is more popular than it really is.

Instant gratification

The other big advantage that server logfiles have is that the data they record don’t have to travel to a remote location; they’re recorded there and then. With GA and other page-tagging stats analysis tools, the data must be sent from the users’ browser to the tool’s own servers, and thus have to negotiate the Internet before being recorded. If there are any problems along the way, then the data may be delayed or even lost.

Don’t sweat the small stuff

Ultimately, though, it’s probably not worth fretting about the discrepancies between page tags and logfiles. As Kay’s said before, avoid paralysis by analysis – it’s the overall trends that are likely to give you the most worthwhile information, not the individual visitors’ actions.


Leave a Reply

Basic HTML is allowed. Your email address will not be published.

Subscribe to this comment feed via RSS