Caveat: I don't work for Google and this is only a surface treatment of the subject. There are likely to be details that are not 100% accurate but they should tell you enough to get you running.

Introduction


If you have seen the various reports you can produce in Google Analytics, you might be amazed how all of this information can be obtained. It is surely mysterious and magical, perhaps being powered by unicorns and wizards. Actually, it isn't, but it does rely on a number of tricks to work out the relevant data - most of which rely simply on how widely used Google Analytics is but what needs to be understood is why and when the information in Google Analytics is not correct.

What is crucial is that you do not expect GA to be 100% accurate, it cannot be and in this post I will explain why. For instance, you setup GA on your site, visit it, go and look at the demographics report and expect to see 1 visitor from your age group and gender and the information is either missing or incorrect (ignoring the fact that you usually have to wait a while for it to appear anyway).

Also, it is crucial that you do not expect data taken from a few number of points to be accurate. GA is definitely designed for large numbers of visitors where a few percent error is acceptable.

Web Requests Don't Contain Much Data

When you request a web page, there is a range of data contained in the request. Some of it, like the source and destination addresses are needed for the mechanics to work correctly, other information is added for the convenience of web servers to know a little something about the "agent" that is requesting the page (the browser, program, hacking tool etc.). This allows the web server to send back slightly different content to certain browsers that might not support certain functionality. There are also fields that the browser can fill in to tell the web server whether, for instance, it can compress the data being sent back which reduces the load on the network. The only other thing of consequence that is sent to the web server are cookies (see later). What is NOT sent to the web server, by default, is anything that specifically identifies you the user. The reasons are simple: 1) There is no unified definition of what a web user is - no standard "identity" and 2) The web server does not need to know!

For instance, when you go to amazon.co.uk, unless you have an account and sign in, amazon does not know who you are, how old you are or where else you have been. If it does, then it does it with the same trickery that GA uses to track people and decide who they are.

Cookies

The reason cookies are so important is that they allow web servers to store information on each machine that visits the site. Note that the cookies are just small text files and they are stored on disk, so if you log in from another machine, that cookie will not exist there unless the site you visit sends it back to you again.

The cookies can contain anything that the web server decides to store there, including usernames and passwords if they are stupid, but usually, a cookie will be used to store your "session" when you are logged in so that when you change between pages, the cookie will be sent in each time and the site knows who you are, otherwise you would be lost between pages since your identity is not sent in the web page request.

The trick here is sharing cookies between sites so that a whole group of sites can follow you and potentially work out who you are (you might have seen virus checkers complaining about tracking cookies). Now you can't actually share cookies across sites normally, they are only sent to sites that contain the same "origin" as the cookie. So if a cookie has an origin domain of google.co.uk, then it will only be sent to web page requests for a page at google.co.uk. Amazon, for instance, cannot demand that your browser sends it a Google cookie. Note also that the browser sends the cookie(s) automatically to any site with the correct domain.

What actually happens though, is that a site, let's say Amazon, can get permission from Google to embeded a small piece of Google code into its own page. When that Google code is called, and sends a request to, say,  google.co.uk, the google cookie will be sent with it, even though you visited Amazon. What Google can then do is know that you visited Amazon and more usefully, Amazon can know that you visited Amazon (naturally, if you have an account at Amazon, they would already know that).

The Clever Part

The question, though, is how much can Google really determine from this? Well, of course, it depends on whether you have a Google account or not. If I have a Google account and am logged in, a local Google cookie will contain a user identifier (as would all sites - not just Google). If I then visit Amazon and this cookie is sent to Google, Google don't just know that someone visited Amazon, but they know that I, Luke Briner, thirty-something, living in the UK etc. have visited Amazon. This information is much richer than the basic information and is something which is both interesting and from Google's point of view, very valuable commercially to a large company that sells to millions of people, like Amazon. It can allow Amazon to ask what age range they attract, what gender, what nationalities. Perhaps it shows that French people don't last very long on the site so they might consider building a French language web site.

Of course, this can be extended across other sites. Google owns YouTube, Google+ etc so any of these sites that contain membership information can be used by GA to identify the user.

What Else Can They Determine?

Earlier on, we mentioned the web request containing information about the browser, it looks something like this: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0 and this tells Google what operating system and browser you are using. It can also determine whether you are on a mobile phone browser and to an accurate degree, even what handset you are using. There are 1000s of these unique strings which Google can easily translates into real-world data. (They look really weird but there is a funny history about how they evolved to be so weird)

GA can have a good guess at your location and language, either by seeing what language the browser has sent in the request and/or by using your machine IP address to get a rough geographic position (although often this is the location of your service provider rather than your actual machine).

One of the other useful features is the referrer information which tells us which page the visitor came from. This is helpful because we might compare how useful Google is compared to Bing or Yahoo but it also can help us decide whether our internal links are working in the way we expect - are people clicking the "buy it" link or simply clicking through from Google shopping?

GA can determine which pages people visit with reasonable accuracy by using an identifier in a cookie, it won't tell us who it is specifically but it will say that user X went to pages A, B and C and stayed on B for 10 minutes. It will then say where the user left. It will add timings for all of these visits.

What Can We Add to GA?

Google knows that GA cannot determine everything, so it allows us to add custom information to our reports. For instance, we can add events that record specific things happening on our site or we can add e-commerce values to certain events to GA can track how much money certain pages are making (and therefore which might need the most attention).

Why Isn't It Perfect?

GA isn't perfect for several reasons.

The main reason is that a lot of the data is inferred rather than being directly know. For instance, to determine that a user has been on a page B for, say, 10 minutes, is inferred by the fact that they went to page A, then page B and then page C and the gap between the arrival time of pages B and C is 10 minutes. Of course, this might not be accurate. Perhaps a user clicked on page B but then went to another browser tab to do something else - they might have got through page B very quickly once they went back to it. GA cannot know this - and this is why we need lots of data for these odd events to come out in the averages.

The second way in which is suffers is due to general web latency and other measures outside the control of Google. For instance, you might visit a page that sends information to Google but this packet might get lost, it might take a  long time to reach Google, which may or may not affect its timings, people might be using various security tools that block expected behaviour or override browser settings. All of these would affect individual readings, but again, over time and with increasing numbers, these values will fall out in the wash.

Another problem is that it relies on cookies and cookies can be disabled, blocked or deleted which can have a whole other effect on data for low numbers of users. Again, GA works on the basis that most people do not play around with Cookie settings.

What Can We Do to Improve It?

Firstly, by understanding how GA works, we cannot expect it to do too much. We cannot expect any specific piece of data to be 100% accurate.

Secondly, we can learn how to interpret the data and decide between what is actually a problem and what might just look like a problem because there is not enough data.

Thirdly, if we really need split-second timing information, we need to code this into the web site in question so we can have complete visibility of what is happening when and with much greater accuracy. Note, that we can still suffer the same problems in that we don't know whether a specific user is struggling with signup or has gone to make a cup of tea, although we could mitigate by having some kinds of timeouts with messages that a user can click which tells the system that the user is still there.

Fourthly, if we want to track incoming links from pages we have control over, rather than hoping that GA works it out (which it might), we can add some custom parameters to the URL which allows us to tell Google exactly what we want GA to know. This is especially useful if we have a campaign on social media and want to know how many people came to our site from that specific link. We do that using the Google URL builder. Usually, it's then easiest to use a URL shortener service like bit.ly to make the URL short enough to post on Twitter etc.

As with any management, we should allow GA to be what it is (it is free after all!) and not coerce it too much to do what it is not designed to do, but probably more importantly, we need to ask a very important question that is often lost in all the discussions and that is, "what do we actually need to know?". It is easy to say, "I want to know what device the user is using" but that is not actually a need, the reality might be that we want to know if users are visiting from Android so we could write an Android app but the question might actually be, "are users visiting from Android on the app or are they using the Android web browser?" Unless you can answer these questions before you start on your data crusade, you'll end up like those managers or politicians who are addicted to stats for the sake of it and who spend large amounts of time and money on things that are not actually business critical.