blufive

[As some of you may have noticed, I'm a bit of a web browser weenie. Some of my bubbling-under posts are about web browser statistics. So, as a prelude, here's an updated version of something I wrote for a different audience some time ago]

Many common web statistics analysis packages are lousy at identifying browsers beyond a few market leaders. In the case of openly-published large-scale web browser statistics, browser detection/identification is almost uniformly apalling. To be fair, it is particularly difficult to do this job well. The only remotely accurate way to identify a browser from the server-side is to analyse the HTTP "User-Agent" header that the browser sends when it requests a file. Unfortunately, this isn't as simple as it may sound. Here are some of the problems.

Manufacturer Controlled Spoofing

In the early days of the web, when support for things as basic as images varied wildly from one browser to another, it became common for web servers to send different content to different browsers. As new browsers came on the market, they would sometimes send user-agent strings resembling those of a similarly capable competitor – usually a version of Netscape, the market leader at the time – so that web servers would allow the browser to access more sophisticated content.

This practice was particularly widespread at the height of the "browser wars". As a result, to this day, most browsers have a user-agent string that begins "Mozilla/x.x" – "Mozilla" being the internal product codename of Netscape's Navigator web browser. To add to the confusion, the name "Mozilla" has now been passed on to the open-source Mozilla project, which, in turn, uses a "Mozilla/5.0" user-agent, leading many browser detection routines to categorise it as the non-existent "Netscape 5".

Nowadays, of course, Netscape is no longer market leader so browsers imitate Internet Explorer, which in turn still mimics Netscape 4.0. Confused yet? You're not the only one.

User Controlled Spoofing

In the last few years, a combination of the overwhelming market dominance of Internet Explorer, poor support for standards in many older browsers, and security concerns for e-commerce transactions has led many websites to allow only one or two browsers to access their sites.

Many users are unhappy about being prevented from accessing sites of interest on the basis of the browser they are using, so some minority browsers have started to offer the user control over the user agent string that is sent. In most cases, when spoofing, these browsers provide enough information to allow correct identification by those in the know, but many crude browser detection methods will jump to the "wrong" conclusion, and let the user in.

The problem for anyone wanting to analyse browser statistics is that many stats packages are just as vulnerable to this deception as web servers.

Variable User Agent Strings

In addition to deliberate spoofing, most browsers have many different user agent strings, indicating differences in language support, release/service pack level, operating system, ISP customisation, and so on. To get some idea of what's going on , have a look at these May 2004 stats for a Server at a US university [warning: 1.6MB HTML file]. The bulk of that file is a list of all the different user agent strings collected by the server over the month. Over 17,000 of them. A vast number of those (I'd guess at about 7-8,000) appear to be different versions of Internet Explorer 6, and other browsers display similar variety, not to mention all the search spiders, download managers, etcetera.

Opera

Recent releases of Opera have built-in user-controlled spoofing, and are set to masquerade as Internet Explorer 5 or 6 by default (depending on the version of Opera). While it's an imperfect mimic (to allow clued-in people to identify Opera) many statistics packages are fooled by the disguise, and evidence suggests that most Opera users never bother to change this default, suggesting that many sources under-report usage of this browser. Many sources that do detect Opera do not distinguish individual versions.

Safari

Apple's Safari web browser has a manufacturer-controlled spoof, which designed to be mistaken for a Mozilla/Gecko-based browser. Like Opera, it includes enough information to allow easy identification by those in the know. Many public web stats sources are falling for the spoof, and including it with other Mozilla/Gecko based browsers, though the situation is improving

Mozilla/Gecko

Few sources correctly differentiate all Mozilla/Gecko based browsers. Again, this is a difficult task, as mozilla.org releases dozens of builds every day, all with different user-agent strings, as part of their test process. There are also several commercial and non-commercial entities (such as Netscape, IBM, Sun Microsystems, Debian and others) releasing browsers based on Mozilla code. It's also possible for users to alter the Mozilla user-agent string.

AOL/Windows and Other Internet Explorer-Based Browsers

Many stats packages do not accurately distinguish web client software that embeds parts of Internet Explorer from Internet Explorer itself. The most common example of this is the browser software that AOL provides to its customers, though there are others that take this approach, such as MSN Explorer, older versions of the CompuServe client, NeoPlanet, and many others. Most of the time, these browsers will behave much like Internet Explorer, but there are exceptions.

Platform

Few stats packages make a detailed distinction between user platforms, so we can't distinguish Internet Explorer (Mac) from Internet Explorer (Windows), or similar. This can be important, as (for example) Internet Explorer for Mac OS is not simply a port of the Windows code – it has a completely different rendering engine, which is much better at handling some CSS and standards-compliant code than its Windows counterpart.