Content-type: matter-transport/sentient-life-form

Right now, people are once again wildly discussing hit counts and similar nonsense. Usually, I don't care about these (my server has an absurdly high free allowance that I can never use, and the server load is also low - so why should I care how much comes in?), but with the various announcements of hit counts, page views, and visits, I always have to smile a little.

Just as a small analysis of the whole story. First, the most important part: where do these numbers come from? Basically, there are two possibilities. One relies on the fact that pages contain a small element (e.g., an image - sometimes invisible - or a piece of JavaScript or an iframe - all commonly referred to as a web bug (web bug)) that is counted. The other method goes to the log files of the web server and evaluates them. There is a third one, where the individual visitor is identified via a cookie - but this is rather rarely used, except for some rather unpopular advertising systems.

Basically, there are only a few real numbers that such a system can really provide (with the exception of individualization via cookies): on the one hand, hits, on the other hand, megabytes and transfer. Quite remotely useful, there is also the number of different hosts (IP addresses) that have accessed the site.

But these numbers have a problem: they are purely technical. And thus strongly dependent on technology. Hits go up if you have many external elements. Bytes go up if you have many long pages (or large images or ...). IP addresses go down if many visitors are behind proxies. And they go up if you have many ISDN users - because of the dynamic dial-up addresses. Changes in the numbers are therefore due to both changes in visitors and changes in the pages.

All these numbers are as meaningful as the coffee grounds in the morning cup. That's why people derive other numbers from these - at least technically defined - numbers, which are supposed to say something. Here, the visits (visits to the website), the page impressions (accesses to real page addresses), and the visitors (different visitors) are to be mentioned.

Let's take the simplest number, which at least has a rudimentary connection to the real world: page impressions. There are different ways to get there. You can put the aforementioned web bugs on the pages that are to be counted. Thus, the number is about as reliable as the counting system. Unfortunately, the counting systems are absolutely not, but more on that in a moment. The alternative - going through the web server log files - is a bit better. Here, you simply count how many hits with the MIME type text/html (or whatever is used for your own pages) are delivered. You can also count .html - but many pages no longer have this in the addresses, the MIME type is more reliable.

Significance? Well, rather doubtful. Many users are forced through their providers via proxies - but a proxy has the property of helping to avoid hits. If a visitor has retrieved the page, it may (depending on the proxy configuration) be delivered to other visitors from the cache, not fetched from the server. This affects, for example, the entire AOL - the numbers are clearly distorted there. The more A-list-bloggerish the blogger really is, the more distorted the numbers often are (since cache hits can be more frequent than with less visited blogs).

In addition, browsers also do such things - cache pages. Or visitors do something else - reload pages. Proxies repeat some loading process automatically because the first one may not have gone through completely due to timeout - all of these are distortions of the numbers. Nevertheless, page impressions are still at least halfway usable. Unless you use web bugs.

Because web bugs have a general problem: they are not main pages. But embedded objects. Here, browsers often behave even more stubbornly - what is in the cache is displayed from the cache. Why fetch the little picture again? Of course, you can prevent this with suitable headers - nevertheless, it often goes wrong. JavaScript-based techniques completely bypass users without JavaScript (and believe me, there are significantly more of them than is commonly admitted). In the end, web bugs have the same problems as the actual pages, only a few additional, own problems. Why are they still used? Because it is the only way to have your statistics counted on a system other than your own. So indispensable for global length comparisons.

Well, let's leave page impressions and thus the area of rationality. Let's come to visits, and thus closely related to visitors. Visitors are mysterious beings on the web - you only see the accesses, but who it is and whether you know them, that is not visible. All the more important for marketing purposes, because everything that is nonsense and cannot be verified can be wonderfully exploited for marketing.

Visitors are only recognizable to a web browser via the IP of the access, plus the headers that the browser sends. Unfortunately, this is much more than one would like to admit - but (except for the cookie setters with individual user tracking) not enough for unique identification. Because users share IPs - every proxy will be counted as one IP. Users may use something like tor - and thus the IP is often different than the last time. Users share a computer in an Internet café - and thus it is actually not users, but computers that are assigned. There are headers that are set by caches with which assignments can be made - but if the users behind the cache all use only private IP addresses (the 10.x.x.x or 172.x.x.x or 192.168.x.x addresses that you know from relevant literature), this does not help either.

Visitors can still be assigned a bit if the period is short - but over days? Sorry, but in the age of dynamic IP addresses, that doesn't help at all. The visitors of today and those of tomorrow can be the same or different - no idea. Nevertheless, it is proudly announced how many visitors one had in a month. Of course, this no longer has any meaning. Even daily numbers are already strongly changed by dynamic dial-ups (not everyone uses a flat rate and has the same address for 24 hours).

But to add to the madness, not only the visitors are counted (allegedly), but also their visits. Yes, that's really exciting. Because what is a visit? Ok, recognizing a visitor again over a short period of time (with all the problems that proxies and the like bring about, of course) works quite well - and you also know exactly when a visit begins. Namely, with the first access. But when does it end? Because there is no such thing as ending a web visit (a logout). You just go away. Don't come back so quickly (if at all).

Yes, that's when it gets really creative. Do you just take the time intervals of the hits? Or - because visitors always read the content - do you calculate the time interval from when a hit is a new visit from the size of the last retrieved page document? How do you filter out regular refreshes? How do you deal with the above visitor counting problems?

Not at all. You just suck. On the fingers. Then a number comes out. Usually based on a time interval between hits - long pause, new visit. That's just counted. And it's added to a sum. Regardless of the fact that a visit may have been interrupted by a phone call - and therefore two visits were one visit, just with a pause. Regardless of the fact that users share computers or IP addresses - and thus a visit in reality was 10 interwoven visits.

Oh, yes, I know that some software uses the referrer headers of the browser to assign paths through the system and thus build clearer visits. Which of course no longer works smoothly if the user goes back with the back button or enters an address again without a referrer being produced. Or uses a personal firewall that partially filters referrers.

What is really cute is that all these numbers are thrown on the market without clear statements being made. Of course, sometimes it is said which service the numbers were determined via - but what does that say? Can the numbers be faked there? Does the operator count correctly (at blogcounter.de you can certainly fake the numbers in the simplest way) and does he count sensibly at all? Oh well, just take numbers.

The argument is often brought up that although the numbers cannot be compared directly as absolute numbers across counter boundaries, you can compare numbers from the same counter - companies are founded on this, which make money by renting out this coffee ground technology to others and thus realizing the great cross-border rankings. Until someone notices how the counters can be manipulated in a trivial way ...

It gets really cute when the numbers are brought into line with the time axis and things like average dwell time are derived from this and then, in combination with the page size, it is determined how many pages were read and how many were just clicked (based on the usual reading speed, such a thing is actually "evaluated" by some software).

So let's summarize: there is a limited framework of information that you can build on. These are hits (i.e., retrievals from the server), hosts (i.e., retrieving IP addresses), and amounts transferred (summing the bytes from the retrievals). In addition, there are auxiliary information such as e.g. referrers and possibly cookies. All numbers can be manipulated and falsified - and many are actually falsified by common Internet technologies (the most common case being caching proxies).

These rather unreliable numbers are chased through - partly non-public - algorithms and then mumbo jumbo is generated, which is used to show what a cool frood you are and where the towel hangs.

And I'm supposed to participate in such nonsense?

PS: According to the awstats evaluation, the author of this posting had 20,172 visitors, 39,213 visits, 112,034 page views in 224,402 accesses, and pushed 3.9 gigabytes over the line last month - which, as noted above, is completely irrelevant and meaningless, except that he might look for more sensible hobbies.

Blogcounter, Penis Size Comparisons, and Other Lies