Conspiracies in Web Tracking

Despite my headline, I’m not really going to go all Mulder on you and start ranting about Big Brother and privacy issues and all that. Instead it’s just some thoughts I’ve been entertaining lately on technology and tracking people and habits on the Web. Some people may choose to see the things I’m writing about as conspiratorial, and that’s fine for them; they may not want to read on, though :) .

Google

This whole train of thought was initiated by Google, when I was working on updating my own search results page: in order to make my search results looks like Google’s, I was viewing the HTML source of Google’s results page (why I’ve never viewed the source before, I don’t know) and uncovered a very clever method of tracking which links get clicked.

Now, quick reality check here: I have nothing against tracking Web activity, whether it’s in the form of Web server logfiles, cookies, link redirection, whatever. I just take it for granted these days that every site I’m visiting is tracking me in some way or another. No biggie. C’est la vie.

And I’m highlighting Google here not because they’re the only search engine that tracks clickthroughs on search results—no, every search engine that wants to stay in the game needs to do this—but because I’m more impressed with how Google is doing this and thought it warranted a detailed look.

Anyway, back to the Google link tracking. Looking at the source of the results, this is the HREF tag for a link:
<a href=http://www.st-patricks-day.com/ onmousedown="return clk(1,this)">
See what’s happening? The link target is benign, it’ll take you to the site you want. But the onMouseDown JavaScript event is the workhorse—it operates (invisibly) when the link is clicked, and triggers a function. Curious, I examined the source of that function (I’ve reformatted it to be readable):

function clk(n,el) {
  if (document.images) {
    (new Image()).src="/url?sa=T&start="+n+"&url="+escape(el.href);
  }
  return true;
}

This is one of the most clever, elegant ways to track clicks that I’ve seen. The only problem is, it relies on JavaScript being enabled, but since the majority of users out there do have it enabled, it will be statistically accurate most of the time. (However, so far I’ve only seen it used on Internet Explorer; it didn’t show up in my tests on Firefox.)

What’s it do? Simple—it creates a new image inside the document (which probably turns out to be a transparent 1-pixel image) when the link is clicked, and when this happens, the source serving up the image is passed a string telling it what URL was clicked and what page of results it was on. This source is a program that logs the result to a database and then returns an image. But the user won’t see that happen; as soon as this image is served, the link activates and they’re on their merry way.

Clever as hell. Uncovering this about Google was surprising, but not too surprising; I always assumed they were tracking this type of data. What’s just as interesting is that I’ve never seen anyone online talk about it.

Why do I think this is more clever than, say, Yahoo? Yahoo tracks clickthroughs in the obvious way: they just use link redirection to run you through their server the old-fashioned way; the link goes directly to their server with the parameters passed to it, and that server redirects to the external link. It doesn’t rely on JavaScript trickery. AlltheWeb uses link redirection also, but they mask it with JavaScript in the same way that Serendipity does, which I cover below.

And I can’t shake the intuitive feeling that Google’s method is more efficient, resource-wise, than the other ways being employed. Tough to say for sure, though, since I don’t have access to the actual systems…

The end result is that Google (as well as all the other search engines) has a massive database somewhere of what links are being clicked and roughly what priority they occupy in the clickaverse. And don’t forget the standard data that is logged when a server receives a request: IP address, browser, referring page… I’m drooling at the possibilities of what could be mined out of this data. It wouldn’t surprise me if this data figures heavily in PageRank.

TypeKey

TypeKey is one of the recent hot topics raging through the blogosphere: a “free, open system providing a central identity that anyone can use to log in and post comments on blogs and other web sites.” Think Passport for the weblog world. The general idea is cut down on comment spam by requiring would-be commenters to register and log in to leave comments, and rather than having a thousand different logins across a thousand different sites, use TypeKey instead.

There’s some controversy surrounding all this, but that’s not what I’m writing about. Nor do I really have an opinion one way or another; my take on it is, this type of project is generally doomed to failure, and I suspect Six Apart is biting off more than they can chew with this—dealing with all the data management hassles and liability issues with such a scheme is not something I’d want to undertake, but more power to them if it works.

My point. The interesting aspects I see to the TypeKey service are in the data that will be collected about comment habits of users. Not the comments themselves—important distinction here—TypeKey won’t be tracking the actual comments a person makes on a weblog, rather it would only verify that person’s identity and grab some metadata about the comment in the process.

At the very least TypeKey will track and store this information about users:

  • Login name
  • Password
  • Email address
  • Date and time they were last authenticated
  • Where they were last authenticated
  • Possibly the IP address of the person, if the authentication service requires it

I guarantee it. And this data will be aggregated, so for every user record there will be a timestamped log of which post on which site they left a comment on. By itself, a small dataset doesn’t reveal much. But over time, when the data grows, then you can start looking for trends, and teasing out interesting tidbits of information on the commenting habits of users.

Like what sites a user is more likely to comment on, and how frequently. Or what times of day a user is most active. Or you can look for spikes of activity on sites, correlate those to hot topics. You may be able to spot likely comment spammers (even if they keep changing identities) based on perceived spam patterns.

Even as indirect data, this is pretty powerful stuff. And valuable.

Serendipity

Serendipity is a PHP-based weblog software that you’ve probably seen around the Web. From what I’ve seen it seems pretty solid, and I’m highlighting it here to point out something it does that I haven’t seen other blog software do (at least, not overtly): external link tracking, just like the search engines.

On a Serendipity-powered blog, you’ll often notice, along with other interesting blog-related statistics like most popular posts, a readout called “Top Exits.” These are the most frequently clicked links leading away from the site, so obviously the blog is tracking what links are being clicked on.

Here’s how it works: there’s a PHP script that handles the link redirect (called exit.php, for instance) which is passed the URL of the link. This URL is stored in a database along with the total number of clickthroughs it has garnered. The clickthrough count is updated and you get redirected appropriately. Same way the search engines work. Mostly this is invisible to the user, and a bit of JavaScript trickery makes it appear that nothing is happening at all.

It’s not terribly clever, this JavaScript; it simply changes the status bar of the browser window to reflect the target link (rather than the system-encoded tracking link) when the mouse hovers over it. But it gets the job done, and the clickthrough is tracked.

The clever part is managing this database of URLs; it has to be done automagically or else nobody would use the software because they didn’t want to write obfuscated links and JavaScript. I haven’t looked at Serendipity’s source, but I imagine the system examines each blog entry for links, and if it finds any, updates the database and replaces the link with the tracking link and necessary JavaScript. That way the process is as invisible to the blog author as it is to the end user.

chuggnutt.com

Now I know you’re wondering, what kind of tracking have I got going on here? I’m glad you asked! I’m more than happy to share.

There’s the standard Apache Web server logfiles, which track IP address, date/time, request, browser and referring link of every user. I pull those periodically and parse them into a MySQL database running on my home machine, so I can run all sorts of nifty reports on the data that I wouldn’t be able to do otherwise.

I do rudimentary tracking of the number of hits to each individual blog post. That’s just a number in the database.

And I set a cookie when a user makes a comment, if the user checks the “Remember me” box when leaving a comment. Inside that cookie are three text values: the name they left, the email address they left, and the website address they left. All three items can be blank, so it’s entirely possible there’s chuggnutt.com cookies that contain no data at all floating around on people’s machines.

And that’s it, currently. For all my enthusiasm for the tracking technologies and methods I’ve been spewing about here, what it all really boils down to on my own site is that by and large, I’m just too damn lazy to do more than what’s there already.

Ironic, isn’t it?

However, given the time, I may someday implement some things like I’ve covered. I like Google’s method of clickthrough tracking, but wouldn’t necessarily only want to rely on JavaScript being active to work. A combination of techniques, perhaps. Though to be honest, I’m not sure what benefit I’d personally get by tracking clickthroughs to other sites… but I am a big brother, after all…

If I ever decide to do something like this, though, I’ll document it here. Open and educational!

2 comments

  1. My blog currently runs Serendipity. I ended up turning off the exit tracking, mainly because I didn’t really care who was clicking on what. I didn’t like the weird URLs either (I have Firefox set to not allow Javascript changes to the status bar…I hate those sites with scrolling messages down there).

    Serendipity has a data table called ‘references’ where all the URLs used in the blog content are sucked out and stored, presumably for doing that exit tracking. Looking at the ‘entries’ table, it looks like Serendipity leaves the blog content alone, so it must swap on the fly.

    I guess I’m more interested in how people get to my blog than what they click on to leave it.

  2. Scrolling status bar JavaScript messages… gah… I remember I did that to an old site of mine once, back in, like, 1997.

    Swapping link content on the fly? If so, that must be a huge performance hit; I wonder how well Serendipity scales in that case.

    For the most part, I’m more interested in how people got here, rather than where they leave to, as well. But that’s appropriate for our niches on the web–the "destination" sites. By and large, data on what links users clicked from my site has no intrinsic value–not like clickthrough data from search engines ("departure" sites).

    Though of course, clickthrough data for your site and mine *is* available… in the logfiles of the sites that get linked to from us. That’s something to think about.

Comments are closed.