... I point out that I simply delete trackbacks from blogs if their sole purpose is to promote some obscure Amazon shops. Sorry, but just because advertising junk is stored in a weblog software doesn't mean I let every inappropriate trackback through. And no, just because a keyword from the post also appears in one of my posts doesn't make it an interesting trackback—it's just spam.
I always post source snippets and log file excerpts and stuff like that. For this I use the PRE tag so the stuff is displayed preformatted and in a monospaced font. It works well with all browsers. But a couple of browsers are giving me quite a bit of trouble. First of all Safari 1.0 - ok, that's inevitably dying out and is only a problem in that the horizontal scrollbar obscures the bottom line. You can work around that if necessary with a blank line.
But IE for Windows is also acting up - users tell me that the width is always complete, without a scrollbar. I don't have Windows here, I can't test it here, but that would be annoying of course - I can't use PRE on the front page, otherwise it messes up the layout.
Really extreme is IE 5.5 Mac: it hides the PRE completely. And I don't understand why. They simply aren't displayed. The page validates of course. Well, IE Mac 5.5 will hopefully soon be extinct too and the poor folks still using it have my sympathy, but no source code.
But for Windows IE I'd be grateful for a tip on the CSS problem. If you can fix it with normal CSS means and without too heavy-handed hacks, I could build that in. Here's an example article with PRE blocks.
So, I've added Gravatars to the comments. Anyone who has one will now be displayed with a picture. At the moment though, the distribution of Gravatars is still a bit sparse - I find them kind of fun, as they make commenters somewhat more personally recognizable. Not just anonymous names in the background.
Since Gravatars are pulled based on the email address entered: this will definitely not be published by me. Gravatars use an MD5 hash of the email address, so the address cannot be reconstructed from the link. And besides, WordPress doesn't publish the email anywhere else anyway.
But if you still don't want to enter your regular address: I have 50 Google Mail invites left over. If you send me a message via my feedback form, you can get one and use that instead. Google Mail has a pretty decent spam filter and with 1 GB of storage space it takes a very long time to fill up if you don't empty it. Perfect as a throwaway account...
And if you don't want that either, you'll just get my default Gravatar and then you'll just look a bit pale.
You should never have to reach for your mouse. To make sure Conkeror remains pure, I do not own a mouse.
So if you're a mouse-phobic, you might find some relief with this browser.
And because I'm an experimentally inclined fellow, I naturally had to try it out right away. Ok, Emacs key bindings are terrible (hey, I'm a VI guy) but still the whole thing is quite usable - you could get used to it if only the other applications on your system had similar controls. And here's a tip for Mac users: yes, the whole thing works for you too. However, you do need to start the browser with a parameter, but that's not supported by Firefox.App. Instead, just enter the following command in the terminal (warning, one line!): /Applications/Firefox.App/Contents/MacOS/firefox -chrome chrome://conkeror/content
You may need to adjust the path to Firefox.App. After that, a small window opens with a rather spartan help file. Read it thoroughly, because if you don't at least remember how to open the help page, you'll be stuck. The big B goes back in the history, so if you get lost, you can always get back to the help with it. Oh yes, and to quit doesn't work with Apple-Q - after all it's Emacs. So press Ctrl-X and C one after the other.
Since I had an interesting study object, I wanted to see how much I could uncover in my logfiles with a bit of cluster analysis. So I created a matrix from referrers and accessing IP addresses and got an overview of typical user scenarios - how do normal users look in the log, how do referrer spammers look, and how does our friend look.
All three variants can be distinguished well, even though I'd currently rather shy away from capturing it algorithmically - all of it can be simulated quite well. Still, a few peculiarities are noticeable. First, a completely normal user:
aa.bb.cc.dd: 7 accesses, 2005-02-05 03:01:45.00 - 2005-02-04 16:18:09.00
0065*-
0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4031994 ...
0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4031612 ...
0001*http://mudbomb.com/archives/2005/02/02/wysiwyg-plugin-for-wo ...
0001*http://www.heise.de/newsticker/meldung/55992
0001*http://log.netbib.de/archives/2005/02/04/nzz-online-archiv-n ...
0001*http://www.heise.de/newsticker/meldung/56000
0001*http://a.wholelottanothing.org/2005/02/no_one_can_have.html
You can nicely see how this user clicked away from my weblog and came back - the referrers are by no means all links to me, but incorrect referrers that browsers send when switching from one site to another. Referrers are actually supposed to be sent only when a link is really clicked - hardly any browser does that correctly. The visit was on a defined day and they got in directly by entering the domain name (the "-" referrers are at the top and the earliest referrer that appears is at the top).
Or here's an access from me:
aa.bb.cc.dd: 6 accesses, 2005-02-04 01:11:56.00 - 2005-02-03 08:27:09.00
0045*-
0001*http://www.aylwardfamily.com/content/tbping.asp
0001*http://temboz.rfc1437.de/view
0001*http://web.morons.org/article.jsp?sectionid=1&id=5947
0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4029220 ...
0001*http://sport.ard.de/sp/fussball/news200502/03/bvb_verpfaende ...
0001*http://www.cadenhead.org/workbench/entry/2005/02/03.html
I recognize myself by the referrer with temboz.rfc1437.de - that's my online aggregator. Looks similar - a lot of incorrectly sent referrers. Another user:
aa.bb.cc.dd: 19 accesses, 2005-02-12 14:45:35.00 - 2005-01-31 14:17:07.00
0015*http://www.muensterland.org/system/weblogUpdates.py
0002*-
0001*http://www.google.com/search?q=cocoa+openmcl&ie=UTF-8&oe=UTF ...
0001*http://blog.schockwellenreiter.de/8136
0001*http://www.google.com/search?q=%22Rainer+Joswig%22&ie=UTF-8& ...
0001*http://www.google.com/search?q=IDEKit&hl=de&lr=&c2coff=1&sta ...
This one came more often (across multiple days) via my update page on muensterland.org and also searched for Lisp topics. And they came from the shock wave guy once. Absolutely typical behavior.
Now in comparison, a typical referrer spammer:
aa.bb.cc.dd 6 accesses, 2005-02-12 17:27:27.00 - 2005-02-02 09:25:22.00
0002*http://tramadol.freakycheats.com/
0001*http://diet-pills.ronnieazza.com/
0001*http://phentermine.psxtreme.com/
0001*http://free-online-poker.yelucie.com/
0001*http://poker-games.psxtreme.com/
All referrers are direct domain referrers. No "-" referrers - so no accesses without a referrer. No other accesses - if I analyzed it more precisely by page type, it would be noticeable that no images, etc. are accessed. Easy to recognize - just looks sparse. Typical is also that each URL is listed only once or twice.
Now our new friend:
aa.bb.cc.dd: 100 accesses, 2005-02-13 15:06:16.00 - 2005-02-11 07:07:55.00
0039*-
0030*http://irish.typepad.com
0015*http://www208.pair.com
0015*http://blogs.salon.com
0015*http://hfilesreviewer.f2o.org
0015*http://betas.intercom.net
0005*http://vowe.net
0005*http://spleenville.com
What stands out are the referrers without a trailing slash - atypical for referrer spam. Also, just normal sites. Also noticeable is that pages are accessed without a referrer - hidden behind these are the RSS feeds. This one is also easily distinguishable from users. Especially since there's a certain rhythm to it - apparently always 15 accesses with one referrer, then switch the referrer. Either the referrer list is quite small, or I was lucky that it tried the same one with me twice - one of them is there 30 times.
Normal bots don't need much comparison - few of them send referrers and are therefore completely uninteresting. I had one that caught my attention:
aa.bb.cc.dd: 5 accesses, 2005-02-13 15:21:26.00 - 2005-01-31 01:01:07.00
2612*-
0003*http://www.everyfeed.com/admin/new_site_validation.php?site= ...
0002*http://www.everyfeed.com/admin/new_site_validation.php?site= ...
A new search engine for feeds that I didn't know yet. Apparently the admin had just entered my address somewhere beforehand and then the bot started collecting pages. After that, he activated my newly found feeds in the admin interface. Seems to be a small system - the bot runs from the same IP as the admin interface. Most other bots come from entire bot farms, web spidering is an expensive affair after all ...
In summary, it can be concluded that the current generation of referrer spammer bots and other bad bots are still quite primitive in structure. They don't use botnets to use many different addresses and hide that way, they use pure server URLs instead of page URLs and have other quite typical characteristics such as certain rhythms. They also almost always come multiple times.
Unfortunately, these are not good features to capture algorithmically - unless you run your referrers into a SQL database and check each referrer with appropriate queries against the typical criteria. This way you could definitely catch the usual suspects and block them right on the server. Because normal user accesses look quite different.
However, new generations are already in the works - as my little friend shows, the one with the missing slash. And thanks to the stupid browsers with their incorrectly generated referrers (which say much more about the browser's history than about actual link following), you can't simply counter-check the referenced pages, since many referrers are pure blind referrers.
I just found some referrers in my logs that I absolutely couldn't find anything on that would point back to me. Nothing unusual so far - referrer spam would be the first suspicion. But the sites mentioned in the referrers are perfectly normal weblogs and other sites - no one who would have reason to spam their site (for example, a blog with about 1 post per month, or an Irish site and a few other strange referrers). The numbers are also different than with normal referrer spam: that usually comes either only 1-2 times or if so with many addresses and each one then about 100x or similar. This one comes about 15 times.
So I dug around in the logs a bit to see if I could find something. And sure enough, the referrers have unusual characteristics: they don't end with a /. Normally an address that doesn't end with / is automatically redirected to the /-variant. Referrers are thus normally /-terminated or direct HTML pages or something comparable. Pure site specifications without a / at the end are rather rare.
Something else also stands out: the pages were actually accessed - or at least downloaded. And the pages belonging to one referrer are quite randomly mixed - with normal users you'd actually expect some form of consistency in what comes through as a referrer. Above all, it's rare for 15 links to come to one page all at once...
And the essential criterion: the IP of the accessing computer is always the same across the different ones. An analysis then produced the following picture:
15 betas.intercom.net
15 blogs.salon.com
15 hfilesreviewer.f2o.org
30 irish.typepad.com
5 spleenville.com
5 vowe.net
15 www208.pair.com
All clearly fake referrers. Additionally, 34 accesses to my RSS feeds without a referrer. Accesses were only to direct posts and RSS feeds - not to overview pages or archive pages. It looks very much like the bot is proceeding as follows: search for RSS feeds, grab them, then search for permalinks to articles in them and download them to access comment forms, for example. The whole thing nicely disguised as supposed visitors, including forged referrers that seem unsuspicious. Also not too many accesses from one referrer, rather switch it up more often.
Actually nothing new - with email spam, forged real senders are quite common and usual to be harder to filter. But with scraper bots, I'm seeing this kind of mimicry live for the first time - I've only been observing these symptoms for about 1-2 weeks now.
For admins, this whole thing is quite annoying, since you can use referrer logs even less than you could before. Previous referrer spam was certainly a nuisance, but due to the pretty dumb names of the referrers it was easy to recognize. This form of log phenomenon also falsifies the referrers - but is much less noticeable. Could be interesting for weblogs that display their referrers directly in the post.
And of course the problem remains that I still don't know what the bot wants to do with the collected information. Although I'm strongly suspecting spam, but that's just a guess - could also be a bot searching for typical security holes. In any case it's a bot and in any case it has no good intentions - because otherwise it wouldn't need to hide.