The Sohu.com Search Bot Is Acting Strange

The search bot from sohu.com is currently crawling my pages. So far, so good. It uses robots.txt, which is already a good sign. But there are two things that really puzzle me:

First, it accesses every page twice. Once with a HEAD request and once with a GET request. That's pretty stupid for several reasons. On one hand, you can handle it directly using Conditional GET, and on the other hand, it provokes double page generation for dynamically generated pages — because even though the HEAD request only fetches the header lines, for example to calculate the Content-Length, the page still has to be generated anyway (of course, this depends on how the generating system is written).

Second, every few pages it accesses a page called abcdefghijklmn.htm. And I really don't understand what that nonsense is supposed to be. Some kind of keep-alive check? No idea. Very strange.

sohu-search is a weird bot

The Sohu.com Search Bot Is Acting Strange