Content-type: matter-transport/sentient-life-form

Vampire - An extension of mod_python that makes it more developer-friendly. For example, it can also perform automatic code reloading.

Overview of new features in Apache 2.2 - Apache HTTP Server - what's new in Apache 2.2. Very interesting: the Event MPM. With this, Apache finally reports back at the top of the line for Keep-Alive sessions (previously, Apache had to reserve a worker for each Keep-Alive, which made Apache nearly unusable for streaming with a larger number of clients).

Django, Apache and FCGI

In Django, lighttpd and FCGI, second take I described a method how to run Django with FCGI behind a lighttpd installation. I did run the Django FCGIs as standalone servers so that you can run them under different users than the webserver. This document will give you the needed information to do the same with Apache 1.3.

Update: I maintain my descriptions now in my trac system. See the Apache+FCGI description for Django.

Update: I changed from using unix sockets to using tcp sockets in the description. The reason is that unix sockets need write access from both processes - webserver and FCGI server - and that's a bit hard to setup right, sometimes. tcp sockets are only a tad bit slower but much easier to set up.

First the main question some might ask: why Apache 1.3? The answer is simple: many people still have Apache 1.3 running as their main server and can't easily upgrade to Apache 2.0 - for example if they run large codebases in mod perl or mod python they will run into troubles with migrating because Apache 2.0 will require mod perl2 or mod python2 and both are not fully compatible with older versions. And even though lighttpd is a fantastic webserver, if you already run Apache 1.3 there might just not be the need for another webserver.

So what do you need - besides the python and django stuff - for Apache 1.3 with FastCGI? Just the mod rewrite module and mod fastcgi module installed, that's all. Both should come with your systems distribution. You will still need all the python stuff I listed in the lighttpd article.

mod_fastcgi is a bit quirky in it's installation, I had to play a bit around with it. There are a few pitfalls I can think of:

the specification of the socket can't be an absolute path but must be a relative path with respect to the FastCgiIpcDir
the specification of the FCGI itself (even though it's purely virtual) must be in a fully qualified form with respect to the document root you want to use. If you use a relative path, it will be relative to the document root of the default virtual host - and that's most surely not the document root you will use if you want to set up a virtual host with the FCGI.
the FCGI itself can't be defined within a virtual host - it must be defined in the main server config. That's where the relative addressing problem comes into play.
the socket file must be both readable and writeable by the FCGI user and the Apache user. Usually you do this by changing the socket file to group writeable and changing the group of that socket file to a group where both the user and the apache are members of.

Now here is the config snippet you have to add to your httpd.conf. I use the same directories as with the lighttpd sample, you most surely will have to adapt that to your situation.


 FastCgiExternalServer /home/gb/work/myproject/publichtml/admin.fcgi -host 127.0.0.1:8000
FastCgiExternalServer /home/gb/work/myproject/publichtml/main.fcgi -host 127.0.0.1:8001

 <VirtualHost *> ServerAdmin gb@bofh.ms
 Servername www.example.com
 ErrorLog /home/gb/work/myproject/logs/django-error.log
 CustomLog /home/gb/work/myproject/logs/django-access.log combined
 DocumentRoot /home/gb/work/myproject/public_html
 RewriteEngine On
 RewriteRule ^(/admin/.)$ /admin.fcgi$1 [L]
 RewriteRule ^(/main/.)$ /main.fcgi$1 [L]
 </VirtualHost> ```

You have to allow the webserver write access to the logs directory, so you might want to use a different location for them - possibly in `/var/log/apache/ `or whereever your apache puts it's logs. The FastCgiExternalServer directives must be outside of the virtual host definitions, but must point to files within the virtual hosts document root. But those files needn't (and probably shouldn't) exist in the filesystem, they are purely virtual. The given setup reflects the setup I did for the lighttpd scenario.

Now restart your apache, start your django-fcgi.py and you should be able to access your django application. Keep in mind to copy the admin_media files over to the document root, otherwise your admin will look very ugly.

django-fcgi.py --settings=myproject.settings.main --host=127.0.0.1 --port=8000 --daemon django-fcgi.py --settings=myproject.settings.admin --host=127.0.0.1 --port=8001 --daemon


Have fun.

Apache modauthtkt is a framework for Single-Signon in Apache-based solutions across technology boundaries (CGI, mod_perl and whatever else exists). I should take a look at it, could be interesting for me.

First Django Tutorials Online

The Django programmers start with the tutorials. The first tutorial primarily deals with creating the database model and the basic code for the objects to be managed, and the second tutorial deals with the automatically generated administration interface. Very nice, all of it.

The system is of course strongly focused on content creation and management - but still general enough so that it can also be used for differently structured content. The entire administration is automatically created from the object model and some hints, so it always aligns with the real data in the system. And the default look is also quite appealing.

Server integration is done simply via mod python - so via Apache. Which is also an advantage, as mod python offers very high performance right out of the box. And for more demanding cases, there's the caching in Django. I must say, what I've seen of Django so far, I like it very much.

An important note is missing in the installation instructions: Apache2 is mandatory, and therefore also ModPython in the corresponding version. However, Mac OS X only provides Apache 1.3, and many other servers also only have the 1.3 Apache available, so Django still has a real drawback here.

By the way, if you want to upgrade from Apache to Apache2 on Debian: if mod perl is in use, forget it. The mod perl2 for Apache2 in Debian Sarge is complete garbage - as if the API changes in mod perl2 compared to the old mod perl weren't annoying enough. In principle, you can no longer get Perl modules to run so easily with it.

Update: By the way, there is currently a lot of activity in the Subversion for Django to eliminate the requirement for Apache. A simple development server is already included, so in the future you will no longer need Apache for initial experiments. And you could also set up the deployment on other legs in the long run - for example, FCGI behind lighttpd.

Update 2: The third tutorial is out and deals with the view for the visitor. They have a pretty intense pace right now with Django.

Spyce is a Python web framework with damn good performance: a simple page with a template behind it delivers over 90 hits per second on my machine (Spyce integrated into Apache via mod_python, memory cache). Take that, PHP!

Apache2, php5-fcgi, php4-fcgi, mod_fastcgi HowTo

Apache2, php5-fcgi, php4-fcgi, mod_fastcgi HowTo provides everything you need to know to run PHP as an FCGI process. And even in German. The little bit of Apache2 in there can be mentally converted to Apache 1.3, the Apache is actually hardly affected.

FCGI offers, in combination with suexec, the possibility to run PHP per virtual host under a dedicated user and thus the possibility in shared hosting environments to set up files in a virtual host so that another user with his PHP cannot read them. You could even run the FCGI-PHPs in a chroot jail to isolate them even more.

In addition, FCGI is often significantly more resource-efficient for PHP, as fewer PHP processes can run than Apache processes and the Apache processes do not become so bloated. If you have many virtual hosts, this can lead to the FCGI processes catching up in number - but then you should consider whether the FCGI processes should not run better on a dedicated machine.

This would be exactly the right thing for simon, especially since I could then also allow PHP for the other users.

mod_fastcgi and mod_rewrite

Well, I actually tried using PHP as FastCGI - among other things because I could also use a newer PHP version. And what happened? Nothing. And there was a massive problem with mod rewrite rules. In the WordPress .htaccess, everything is rewritten to the index.php. The actual path that was accessed is appended to the index.php as PATH INFO. Well, and the PHP then spits out this information again and does the right thing.

But when I had activated FastCGI, that didn't work - the PHP always claimed that no input file was passed. So as if I had called the PHP without parameters. The WordPress administration - which works with normal PHP files - worked wonderfully. And the permission stuff also worked well, everything ran under my own user.

Only the Rewrite-Rules didn't work - and thus the whole site didn't. Pretty annoying. Especially since I can't properly test it without taking down my main site. It's also annoying that suexec apparently looks for the actual FCGI starters in the document root of the primary virtual server - not in those of the actual virtual servers. This makes the whole situation a bit unclear, as the programs (the starters are small shell scripts) are not where the files are. Unless you have created your virtual servers below the primary virtual server - but I personally consider that highly nonsensical, as you can then bypass Perl modules loaded in the virtual server by direct path specifications via the default server.

Ergo: a failure. Unfortunately. Annoying. Now I have to somehow put together a test box with which I can analyze this problem ...

Update: a bit of searching and digging on the net and a short test and I'm wiser: PATH_INFO with PHP as FCGI version under Apache is broken. Apparently, PHP gets the wrong PATH_INFO entry and the wrong SCRIPT NAME. As a result, the interpreter simply does not find its script when PATH INFO is set and nothing works anymore. Now I have to search further to see if there is a solution. cgi.fix_pathinfo = 1 (which is generally offered as a help for this) does not work anyway. But if I see it correctly, there is no usable solution for this - at least none that is obvious to me. Damn.

Update 2: I found a solution. This is based on simply not using Apache, but lighttpd - and putting Apache in front as a transparent proxy. This works quite well, especially if I strongly de-core the Apache and throw the PHP out of it, it also becomes much slimmer. And lighttpd can run under different user accounts, so I also save myself the wild hacking with suexec. However, a lighttpd process then runs per user (lighttpd only needs one process per server, as it works with asynchronous communication) and the PHPs run wild as FastCGI processes, not as Apache-integrated modules. Apache itself is then only responsible for purely static presences or sites with Perl modules - I still have quite a few of those. At the moment I only have a game site running there, but maybe it will be switched in the next few days. The method by which cruft-free URIs are produced is quite funny: in WordPress you can simply enter the index.php as an Error-Document: ErrorDocument 404 /index.php?error=404 would be the entry in the .htaccess, in lighttpd there is an equivalent entry. This automatically redirects non-existent files (and the cruft-free URIs do not exist as physical files) to WordPress. There it is then checked whether there really is no data for the URI and if there is something there (because it is a WordPress URI), the status is simply reset. For the latter, I had to install a small patch in WordPress. This saves you all the RewriteRules and works with almost any server. And because it's now 1:41, I'm going to bed now ...

Alternative Rewrite Rules result in a significantly simpler .htaccess, especially one that doesn't constantly need to be updated by WordPress. This is particularly practical if you also use the .htaccess for other purposes. Additionally, Apache is not necessarily faster with the complex Rewrite-Rules from WordPress. I have activated them myself, let's see how WordPress 1.5 performs with these entries. If there are no problems, they will stay that way, because I like them much better than the other variant. And they don't have the problems that the others have - old mod_rewrite can only do greedy matching, which makes creating complex lists of rewrites quite hairy ...

And log files again

Since I had an interesting study object, I wanted to see how much I could uncover in my logfiles with a bit of cluster analysis. So I created a matrix from referrers and accessing IP addresses and got an overview of typical user scenarios - how do normal users look in the log, how do referrer spammers look, and how does our friend look.

All three variants can be distinguished well, even though I'd currently rather shy away from capturing it algorithmically - all of it can be simulated quite well. Still, a few peculiarities are noticeable. First, a completely normal user:


aa.bb.cc.dd: 7 accesses, 2005-02-05 03:01:45.00 - 2005-02-04 16:18:09.00
 0065*-
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4031994 ...
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4031612 ...
 0001*http://mudbomb.com/archives/2005/02/02/wysiwyg-plugin-for-wo ...
 0001*http://www.heise.de/newsticker/meldung/55992
 0001*http://log.netbib.de/archives/2005/02/04/nzz-online-archiv-n ...
 0001*http://www.heise.de/newsticker/meldung/56000
 0001*http://a.wholelottanothing.org/2005/02/no_one_can_have.html

You can nicely see how this user clicked away from my weblog and came back - the referrers are by no means all links to me, but incorrect referrers that browsers send when switching from one site to another. Referrers are actually supposed to be sent only when a link is really clicked - hardly any browser does that correctly. The visit was on a defined day and they got in directly by entering the domain name (the "-" referrers are at the top and the earliest referrer that appears is at the top).

Or here's an access from me:


aa.bb.cc.dd: 6 accesses, 2005-02-04 01:11:56.00 - 2005-02-03 08:27:09.00
 0045*-
 0001*http://www.aylwardfamily.com/content/tbping.asp
 0001*http://temboz.rfc1437.de/view
 0001*http://web.morons.org/article.jsp?sectionid=1&id=5947
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4029220 ...
 0001*http://sport.ard.de/sp/fussball/news200502/03/bvb_verpfaende ...
 0001*http://www.cadenhead.org/workbench/entry/2005/02/03.html

I recognize myself by the referrer with temboz.rfc1437.de - that's my online aggregator. Looks similar - a lot of incorrectly sent referrers. Another user:


aa.bb.cc.dd: 19 accesses, 2005-02-12 14:45:35.00 - 2005-01-31 14:17:07.00
 0015*http://www.muensterland.org/system/weblogUpdates.py
 0002*-
 0001*http://www.google.com/search?q=cocoa+openmcl&ie=UTF-8&oe=UTF ...
 0001*http://blog.schockwellenreiter.de/8136
 0001*http://www.google.com/search?q=%22Rainer+Joswig%22&ie=UTF-8& ...
 0001*http://www.google.com/search?q=IDEKit&hl=de&lr=&c2coff=1&sta ...

This one came more often (across multiple days) via my update page on muensterland.org and also searched for Lisp topics. And they came from the shock wave guy once. Absolutely typical behavior.

Now in comparison, a typical referrer spammer:


aa.bb.cc.dd 6 accesses, 2005-02-12 17:27:27.00 - 2005-02-02 09:25:22.00
 0002*http://tramadol.freakycheats.com/
 0001*http://diet-pills.ronnieazza.com/
 0001*http://phentermine.psxtreme.com/
 0001*http://free-online-poker.yelucie.com/
 0001*http://poker-games.psxtreme.com/

All referrers are direct domain referrers. No "-" referrers - so no accesses without a referrer. No other accesses - if I analyzed it more precisely by page type, it would be noticeable that no images, etc. are accessed. Easy to recognize - just looks sparse. Typical is also that each URL is listed only once or twice.

Now our new friend:


aa.bb.cc.dd: 100 accesses, 2005-02-13 15:06:16.00 - 2005-02-11 07:07:55.00
 0039*-
 0030*http://irish.typepad.com
 0015*http://www208.pair.com
 0015*http://blogs.salon.com
 0015*http://hfilesreviewer.f2o.org
 0015*http://betas.intercom.net
 0005*http://vowe.net
 0005*http://spleenville.com

What stands out are the referrers without a trailing slash - atypical for referrer spam. Also, just normal sites. Also noticeable is that pages are accessed without a referrer - hidden behind these are the RSS feeds. This one is also easily distinguishable from users. Especially since there's a certain rhythm to it - apparently always 15 accesses with one referrer, then switch the referrer. Either the referrer list is quite small, or I was lucky that it tried the same one with me twice - one of them is there 30 times.

Normal bots don't need much comparison - few of them send referrers and are therefore completely uninteresting. I had one that caught my attention:


aa.bb.cc.dd: 5 accesses, 2005-02-13 15:21:26.00 - 2005-01-31 01:01:07.00
 2612*-
 0003*http://www.everyfeed.com/admin/new_site_validation.php?site= ...
 0002*http://www.everyfeed.com/admin/new_site_validation.php?site= ...

A new search engine for feeds that I didn't know yet. Apparently the admin had just entered my address somewhere beforehand and then the bot started collecting pages. After that, he activated my newly found feeds in the admin interface. Seems to be a small system - the bot runs from the same IP as the admin interface. Most other bots come from entire bot farms, web spidering is an expensive affair after all ...

In summary, it can be concluded that the current generation of referrer spammer bots and other bad bots are still quite primitive in structure. They don't use botnets to use many different addresses and hide that way, they use pure server URLs instead of page URLs and have other quite typical characteristics such as certain rhythms. They also almost always come multiple times.

Unfortunately, these are not good features to capture algorithmically - unless you run your referrers into a SQL database and check each referrer with appropriate queries against the typical criteria. This way you could definitely catch the usual suspects and block them right on the server. Because normal user accesses look quite different.

However, new generations are already in the works - as my little friend shows, the one with the missing slash. And thanks to the stupid browsers with their incorrectly generated referrers (which say much more about the browser's history than about actual link following), you can't simply counter-check the referenced pages, since many referrers are pure blind referrers.

ModSecurity - Web Intrusion Detection And Prevention / mod_security is an Apache module that examines requests and decides based on filters whether a request should be allowed through or whether a filter measure (script, log, etc.) should be triggered. Quite interesting, even though I'm generally skeptical about rule-based filtering against attacks - it only finds known or expected attacks. The real danger lies in the unexpected attacks...

CVS Module for Apache - An Apache module that serves files from a CVS repository and performs a checkout when needed