apache

Vampire - An extension of mod_python that makes it more developer-friendly. For example, it can also perform automatic code reloading.

Overview of new features in Apache 2.2 - Apache HTTP Server - what's new in Apache 2.2. Very interesting: the Event MPM. With this, Apache finally reports back at the top of the line for Keep-Alive sessions (previously, Apache had to reserve a worker for each Keep-Alive, which made Apache nearly unusable for streaming with a larger number of clients).

Django, Apache and FCGI

In Django, lighttpd and FCGI, second take I described a method how to run Django with FCGI behind a lighttpd installation. I did run the Django FCGIs as standalone servers so that you can run them under different users than the webserver. This document will give you the needed information to do the same with Apache 1.3.

Update: I maintain my descriptions now in my trac system. See the Apache+FCGI description for Django.

Update: I changed from using unix sockets to using tcp sockets in the description. The reason is that unix sockets need write access from both processes - webserver and FCGI server - and that's a bit hard to setup right, sometimes. tcp sockets are only a tad bit slower but much easier to set up.

First the main question some might ask: why Apache 1.3? The answer is simple: many people still have Apache 1.3 running as their main server and can't easily upgrade to Apache 2.0 - for example if they run large codebases in mod perl or mod python they will run into troubles with migrating because Apache 2.0 will require mod perl2 or mod python2 and both are not fully compatible with older versions. And even though lighttpd is a fantastic webserver, if you already run Apache 1.3 there might just not be the need for another webserver.

So what do you need - besides the python and django stuff - for Apache 1.3 with FastCGI? Just the mod rewrite module and mod fastcgi module installed, that's all. Both should come with your systems distribution. You will still need all the python stuff I listed in the lighttpd article.

mod_fastcgi is a bit quirky in it's installation, I had to play a bit around with it. There are a few pitfalls I can think of:

  • the specification of the socket can't be an absolute path but must be a relative path with respect to the FastCgiIpcDir
  • the specification of the FCGI itself (even though it's purely virtual) must be in a fully qualified form with respect to the document root you want to use. If you use a relative path, it will be relative to the document root of the default virtual host - and that's most surely not the document root you will use if you want to set up a virtual host with the FCGI.
  • the FCGI itself can't be defined within a virtual host - it must be defined in the main server config. That's where the relative addressing problem comes into play.
  • the socket file must be both readable and writeable by the FCGI user and the Apache user. Usually you do this by changing the socket file to group writeable and changing the group of that socket file to a group where both the user and the apache are members of.

Now here is the config snippet you have to add to your httpd.conf. I use the same directories as with the lighttpd sample, you most surely will have to adapt that to your situation.


 FastCgiExternalServer /home/gb/work/myproject/publichtml/admin.fcgi -host 127.0.0.1:8000
FastCgiExternalServer /home/gb/work/myproject/publichtml/main.fcgi -host 127.0.0.1:8001

 <VirtualHost *> ServerAdmin gb@bofh.ms
 Servername www.example.com
 ErrorLog /home/gb/work/myproject/logs/django-error.log
 CustomLog /home/gb/work/myproject/logs/django-access.log combined
 DocumentRoot /home/gb/work/myproject/public_html
 RewriteEngine On
 RewriteRule ^(/admin/.)$ /admin.fcgi$1 [L]
 RewriteRule ^(/main/.)$ /main.fcgi$1 [L]
 </VirtualHost> ```

You have to allow the webserver write access to the logs directory, so you might want to use a different location for them - possibly in `/var/log/apache/ `or whereever your apache puts it's logs. The FastCgiExternalServer directives must be outside of the virtual host definitions, but must point to files within the virtual hosts document root. But those files needn't (and probably shouldn't) exist in the filesystem, they are purely virtual. The given setup reflects the setup I did for the lighttpd scenario.

Now restart your apache, start your django-fcgi.py and you should be able to access your django application. Keep in mind to copy the admin_media files over to the document root, otherwise your admin will look very ugly.

django-fcgi.py --settings=myproject.settings.main --host=127.0.0.1 --port=8000 --daemon django-fcgi.py --settings=myproject.settings.admin --host=127.0.0.1 --port=8001 --daemon


Have fun.

Apache modauthtkt is a framework for Single-Signon in Apache-based solutions across technology boundaries (CGI, mod_perl and whatever else exists). I should take a look at it, could be interesting for me.

First Django Tutorials Online

The Django programmers start with the tutorials. The first tutorial primarily deals with creating the database model and the basic code for the objects to be managed, and the second tutorial deals with the automatically generated administration interface. Very nice, all of it.

The system is of course strongly focused on content creation and management - but still general enough so that it can also be used for differently structured content. The entire administration is automatically created from the object model and some hints, so it always aligns with the real data in the system. And the default look is also quite appealing.

Server integration is done simply via mod python - so via Apache. Which is also an advantage, as mod python offers very high performance right out of the box. And for more demanding cases, there's the caching in Django. I must say, what I've seen of Django so far, I like it very much.

An important note is missing in the installation instructions: Apache2 is mandatory, and therefore also ModPython in the corresponding version. However, Mac OS X only provides Apache 1.3, and many other servers also only have the 1.3 Apache available, so Django still has a real drawback here.

By the way, if you want to upgrade from Apache to Apache2 on Debian: if mod perl is in use, forget it. The mod perl2 for Apache2 in Debian Sarge is complete garbage - as if the API changes in mod perl2 compared to the old mod perl weren't annoying enough. In principle, you can no longer get Perl modules to run so easily with it.

Update: By the way, there is currently a lot of activity in the Subversion for Django to eliminate the requirement for Apache. A simple development server is already included, so in the future you will no longer need Apache for initial experiments. And you could also set up the deployment on other legs in the long run - for example, FCGI behind lighttpd.

Update 2: The third tutorial is out and deals with the view for the visitor. They have a pretty intense pace right now with Django.

Spyce is a Python web framework with damn good performance: a simple page with a template behind it delivers over 90 hits per second on my machine (Spyce integrated into Apache via mod_python, memory cache). Take that, PHP!

Apache2, php5-fcgi, php4-fcgi, mod_fastcgi HowTo

Apache2, php5-fcgi, php4-fcgi, mod_fastcgi HowTo provides everything you need to know to run PHP as an FCGI process. And even in German. The little bit of Apache2 in there can be mentally converted to Apache 1.3, the Apache is actually hardly affected.

FCGI offers, in combination with suexec, the possibility to run PHP per virtual host under a dedicated user and thus the possibility in shared hosting environments to set up files in a virtual host so that another user with his PHP cannot read them. You could even run the FCGI-PHPs in a chroot jail to isolate them even more.

In addition, FCGI is often significantly more resource-efficient for PHP, as fewer PHP processes can run than Apache processes and the Apache processes do not become so bloated. If you have many virtual hosts, this can lead to the FCGI processes catching up in number - but then you should consider whether the FCGI processes should not run better on a dedicated machine.

This would be exactly the right thing for simon, especially since I could then also allow PHP for the other users.

mod_fastcgi and mod_rewrite

Well, I actually tried using PHP as FastCGI - among other things because I could also use a newer PHP version. And what happened? Nothing. And there was a massive problem with mod rewrite rules. In the WordPress .htaccess, everything is rewritten to the index.php. The actual path that was accessed is appended to the index.php as PATH INFO. Well, and the PHP then spits out this information again and does the right thing.

But when I had activated FastCGI, that didn't work - the PHP always claimed that no input file was passed. So as if I had called the PHP without parameters. The WordPress administration - which works with normal PHP files - worked wonderfully. And the permission stuff also worked well, everything ran under my own user.

Only the Rewrite-Rules didn't work - and thus the whole site didn't. Pretty annoying. Especially since I can't properly test it without taking down my main site. It's also annoying that suexec apparently looks for the actual FCGI starters in the document root of the primary virtual server - not in those of the actual virtual servers. This makes the whole situation a bit unclear, as the programs (the starters are small shell scripts) are not where the files are. Unless you have created your virtual servers below the primary virtual server - but I personally consider that highly nonsensical, as you can then bypass Perl modules loaded in the virtual server by direct path specifications via the default server.

Ergo: a failure. Unfortunately. Annoying. Now I have to somehow put together a test box with which I can analyze this problem ...

Update: a bit of searching and digging on the net and a short test and I'm wiser: PATH_INFO with PHP as FCGI version under Apache is broken. Apparently, PHP gets the wrong PATH_INFO entry and the wrong SCRIPT NAME. As a result, the interpreter simply does not find its script when PATH INFO is set and nothing works anymore. Now I have to search further to see if there is a solution. cgi.fix_pathinfo = 1 (which is generally offered as a help for this) does not work anyway. But if I see it correctly, there is no usable solution for this - at least none that is obvious to me. Damn.

Update 2: I found a solution. This is based on simply not using Apache, but lighttpd - and putting Apache in front as a transparent proxy. This works quite well, especially if I strongly de-core the Apache and throw the PHP out of it, it also becomes much slimmer. And lighttpd can run under different user accounts, so I also save myself the wild hacking with suexec. However, a lighttpd process then runs per user (lighttpd only needs one process per server, as it works with asynchronous communication) and the PHPs run wild as FastCGI processes, not as Apache-integrated modules. Apache itself is then only responsible for purely static presences or sites with Perl modules - I still have quite a few of those. At the moment I only have a game site running there, but maybe it will be switched in the next few days. The method by which cruft-free URIs are produced is quite funny: in WordPress you can simply enter the index.php as an Error-Document: ErrorDocument 404 /index.php?error=404 would be the entry in the .htaccess, in lighttpd there is an equivalent entry. This automatically redirects non-existent files (and the cruft-free URIs do not exist as physical files) to WordPress. There it is then checked whether there really is no data for the URI and if there is something there (because it is a WordPress URI), the status is simply reset. For the latter, I had to install a small patch in WordPress. This saves you all the RewriteRules and works with almost any server. And because it's now 1:41, I'm going to bed now ...

Alternative Rewrite Rules result in a significantly simpler .htaccess, especially one that doesn't constantly need to be updated by WordPress. This is particularly practical if you also use the .htaccess for other purposes. Additionally, Apache is not necessarily faster with the complex Rewrite-Rules from WordPress. I have activated them myself, let's see how WordPress 1.5 performs with these entries. If there are no problems, they will stay that way, because I like them much better than the other variant. And they don't have the problems that the others have - old mod_rewrite can only do greedy matching, which makes creating complex lists of rewrites quite hairy ...

Und nochmal Logfiles

Da ich ja nun ein interessantes Studienobjekt hatte, wollte ich mal gucken inwieweit ich mit ein bischen Clusteranalyse in meinen Logfiles irgendwas interessantes zutagefördern würde. Ich habe also eine Matrix angelegt aus Referrern und zugreifenden IP-Adressen und mir damit mal einen Überblick über typische Userszenarien gemacht - also wie sehen normale User aus im Log, und wie sehen Referrer-Spammer aus und wie sieht unser Freund aus.

Alle drei Varianten lassen sich gut unterscheiden, auch wenn ich im Moment da noch eher davor zurückschrecken würde das algorithmisch zu fassen - das lässt sich nämlich alles recht gut simulieren. Trotzdem sind ein paar Auffälligkeiten zu sehen. Zuerst mal ein ganz normaler Benutzer:


aa.bb.cc.dd: 7 Zugriffe, 2005-02-05 03:01:45.00 - 2005-02-04 16:18:09.00
 0065*-
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4031994 ...
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4031612 ...
 0001*http://mudbomb.com/archives/2005/02/02/wysiwyg-plugin-for-wo ...
 0001*http://www.heise.de/newsticker/meldung/55992
 0001*http://log.netbib.de/archives/2005/02/04/nzz-online-archiv-n ...
 0001*http://www.heise.de/newsticker/meldung/56000
 0001*http://a.wholelottanothing.org/2005/02/no_one_can_have.html

Man sieht schön wie dieser User von meinem Weblog weggeklickt hat und wieder zurückgekommen ist - die Referrer sind nämlich mitnichten alles Links auf mich, sondern falsche Referrer die die Browser schicken, wenn der Benutzer von einer Site auf eine andere wechselt. Eigentlich sollen Referrer ja nur dann geschickt werden, wenn auch wirklich ein Link geklickt wird - kaum ein Browser macht das aber richtig. Der Besuch war an einem definierten Tag und er ist direkt eingestiegen durch Eingabe des Domainnamens (die "-" Referrer stehen oben und oben steht der früheste Referrer der vorkommt).

Oder hier mal ein Zugriff von mir:


aa.bb.cc.dd: 6 Zugriffe, 2005-02-04 01:11:56.00 - 2005-02-03 08:27:09.00
 0045*-
 0001*http://www.aylwardfamily.com/content/tbping.asp
 0001*http://temboz.rfc1437.de/view
 0001*http://web.morons.org/article.jsp?sectionid=1&id=5947
 0001*http://www.tagesschau.de/aktuell/meldungen/0,1185,OID4029220 ...
 0001*http://sport.ard.de/sp/fussball/news200502/03/bvb_verpfaende ...
 0001*http://www.cadenhead.org/workbench/entry/2005/02/03.html

Ich erkenne mich daran, das Referrer mit temboz.rfc1437.de vorkommen - das ist mein Online-Aggregator. Sieht ähnlich aus - ne Menge falsch geschickter Referrer. Noch ein anderer User:


aa.bb.cc.dd: 19 Zugriffe, 2005-02-12 14:45:35.00 - 2005-01-31 14:17:07.00
 0015*http://www.muensterland.org/system/weblogUpdates.py
 0002*-
 0001*http://www.google.com/search?q=cocoa+openmcl&ie=UTF-8&oe=UTF ...
 0001*http://blog.schockwellenreiter.de/8136
 0001*http://www.google.com/search?q=%22Rainer+Joswig%22&ie=UTF-8& ...
 0001*http://www.google.com/search?q=IDEKit&hl=de&lr=&c2coff=1&sta ...

Dieser kam öfter (also mehrere Tage) über meine Update-Seite auf muensterland.org und zusätzlich hat er noch nach Lisp-Themen gesucht. Und vom Herrn der Schockwelle ist er auch mal gekommen. Absolut typisches Verhalten.

Jetzt mal im Vergleich ein typischer Referrer-Spammer:


aa.bb.cc.dd 6 Zugriffe, 2005-02-12 17:27:27.00 - 2005-02-02 09:25:22.00
 0002*http://tramadol.freakycheats.com/
 0001*http://diet-pills.ronnieazza.com/
 0001*http://phentermine.psxtreme.com/
 0001*http://free-online-poker.yelucie.com/
 0001*http://poker-games.psxtreme.com/

Alle Referrer sind direkte Domain-Referrer. Keine "-" Referrer - also keine Zugriffe ohne Referrer. Keine sonstigen Zugriffe - würde ich es genauer analysieren nach Seitentyp, würde auffallen das keine Bilder etc. zugegriffen werden. Leicht zu erkennen - sieht einfach mager aus. Typisch ist auch das jede URL nur einmal oder zweimal angegeben ist.

Jetzt unser neuer Freund:


aa.bb.cc.dd: 100 Zugriffe, 2005-02-13 15:06:16.00 - 2005-02-11 07:07:55.00
 0039*-
 0030*http://irish.typepad.com
 0015*http://www208.pair.com
 0015*http://blogs.salon.com
 0015*http://hfilesreviewer.f2o.org
 0015*http://betas.intercom.net
 0005*http://vowe.net
 0005*http://spleenville.com

Was auffällt sind die Referrer ohne abschliessenden / - untypisch für Referrer-Spam. Ausserdem halt ganz normale Sites. Was auch auffällt, es werden Seiten zugegriffen ohne Referrer - dahinter verstecken sich die RSS-Feeds. Auch dieser ist also leicht von Usern zu unterscheiden. Vor allem da ein gewisser Rhythmus drin ist - scheinbar immer 15 Zugriffe mit einem Referrer, dann den Referrer wechseln. Entweder ist die Referrer-Liste recht klein, oder ich hatte Glück das er zweimal den gleichen bei mir probiert hat - einer ist nämlich 30x da.

Normale Bots braucht man nicht gross zu vergleichen - die wenigsten schicken Referrer mit und sind deshalb völlig uninteressant. Ich hatte einen, der mir aufgefallen war:


aa.bb.cc.dd: 5 Zugriffe, 2005-02-13 15:21:26.00 - 2005-01-31 01:01:07.00
 2612*-
 0003*http://www.everyfeed.com/admin/new_site_validation.php?site= ...
 0002*http://www.everyfeed.com/admin/new_site_validation.php?site= ...

Eine neue Suchmaschine für Feeds die ich noch nicht kannte. Scheinbar hat der Admin gerade vorher irgendwo meine Adresse eingetragen und dann hat der Bot losgelegt die Seiten zu sammeln. Danach hat er dann im Administrationsinterface meine von ihm neu gefundenen Feeds freigeschaltet. Scheint ein kleines System zu sein - der Bot läuft von der gleichen IP wie das Administrationsinterface. Die meisten anderen Bots kommen von ganzen Botfarmen, Webspidern ist halt eine aufwändige Sache ...

Zusammenfassend lässt sich also feststellen, das die derzeitige Generation von Referrer-Spammer-Bots und anderen Mal-Bots noch recht primitiv aufgebaut ist. Sie benutzen keine Botnetze um viele unterschiedliche Adressen zu verwenden und sich dadurch zu verstecken, sie benutzen reine Server-URLs statt Seiten-URLs und haben auch sonst recht viele typische Kennzeichen wie z.B. bestimmte Rhythmen. Ausserdem kommen sie fast immer mehrfach.

Leider sind das keine guten Merkmale um sie algorithmisch zu fassen - ausser man lässt seine Referrer in eine SQL-Datenbank laufen und prüft jeden Referrer mit entsprechenden Selects auf die typischen Kriterien. Darüber könnte man dann durchaus die üblichen Verdächtigen erwischen und gleich auf dem Server blocken. Denn normale User-Zugriffe sehen deutlich anders aus.

Allerdings sind auch schon neue Generationen in der Mache - wie mein kleiner Freund, der mit dem fehlenden /, zeigt. Und dank der dämlichen Browser mit ihren falsch erzeugten Referrern (die viel mehr über die History des Browsers aussagen als über tatsächliche Link-Verfolgung) kann man nicht einfach die referenzierten Seiten gegenchecken, da viele Referrer reine Blindreferrer sind.

ModSecurity - Web Intrusion Detection And Prevention / mod_security ist ein Apache-Modul das in Requests reinguckt und aufgrund von Filtern entscheidet ob ein Request durchgeht oder eine Filtermassnahme (Script, Log etc.) gestartet werden soll. Ganz interessant, auch wenn ich Regelbasiertes Filtern gegen Attacken generell eher skeptisch gegenüber stehe - die finden eben nur bekannte oder erwartete Angriffe. Das gefährliche sind da aber die unerwarteten Angriffe ...

CVS Module for Apache - Ein Apache-Modul welches Dateien aus einem CVS-Repository ausliefert und bei Bedarf einen checkout macht