Caching Strategies for PHP-Based Systems
There are basically two ways to implement caching in a PHP-based system. Okay, there are many more, but two main approaches are clearly identifiable. I've compiled what's interesting in this context - especially since some colleagues are currently suffering under high server load. The whole thing is kept general, but for understandable reasons also considers the specific implications for WordPress.
- Caching of pre-compiled PHP pages
- Caching of page output
There are numerous variations for both main approaches. PHP pages themselves exist on web servers as source code - unprocessed and not optimized in any way for the loading process. With complex PHP systems running, parsing and compiling into internal code happens for every PHP file. With systems that have many includes and many class libraries, this can be quite substantial. The first main direction of caching starts at this point: the generated intermediate code is simply stored away. Either in shared memory (memory blocks that are available to many processes of a system collectively) or on the hard disk. There are a number of solutions here - I personally use turck-mmcache. The reason is mainly that it doesn't cache in shared memory but on the disk (which as far as I know the other similar solutions also do) and that there is a Debian package for turck-mmcache. And that I've had relatively few negative experiences with it so far (at least on Debian stable - on Debian testing things are different, where PHP applications crash on you). Since WordPress is based on a larger set of library modules with quite substantial source content, such a cache brings quite a bit to reduce WordPress's baseline load. Since these caches are usually completely transparent - with no visible effects except for the speed improvement - you can also generally enable such a cache.
The second main direction for caching is the intermediate storage of page contents. Here's a special feature: pages are often dynamically generated depending on parameters - and therefore a page doesn't always produce the same output. Just think of mundane things like displaying the username when a user is logged in (and has stored a cookie for it). Page contents can also be different due to HTTP Basic Authentication (the login technique where the popup window for username and password appears). And POST requests (forms that don't send their contents via the URL) also produce output that depends on this data.
Basically, an output cache must consider all these input parameters. A good strategy is often not to cache POST results at all - because error messages etc. would also appear there, which depending on external sources (databases) could produce different outputs even with identical input values. So really only GET requests (URLs with parameters directly in the URL) can be meaningfully cached. However, you must consider both the sent cookies and the sent parameters in the URL. If your own system works with basic authentication, that must also factor into the caching concept.
A second problem is that pages are rarely purely static - even static pages certainly contain elements that you'd prefer to have dynamically. Here you need to make a significant decision: is purely static output enough, or does a mix come in? Furthermore, you still need to decide how page updates should affect things - how does the cache notice that something has changed?
One approach you can pursue is a so-called reverse proxy. You simply put a normal web proxy in front of the web server so that all access to the web server itself is technically routed through this web proxy. The proxy sits directly in front of the web server and is thus mandatory for all users. Since web proxies should already handle the problem of user authentication, parameters, and POST/GET distinction quite well (in the normal application situation for proxies, the problems are the same), this is a very pragmatic solution. Updates are also usually handled quite well by such proxies - and in an emergency, users can persuade the proxy to fetch the contents anew through a forced reload. Unfortunately, this solution only works if you have the server under your own control - and the proxy also consumes additional resources, which means there might not be room for it on the server. It also heavily depends on the application how well it works with proxies - although problems between proxy and application would also occur with normal users and therefore need to be solved anyway.
The second approach is the software itself - ultimately, the software can know exactly when contents are recreated and what needs to be considered for caching. Here there are again two directions of implementation. MovableType, PyDS, Radio Userland, Frontier - these all generate static HTML pages and therefore don't have the problem with server load during page access. The disadvantage is obvious: data changes force the pages to be recreated, which can be annoying on large sites (and led me to switch from PyDS to WordPress).
The second direction is caching from the dynamic application itself: on first access, the output is stored under a cache key. On the next access to the cache key, you simply check whether the output is already available, and if so, it's delivered. The cache key is composed of the GET parameters and the cookies. When database contents change, the corresponding entries in the cache are deleted and thus the pages are recreated on the next access.
WordPress itself has Staticize, a very practical plugin for this purpose. In the current beta, it's already included in the standard scope. This plugin creates a cache entry for pages as described above. And takes parameters and cookies into account - basic authentication isn't used in WordPress anyway. The trick, though, is that Staticize saves the pages as PHP. The cache pages are thus themselves dynamic again. This dynamism can now be used to mark parts of the page with special comments - which allows dynamic function calls to be used again for these parts of the page. The advantage is obvious: while the big efforts for page creation like loading the various library modules and reading from the database are completely done, individual areas of the site can remain dynamic. Of course, the functions for this must be structured so they don't need WordPress's entire library infrastructure - but for example, dynamic counters or displays of currently active users or similar features can thus remain dynamic in the cached pages. Matt Mullenweg uses it, for example, to display a random image from his library even on cached pages. Staticize simply deletes the entire cache when a post is created or changed - very primitive and with many files in the cache it can take a while, but it's very effective and pragmatic.
Which caches should you sensibly deploy and how? With more complex systems, I would always check whether I can deploy a PHP code cache - so turck mCache or Zend Optimizer or phpAccelerator or whatever else there is.
I would personally only activate the application cache itself when it's really necessary due to load - with WordPress you can keep a plugin on hand and only activate it when needed. After all, caches with static page generation have their problems - layout changes only become active after cache deletion, etc.
If you can deploy a reverse proxy and the resources on the machine are sufficient for it, it's certainly always recommended. If only because you then experience the problems yourself that might exist in your own application regarding proxies - and which would also cause trouble to every user behind a web proxy. Especially if you use Zope, for example, there are very good opportunities in Zope to improve the communication with the reverse proxy - a cache manager is available in Zope for this. Other systems also offer good fundamentals for this - but ultimately, any system that produces clean ETag and Last-Modified headers and correctly handles conditional GET (conditional accesses that send which version you already have locally and then only want to see updated contents) should be suitable.