A while back we bought some nice new IBM Xeon based servers as IMAP servers. Because email is an IO intensive application, we bought each of these machines with 12G of RAM so we could do as much caching as possible. Our previous machines with 8G of RAM showed that quite a lot of that RAM was eaten up as "active/application" RAM, leaving only about 1G to 2G of memory for caching, so we thought that by getting 12G we'd be leaving a lot more RAM available for caching.

So it was quite surprising when after a few days of running, we saw memory stats like this:

total       used       free     buffers     cached
Mem:      12466848   12419764      47084      463564    1550232
-/+ buffers/cache:   10405968    2060880
Swap:      2048276      69828    1978448

All 12G of memory was being used, but only 1.5G was for caching, the other 10G+ was "active/application" memory again. How is it that a 12G server doing less work, with less running processes than a server with 8G was using more memory?

Well after some debugging work with Andrew Morton and Chris Mason, we found that the memory was being used due to a bug in ReiserFS where if you use data=journal with particular workloads, it was leaking "zero-refcount pages on the LRU". As Chris noted, "The fastmail guys find all the fun bugs" (see here for some an example of a previous bug only our workload seemed to be hitting regularly)

So after Chris produced a patch, we tried it out. The good news was that it appeared to fix the leaking problem, "active/application" memory no longer increased to take up 10G+. The bad news was that neither did the cache memory seem to increase beyond 1.5G or so either, leaving us with about 10G+ of free memory not being used by anything!

Some more investigation suggested that the problem was that the machine was running out of "low memory", which is the only memory that can be used for the inode caching, and when the inodes were being reclaimed, the page cache for the inodes was also being reclaimed. In general, most systems don't need to cache lots and lots of inodes, but because of the way cyrus (our IMAP server) works, it stores each email on disk as a separate file. A quick calculation suggests that this one server alone had >50 million files on it. Because of the continuous access to lots of separate small files, it was causing the low memory to fill up with inodes items, causing older ones to quickly fall out and reclaim the associated cache memory.

Andrew suggested we try running a full 64-bit kernel, because that removes the low memory limitations that the 32-bit kernel introduces. Now these new servers support the x86-64 64-bit computing extensions, but because the rest of our existing IMAP servers don't, we are running 32-bit kernels with PAE enabled to address memory > 4G. Previously this has never appeared to be a problem, it just worked.

We decided to try a 64-bit kernel on the new machines, and once we did that, everything came together nicely. So with the ReiserFS patch stopping the memory leak, and the 64-bit kernel removing the inode caching limitation, we can now use the full 12G in these machines for caching inodes and disk. The result has been a nice decrease in system load average as considerably more indexes and emails are now kept hot in the memory cache on these machines.

total       used       free     buffers     cached
Mem:      12295924   12229816      66108     1503084    7841212
-/+ buffers/cache:    2885520    9410404
Swap:      2048276     121500    1926776

Some graphs really help illustrate this as well.

You can see on the left the 32-bit kernel before the patch, where all memory is listed as "apps". Then in the middle you can see the ReiserFS patch, where most memory is left as "unused". The spike in the middle was caused by some tests on a single 10G file which did cause the cache memory to be used. The right hand side shows with the ReiserFS patch and a 64 bit kernel, which shows most memory now being used for the cache.

Here you can see how before the 64-bit kernel, it wasn't possible to have more than 100,000-200,000 inodes in the inode cache at a time. After the 64-bit kernel, the inode cache can easily grow up to almost 2 million items with no problems.

Kernel performance optimisation can often seem a bit of a dark art. There's lots of potential bottle neck areas (network, IO, memory, CPU, scheduler, all sort of different caches, etc) and a number of knobs to change, including some not really even documented (eg lowmem_reserve_ratio), and things can change from one kernel version to the next. On top of that, when you run into bugs that people either other people haven't run into, or don't actually realise they're running into, it can be interesting/frustrating process to investigate and dig to find out what's actually going on.

In this case, it was nice to find a solution and be able to make the most of the new servers we bought.

Update (25-Sep-07): Someone suggested we try altering the value of
vfs_cache_pressure in /proc/sys/vm/. According to the kernel documentation, vfs_cache_pressure:

Controls the tendency of the kernel to reclaim the memory which is
used for caching of directory and inode objects.

At the default value of vfs_cache_pressure = 100 the kernel will
attempt to reclaim dentries and inodes at a "fair" rate with respect
to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure
causes the kernel to prefer to retain dentry and inode caches.
Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer
to reclaim dentries and inodes.

I tried lowering this, and it does help (cache goes from 1.5G up to
2.5G, and maximum inodes in use goes from < 100,000 up to about 250,000), but it's still not nearly as much compared to what the 64-bit kernel is able to achieve by eliminating low zone memory altogether.