Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers
This is a technical post. Regular FastMail users subscribed to receive email updates from the FastMail blog can just ignore this post.
So over the last couple of weeks we noticed that our new IMAP servers with 48G of RAM haven't been performing as well as expected, and there were some oddities. Namely two things stuck out:
- There was free memory. There's 20T of data on these machines. The
kernel should have used lots of memory for caching, but for some
reason, it wasn't. cache ~ 2G, buffers ~ 25G, unused ~ 5G
- The machine has an SSDs for very hot data. In total, there's about
16G of data on the SSDs. Almost all of that 16G of data should end
up being cached, so there should be little reading from the SSDs at
all. Instead we saw at peak times 2k+ blocks read/s from the SSDs.
Again a sign that caching wasn't working.
After doing some searching, we found this thread in the Linux kernel mailing list.
It appears that patch never went anywhere, and zone_reclaim_mode is still defaulting to 1 on our pretty standard file/email/web server type machine with a NUMA kernel.
By changing it to 0, we saw an immediate massive change in caching behaviour. Now cache ~ 27G, buffers ~ 7G and unused ~ 0.2G, and IO reads from the SSD dropped to 100/s instead of 2000/s.
So if you’re using newer AMD/Intel processors with a NUMA kernel in a web server/file server/email server setup, you should make sure you set /proc/sys/vm/zone_reclaim_mode to 0. I’ve posted to the LKML about this, but haven’t heard anything, so I have no idea if anyone regards this default value as a bug or not.