This is a technical post. Regular FastMail users subscribed to receive email updates from the FastMail blog can just ignore this post.
For the last few years, most of the IMAP servers we've bought have followed the same hardware format. A 1U server with an LSI SCSI or SAS controller, connected to two external RAID storage units. The RAID storage units use an ARECA controller and present the internal SATA/SAS disks as SCSI/SAS volumes. This setup has worked really well and generally been very solid.
However after recently upgrading the hard drives in one of our RAID storage boxes, we started experiencing some annoying kernel errors. Under high IO load as we synced new data to them, we'd end up seeing something like this in the kernel log.
[ 1378.310010] mptscsih: ioc1: attempting task abort! (sc=ffff88083cfa6000) [ 1378.310091] sd 2:0:0:0: [sdj] CDB: Read(10): 28 00 0d 18 ad 2d 00 00 02 00 [ 1378.682660] mptscsih: ioc1: task abort: SUCCESS (sc=ffff88083cfa6000)
These would usually be repeated many times, and sometimes we'd see things like this after the above messages.
[ 1400.805969] Errataon LSI53C1030 occurred.sc->req_bufflen=0x1000,xfer_cnt=0x400 [ 1400.827927] mptbase: ioc1: LogInfo(0x11070000): F/W: DMA Error [ 1401.090516] mptbase: ioc1: LogInfo(0x11070000): F/W: DMA Error
Simultaneously, the RAID controller would report in it's log:
2010-08-16 08:24:50 Host Channel 0 SCSI Bus Reset
And there would often be some corruption of any data that was being written at the time.
We'd seen a problem like this before when we'd bought new hard drives, but after upgrading the firmware in the hard drives, they'd gone away. Unfortunately in this case, the new hard drives we had already had the latest firmware, so that wasn't something that would help.
We tried a number of things. Downgrading the SCSI bus speed to 80 MB/s. Using the latest version of the LSI driver from their website (4.22) rather than the version that comes in the vanilla Linux kernel (3.04.14). Reducing the SCSI queue depth on the LSI card from 64 to 16. Upgrade the RAID controller firmware to the very latest version. None of these things helped. In each case, with high IO load, within 10 minutes we could cause the error to occur.
My final thought was that maybe it's timeout related. With SCSI, the HBA can queue a lot of requests to be completed out of order. So if you shove a lot of IOPs to the RAID unit (so many that the write back cache fills up) maybe the internal scheduler in the RAID controller is interacting with the TCQ in the hard drives in some way badly, and some of the requests end up taking a long time to complete. Then the HBA has some timeout amount, and if a request takes longer than that, it assumes something has gone wrong and then tries to cancel everything that's outstanding and reset the bus.
In Linux, you can control the timeout for each SCSI target device (eg a RAID volumeset in our case) via a tunable in /sys/.
The default value for the timeout on these LSI cards is 30 seconds. I increased it to 300 seconds on all targets, and we started the IO storm again.
Normally we'd see problems within 10 minutes. We let this run for 24 hours and not a problem!
Not 100% conclusive proof, but it's looking pretty likely that that's culprit. So my assumption is that the LSI card has a 30 second default timeout, and the RAID unit under heavy IO load can take longer than 30 seconds to respond to some queued requests. It would explain why the problem only occurs under heavy load and when the write back cache gets filled up.
Hopefully this helps someone else if they encounter this problem one day.
Additional: So even with these changes, one of the things we noticed was that a high IO load to one RAID volume (eg. in our case, moving users around) can severely affect the performance of other RAID volumes. The issue is related to the way each SCSI HBA has a queue depth it can manage, but in the kernel, each mounted volume has it’s own outstanding request queue. When the number of volumes is large, the sum of request in the volume queues can be much larger than the HBA queue, causing poor response times as lots of processes block on IO. On our systems with a large number of volumes, reducing the per-volume queue depth (/sys/block/sd*/device/queue_depth) from the default of 64 to 16 resulted in much more even performance. Other reading.