Diary of an outage
As some of you are no doubt aware, yesterday we had a fairly serious outage. It only affected a small number of users, but for them it meant some some 4-6 hours with partial or no access to mail. In this technical post I'll be explaining exactly what happened and what we are doing to fix the problem and ensure it doesn't happen again in the future.
For the non-technical readers, you can skip to the last paragraph.
Early morning page
At around 4:30am Melbourne time (around 17:30 UTC) I was paged by our automated monitoring tools. We have extensive monitoring of most aspects of our infrastructure, and the system has permission to wake people up if it notices a problem. In this case, it noticed that one of our many backend mail servers, imap21, was no longer reeachable. Unlike many other service providers we don't do automatic failovers because if the software makes the wrong decision it can lead to worse problems than might otherwise occur. We prefer to put a human into the loop to make sure things look sane before taking any action.
A machine failure of some sort is not as uncommon as you might think. Usually it's a single disk failure, but we've variously had disk controllers fail, power supplies fail, and kernel or other bugs cause machines to crash. In the event of a crash, the machine usually reboots. The normal procedure for a night-time failure is to failover all mail stores on the machine to replicas on other machines. That's anywhere between 15 and 40 stores depending on hardware configuration. We prefer to return service to the original machine if possible so we can maintain the replicas and thus keep our redundancy level, but that's usually not possible in the case of a hardware fault - the machine is no longer fit for service.
The failover process involves shutting down both the current "master" slot and the target replica slot, reconfiguring them such that the master becomes the replica and the replica the master, and then starting them both up. The system configuration database is also updated so that all other services (web client, IMAP/POP frontends, mail delivery) know where to find the stores now. This is all done by a scripted process that can be initiated with just one command. In the event that the current master slot is unavailable, that part is skipped.
This morning imap21 failed in a fairly serious way. It was not available on the network. The management console showed that it tried to reboot, found corruption on the operating system disk, tried to repair it and couldn't and was waiting for manual intervention. That immediately told me this was a serious problem and the best course of action was not to try and repair it, but instead failover its 15 slots to replicas. I did this, checked our monitoring to confirm all systems were operating normally, and went back to bed. I knew that full repairs would potentially be a big job, and I wanted enough sleep to be able to do it without messing things up further. As far as I was able to tell service had been fully restored to users, which is the single most important thing.
Fast forward to 9am. I arrived at the office after a bit more sleep at home and on the train (you take what you can get!). On arrival I was told that a number of users had been reporting either missing mail or errors on login. The first thing I thought of turned out to correct: at least one of the replicas I had failed over to were not up to date. But first we had to prove it!
The Cyrus replication model is an "eventually consistent" one, fairly common among database server software (which is what Cyrus is, though optimised for email storage). When some action is performed (delivering a message, creating a folder, moving a message, etc), the operation is performed on the local data store, and a record of the operation is written to a replication log. A separate pair of programs (sync_client and sync_server) use that log to perform the same actions on the replicas. Typically that happens in near realtime, but if replication was in fact behind, then we should see lots of unprocessed lines in the replication log.
Getting to the replication log was something of a problem because as noted above, imap21 refused to boot. A filesystem check was not pretty - literally thousands of errors. That's quite worriesome as this is a root disk which generally shouldn't have any corruption on it (it almost never recieves writes), but figuring out why the crash happened is something of a secondary concern at this point. So I reinstalled the OS into the root partition. That's something we do all the time and it's fast and accurate. Twenty minutes later it was fully reinstalled and the machine was up and running, and we got to inspect the replication logs.
The good news was that 14 of the 15 stores on this machine had fully up-to-date replicas, so there was only one mail store to deal with. The bad news is that the remaining store had some 2-3 days worth of unreplicated events in it's log. To add more pain to the situation that store was coincidentally the current "new user" store, where new users are created. That means that any users that had signed up in the last 2-3 days did not exist at all on the replica, thus the reported login errors. A horrible first experience!
Examining the log, we discovered that right at the top there was a complex series of folder renames within a single replication event. This is not a particularly unusual operation. This time it tripped a known, rare bug in the way renames are replicated that caused the replication process (sync_client) to abort. The Cyrus master daemon starts it up again, but then it hits the same point and dies again, over and over. Replication stops.
Fixing this is on our list of things to do. Because it's a fairly rare thing to happen and usually gets dealt with quickly, it hasn't quite made it far enough up our list to deal with. Obviously that has now changed, but lets talk about why it wasn't dealt with this time.
Monitoring the monitors
As noted, we have a lot of system monitoring running, and we place a huge amount of faith in it. It's almost by definition that if there's no problems being reported, then there are no problems, at least in overall system health. Of course individual per-user problems can appear from time to time, which is unfortunate. A support ticket is the usual way to get to the bottom of individual problems like that.
Something we monitor is replication lag, which is almost entirely based on the size of the replication logs. If they grow "too large", a low-level warning is produced. A warning of this type results in a message being posted to our IRC channel and an email being sent, but it will never generate a page. It's the kind of thing that we look at and action every now and again, and as noted it's a fairly rare event under normal operation.
There is however one time where that warning can occur yet not be a problem, and that's when users are being moved around a lot. User moves are done via the replication system, and when you're doing a lot of them at once it can generate a lot of replication log traffic, sometimes causing replication to lag significantly for short periods of time. Something we've been doing for the last few months is redesigning our disk layout into something much easier to reason about and work with, and that has required a lot of moves. This particular warning has not been as uncommon as it should have been recently, leading to the situation where we've started ignoring it.
Obviously this is a dangerous place to be. I recognise how bad that sounds - "system warnings were ignored" - but this is what happens when you have a warning that doesn't quite match up with the importance of the situation. Think of it like the fuel light on your car. It needs to be calibrated to come on at just the right time. Too late, and you run out of fuel before you can refill. Too early, and you start to learn that it isn't really a problem; you won't run out of fuel for ages. At that point you might as well not have the light at all. It's not a perfect analogy, but it's instructive.
So far we haven't figured out the original cause of the filesystem corruption (we suspect a hardware failure that isn't visible to our normal tools) that led to the crash. But that's not quite the point. Had appropriate action been taken when the replication lag was first noticed, we would have had fully up-to-date replicas at failover, and this entire situation would have been little more then a ten-minute outage and me being a little tired for the day.
Obviously, we have a few things to do out of this!
- The replication lag monitor needs to be able to send a page whethings are getting bad. Our current thinking is that it should page if it doesn't have a least one replica less than five minutes behind.
- It also needs to understand that there are other things that can cause lag and compensate, which means it needs to know what user move operations are in progress. That said, replication falling behind is always a little dangerous, so we might be better to somehow change the way moves happen so that we always have at least one viable replica at all times. We'll need to consider this in more detail.
- We need to make the replication system able to cope with hard replication failures. One idea we're currently considering is to put failed replication events into another logfile, and come back to them once the main log is empty. This needs a bit of thought, particularly around the consequences of operations being applied out-of-order.
- We need to fix the renaming bug. Having a last-ditch protection against replication problems is great, but even better than handling problems is avoiding them in the first place.
It's now 13 hours since the problem was fully understood and recovery began. We believe that most if not all missing mail and broken accounts have been repaired. If you were unfortunate enough to be affected and you're still seeing problems, please contact support.
We are very sorry for this outage. We understand how important your email is and how much it affects you when it's unavailable. We're proud of our track record on reliabilty but we know we've dropped the ball on this one. To our new users, who are the ones most likely affected by this outage, we understand very well how this makes us look. We're working hard to get the situation resolved and to restore your trust. Thank you all for your patience and understanding.