Improvements after outage last week
This is a technical post. Regular FastMail users subscribed to receive email updates from the FastMail blog can just ignore this post.
Last week we had an outage that affected all users that lasted for about 1 hour. This is one of the worst outages we’ve had in the last 4 years. Our overall reliability over the last 4 years can be put down to our redundant slots & stores architecture and using a very reliable hosting provider (NYI).
The outage last week was a sequence of events caused by a recent internal change. We changed over our internal DNS server to slave off Opera’s servers to allow better internal DNS integration. Unfortunately we were only part way through that process, and we had only setup one internal server. It’s our general policy that everything we setup these days must be replicated between at least two servers which we had intended to do, but hadn’t got around to.
That internal DNS server was also running on the server that’s our primary database server. Unfortunately that server crashed with a kernel panic. Normally we’d just fail everything over to our replica database server, but because the internal DNS server was also down, all our tools which expected to be able to resolve internal domain names also failed, and we weren’t able to fail over easily. Also because the internal DNS was down, we weren’t easily able to access the remote management module (RMM) of the server to reboot it, and had to go through the NYI ticket system, which always takes a bit longer.
The net result is something that we should have detected within a few minutes, and easily failed over with our failover tools, took almost an hour to do in the end.
We’ve now setup the internal DNS servers to be part of our standard redundant setup. We’ve also setup consistent naming and IP addresses for all our RMM modules so that they’ll be easier to access, and even if there are DNS problems, we’ll be able to access them via IP.
We can’t stop servers crashing, but we aim to have every service redundant so that if any server fails, we can fail over to a replica within a short amount of time, either automatically where possible, or manually where we think it’s better to have some human intervention first.
Overall, I believe that our continuous attempts to improve reliability have been working very well, and we always aim to learn from any problems and do better.
Update 6/Oct: I’ve posted some additional information to this forum thread.