Dec 10: Security - Availability

Technical

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 9th December was about email authentication. The following post is from our support team.

Technical level: medium

Availability is the ability for authorised users to gain access to their data in a timely manner.

While both Confidentiality and Integrity are important, they are not noticed unless something goes wrong. Availability on the other hand, is super visible. If you have an outage then it will be all through the media - like Yahoo, or Microsoft, or Google, or indeed FastMail.

Availability at FastMail

Our record really speaks for itself. Our public pingdom page and status page show how reliably available we are. FastMail has great uptime.

We achieve this by reducing single points of failure, and having all data replicated in close to real time.

I was in New York earlier this year consolidating our machines, removing some really old ones, and also moving everything to new cabinets which have a better cooling system and more reliable power. We had out-grown the capacity in our existing cabinets, and didn't have enough power to run completely on just half our circuits any more.

Our new cabinets have redundant power - a strip up each side of the rack - and every server is wired to both strips, and able to run from just one. Each strip has the capacity to run the entire rack by
itself.

03-pdu-powerin
05-back-04

The servers are laid out in such a way that we can shut down any one cabinet. In fact, we can shut down half the cabinets at a time without impacting production users. In 2014 it's not such a big deal to be able to reinstall any one of your machines in just a few minutes - but in 2005 when we switched to fully automated installation of all our machines, only a few big sites were doing it. For the past few years, we've been at the point where we can shut down any machine with a couple of minutes' notice to move service off it, and users don't even notice that it's gone. We can then fully reinstall the operating system.

We have learned some hard lessons about availability over the years. The 2011 incident took a week to recover from because it hit every server at exactly the same time. We couldn't mitigate it by moving load to the replicas. We are careful not to upgrade everywhere at once any more, no matter how obvious and safe the change looks!

Availability and Jurisdiction

People often ask why we're not running production out of our Iceland datacentre. We only host secondary MX and DNS, plus an offsite replica of all data there.

While we work hard on the reliability of our systems, a lot of the credit for our uptime has to go to our awesome hosting provider, NYI. They provide rock-solid power and network. To give you some examples:

  • During Hurricane Sandy, when other datacentres were bucketing fuel
    up the staircases and having outages, we lost power on ONE circuit
    for 30 seconds. It took out two units which hadn't been cabled
    correctly, but they weren't user facing anyway.
  • We had a massive DDOS attempted against us using the NTP flaw a
    while ago. They blocked just the NTP port to the one host being
    attacked, and informed us of the attack while they asked their
    upstream providers to push the block out onto the network to kill
    off the attack. Our customers didn't even notice.
  • They provide 24/7 onsite technical staff. Once when they were busy
    with another emergency, I had to wait 30 minutes for a response on
    an issue. The CEO apologised to me personally for having to wait.
    Normal response times are within 2 minutes.

The only outage we've had this year that can be attributed to NYI at all is a 5 minute outage when they switched the network uplink from copper to fibre, and managed to set the wrong routing information on the new link. 5 minutes in a year is pretty good.

The sad truth is, we just don't have the reliability from our Iceland datacentre to provide the uptime that our users expect of us.

  • Network stats to New
    York
    : you see
    the only time it drops below 99.99% is July, when I moved all the
    servers, and there was the outage on the 26th (actually 5 minutes by
    my watch). As far as I can tell, the outages on the 31st were
    actually a pingdom error rather than a problem in NYI
  • Network stats to
    Iceland
    :
    Ignore the 5 hour outage in August, because that was actually me in
    the datacentre. We don't have dual cabinet redundancy there, so I
    couldn't keep services up while I replaced parts. Even so, there are
    multiple outages longer than 10 minutes. These would have been very
    user-visible if users saw them. As it is, they just page the poor
    on-call engineer.

If we were to run production traffic to another datacentre, we would have to be convinced that they provide a similar level of quality to that provided by NYI. Availability is the life-blood of our customers. They need email to be up, all the time.

Human error

Once you get the underlying hardware and infrastructure to the level of reliability we have, the normal cause of problems is human error.

We have put a lot of work this year into processes to help avoid human errors causing production outages. There will be more on the testing process and beta => qa => production rollout stages in a later blog post. We've also had to change our development style slightly to deal with the fact that we now have two fully separate instances of our platform running in production - we'll also blog about that, since it's been a major project this year.

General internet issues

Of course, the internet itself is never 100% reliable, as was seen by our Optus and Vodafone using customers in Australia recently. Optus were providing a route back from NYI which went through Singtel, and it wasn't passing packets. There was nothing we could do, we had to wait for Optus to figure out what was wrong and fix it at their end.

We had a similar situation with Virgin Media in the UK back in 2013, but then we managed to route traffic via a proxy in our Iceland datacentre. This wouldn't have worked for Australia, because traffic from Australia to Iceland travels through New York too.

We are looking at what is required to run up a proxy in Australia for Asia-Pacific region traffic if there are routing problems from this part of the world again. Of course, that depends on the traffic from our proxy being able to get through.

One of the nastiest network issues we've ever had was when traffic to/from Iceland was being sent through two different network switches in London, depending on the exact source/destination address pair - and one of the switches was faulty - so only half our traffic was getting through. That one took 6 hours to be resolved. Thankfully, there was no production traffic to Iceland, so users didn't notice.