Dec 14: On Duty!


This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 13th December was about our DNS hosting system. The following post on 15th December shows how we load the mailbox screen so quickly.

Technical level: low

BZZZT, BZZZT. Wake up, something is broken! My wife hates the SMS tone on my phone. It triggers bad memories of the middle of the night.

When I started at FastMail in 2004, we already had a great production monitoring system. Like many things, we were doing this well before most internet services. Everyone was on call all the time, and we used two separate SMS services plus the jfax faxing service API to make a phone call (which would squeal at you) just in case the SMSes failed.

The first year was particularly bad - there was a kernel lockup bug which we hadn't diagnosed. Thankfully Chris Mason debugged it to a problem in reiserfs when there were too many files deleted all at once, and fixed it. Things became less crazy after that.

End-to-end monitoring

Our production monitoring consists of two main jobs, "pingsocket" and "pingfm". pingsocket runs every 2 minutes, pingfm runs every 10 minutes. There are also a bunch of individual service monitors for replication that inject data into individual databases or mail stores and check that the same data reaches the replicas in a reasonable timeframe. There are tests for external things like RBL listings and outbound mail queue sizes, and finally tests for server health like temperature, disk failure and power reliability.

pingsocket is the per-machine test. It checks that all services which are supposed to be running, are running. It has some basic abilities to restart services, but mostly it will alert if something is wrong.

pingfm is the full end-to-end test. It logs in via the website and sends itself an email. It checks that the email was delivered correctly and processed by a server-side rule into a subfolder. It then fetches the same email again via POP3 and injects that into yet another folder and confirms receipt. This tests all the major email subsystems and the web interface.

Any time something breaks that we don't detect with our monitoring, we add a new test that will check for it. We also think of likely failure modes when we add something new, and test that they are working - for example the calendar service was being tested on every server even before we released it to beta.

Notification levels

The split has always been between "urgent" (sms) and non-urgent (email). Many issues like disk usage levels have both a warning level and an alert level. We will get emails telling us of the issue first, and only a page if the issue hits the alert stage. This is great for getting sleep.

We now have a middle level where it will notify us via our internal IRC channel, so it interrupts more than just an email, but still won't wake anybody.

Duty Roster

Until 2011, every alert went to everybody in the company - though it would choose somebody first based on a complex algorithm. If you didn't want to be paged, you had to turn your phone off. This was hard if anyone went on holiday, and occasionally we would have everyone go offline at once, which wasn't great. Once I switched my phone on in Newark Airport to discover that the site was down and nobody else had responded, so I sat perched on a chair on my laptop and fixed things. We knew we needed a better system.

That system is called 'Lars Botlesen' or larsbot for short (as a joke on the Opera CEO's name). The bot has been injected into the notification workflow, and if machines can talk to the bot, they will hand over their notification instead of paging or emailing directly. The bot can then take action.

The best thing we ever added was the 30 second countdown before paging. Every 5 seconds the channel gets a message. Here's one from last night (with phone numbers and pager keys XXXed out):

<Lars_Botlesen> URGENT gateway2 - [rblcheck: new: dirty, but only one clean IP remaining. Run rblcheck with -f to force] [\n   "listed",\n   {\n      "" : 1\n   },\n   {\n      "" : "listed"\n   }\n]\n
<Lars_Botlesen> LarsSMS: 'gateway2: [\n   "listed",\n   {\n      "" : 1\n   },\n   {\n      "" : "listed"\n   }\n]\n' in 25 secs to brong (XXX)
<Lars_Botlesen> LarsSMS: 'gateway2: [\n   "listed",\n   {\n      "" : 1\n   },\n   {\n      "" : "listed"\n   }\n]\n' in 20 secs to brong (XXX)
<brong_> lars ack
<Lars_Botlesen> sms purged: brong (XXX) 'gateway2: [\n   "listed",\n   {\n      "" : 1\n   },\n   {\n      "" : "listed"\n   }\n]\n'
* Lars_Botlesen has changed the topic to: online, acked mode, Bron Gondwana in Australia/Melbourne on duty - e:XXX, p:+XXX,Pushover
<Lars_Botlesen> ack: OK brong_ acked 1 issues. Ack mode enabled for 30 minutes

We have all set our IRC clients to alert us if there's a message containing the text LarsSMS - so if anyone is on their computer, they can acknowledge the issue and stop the page. By default it will stay in the acknowledged mode for 30 minutes, suppressing further message - because the person who has acknowledged is actively working to fix the issue and looking out for new issues as well.

This was particularly valuable while I was living in Oslo. During my work day, the Australians would be asleep - so I could often catch an issue before it woke the person on duty. Also if I broke something, I would be able to avoid waking the others.

If I hadn't acked, it would have SMSed me (and pushed through a phone notification system called Pushover, which has replaced jfax). Actually, it wouldn't have because this happened 2 seconds later:

<robn> lars ack
* Lars_Botlesen has changed the topic to: online, acked mode, Bron Gondwana in Australia/Melbourne on duty - e:XXX, p:+XXX,Pushover
<Lars_Botlesen> ack: no issues to ack. Ack mode enabled for 30 minutes

We often have a flood of acknowlements if it's during a time when people are awake. Everyone tries to be first!

If I hadn't responded within 5 minutes of the first SMS, larsbot would go ahead and page everybody. In the worst case where the on-duty doesn't respond, it is only 5 minutes and 30 seconds before everyone is alerted. This is a major part of how we keep our high availability.

Duty is no longer automatically allocated - you tell the bot that you're taking over. You can use the bot to pass duty to someone else as well, but it's discouraged. The person on duty should be actively and knowingly taking on the responsibility.

We pay a stipend for being on duty, and an additional amount for being actually paged. The amount is enough that you won't feel too angry about being woken, but low enough that there's no incentive to break things so you make a fortune. The quid-pro-quo for being paid is that you are required to write up a full incident report detailing not only what broke, but what steps you took to fix it so that everyone else knows what to do next time. Obviously, if you _don't_ respond to a page and are on duty, you get docked the entire day's duty payment and it goes to the person who responded instead.

Who watches the watchers?

As well as larsbot, which runs on a single machine, we have two separate machines running something called arbitersmsd. It's a very simple daemon which pings larsbot every minute and confirms that larsbot is running and happy.

Arbitersmsd is like an spider with tentacles everywhere - it sits on standalone servers with connections to all our networks, internal and external as well as a special external network uplink which is physically separate from our main connection. It also monitors those links. If larsbot can't get a network connection to the world, or the one of the links goes down, arbitersmsd will scream to everybody through every link it can.

There's also a copy running in Iceland, which we hear from occasionally when the intercontinental links have problems - so we're keenly aware of how reliable (or not) our Iceland network is.

Automate the response, but have a human decide

At one stage, we decided to try to avoid having to be woken for some types of failure by using Heartbeat, a high availability solution for Linux, on our frontend servers. The thing is, our servers are actually really reliable, and we found that heartbeat failed more often than our systems - so the end result was reduced reliability! It's counter-intuitive, but automated high-availability often isn't.

Instead, we focused on making the failover itself as easy as possible, but having a human make the decision to actually perform the action. Combined with our fast response paging system, this gives us high availability, safely. It also means we have great tools for our every day operations. I joke that you should be able to use the tools at 3am while not sober and barely awake. There's a big element of truth though - one day it will be 3am, and I won't be sober - and the tool needs to be usable. It's a great design goal, and it makes the tools nice to use while awake as well.

A great example of how handy these tools are was when I had to upgrade the BIOS on all our blades because of a bug in our network card firmware which would occasionally flood the entire internal network. This was the cause of most of our outages through 2011-2012, and took months to track down. We had everything configured so that we could shut down an entire bladecentre, because there were failover pairs in our other bladecentre.

It took me 5 cluster commands and about 2 hours to do the whole thing (just because the BIOS upgrades took so long, there were about 20 updates to apply, and I figured I should do them all at once):

  1. as -r bc1 -l fo -a - move all the services to the failover pair
  2. asrv -r bc1 all stop - stop all services on the machines
  3. as -r bc1 -a shutdown -h now - shut down all the blades
  4. This step was a web-based bios update, in which I checked checkboxes
    on the management console for every blade and every BIOS update. The
    console fetched the updates from the web and uploaded them onto each
    blade. This is the bit that took 2 hours. At the end of the updates,
    every server was powered up automatically.
  5. asrv -r bc1 -a all start
  6. as -r bc2 -l fo -m - move everything from bladecentre 2 that's not
    mastered there back to its primary host

It's tooling like this which lets us scale quickly without needing a large operations team. For the second bladecentre I decided to reinstall all the blades as well, it took an extra 10 minutes and one more command: as -r bc2 -l utils/ -r before the shutdown.

Nobody knows

Some days are worse than others. Often we go weeks without a single incident.

The great thing about our monitoring is that it usually alerts us of potential issues before they are visible. We often fix things, sometimes even in the middle of the night, and our customers aren't aware. We're very happy when that happens - if you never know that we're here, then we've done our job well!

BZZZT Twitch - oh, just a text from a friend. Phew.