This blog post is part of the FastMail 2014 Advent Calendar.
Technical level: medium
Any server in our entire system can be reinstalled in under 10 minutes, without users being aware.
That was a very ambitious goal in 2004 when I started at FastMail. DevOps was in its infancy, the tools available to us weren't very good yet - but the alternative of staying with what we had was not scalable. Every machine was hand-built following a "script" which was instructions on our internal wiki, and hoping you got every step perfect. Every machine was slightly different.
We chose Debian Linux as the base system for the new automated installs, using FAI to install the machines from network boot to a full operating system with all our software in place.
Our machines are listed by name in a large central configuration file, which maps from the hardware addresses of the ethernet cards (easy to detect during startup, and globally unique) to a set of roles. The installation process uses those roles to decide what software to install.
Aside: the three types of data
I am a believer in this taxonomy, which splits data into three different types for clarity of thinking:
- Own Output - creative effort you have produced yourself. In theory
you can reproduce it, though anyone who has lost hours of work to a
crash knows just how disheartening it can be to repeat yourself.
- Primary Copy of others' creative output. Unreproducable. Lose this,
it's gone forever.
- Secondary Copy of anything. Cache. Can always be re-fetched.
There is a bit of blurring between categories in practice, particularly as you might find secondary copies in a disaster and get back primary data you thought was lost. But for planning, these categories are very valuable for deciding how to care for the data.
Own Output - stick it in version control. Always.
Since the effort that goes into creating is so high compared to data
storage cost, there's no reason to discard anything, ever. Version
control software is designed for precisely this purpose.
The repository then becomes a Primary Copy, and we fall through to;
Primary Copy - back it up. Replicate it. Everything you can to
ensure it is never lost. This stuff is gold.
In FastMail's case as an email host, it is other people's precious
memories. We store emails on RAIDed drives with battery backed RAID
units on every backend server, and each copy is replicated to two
other servers, giving a total of 3 copies on RAID1 or 6 disks in
total with a full copy of every message on them.
One of those copies is in a datacentre a third of the distance
around the world from the other two.
On top of this, we run nightly backups to a completely separate
format on a separate system.
Secondary Copy - disposable. Who cares. You can always get it again.
Actually, we do keep backups of Debian package repositories for
every package we use just in case we want to reinstall and the
mirror is down. And we keep a local cache of the repository for fast
reinstalls in each datacentre too.
It's amazing how much stuff on a computer is just cache. For example, Operating System installs. It is so frustrating when installing a home computer how intermingled the operating system and updates (all cache) becomes with your preference selections and personal data (own creative output or primary copy). You find yourself backing up a lot more than is strictly necessary.
Operating system is purely cache
We avoid the need to do full server backups at FastMail by never changing config files directly. All configuration goes in version-controlled templates files. No ifs, no buts. It took a while to train ourselves with good habits here - reinstalling frequently and throwing out anything that wasn't stored in version control until everyone got the hint.
The process of installing a machine is a netboot with FAI which wipes the system drive, installs the operating system, and then builds the config from git onto the system and reboots ready-to-go. This process is entirely repeatable, meaning the OS install and system partition is 100% disposable, on every machine.
If we were starting today, we would probably build on puppet or one of the other automation toolkits that didn't exist or weren't complete enough when I first built this. Right now we still use Makefiles and perl's Template-Toolkit to generate the configuration files. You can run
make diff on each configuration directory to see what's different between a running machine and the new configuration, then
make install to upgrade the config files and restart the related service. It works fine.
It doesn't matter what exact toolkit is used to automate system installs, so long as it exists. It's the same process regardless of whether we just want to reinstall to ensure a clean system, are recovering from a potential system compromise, are replacing failed hardware, or we have new machines to add to our cluster.
User data on separate filesystems
Most of our machines are totally stateless. They perform compute roles, generating web pages, scanning for spam, routing email. We don't store any data on them except cached copies of frequently accessed files.
The places that user data are stored are:
- Email storage backends (of course!)
- File storage backends
- Database servers
- Backup servers
- Outbound mail queue (this one is a bit of a special case - email can
held for hours because the receiving server is down, misconfigured,
porarily blocking us. We use drbd between two machines for the
ool, because postfix doesn't like it when the inode changes)
The reinstall leaves these partitions untouched. We create data partitions using either LUKS or the built-in encryption of our SSDs, and then create partitions with labels so they can be automatically mounted. All the data partitions are currently created with the ext4 filesystem, which we have found to be the most stable and reliable choice on Linux for our workload.
All data is on multiple machines
We use different replication systems for different data. As mentioned in the Integrity post, we use an application level replication system for email data so we can get strong integrity guarantees. We use a multi-master replication system for our Mysql database, which we will write about in this series as well. I'd love to write about the email backup protocol as well, but I'm not sure I'll have time in this series! And the filestorage backend is another protocol again.
The important thing is every type of data is replicated over multiple machines, so with just a couple of minutes' notice you can take a machine out of production and reinstall or perform maintenance on it
(the slowest part of shutting down an IMAP server these days is copying the search databases from tmpfs to real disk so we don't have to rebuild them after the reboot).
Our own work
We use the git version control system for all our own software. When I started at FastMail we used CVS, and we converted to Subversion and then to finally to Git.
We have a reasonably complex workflow, involving per-host branches, per-user development branches, and a master branch where everything eventually winds up. The important thing is that nothing is considered "done" until it's in git. Even for simple one-off tasks, we will write a script and archive it in git for later reference. The amount of code that a single person can write is so small these days compared to the size of disks that it makes sense to keep everything we ever do, just in case.