This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on December 3rd was about how we do real-time notifications. The next post on December 5th is about data integrity.

Technical level: highly technical

We've written a lot about our slots/stores architecture before - so I'll refer you to our documentation rather than rehashing the details here.

We have evolved over the years, and particularly during the Opera time, I had to resist the forces suggesting a "put all your storage on a SAN and your processing on compute nodes" design, or "why don't you just virtualise it", as if that's a magic wand that solves all your scalability and IO challenges.

Luckily I had a great example to point to: Berkeley University had a week-long outage on their Cyrus systems when their SAN lost a drive. They were sitting so close to the capability limits of their hardware that their mail architecture couldn't handle the extra load of adding a new disk, and everything fell over. Because there was one single pool of IO, this meant every single user was offline.

I spent my evenings that week (I was living in Oslo) logging in to their servers and helping them recover. Unfortunately, the whole thing is very hard to google - search for "Berkeley Cyrus" and you'll get lots of stuff about the Berkeley DB backend in Cyrus and how horrible it is to upgrade...

So we are very careful to keep our IO spread out across multiple servers with nothing shared, so an issue in one place won't spread to the users on other machines.

The history of our hardware is also, to quite a large degree, the history of the Cyrus IMAPd mail server. I've been on the Cyrus Governance board for the past 4 years, and writing patches for a lot longer than that.

Email is the core of what we do, and it's worth putting our time into making it the best we can. There are things you can outsource, but hardware design and the mail server itself have never been one of those things for us.

Early hardware - meta data on spinning disks

When I started at FastMail 10 years ago, our IMAP servers were honking great IBM machines (6 rack units each) with a shared disk array between them, and a shiny new 4U machine with a single external RAID6 unit. We were running a pre-release CVS 2.3 version of Cyrus on them, with a handful of our own patches on top.

One day, that RAID6 unit lost two hard disks in a row, and a third started having errors. We had no replicas, we had backups, but it took a week to get everyone's email restored onto the new servers we had just purchased and were still testing. At least we had new servers! For that week though, users didn't have access to their old email. We didn't want this to ever happen again.

Our new machines were built more along the lines of what we have now, and we started experimenting with replication. The machines were 2U boxes from Polywell (long since retired now), with 12 disks - 4 high speed small drives in two sets of RAID1 for metadata, and 8 bigger drives (500Gb! - massive for the day) in two sets of RAID5 for email spool.

Even then I knew this was the right way - standalone machines with lots of IO capability, and enough RAM and processor (they had 32Gb of RAM) to run the mail server locally, so there are minimal dependencies in our architecture. You can scale that as widely as you want, with a proxy in front that can direct connections to the right host.

We also had 1U machines with a pair of attached SATA to SCSI drive units on either side. Those drive units had the same disk layout as the Polywell boxes, except the OS drives were in the 1U box - I won't talk any more about these, they're all retired too.

This ran happily for a long time on Cyrus 2.3. We wrote a tool to verify that replicas were identical to masters in all the things that matter (what can be seen via IMAP), and pushed tons of patches back to the Cyrus project to improve replication as we found bugs.

We also added checksums to verify data integrity after various corruptions were detected between replicas which showed a small rate of bitrot (on the order of 20 damaged emails per year across our entire system) - and tooling to allow the damage to be fixed by pulling back the affected email from either a replica or the backup system and restoring it into place.

Metadata on SSD

Cyrus has two metadata files per mailbox (actually, there are more these days) cyrus.index and cyrus.cache. With the growing popularity of SSDs around 2008-2009, we wanted to use SSDs, but cyrus.cache was just too big for the SSDs we could afford. It's also only used for search and some sort commands, but the architecture of Cyrus meant that you had to MMAP the whole file every time a mailbox was opened, just in case a message was expunged. People had tried running with cache on slow disk and index on SSD, and it was still too slow.

There's another small directory which contains the global databases - mailboxes database, seen and subscription files for each user, sieve scripts, etc. It's a very small percentage of the data, but our calculations on a production server showed that 50% of the IO went to that config directory, about 40% to cyrus.index, and only 10% to cache and spool files.

So I spent a year and concentrated on rewriting the entire internals of Cyrus. This became Cyrus 2.4 in 2010. It has consistent locking semantics, which actually make it a robust QRESYNC/CONDSTORE compatible server (new standards which required stronger guarantees than the cyrus
2.3 datastructures could provide), and also meant that cache wasn't loaded until it was actually needed.

This was a massive improvement for SSD-based machines, and we bought a bunch of 2U machines from E23 (our existing external drive unit vendor) and then later from Dell through Opera's sysadmin team.

These machines had 12 x 2Tb drives in them, and two Intel x25E 64Gb SSDs. Our original layout was 5 sets of RAID1 for the 2Tb drives, with two hotspares.

Email on SSD

We ran happily for years with the 5 x 2Tb split, but something else came along. Search. We wanted dedicated IO bandwidth for search. We also wanted to load the initial mailbox view even faster. We decided that almost all users get enough email in a week that their initial mailbox view is going to be able to be generated from a week's worth of email.

So I patched Cyrus again. For now, this set of patches is only in the FastMail tree, it's not in upstream Cyrus. I plan to add it after Cyrus 2.5 is released. All new email is delivered to the SSD, and only archived off later. A mailbox can be split, with some emails on the SSD, and some not.

We purchased larger SSDs (Intel DC3700 - 400Gb), and we now run a daily job to archive emails that are bigger than 1Mb or older than 7 days to the slow drives.

This cut the IO to the big disks so much that we can put them back into a single RAID6 per machine. So our 2U boxes are now in a config imaginatively called 't15', because they have 15 x 1Tb spool partitions on them. We call one of these spools plus its share of SSD and search drive a "teraslot",
as opposed to our earlier 300Gb and 500Gb slot sizes.

They have 10 drives in a RAID6 for 16Tb available space, 1Tb for operating system and 15 1Tb slots.

They also have 2 drives in a RAID1 for search, and two SSDs for the metadata.

Filesystem         Size  Used Avail Use% Mounted on
/dev/mapper/sdb1   917G  691G  227G  76% /mnt/i14t01
/dev/mapper/sdb2   917G  588G  329G  65% /mnt/i14t02
/dev/mapper/sdb3   917G  789G  129G  86% /mnt/i14t03
/dev/mapper/sdb4   917G   72M  917G   1% /mnt/i14t04
/dev/mapper/sdb5   917G  721G  197G  79% /mnt/i14t05
/dev/mapper/sdb6   917G  805G  112G  88% /mnt/i14t06
/dev/mapper/sdb7   917G  750G  168G  82% /mnt/i14t07
/dev/mapper/sdb8   917G  765G  152G  84% /mnt/i14t08
/dev/mapper/sdb9   917G   72M  917G   1% /mnt/i14t09
/dev/mapper/sdb10  917G  800G  118G  88% /mnt/i14t10
/dev/mapper/sdb11  917G  755G  163G  83% /mnt/i14t11
/dev/mapper/sdb12  917G  778G  140G  85% /mnt/i14t12
/dev/mapper/sdb13  917G  789G  129G  87% /mnt/i14t13
/dev/mapper/sdb14  917G  783G  134G  86% /mnt/i14t14
/dev/mapper/sdb15  917G  745G  173G  82% /mnt/i14t15
/dev/mapper/sdc1   1.8T  977G  857G  54% /mnt/i14search
/dev/md0           367G  248G  120G  68% /mnt/ssd14

The SSDs use software RAID1, and since Intel DC3700s have strong onboard crypto, we are using that rather than OS level encryption. The slot and search drives are all mapper devices because they use LUKS encryption. I'll talk more about this when we get to the confidentiality post in the security series.

The current generation

Finally we come to our current generation of hardware. The 2U machines are pretty good, but they have some issues. For a start, the operating system shares IO with the slots, so interactive performance can get pretty terrible when working on those machines.

Also, we only get 15 teraslots per 2U.

So our new machines are 4U boxes with 40 teraslots on them. They have 24 disks in the front on an Areca RAID controller:


And 12 drives in the back connected directly to the motherboard SATA:


The front drives are divided into two lots of 2Tb x 12 drive RAID6 sets, for 20 teraslots each.

In the back, there are 6 2Tb drives in a pair of software RAID1 sets (3 drives per set, striped, for 3Tb usable) for search, and 4 Intel DC3700s as a pair of RAID1s. Finally, a couple of old 500Gb drives for the OS - we have tons of old 500Gb drives, so we may well recycle them. In a way, this is really two servers in one, because they are completely separate RAID sets just sharing the same hardware.

Finally, they have 192Gb of RAM. Processor isn't so important, but cache certainly is!

Here's a snippet from the config file showing how the disk is distributed in a single Cyrus instance. Each instance has its own config file, and own paths on the disks for storage:

servername: sloti33d1t01
configdirectory: /mnt/ssd33d1/sloti33d1t01/store1/conf
sievedir: /mnt/ssd33d1/sloti33d1t01/store1/conf/sieve
duplicate_db_path: /var/run/cyrus/sloti33d1t01/duplicate.db
statuscache_db_path: /var/run/cyrus/sloti33d1t01/statuscache.db
partition-default: /mnt/ssd33d1/sloti33d1t01/store1/spool
archivepartition-default: /mnt/i33d1t01/sloti33d1t01/store1/spool-archive
tempsearchpartition-default: /var/run/cyrus/search-sloti33d1t01
metasearchpartition-default: /mnt/ssd33d1/sloti33d1t01/store1/search
datasearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search
archivesearchpartition-default: /mnt/i33d1search/sloti33d1t01/store1/search-archive

The disks themselves - we have a tool to spit out the drive config of the SATA attached drives. It just pokes around in /sys for details:

$ utils/
  1 - HDD 500G RDY sdc        3QG023NC
  2 - HDD 500G RDY sdd        3QG023TR
  3 E SSD 400G RDY sde  md0/0 BTTV332303FA400HGN
  4 - HDD   2T RDY sdf  md2/0 WDWMAY04568236
  5 - HDD   2T RDY sdg  md3/0 WDWMAY04585688
  6 E SSD 400G RDY sdh  md0/1 BTTV3322038L400HGN
  7 - HDD   2T RDY sdi  md2/1 WDWMAY04606266
  8 - HDD   2T RDY sdj  md3/1 WDWMAY04567563
  9 E SSD 400G RDY sdk  md1/0 BTTV323101EM400HGN
 10 - HDD   2T RDY sdl  md2/2 WDWMAY00250279
 11 - HDD   2T RDY sdm  md3/2 WDWMAY04567237
 12 E SSD 400G RDY sdn  md1/1 BTTV324100F9400HGN

And the Areca tools work for the drives in front:

$ utils/cli64 vsf info
  # Name             Raid Name       Level   Capacity Ch/Id/Lun  State
  1 i33d1spool       i33d1spool      Raid6   20000.0GB 00/00/00   Normal
  2 i33d2spool       i33d2spool      Raid6   20000.0GB 00/01/00   Normal
GuiErrMsg: Success.
$ utils/cli64 disk info
  # Enc# Slot#   ModelName                        Capacity  Usage
  1  01  Slot#1  N.A.                                0.0GB  N.A.
  2  01  Slot#2  N.A.                                0.0GB  N.A.
  3  01  Slot#3  N.A.                                0.0GB  N.A.
  4  01  Slot#4  N.A.                                0.0GB  N.A.
  5  01  Slot#5  N.A.                                0.0GB  N.A.
  6  01  Slot#6  N.A.                                0.0GB  N.A.
  7  01  Slot#7  N.A.                                0.0GB  N.A.
  8  01  Slot#8  N.A.                                0.0GB  N.A.
  9  02  Slot 01 WDC WD2003FYYS-02W0B0            2000.4GB  i33d1spool
 10  02  Slot 02 WDC WD2003FYYS-02W0B0            2000.4GB  i33d1spool
 11  02  Slot 03 WDC WD2000FYYZ-01UL1B0           2000.4GB  i33d1spool
 12  02  Slot 04 WDC WD2003FYYS-02W0B0            2000.4GB  i33d1spool
 13  02  Slot 05 WDC WD2003FYYS-02W0B0            2000.4GB  i33d1spool
 14  02  Slot 06 WDC WD2003FYYS-02W0B0            2000.4GB  i33d1spool
 15  02  Slot 07 WDC WD2003FYYS-02W0B1            2000.4GB  i33d1spool
 16  02  Slot 08 WDC WD2003FYYS-02W0B0            2000.4GB  i33d1spool
 17  02  Slot 09 WDC WD2000F9YZ-09N20L0           2000.4GB  i33d1spool
 18  02  Slot 10 WDC WD2003FYYS-02W0B1            2000.4GB  i33d1spool
 19  02  Slot 11 WDC WD2003FYYS-02W0B1            2000.4GB  i33d1spool
 20  02  Slot 12 WDC WD2003FYYS-02W0B1            2000.4GB  i33d1spool
 21  02  Slot 13 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 22  02  Slot 14 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 23  02  Slot 15 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 24  02  Slot 16 WDC WD2000FYYZ-01UL1B0           2000.4GB  i33d2spool
 25  02  Slot 17 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 26  02  Slot 18 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 27  02  Slot 19 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 28  02  Slot 20 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 29  02  Slot 21 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 30  02  Slot 22 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
 31  02  Slot 23 WDC WD2002FYPS-01U1B1            2000.4GB  i33d2spool
 32  02  Slot 24 WDC WD2003FYYS-02W0B0            2000.4GB  i33d2spool
GuiErrMsg: Success.

We always keep a few free slots on every machine, so we have the capacity to absorb the slots from a failed machine. We never want to be in the state where we don't have enough hardware!

Filesystem         Size  Used Avail Use% Mounted on
/dev/mapper/md2    2.7T  977G  1.8T  36% /mnt/i33d1search
/dev/mapper/md3    2.7T  936G  1.8T  35% /mnt/i33d2search
/dev/mapper/sda1   917G  730G  188G  80% /mnt/i33d1t01
/dev/mapper/sda2   917G  805G  113G  88% /mnt/i33d1t02
/dev/mapper/sda3   917G  709G  208G  78% /mnt/i33d1t03
/dev/mapper/sda4   917G  684G  234G  75% /mnt/i33d1t04
/dev/mapper/sda5   917G  825G   92G  91% /mnt/i33d1t05
/dev/mapper/sda6   917G  722G  195G  79% /mnt/i33d1t06
/dev/mapper/sda7   917G  804G  113G  88% /mnt/i33d1t07
/dev/mapper/sda8   917G  788G  129G  86% /mnt/i33d1t08
/dev/mapper/sda9   917G  661G  257G  73% /mnt/i33d1t09
/dev/mapper/sda10  917G  799G  119G  88% /mnt/i33d1t10
/dev/mapper/sda11  917G  691G  227G  76% /mnt/i33d1t11
/dev/mapper/sda12  917G  755G  162G  83% /mnt/i33d1t12
/dev/mapper/sda13  917G  746G  172G  82% /mnt/i33d1t13
/dev/mapper/sda14  917G  802G  115G  88% /mnt/i33d1t14
/dev/mapper/sda15  917G  159G  759G  18% /mnt/i33d1t15
/dev/mapper/sda16  917G   72M  917G   1% /mnt/i33d1t16
/dev/mapper/sda17  917G  706G  211G  78% /mnt/i33d1t17
/dev/mapper/sda18  917G   72M  917G   1% /mnt/i33d1t18
/dev/mapper/sda19  917G   72M  917G   1% /mnt/i33d1t19
/dev/mapper/sda20  917G   72M  917G   1% /mnt/i33d1t20
/dev/mapper/sdb1   917G  740G  178G  81% /mnt/i33d2t01
/dev/mapper/sdb2   917G  772G  146G  85% /mnt/i33d2t02
/dev/mapper/sdb3   917G  797G  120G  87% /mnt/i33d2t03
/dev/mapper/sdb4   917G  762G  155G  84% /mnt/i33d2t04
/dev/mapper/sdb5   917G  730G  187G  80% /mnt/i33d2t05
/dev/mapper/sdb6   917G  803G  114G  88% /mnt/i33d2t06
/dev/mapper/sdb7   917G  806G  112G  88% /mnt/i33d2t07
/dev/mapper/sdb8   917G  786G  131G  86% /mnt/i33d2t08
/dev/mapper/sdb9   917G  663G  254G  73% /mnt/i33d2t09
/dev/mapper/sdb10  917G  776G  142G  85% /mnt/i33d2t10
/dev/mapper/sdb11  917G  743G  174G  82% /mnt/i33d2t11
/dev/mapper/sdb12  917G  750G  168G  82% /mnt/i33d2t12
/dev/mapper/sdb13  917G  743G  174G  82% /mnt/i33d2t13
/dev/mapper/sdb14  917G  196G  722G  22% /mnt/i33d2t14
/dev/mapper/sdb15  917G  477G  441G  52% /mnt/i33d2t15
/dev/mapper/sdb16  917G  539G  378G  59% /mnt/i33d2t16
/dev/mapper/sdb17  917G   72M  917G   1% /mnt/i33d2t17
/dev/mapper/sdb18  917G   72M  917G   1% /mnt/i33d2t18
/dev/mapper/sdb19  917G   72M  917G   1% /mnt/i33d2t19
/dev/mapper/sdb20  917G   72M  917G   1% /mnt/i33d2t20
/dev/md0           367G  301G   67G  82% /mnt/ssd33d1
/dev/md1           367G  300G   67G  82% /mnt/ssd33d2

Some more copy'n'paste:

$ free
             total       used       free     shared    buffers     cached
Mem:     198201540  197649396     552144          0    7084596  120032948
-/+ buffers/cache:   70531852  127669688
Swap:      2040248    1263264     776984

Yes, we use that cache! Of course, the Swap is a little pointless at that size...

$ grep 'model name' /proc/cpuinfo
model name  : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name  : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name  : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name  : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name  : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name  : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name  : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
model name  : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz

IMAP serving is actually very low on CPU usage. We don't need a super-powerful CPU to drive this box. The CPU load is always low, it's mostly IO wait - so we just have a pair of 4 core CPUs.

The future?

We're in a pretty sweet spot right now with our hardware. We can scale these IMAP boxes horizontally "forever". They speak to the one central database for a few things, but that could be easily distributed. In front of these boxes are frontends with nginx running an IMAP/POP/SMTP proxy, and compute servers doing spam scanning before delivering via LMTP. Both look up the correct backend from the central database for every connection.

For now, these 4U boxes come in at about US$20,000 fully stocked, and our entire software stack is optimised to get the best out of them.

We may containerise the Cyrus instances to allow fairer IO and memory sharing between them if there is contention on the box. For now, it hasn't been necessary because the machines are quite beefy, and anything which adds overhead between the software and the metal is a bad thing. As container software gets more efficient and easier to manage, it might become worthwhile rather than running multiple instances on the single operating system as we do now.