Dec 21: File Storage

Product

This blog post is part of the FastMail 2014 Advent Calendar.

The previous post on 20th December saw us open source our core javascript library. The following post on 22nd December is the long awaited beta release of our contact syncing solution.

Technical level: high

Why does an email service need a file storage system?

For many years, well before cloud file storage became an everyday thing, FastMail has had a file storage feature. Like many features, it started as a logical extension of what we were already doing. People were accessing their email anywhere via a web browser, it would be really nice to be able to access important files everywhere as well. And there’s the obvious email integration points as well, being able to save attachments from emails to somewhere more structured, or having files that you commonly want to attach to emails (e.g. your resume) stored at FastMail, rather than having to upload it again each time.

Lots of FastMail's early features exist because they're the kind of thing that we wanted for ourselves!
It also turns out that have a generic file storage is useful for other features, as we discovered later.

A generic file service

The first implementation of our file storage system used a real filesystem on a large RAID array. To make the data available to our web servers, we used NFS. While in theory and nice and straight forward solution, unfortunately it all ended up being fairly horrible. We used the NFS server built into the Linux kernel at that time, and although it was supposed to be stable, that was not our experience at all. While all our other servers had many months of uptime, the file storage server running NFS would freeze/crash somewhere between every few days and every week. This was particularly surprising to us because we weren’t actually stressing it much, the workload wasn’t high compared to the IO that some file servers perform.

Having the NFS server go offline and people losing access to their files until it is rebooted was bad enough, but there was a much worse problem. Any process that tried to access a file on the NFS mount would freeze up until the server came back. Since the number of processes handling web requests was limited, all it took was a few 100 requests by users trying to access their file storage, and suddenly there were no processes left to handle other web requests, and ALL web requests would start failing, meaning no one was able to access the FastMail site at all. Not nice. We tried a combination of soft mounts and other flags, but couldn’t find a combination that was both consistently reliable and failure safe.

Apparently I have suppressed the memories — something to do with being woken by both young children AND broken servers, but Rob M remembers, and he says: "In one of those great moments of frustration at being woken up again by a crashed NFS server, Bron wanted to do a complete rewrite, and to use an entirely different way of storing the files. Instead of storing the file system structure in a real filesystem, we decided to use a database. However we didn’t want to store the actual file data in the database, that would result in a massive monolithic database with all user file data in it, not easy to manage. So the approach he came up with is a rather neat hybrid that has worked really well." So there you go.

One of my first major projects at FastMail was this new file storage service. I was fresh from building data management tools for late-phase clinical trials (drugs, man) for Quintiles in New Jersey (that's right, I moved from living in New Jersey and working on servers in Melbourne to living in Melbourne and working on servers in New York). I over-engineered our file storage system using many of the same ideas I had used for clinical data.

Interestingly, a lot of the work I'd been doing at Quintiles looked very similar in design to git, though it was years before git came out. Data addressed by digest (sha1), signatures on digests of lists of digests to provide integrity and authenticity checks over large blocks of data. That product doesn't seem to exist any more though.

The original build of the file service was based on many of the same concepts. File version forking and merging (which was too complex and got scrapped) with very clever cache hierarchy and invalidation scheme. Blobs (file contents themselves) are stored in a key-value pool spread over multiple machines, with push to many copies before the store is successful, and a background cleaning task that ensures they are spread everywhere and garbage collected when no longer referenced.

The blob storage system is very simple - we could definitely build or grab off the shelf something a lot faster and better these days, but it's very robust, and that matters to us more than speed.

Interestingly enough, while the caching system was great when there was a lower volume of changes and slow database servers, it eventually became faster to remove a layer of caching entirely as our needs and technology changed.

Database backed nodes

The same basic architecture still exists today. The file storage is a giant database table in our central mysql database. Every entry is a "Node", with a primary key called NodeId, and a "ParentNodeId". Node number 1 is treated specially, and is of class 'RootNode'. It's the top of the tree.

Because there are hundreds of thousands of top level nodes (ParentNodeId
== 1), it would be crazy to read the key 'ND:1' (node directories, parent 1) for normal operations. Instead, we fetch "UA:$UserId" which is the ACL for the user's userid, and then walk the tree back up from each ACL which grants the user any access, building a tree that way.

For example:

$ DEBUG_VFS=1 vfs -u brong@fastmail.fm ls /
INIT
Fetching ND:504452
/:
--
Fetching UA:485617
Fetching N:20872929
Fetching N:3
Fetching N:1394099
Fetching N:1394098
Fetching N:2
Fetching N:504452
d---   504452 2005-09-14 04:31:35 brong.fastmail.fm/
d---  1394098 2006-01-20 00:44:04 admin.fastmail.fm/
d---        2 2005-09-13 07:46:04 robm.fastmail.fm/

Whereas if we're inside an ACL path we walk the tree normally from that ACL (we still need to check the other ACLs to see if they also impact the data we're looking at...):

$ DEBUG_VFS=1 vfs -u brong@fastmail.fm ls '~/websites'
INIT
Fetching ND:504452
Fetching N:504452
Fetching UA:485617
Fetching N:20872929
Fetching N:3
Fetching N:1394099
Fetching N:1394098
Fetching N:2
Fetching UT:485617
/brong.fastmail.fm/files/websites: