This is the third post in the 2016 FastMail Advent Calendar. Stay tuned for another post tomorrow.
This is a quite technical post about one of the key database formats in our technology stack at FastMail. I will be writing about the database formats that are in Cyrus IMAPd now, and our ideas for the future.
When I started using Cyrus in 2003, there were three main database formats available:
- flat files (very inefficient and simplistic, and not binary safe)
- berkeley db (quite powerful, but prone to upgrade problems)
- skiplist (entirely embedded within cyrus, but had corruption issues)
The Wikipedia page for skip lists explains the history of the data structure itself. Cyrus' implementation of skip lists is an embedded single file key-value database engine, with pointers being represented as offsets within the file.
I got to know the skiplist format while fixing the many reliability bugs back in 2007. The full commit history in the git repository shows my initial patches being added by Ken Murchison when the project was still using CVS, and then further updates over time.
By 2010, skiplist was pretty solid, but we had a problem with recovery times once we started using it for really big files. If a server crashed for whatever reason, the recovery time for a hundreds-of-megabyte conversations database was on the order of half an hour. In that time the user was locked out of their email. Being locked out for half an hour is unacceptable!
Based on my experience with skiplist, I experimented with many different formats (called things like skiplist2), but settled on the name twoskip due to the twin-linked-lists at the core of the design. You can read the entire implementation plus a ton of comments containing design information - it's all open source and available in git, both twoskip itself (about 2500 lines) and the mappedfile library (about 500 lines) which I built to abstract the storage layer.
The key tradeoff was adding an extra fsync to be sure that the file was marked dirty before appending anything. Since writes can be arbitrarily reordered, it was theoretically possible for pointers to be updated without the new records being appended to the file before a crash. By using three fsyncs per transaction, there is no risk of opening a file and having invalid pointers (absent buggy software or disk errors.)
So twoskip places very few requirements on the host operating system. To have a consistent file there are only three:
- fcntl locking must work and ensure that multiple processes
don't believe they can write to the file at the same time.
- fsync must at least provide a barrier, no writes after the fsync can apply
before writes that were done before the fsync.
- overwriting a single 64 bit location must either write the old value or the
new value, not something random with only half the bytes.
Obviously if you have a power failure and a disk goes crazy and writes rubbish over a few sectors in its last powerless cry for help, then there's nothing any file format can do. But stick to those basics, and twoskip is sound.
To be sure that your commit actually succeeded by the time that the commit function call returned, you also need fsync to guarantee that the changes actually made it to disk when it returns, but that's all.
A twoskip file is in one of two states - clean or dirty. In clean state, it's guaranteed to have all pointers correct and lookups can be done in O(log N). If opened in dirty state, it's still always possible to get a clean read, but it might be neccessary to fall back to a slower linear scan if you hit dirty pointers. It is necessary to repair the database before doing any further writes. Repair is very fast though.
And fast repair is key for our usage. It doesn't matter too much if a single command takes 120ms instead of 110ms to return because the database was a little slower - that's lost in network latency noise. But if an account is locked for 10 minutes, that is very visible. So low variability is more important than raw speed.
I was particularly reminded of the value of these tradeoffs this week when an issue with the blobId storage that we added for JMAP caused someone's replication to go crazy - a twoskip database had grown to over 50Gb on the replica, with "only" 800Mb of legitimate data. We had to kill the process that was creating the file to stop the disk filling.
When the database was next loaded it ran recovery on the file. Just 10 seconds and it was ready to use again. That was a full recovery and consistency check of the file.
The biggest design challenge with twoskip is that the file is append-only except for pointer updates. Over time with random inserts, the data gets quite fragmented with jumps all over the place, because the logical ordered linked list is scattered all over the on-disk offsets. Secondly, deleted records are never removed.
To deal with this I followed the same logic that the skiplist format uses. On commit we check some metrics. If the file is due for a "checkpoint" then a new database is opened at $filename.NEW and the engine iterates over all the existing records and writes them in order to the new file, then renames the compacted DB over the existing one. An example from our logs:
2016-12-02T02:17:31.259652-05:00 imap21 sloti21d2t40/ctl_conversationsdb: twoskip: checkpointed /mnt/ssd21d2/sloti21d2t40/store254/conf/domain/b/brong.net/user/b/brong.conversations (414414 records, 147739952 => 52722344 bytes) in 2.120 seconds
All well and good - in this case I was using a command line tool, but for those 2.12 seconds, my account was still locked against anything else happening. If an imap connection had triggered the repack, then my client would have been waiting for a response until the repack finished. A random process pays the garbage collection tax when the threshold is reached.
Even without changing the twoskip format that much, there are some thing that could be done to reduce the impact of repacks and allow more concurrency:
Read-only isolated transactions
This one needs some slight format changes. Instead of replacing records by writing a brand new record and linking the new record into the chain, just add one more pointer to the chain which says "this record is replaced with a new value" and points to a value node, which then points back to the previous value node, and so on. Likewise a delete would be a node in this chain.
Readers can extend their mmap of the existing file descriptor without ever releasing and reopening if the file changes, and by skipping over any node that is located past the filesize at start of transaction, they can get a consistent read. With a bit of clever lock interleaving, this could allow other processes to write while a long-running reader kept reading from a consistent view - MVCC.
Almost everything is already in place do to this, it just needs a couple more record types and some changes to the code which locates the next record. It will be an in-place upgrade.
This one can be done totally independently of other changes, just run the recovery code with the end offset at the most recent savepoint rather than at the start of the root transaction to roll back from a sub-transaction.
The JMAP code in Cyrus would benefit greatly from being able to start processing a single command and then abort if it hit a conflict without having to fail the entire batch.
We're still a little way off nested transactions being valuable for everything, because we have too many internal database formats which are updated in lockstep - they would all have to support nested transactions to safely roll back partial updates.
But adding it to twoskip is just a matter of changing the transaction struct format slightly and adding nested transaction interface support to the cyrusdb layer.
This depends on having an isolated reader transaction. You could open one of those, copy all the records to a new file, and then just replay the changes to the log since that time. Obviously $database.NEW would be exclusively locked for the duration, so you would never have two processes trying to do a checkpoint at the same time. Once the new database has all the records from the read-only transaction, then it's time to take an exclusive lock on the active database and replay the changes since that point.
With this, it would still take 2+ seconds to checkpoint my 400k record database, but at least other processes could keep reading and writing it while this happened, and the only time they would be locked out was for a small catchup transaction at the end.
A further optimisation then would be to log "needs checkpoint" to some central repacking daemon, rather than each process doing the checkpoint itself - so the user is never waiting on database repacks.
A new format?
So there are possible optimisations to twoskip for production, but there's another interesting force at play in Cyrus-land. Object storage. A downside of twoskip is that writes are quite random. Every time you append a new record (create, update or delete) you need to update an average of two offsets from earlier in the file to stitch the new record into the skip lists. These locations are effectively random, so a couple of arbitrary blocks are dirtied somewhere in the large file for every record written.
This is very expensive in filesystems which always copy blocks rather than updating them in place, and it can lead to fragmentation. Not to mention the difficulty of rsyncing or storing back to an object store.
When we had performance issues with random updates in Xapian for our search engine, we built a multi-tiered database solution. There are some differences - in Xapian's case the IDs are checked afterwards to see if they still exist, so the ordering of databases doesn't matter. It's also safe to write to tmpfs for the active database, because we can always re-index the missing messages after a restart, but the concept is transferable.
One thing we could do for a low level database format is to split it into an appendable write file, plus multiple read-only files. Of course the read-only files must keep tombstones for deleted keys unless they are the bottom file, because otherwise we could have deleted data re-surface from old copies. And the databases must be kept in order, but:
- you can avoid updating any pointers by having each process read the
entire log file and build an in-memory skiplist structure with just
the offsets for the keys and values in the log.
- all read-only files can have a greatly simplified on-disk structure
because there's no need to handle crashes - you just generate the file
and fsync it once before making it active.
- it's possible to repack multiple of these read-only databases together
and atomically switch them into place by changing the active list.
So this is on the roadmap. It will be a directory of files rather than a single file on disk, and more complex, but it will be very cheap to rsync or store into an object store as a set of files.
A downside is that key-based search needs to be done on every file, and an interator needs to keep open cursors into each of the files to step through them, so reads might get slightly slower. Again though, low variability is more valuable to me than maximum speed, and this design reduces variability further.
Another possibility is to use a different key-value engine. We've played with LMDB but haven't managed to make it stable with thousands of databases spread over different filesystems yet. We're open to new engines here too, so long as the recovery story is sound.
So we have options for improving the ordered key-value storage layer of Cyrus. That's everything below the cyrusdb interface...
On the top side of the
cyrusdb interface there's another interesting issue. We've already added poor-man's indexing to mailboxes.db with the Reverse ACL support, but there are sqlite databases embedded in the calendar and contacts subsystems because we need multiple indexes, and that adds yet another database type. Cyrus has tons of storage types:
- cyrusdb (twoskip, skiplist, quotalegacy, sql, etc): the key-value store.
The most important file of this format is mailboxes.db, but there are
many others. Each cyrusdb database has its own microformats for both
keys and values.
- cyrus.index: one per mailbox, fixed width header and records
- cyrus.header: one per mailbox, variable size single record,
stores user flags, uniqueid and quotaroot
- cyrus.cache: one per mailbox, variable multiple records, internal
structure, records located by offsets stored in cyrus.index records
- search databases: one per mailbox (squatter) or multiple per user (xapian)
- rfc822 file: multiple per mailbox, one per message, hard links for
exactly duplicate content.
- sieve: both raw script files and bytecode compiled.
- sqlite database: one per user for dav and a global one for calendar alarms
One of my goals is to unify everything except for the rfc822 files themselves and the cache files. Cache will be split into multiple files per user, but based on size rather than mailbox, so it never gets insanely big. We have problems with 2-million plus message mailboxes now, where the cache size breaks the 32 bit offset limit for storage in cyrus.index. A nice splitting and aging policy for caches could cut back on repack IO considerably.
But hand-coding microformats sucks, and a generic way to specify indexes and have the database layer take care of them is very useful. The choice here is to either move to sqlite or similar and use the structure it offers, or write our own.
If we write our own, it will look something like this:
- data stored as CBOR - I'd been looking at msgpack,
but being an RFC is a winner. RFCs don't change, so the spec can be
- schemas for different datatypes stored into the database, so data can
be stored by reference to the schema, and the schema can detail which
indexes are to be generated.
- indexes by schema/key/key2 with a pointer to table row.
- tables by schema/primarykey to complete value.
And all of this behind an interface that allows everything to be updated transactionally. With nested transactions, this all becomes really easy to reason about, avoid lock inversions, and abort cleanly.
So there would be a single database per user with all the non-cache data for that user, and a single database per server with the non-cache data for that server (combining the current mailboxes.db, server level annotations, alarms, etc).
There might even be another layer above this with full cluster listings of mailboxes, which would probably be stored in a distributed database of some sort to allow clustering without the single-point-of-failure that is the mupdate master.
Help us out!
The goal of all this is operational simplicity. My key intention is to make the whole system self healing. Right now it mostly is, but there are still outages and data loss risks if a server crashes, and some cases that require manual reconstructs.
The dream service is 100% self-healing and self-balancing across multiple servers.
All these database formats are freely available as part of the open source Cyrus IMAPd server. Join us on Freenode
#cyrus or on the mailing lists and help us make this shared resource even better.