Dec 21: Building a backup system for Cyrus
This is the twenty-first post in the FastMail 2015 Advent Calendar. Stay tuned for another post tomorrow.
Cyrus doesn't presently have a built-in backup system, so administrators of Cyrus systems are left to devise their own. While this is fairly trivial in a naive way (messages are stored as plain rfc822 files on disk, which any block- or file-based backup software can deal with), Cyrus stores various kinds of metadata in database files, with at-times complicated locking and lock ordering requirements, which can't be reliably backed up or restored by naive block or file copy.
FastMail's current in-house backup system hooks into Cyrus internals from Perl, in ways that aren't always strictly orthodox. It has worked well enough for us in practice, but is showing its age. Rather than spending significant effort refactoring it, we're spending the effort instead building a more robust system directly into Cyrus. This means metadata can be reliably backed up alongside messages, because the backup system understands the data and how to safely read/clone/restore it.
To build a backup system, we need a way to store backup data, a way to communicate mailbox data to (and eventually, from) the backup storage, and tools for managing it all.
Our in-house backup system stores backup data as, for each user, an append-only concatenated-tar/gzip log of the stored data, plus an Sqlite index of the contents of the log, for fast discovery/recovery.
The Cyrus backup system we're developing is taking a similar approach. Each user's data is stored as an append-only log of concatenated gzip chunks (one chunk per backup invocation), and an Sqlite database indexing the current state of their mailboxes and offsets within the log of their message contents. The index can be completely rebuilt by re-parsing the log, and the log's integrity can be verified using the index.
The gzip format is good here: it's well-understood, it's easy to inspect with standard tools, which are available basically everywhere, it provides reasonable compression, and it's very very fast.
There is also a twoskip Cyrus database that maps users to their location on backup storage.
Communicating user data to backup storage
For getting data from real mailboxes into the backup system, we're using the Cyrus replication protocol. We've previously touched on this protocol in ImapClone - Invisible Migration. The backup system presents itself like any other replica target, but it replicates state into backup storage instead of mailboxes. Because the main storage is an append-only log, deletes are just flag replications, the deleted data stays in the log.
Using the replication protocol means not needing to devise a new mechanism for what is basically the same thing, and also has some interesting emergent characteristics:
- To send data from your live IMAP servers to your backup servers, you just use sync_client like for any other replica
- You can backup to multiple destinations at once by configuring a replication channel and matching sync_client for each destination
- Your backups can be as hot as you like: set up your backup channels to run sync_client in repeat/rolling mode and set their sync_repeat_interval to your desired frequency (in seconds!)
- You can schedule backups at specific times using the EVENTS section of your cyrus.conf
- You can use sync log chaining to shift backup workload from your live IMAP servers to their replicas
- You can provoke a backup at any time by running sync_client manually
There are some tradeoffs to make in the setup: more frequent replications means smaller replication work units, smaller replication work units means larger relative protocol overhead and smaller initial log chunks with less effective compression, but workload is more evenly distributed over time. Less frequent replications reduce the protocol overhead and get better initial compression on disk, but at the expense of heavier workload at the times they occur. The sweet spot(s), configuration-wise, will shake out with time and experience.
This architecture will be complemented by a tool for compacting backup data. Its main two functions will be:
- Removal of deleted data that has outlived its retention period
- Coalescing small adjacent log chunks for better gzip compression
This is similar to the repacking in the current backup system, which uses a sqlite database to know which files are still needed, and streams them into a new tar file before deleting the old one.
These are both achieved by using the index to determine what is still needed, and creating a new log+index with only this information, choosing appropriate log chunk boundaries along the way. At no time is a backup log edited in place — it's important, especially at this experimental stage, to preserve the original log/index pair, just in case.
This tool will be particularly important for getting good compression of very hot backups. A busy mailbox may produce tens or hundreds of log chunks per day if it's replicating changes as soon as they happen. This sort of granularity is great — you get to flush the gzip stream very frequently, so you know the data is on disk — but each time you flush, the dictionary resets, so your compression is not good. Being able to coalesce this down after the fact means good compression of long term storage, while still allowing short, frequent initial writes.
There are already tools in progress for rebuilding indices from log files, verifying log files from indices, locking a user's backup while you work on it with other tools (useful if you want to use sqlite to poke around in an index, for example, without getting trodden on by an incoming replication), and so on. The implementation of these tools — reading, processing, and verifying the stored backup data — will underpin the other major component of this system: recovery.
There are two main aspects to recovery: finding the data to be recovered, and recovering it. The former will be determined largely by need — I expect this part to take shape as we start trying to use it. We anticipate the latter to be handled by extensions to the replication protocol, allowing us to replicate into a mailbox while letting the destination mailbox determine things like uid and modseq, though it might also potentially be handled by LMTP injection.
One of the biggest difficulties has been the undocumentedness of the Cyrus replication protocol. I've spent a lot of time up to my ears in the replication code, wrapping my head around how (and sometimes why) it works from the nuts and bolts up. It would be nice to have a coherent protocol document shake out as a result of this implementation, though writing code is a lot more interesting than writing documentation….
Another big difficulty has been some of the assumptions built into the replication protocol. For example, it assumes the receiving end can store a message to disk, and then link it into the relevant mailboxes, without necessarily knowing which mailboxes beforehand. The backup system writes messages into the backup log belonging to the user whose mailbox they are stored in, so it needs to know this in advance. The replication protocol provides a hint about this when it requests the current state, but its state requests can be batched across multiple users, so this hint can be unhelpfully ambiguous — we know backups that are about to have messages stored in them, but not which of those messages go in which backups. However, in the event of failure, the replication protocol promotes fine-granularity replications upwards until they're eventually user-level replications — which it does one at a time, with a protocol restart between each. So we can conveniently work around this one by just rejecting state requests that cross multiple backups. The replication event will be promoted and retried in a way that is unambiguous, and on we go.
The other big difficulty is the handling of mailbox renames. Renaming an ordinary (not Inbox, not special-use) mailbox within a user is fairly trivial, at least within the backup system. You just log the rename, then change the mailbox name in the index. Renaming a user's inbox (i.e. changing their username) requires awkward juggling of the on-disk backup location (based on their username), their entry in the twoskip database (pointing their username to their on-disk backup location), plus logging the rename to their backup log and updating all of their mailbox names within the index database — ideally atomically; at least safely. The current plan is to reject these for now and massage things through by hand — and the tools that develop out of that will inform what shape the automatic handling eventually takes. And then there's renaming a mailbox such that it belongs to a different user (e.g. renaming
user.bethany.work-reports, without also renaming
user.bethany), or renaming a user's inbox but not their folders (e.g. renaming
user.janes but without renaming
user.janes.work-reports). The replication protocol doesn't strictly make these cases impossible, though I don't yet know if they can ever come up in practice….
The friends you make along the way
The Cyrus backup code hasn't landed in the official Cyrus repository yet, though it will soon — probably sometime in January. When it does land, it will have enough functionality that people can run it in a write/examine/verify-only fashion (no recovery) alongside their existing backup strategies to shake out the gremlins, and we can get to work on sketching out and implementing recovery tools.