Dec 1: A tangled path of workarounds

Technical

... or why CardDAV took so long

This is the first post in the FastMail 2015 Advent Calendar. Stay tuned for another post tomorrow.

This blog post is very technical, detailing significant changes to the internals of our mail server software which were made back in June.

It's long and kinda rambling, but had been a long week and what else would I have to do on a Saturday night after the kids are asleep (other than save a draft and proofread it on Monday morning before actually posting) ... or so I thought. Turns out it stayed in the Drafts bucket until the Advent Calendar!

An attempted workaround

In theory, the world is a happy place where everybody follows standards and software is predictable. In practice, the world is nothing like that. This is a source of both frustration and comfort to those of us who work IT – frustration because we beat our heads against broken stuff every day of the week. Comfort because, if this stuff was easy then we'd be out of a job!

In this particular case, the "culprit" is the Apple contacts connector for Mac OS X. It only supports a single addressbook, which is fine. It doesn't let you decide which addressbook on your server to use, which is… problematic.

For business accounts, we provide both the user's own addressbook and a shared addressbook via CardDAV, so that more powerful clients can edit both. In future we plan to have a third separate addressbook for automatically added email addresses. People don't want every single address they have ever emailed flooding device addressbooks, but it can still be useful for to have them for spam whitelisting and with a good client the list will be able to be managed remotely.

We store the shared addressbook on a user called masteruser in the primary domain (or masteruser_{businessname} in one of our domains if the business doesn't have its own domain).

Now, CalDAV and CardDAV don't actually support this at all, in theory. We implement it by returning the canonical path to the shared calendar or addressbook even though it's outside the requested collection path. We decided to do this upon determining that it's what Google's CalDAV server does, and that every client we could test against worked with it. It looks like this:

Request:
PROPFIND /dav/addressbooks/user/brong%40brong.net/ HTTP/1.0
Depth: 1
...

Response:
<?xml version="1.0" encoding="utf-8"?>
<A:multistatus xmlns:A="DAV:" xmlns:E="http://me.com/_namespace/" xmlns:D="urn:ietf:params:xml:ns:carddav" xmlns:C="http://calendarserver.org/ns/" xmlns:XFA68="urn:ietf:params:xml:ns:caldav" xmlns:CY="http://cyrusimap.org/ns/">
 <A:response>
  <A:href>/dav/addressbooks/user/brong@brong.net/</A:href>
   ...
     <A:resourcetype>
      <A:collection/>
     </A:resourcetype>
   ...
 </A:response>
 <A:response>
  <A:href>/dav/addressbooks/user/brong@brong.net/Default/</A:href>
   ...
     <A:resourcetype>
      <A:addressbook/>
      <A:collection/>
     </A:resourcetype>
   ...

 </A:response>
 <A:response>
  <A:href>/dav/addressbooks/user/masteruser_brongnet@brong.net/Shared/</A:href>
   ...
     <A:resourcetype>
      <A:addressbook/>
      <A:collection/>
     </A:resourcetype>
   ...
 </A:response>
</A:multistatus>

In this particular case, a user with username matt@{domain} discovered that his Mac was choosing the shared addressbook rather than his personal addressbook.

A little debugging showed that we were returning the addressbooks in alphabetical order of their owner, so the masteruser was being returned first.

Down the rabbit hole

There's a function in Cyrus called mboxlist_findall() which is used to iterate over mailboxes. It is used for both the IMAP list commands, and for processing mailboxes in other parts of the code.

To add to the complexity, Cyrus supports two different options of the separator between mailbox hierarchy levels ('.' or '/') and two different namespaces for mailbox names --- one where all the user's mailboxes are subfolders of INBOX, and one where they appear at the top level – the "alternative namespace".

Here's where I switch to past tense, because a lot of this code was changed back in June!

For the alternative namespace there was a second function mboxlist_findall_alt() which could be switched into place instead. Either way, the function took a single glob expression to detail which mailboxes to display. The glob was rewritten to match separate parts of the mailbox listing.

I changed the way that mboxlist_findall was called from the DAV code so it would always return the logged in user's addressbook before any other users. In the process, I had to change it slightly. The LIST code is approximately the most complex and error prone of the whole system, so it has pretty good test coverage already. I made sure it passed all the tests and rolled it out.

Doesn't really exist

Here's where things get a little uncertain. On Monday a different customer reported an issue where Mozilla Thunderbird had managed to rename all of their folders under INBOX.Trash.userX. The logs clearly showed it happening, but didn't show how Thunderbird could have been coerced into doing this. I haven't been able to repeat it in my testing since.

The background was that Thunderbird was showing a phantom user folder. This only happens if it's configured to show all folders rather than just subscribed (the default in modern Thunderbird is to only show subscribed), and happened because Thunderbird was using a particular method of listing namespaces which showed the phantom user folder thanks to my changes.

The user folder happened because the listing showed that there were subfolders of user (which is true, everyone's account sits under user in the global namespace), but was also false (only the account owner's folders were visible, so there was no reason to produce the phantom user folder). It turns out that masteruser addressbooks (see above) are also folders for some purposes, and were causing user to appear in business user accounts independently anyway. Only this one user had problems.

But he had pretty bad problems, and I wanted to fix this properly. That was my Monday, from 7am when I first heard of the customer issue, until 2am the next day when I had completely rewritten the mboxlist_findall and mboxlist_findsub functions to be massively simplified, and convert every mailbox into the external namespace before testing it against the glob expression, rather than doing multiple layers of translation.

This time however, I wasn't jumping in quite so fast. I wanted to give this to QA for some serious testing first. And I'm really glad I did, because it found a nasty bug with rename and domain split users.

Somewhat independently, over the past few years I've been slowly moving more code to the simpler mboxlist_allmbox which only works on a single prefix, and has a simpler callback structure. During the week I also extended this with a more advanced API that allows getting just subfolders, just the parent folder, or also checking the related DELETED namespace used internally to avoid dataloss on user error if you turn on the delayed delete option (we turn it on, because we only run a full backup every 24 hours, and using this option we avoid the gap entirely – it's impossible for a customer to accidentally delete messages in a way that we can't restore, once they are on our servers).

Domain Split

We own some hundreds of domains, awesome things like sent.com (which we regularly fend off people trying to buy), and less awesome things like internet-e-mail.com (apologies to the legitimate handful the 64 active accounts on that domain who don't look like scammy free-trial accounts). Sidebar: tons of free trial accounts have signed up only to discover to their disappointment that we block URLs and obvious verification emails from other services before you verify with us, to avoid being a vector for attacks. So they just sit idle until the trial period expires.

But I digress. Once upon a time, when you got your username at FastMail, you had that username for every domain in our system, which was great if you got in first and managed to get bob. But it's not so great for every other person who wants their name. When we started allowing people to have their own domains on family and business accounts, having a single shared namespace was even worse.

So we switched on domain splitting within Cyrus, so that every user is in their own domain. This is implemented with internal names like this:

brong.net!user.brong.Trash

In my namespace, this is INBOX.Trash in the normal namespace, INBOX/Trash with unixhierarchysep turned on, or just Trash in the altnamespace.

For another user in my domain, this is user.brong.Trash, user/brong/Trash, Other Users.brong.Trash or Other Users/brong/Trash depending on settings. Oh, and the Other Users bit is configurable too, you can call it whatever you like in settings and the list code has to manage.

For the admin user, it's even worse. At least they can't see altnamespace, but it appears to them as: user.brong.Trash@brong.net or user/brong/Trash@brong.net which is completely bogus by any reading of the standard, because the parent folder is user/brong@brong.net. But that's how it is right now – besides nobody has to see the admin namespace except us, and our tools know how to handle it. (NOTE: some of this isn't strictly true any more since mid-November when I added initial cross-domain support, but I'll write about that another time)

It turns out that my changes broke renaming folders for users who were domain split, including renaming entire domain-split users. This was really bad, but we found it in QA, and I not only fixed it, but wrote tests to make sure it never happens again.

(another sidebar: not all users are domain split yet because it takes a lot of IO to rename an account, so we have hacky workarounds for THAT too)

Standards compliance

We use the Open Source Cyrus IMAPd server. We're really keen to not only follow the standards as closely as possible, but also to manage compatibility with clients so that everyone gets a good experience.

One place where Cyrus has always failed was complete compatibility with the LIST-EXTENDED RFC. Most of it is massively over complex for clients and they don't use it, but some parts are very useful. For years we have advertised the capability, despite not being 100% compliant. The excellent ImapTest tool told us exactly what we were doing wrong.

I've taken a few stabs at fixing this over the years. I knew that one thing we needed to do was pass multiple folder globs to mboxlist_findall at once, because the standard expects you to only output a folder once per LIST command, even if it matches multiple list expressions. Finally after doing all this work, I could do that. At the same time, I could fix the RENAME bug.

So I finished about 1am on Thursday night/Friday morning, getting it to pass ALL the tests in the listext suite. Then I spend all Friday workday working with QA and testing it myself to be sure that the list command was right. One of the things I did was replaced our glob code, initially written in 1993. It had a bug based on failure to backtrack on partial matches. I could roll my own backtracking engine, or rewrite it to use the pcreposix engine which we already embed for sieve filters to use. I chose to use pcreposix.

Two bugs that snuck through

The first was a documentation failure I think, or a reading comprehension failure on my part. I had a list of special characters to escape, and '+' didn't make it. However, if the '+' character existed multiple times in the folder name, it would crash pcreposix. Oops. Just adding it to the list of characters to escape fixed this bug – but according the logs it affected 3 people overnight before being fixed. Those people couldn't read their folder listing while the bug was in the wild.

The second bug was much more subtle. ONLY for domain split users, if they had a plus address that didn't match exactly to a folder name (either mismatching on case or for a subfolder which didn't exist) then the fuzzy match would fail, and the message would wind up in INBOX.

Actually, we didn't have a test on fuzzy matching in our test suite at all – I fixed that, and was very confused to notice that all the tests still passed. Only when I also turned on virtdomains and created a domain-split user did they start failing.

This was caused by the fact that the fuzzy matches were running on folders in the admin namespace, but using a search expression in the otheruser namespace, which just happen to be the same for non-domain-split users.

In a dream world

A few things I would love to have differently. For one, FastMail would be using entirely the opposite defaults to what we have now. We would be in the altnamespace, so that folders weren't all prefixed by INBOX, and we'd have unixhierarchysep turned on, so that folder names and usernames could contain the dot (.) character. The common defaults from 1999 aren't the common defaults of today.

Unfortunately there are some bugs with altnamespace around subfolders of INBOX, and issues around case sensitivity of INBOX as well. I wrote up the gory detail years ago. We're closer now, but again, the altnamespace fixes have taken a long time because they touch a lot of the code, and it's very easy to get wrong.

And then we need to move all the FastMail users over to this new, fixed altnamespace. This should help particularly with awful clients like Outlook. It's heartbreaking every time a user loses email due to their client having lied about storing the messages to our servers – and since we never saw the message, our excellent backup system can't help get them back.

This is why we're still on mail.messagingengine.com. We plan to move everything over to imap.fastmail.com and smtp.fastmail.com rather than white-labelling our own service, but we want to get all the namespace stuff just-so, so that it's a great experience from day one.

OS X Addressbook

Which brings us right back to the OS X Addressbook. As I said earlier, we try to obey standards as much as possible, but when a particularly popular client doesn't work how users want with our servers, we are forced to create workarounds so that our customers get a good experience.

Even if we reported a bug to Apple right now, it would still take months at a minimum before a fix is in the field and on all devices. So we need to get this working for everyone so we can move CardDAV out of beta.

Unfortunately, all the changes didn't actually fix the original problem! It looks like the list of returned URLs is also being sorted within the Apple client code, and then trimmed to the first two. That's always a top level container that's not an addressbook, plus the alphabetically first URL. We can't really tell Matt to change his name to Adam, so we changed to generating a synthetic URL so instead of returning:

https://carddav.messagingengine.com/dav/addressbooks/user/matt@example.com/Default
https://carddav.messagingengine.com/dav/addressbooks/user/masteruser@example.com/Shared

We will be generating something like:

https://carddav.messagingengine.com/dav/addressbooks/user/matt@example.com/Default
https://carddav.messagingengine.com/dav/addressbooks/zzzz/masteruser@example.com/Shared

Which will sort alphabetically too, and hopefully no other popular client will break, and we'll have a solution that works.

Abusing plus addressing even more

So we did that. The 'zzzz' workaround is on our production servers, and helping fine with the sorting. But then people wanted the ability to add the Shared addressbook on their CardDAV device as well.

Oops.

We don't allow users to log in directly as the masteruser, and it wouldn't scale across a large group anyway. We already use + addresses in POP3 logins to allow selecting a subfolder (e.g. emailing brong+Archive@brong.net delivers directly to my Archive folder, and if I login with that as my POP3 username, it fetches messages from that folder)

So now if you log with a plus address, the CalDAV and CardDAV server will only return collections which match the plus name, for example if I used brong+Shared@brong.net as my CardDAV username, I would only see the shared addressbook. If I use brong+Default@brong.net it only shows my personal addressbook. In theory if I had access to the masteruser's Default addressbook, I'd see that too. This even works for calendars, I could login as brong+b0e80a66-d474-4474-b1d5-d0affe6ce67d@brong.net to see just one calendar (I don't really have one called that).

Working around bugs

This isn't the first time we've had to work around exciting bugs for our DAV services. One interesting one is that if you create a CalDAV account with a client which supports autodiscovery, you'll notice an odd server name:

$ dig +short srv _caldavs._tcp.fastmail.fm
0 1 443 caldav-d49.messagingengine.com.
$ dig +short srv _caldavs._tcp.fastmail.com
0 1 443 caldav-d277161.messagingengine.com.

That's right – we return a different servername for each domain. When you connect to a URL on that domain, we strip the digits after the 'd' and check if they match a DomainId in our internal table. Then if you logged in without a domain, we append that domain to your username before checking your authentication. That way we can support all the different Kates (ssh, Bobs) on the different domains, even if their client doesn't include the domain in the login. This was caused by one of the Apple clients as well – which is kind of sad, since they wrote the spec! The client would send the full username@domain if you gave a servername, but would only send the raw username if it found the servername via autodiscovery.

There was an interesting discussion on the calconnect-l mailing list recently about working around remote calendars which change UID field on every record every fetch. It's really nasty, but we cope with it by generating a synthetic UID field at our end based on a hash of other fields. I'd love to be able to force everyone else to comply to standards – but the fact is that every customer has calendars that they want to be able to view, and they don't necessarily have the clout to force the source of information they need to fix their data model. We have the choice between allowing them to view the calendar, or insisting that the data is invalid.

Where possible, we will do the extra work so our customers have the best experience. Where it's not possible within the existing framework, we'll work on building something better.

And apologies for problems that leak into view along the way. Thankfully no mail was lost due to the fuzzy-match problem, only some messages were delivered to INBOX rather than the proper target folder for some users. And the +++ folder issue only hit a few people and only for one day – the data wasn't damaged, just invisible temporarily.