UPDATE: the cause of the "infinite copy" bug was tracked down to a user side applescript. I have written a separate blog post about it.
This is a technical blog/rant. Users of FastMail who don't use Apple's email clients can safely skip it.
I've spent quite a lot of the past few days dealing with bugs caused by the Mail app in Apple's new Mavericks update.
Apple's mail client has been buggy with IMAP connections forever. It was infamous a couple of years ago for creating folders called INBOX.INBOX.INBOX.INBOX.INBOX.INBOX (until it hit the mailbox limit). We now block those at create time because they were causing "interesting" problems as well as being confusing.
This release has other interesting bugs that I've looked into in the past couple of days. When you uncheck the "keep a copy of my sent email on the server" checkbox, it auto-rechecks itself. I confirmed that report myself on our test laptop.
I also confirmed that the 'Archive' folder (special-use \Archive) doesn't appear in the folder listing, and neither do any subfolders if you have any (one of my accounts does - one per year for the past *mumble* years).
So we know it's not the breakfast of champions. That's not a giant surprise, it's never been the strong point of that OS. The previous revision has problems too:
229.12 UID STORE 588201 +FLAGS.SILENT ($Junk Junk JunkRecorded) 229.12 OK Completed 230.12 UID STORE 588201 -FLAGS.SILENT ($NotJunk NotJunk) 230.12 OK Completed 231.12 UID EXPUNGE 588201 231.12 OK Completed 232.12 UID STORE 588201 +FLAGS.SILENT (\Deleted) 232.12 OK Completed
Anyone who can read IMAP can see that it tries to expunge the message BEFORE the \Deleted flag is set, which is pointless. UID EXPUNGE only deletes messages with the \Deleted flag set.
I found _that_ one because our web interface expunges all \Deleted items, so users noticed they only got the expected behaviour across multiple clients if they left the web interface open at the same time.
Millions of messages
But this doozy takes the cake. I found it nearly a week ago when we had an IO error because a user's cache file was overflowing the 32 bit offset counter that still exists in Cyrus (I have an experiemental branch with 64 bit offsets, but it's not ready for production yet)
They had 71 unique messages in their Junk Mail folder, but included expunged messages (we keep them for a week for backup purposes) there were over 1 MILLION separate entries in the index file. We de-duplicate on store, so the fact that there were over 100,000 copies of the most prolific message in the index didn't totally flood the disk.
I noticed then that they were using 10.9's mail app:
3.19 ID ("name" "Mac OS X Mail" "version" "7.0 (1816)" "os" "Mac OS X" "os-version" "10.9 (13A603)" "vendor" "Apple Inc.")
I wiped the expunged messages from the cache, emailed the user, and left it at that.
This morning I checked again, there were nearly a million messages again, so I enabled telemetry on the account and emailed the user a second time.
Here's what I saw in the telemetry (one of many):
44.18 SELECT "INBOX.Junk Mail" (CONDSTORE) [...] 45.18 FETCH 1:* (FLAGS UID MODSEQ) (CHANGEDSINCE 213634) 45.18 OK Completed (0.000 sec) 46.18 IDLE + idling DONE 46.18 OK Completed 47.18 CHECK 47.18 OK Completed 48.18 UID COPY 3360991:3361069 "INBOX.Junk Mail" * 158 EXISTS * 79 RECENT 48.18 OK [COPYUID 1318456612 3360991:3361069 3361070:3361148] Completed
Yes you read that right. It's copying all the email from the Junk Folder back into the Junk Folder again!. This is legal IMAP, so our server proceeds to create a new copy of each message in the folder.
It then expunges the old copies of the messages, but it's happening so often that the current UID on that folder is up to over 3 million. It was just over 2 million a few days ago when I first emailed the user to alert them to the situtation, so it's grown by another million since.
The only way I can think this escaped QA was that they used a server which (like gmail) automatically suppresses duplicates for all their testing, because this is a massively bad problem.
I discovered my second attempt to contact the user about this problem in their Junk folder tonight. 10 times already!
Given that my colleague had just been paged by high disk usage on that user's server - a usage which was growing fast, and which got reduced massively by expunging old deleted messages... it was time to lock the account until the faulty email client is disabled. We don't lock user accounts lightly, but running a malfunctioning piece of software which is affecting other users and resisting attempts to contact qualifies, and a promise to disable the faulty software will be enough to resume service.
Yes, Mail.app was cleaning up after itself, but we keep deleted emails for a week, and even though it wasn't using disk space, over a million emails were using enough meta database space that a disk had filled up. There are only 79 actual emails in this folder with a total usage of about 2MB, yet the meta files:
91M cyrus.index 906M cyrus.annotations 1.2G cyrus.cache
Over 2GB of disk usage.
The sad thing is - there are about 600 copies of Cyrus on our production farm, and I can roll out a new copy in about 5 minutes. There are umpteen million copies of Mail.app out there, and I can't get them fixed on any particular schedule - so if this happens with more than one user the only real solution that I have is to code a workaround directly into our server to protect our other users.
We already wrote a special case for another one of Apple's brilliant ideas - make the search box default to a full text search of every mailbox. The most expensive possible option for the server.
We have rate limits for other things, but we've never considered needing a rate limit for the COPY command - it would usually hit quota, but because these messages are expunged as fast as they are created, the quota doesn't catch this issue.
The 4 million message 32 bit limit of the UID field will become - as many people have pointed out to me, it's 4 billion, not 4 million - so much for last minute ideas when writing late at night. Sorry for the confusion. It would take a lot longer then a few days to hit the limit!
interesting on that folder soon too, which is another thing we've never hit in production before. The workaround here is at least known - create a new folder, copy the messages across, delete the old folder, and rename the folder into place with a new UIDVALIDITY and new messages.