Intermittent bayes db corruption resolved
This is a technical post that describes the history and recent efforts to track down a bug that was corrupting some users bayes databases. FastMail users subscribed to receive email updates from the Fastmail blog can ignore this post if they are not interested.
Over the past few years, we’ve had sporadic reports of users bayes databases being corrupted and reset back to empty. When this happened, it would cause email delivery for that user to fall back to using the global bayes database, which decreased the overall accuracy of their spam detection until they retrained the database with more spam and non-spam messages.
I had tried multiple times to track down what was causing this issue, but each time with no luck. Each time the problem occurred, there was an error message in the logs of this form.
bayes: bayes db version 0 is not able to be used, aborting!
Often searching the internet for an error message will find other people that have had the same problem and tracked down the solution, but in this case it didn’t. Each time I tried to work through the code to see what was going wrong, I reached a dead end and couldn’t see any obvious problem.
Since the corruptions were very intermittent and losing a bayes database isn’t critical, doesn’t cause email to be lost or inaccessible, and can be rebuilt just by reporting email as spam/non-spam again, tracking this down was always a bit of a lower priority issue.
Recently though, after one more corruption report too many, I decided once and for all to track down what was causing it. Bit by bit over the course of several weeks, I added more and more logging information to the server code to track down where in the code the problem was occurring.
The logging results proved to be very odd. In the vast majority of cases it showed that writing to a particular database worked fine, but every now and then, it caused data to be lost. Eventually I managed to create a reproducible test case. It turned out to be very odd issue because performing a particular programming action with a database library worked fine the first 5 or 6 times, but on the 6th or 7th, it would cause data to become lost. Clearly something odd is happening in the lower level library code.
Fortunately there was a straight forward workaround to the problem, so I’ve now patched our code with the workaround, and over the last few weeks I’ve monitored the logs which show the original error message above has completely disappeared and no databases are being corrupted any more.
I’ve reported bugs to the underlying modules causing the problems, so hopefully long term they’ll fixed as well.