You're reading this version of this blog post, so we succeeded. Our primary datacentre moved and you didn't even notice!
Over the past week, we have moved all of the FastMail, Pobox and Listbox hardware from New York Internet's Manhattan datacentre to their Bridgewater, New Jersey location. The new location gives us more space to expand while keeping the same great service we have always received from NYI's network and remote hands teams.
To prepare for this move, we performed numerous "fire drills" over the last 3 weeks. We shut down half the FastMail infrastructure at a time during the Australian day, to make sure nobody noticed and that all the hardware would come back up cleanly.
Our design goal is to have sufficient redundancy that we can run on half capacity
- comfortably during non-peak times - with a little slowdown during peak times. This is due to our commitment to high availability. The fire drills gave us a high level of confidence that our systems were still meeting this goal in practice.
In 2014 I spent a week in New York and moved all the servers to a new set of racks. In the process we reconfigured that redundancy such that we could take down entire racks at a time if we ever had to - either for this type of move within NYI or even to move to a new provider! It's part of our regular contingency planning, and has been very valuable for this week's work.
We had migrated the Pobox/Listbox hardware from Philadelphia up to New York over a few batches in the last 12 months. While not the same 50/50 plan we used for FastMail, we felt confident we could repeat the batched moves over a much shorter timeline for this move.
Those plans in hand, we are pleased to say that we moved every service, and virtually every server, in only two days!
Getting it done
We have to start by thanking NYI for their assistance and diligent preparation leading up to this move. They set up racks in the new datacentre and bridged all our networks through dark fibre between their two locations.
This week, two of our operations team flew to New York to put our plan into effect. The moves were scheduled for Monday and Tuesday nights (8th and 9th of May), starting at 6pm New York time (8am Melbourne). Rob and Jon are the operations leads for the two sets of infrastructure. They led the move on the ground, working with NYI staff and a team of movers. Back in Australia, I was one of the two operations staff monitoring the move and keeping services running smoothly.
On Tuesday during the day, we were running with half the hardware in each datacentre across the bridged network. We're now entirely in New Jersey with nothing left in Manhattan.
The moves took longer than planned, as moves always do! Missing rack rails, slightly smaller racks than expected, networks not quite coming together on the first go, etc meant Rob and Jon were up until 5am their time getting the last bits up and working. The datacentre crew at NYI-NJ were amazing as well. We were very fortunate 15 years ago when we found NYI, they really are a gem amongst datacentres! As I've said before, a lot of our reliability can be attributed to having a really good datacentre partner. With their help, we were back up and running for the US day.
But enough talking about the move, let's see some more photos!
Even though I knew the plan and had confidence that we had tested each of the individual tasks required, you never know what's going to happen on the ground. (Yes, we even had a plan for what would happen if the truck crashed on either day.) So I speak for everyone when I give Rob and Jon huge high fives for pulling this off so smoothly!
Things customers may have noticed
There will always be a few hitches with a massively coordinated operation. Issues we dealt with in the process:
A handful of FastMail App users were using the QA server, which was offline for about 8 hours on Monday night. Likewise the beta server was offline for about 8 hours on Tuesday night.
Pobox and Listbox new logins broke on Monday night because of an undeclared dependency on the billing service, which was offline during the Monday part of the move. Once that was identified as the cause, the quickest fix was to push forwards and bring up the billing service again in New Jersey.
A bug in Pobox service provisioning cropped up, unrelated to the move. But, because other services were intentionally offline, the bad behavior persisted long enough to cause Pobox DNS to break for 30-40 minutes. During that time, some Pobox users could not send mail, and others reported bouncing messages from their correspondents. As continuous delivery of mail is always our highest goal, we deeply apologize to anyone affected by this issue.