Behind the Scenes of the Bloglines Datacenter Move (Part 3)

As it happened, the new datacenter was built out before the custom blog article replication code was completed and tested. This was ok, because we wanted to stress test the new datacenter machines. After configuring the new machines, we started running some test crawls against an older version of our feed database. To differentiate this test crawler from the Redwood City production crawlers, we changed the User Agent. Many people noticed a crawler with the User Agent “Bloglines/3.0-rho”, and some speculated that rho were the initials of one of the engineers. Actually, rho in this case is the greek letter. We didn’t want to call it a beta, because it wasn’t really, so we went down the greek alphabet. Rho is greater than beta, you see. Yes, we’re easily amused.

The replication code started to stabilize, and we began copying blog articles from the old datacenter to the new one. This happened in fits and starts as we debugged the code. The fact that it happened without us having to take the site down was a great advantage. We also continued to test the Bloglines installation at the new datacenter.

Concurrently, we started working out the datacenter move checklist, enumerating all the items that had to be completed, and at which point. The blog articles were being copied in the background, but all the other databases in the system could only be copied when we could be assured that they wouldn’t be updated (ie. they were operating in read-only mode). With Bloglines, we could “cheat” a little. By turning off the crawlers in Redwood City, we could assure that many of the databases in the system would not be modified, while still keeping the site alive. We could then start copying these databases, and the total amount of downtime would be reduced further. So our move checklist was divided up into the following sections:

  1. Tasks that had to be completed before the day of the move
  2. Tasks to do after the crawlers were turned off
  3. Tasks to do after the site was taken down
  4. Verification steps after everything was moved to the new datacenter
  5. Tasks to do after the site was back up at the new datacenter

When we were reasonably confident in the blog article replication code and we had worked out a reasonably complete move checklist, we set a date for the datacenter move three weeks hence, the evening of Friday December 16. Friday evenings are the slowest, traffic wise. And that would give us an entire weekend to fix any issues that arose during the transfer. Seeing how the migration didn’t actually happen until Monday December 19, it’s safe to assume that some issues came up during the intervening time.

Tomorrow, I’ll talk about the joys of broken DNS caches and pirates.