Behind the Scenes of the Bloglines Datacenter Move (Part 2)

The simplest (and safest) way to move a site is to take it completely down, copy all the data to the new machines, and then bring the site back up at the new datacenter. We could have done that, but the length of downtime required would have numbered in the days, and we didn’t want to do that. Actually, an even simpler way to move a site is to physically take the machines and move them to the new datacenter. Going across country, that still would have required probably 24 hours of downtime, factoring in the time to pull the machines from Redwood City, pack them, put them on an airplane, unpack them, reinstall them in Bedford, and reconfigure them for their new network environment. And after a journey like that, chances are some of the machines wouldn’t come back up. So our only real option was to create a system that would copy at least a large amount of our data to the new datacenter in the background, while Bloglines was still live and operating.
The Bloglines back-end consists of a number of logical databases. There’s a database for user information, including what each user is subscribed to, what their password is, etc. There’s also a database for feed information, containing things like the name of each feed, the description for each feed, etc. There are also several databases which track link and guid information. And finally, there’s the system that stores all the blog articles and related data. We have almost a trillion blog articles in the system, dating back to when we first went on-line in June, 2003. Even compressed, the blog articles consist of the largest chunk of data in the Bloglines system, by a large margin. By our calculations, if we could transfer the blog article data ahead of time, the other databases could be copied over in a reasonable amount of time, limiting our downtime to just a few hours.
We don’t use a traditional database to store blog articles. Instead we use a custom replication system based on flat files and smaller databases. It works well and scales using cheap hardware. One possibility for transferring all this data was to use the unix utility rdist. We had used rdist back at ONElist to do a similar datacenter move, and it worked well. However, instead, we decided to extend the replication system so that it’d replicate all the blog articles to the new datacenter in the background, while keeping everything sync’ed up. This was obviously a tricky bit of programming, but we decided it was the best way to accomplish the move, and it would give us functionality that we would need later (keeping multiple datacenters sync’ed up, for example).
As the new machines were being built out at Bedford, work started on the blog article replication improvements. In the meantime, we still had a service to run. All growing database-driven Internet services have growing pains. All growing database-driven Internet services have scaling issues. That’s just a fact of life. So, in the midst of all this, we couldn’t stop working on improving the existing Bloglines site. It made for an interesting juggling effort.