Behind the Scenes of the Bloglines Datacenter Move (Part 5)

Also See Parts One, Two, Three, Four.
The move itself went almost perfectly. At 2pm, we took the crawlers off-line and started copying many of the databases. At 4pm, we took the entire site down and started copying the remaining databases. Around 5pm, wandering around barefoot, I broke my toe, but that didn’t affect things (other than my toe). Around 7:30pm, it became clear that we would require an extra half an hour to complete things, so we updated the plumber page with the new estimate. Around 8:20pm, everything was back up and we completed testing the site. We took the plumber down at 8:30pm.
After the site came back up, we found a couple of small things that we didn’t discover during testing, but nothing major. And over the past couple of weeks, we’ve continued to tweak and tune the service. The one scare we had happened the Thursday after the move, when fully half of our database machines decided to freeze up, all within half an hour of each other. The site still functioned ok, but it definitely scared us. The biggest unknown with the move, at least for me, were the new machines themselves. How many of the new machines would fail when put under load? We had done a lot of stress testing before the move, but we weren’t able to completely test everything. Having half of the database machines fail almost simultaneously brought that fear to the foreground. Luckily it was a one time occurance, although we’re still tracing the problem. Update: This happened again yesterday. For those interested, we think it’s an issue with the ACPI support in the Redhat Enterprise Linux 4 Update 2 kernel on the Dell 1850s that we use.
So after all this work, what do we have? The extra hardware allows us to crawl every feed in the system twice an hour now. Also, the website is much more responsive. And we have room to grow. I think it was worth the broken toe.