Bloglines Crash

Sigh. Looks like the main Bloglines user database corrupted itself around 2:30am this morning. It did so in such a way so that the database still ‘ran’, for some definition of that. Which means that none of our automated monitoring picked up on this fact. We will change that.
We take snapshots hourly, and we’re recovering to the most recent good snapshot. Unforuntately, that means that if you registered with Bloglines after around 2am Pacific time or so (and that’s a lot of people), you will need to register again.
What caused the corruption? We’re not sure yet, although we’ve been having issues with the JFS filesystem that we’re using. So it may have been that.
Update: The system is back on-line, based on the 2:05am userdb snapshot. We now need to go through the bad database files and find out what really went wrong. We are also bulking up our monitoring system to specifically detect this type of problem in the future.
Update 2: We think we’ve pinpointed the problem. It only affected a small number of users and it had nothing to do with the filesystem, which is good. It was a bug in one of our programs, which is bad. In a few cases, a batch process would delete some site records from the database without removing any subscriptions that referenced those sites. This ended up looking like database corruption, but it wasn’t. The batch process in question runs nightly. The change we made was for it to delete some sites that were invalid (for a couple of definitions of invalid). This is part of our ongoing attempts to make sure our crawler only crawls valid RSS feeds. Yes, we did test the program before we pushed it to the site, but our testing didn’t uncover this particular behavior.

Advertisements