The Aggregated Me

The concept of aggregation is increasingly important on the Internet, as the sheer number of information resources increases. The average user wants to track more and more things on the Internet; an aggregator quickly becomes necessary as one’s bookmark list grows to infinity. The first aggregators, what I call ‘general purpose’ aggregators, like Bloglines, Google Reader, and Newsgator, are focused on tracking blogs and news feeds, making it easy to subscribe to whatever blogs the user came across.
The new service FriendFeed has been getting a lot of attention the past couple of weeks. It’s the latest in the line of what I call ‘individual aggregators,’ services that aggregate all the distributed parts of a person’s on-line presence in one place. A person may have a blog, a Twitter account, a Flickr photostream. These services combine all of these items in one place. This trend started with Facebook’s newsfeed, continued with Plaxo’s Pulse, and then several other services, including Tumblr can do most of what the individual. These services are different than the general purpose aggregators in that they’re focused on tracking individuals, not feeds. But the general purpose aggregators can do what the individual aggregators can do, because the underlying technology, RSS, is the same. It’s really just a matter of user interfaces and a key bit of information.

The Problem

The individual aggregators collect a list of all of the distributed parts of a person’s on-line presence. They ask each user to list their Twitter account, their Flickr account, their YouTube account, their blog. This list doesn’t exist anywhere in a way that’s machine readable. Each of the individual aggregators has to deduce this information and then maintain it. Or more specifically, each user has to maintain this information on each of the individual aggregators. Wouldn’t it be better if this list existed somewhere under direct control of the user in a way where it wasn’t siloed in a centralized, proprietary service? That way, every aggregator could take advantage of it and users would only have to update the list in one place.

A Modest Proposal

This problem is actually a general purpose version of a problem already solved by something called RSS Autodiscovery. In order to make it easier for general purpose aggregators to find RSS feeds to subscribe to, many publishers included a special line of text in the headers of their HTML. I have one on my blog:

<link rel=”alternate” type=”application/rss+xml” title=”RSS” href=”; />

Aggregators know to look for this line, which tells them where the RSS feed for that blog exists. Can’t we just extend this to include a list of all the other aspects of a person’s identity? Have one line for each service the person uses, and change the title accordingly. So, I could include:

<link rel=”alternate” type=”application/rss+xml” title=”Flickr Feed” href=”; />

for my Flickr feed. This doesn’t have to only apply to services that publish RSS feeds. I could even do something like:

<link rel=”alternate” type=”application/twitter” title=”Twitter” href=”wingedpig” />

to indicate my Twitter account.
By doing this, the list of all the parts of a person’s on-line presence is kept under the control of the person, associated with their blog. It’s distributed, open, and easy to implement.

How To Make It Work

For this to work, a couple things need to happen. Blog publishing software has to be modified to ask for and then insert this information into the headers of a person’s blog. Then,aggregators need to be modified to look for this information, and to periodically recheck it. The general purpose aggregators need to augment their interfaces to allow people to subscribe to these new feeds. But none of these things are terribly difficult to do.

Moving On

Well all, sadly it is time for me to move along from Bloglines. The service is in great hands with Ask and I am confident it will live long and prosper under their watch.
But “why” you cry? According to my ever-present Ask PR twin, “to spend more time with my family.” And it’s true; my cats have begun to run with a bad crowd. Beard Papa has some new stores that need visiting. And there are American Idol tryouts just around the corner and I need to get gussied up.
But the real reason is once a start up guy, always a geek (I know, say it ain’t so!). So what’s next? At some point I’ll start another company; that’s a difficult habit to break. But I’m also going to focus more of my time helping other startups and newbie entrepreneurs, something I’m finding increasingly rewarding. I haven’t really discussed that aspect of my career much here on the blog, but I’ve recently been involved with two startups. When Plaxo first started, I gave them some technical help. More recently, I’ve been serving on the board of directors of One True Media, a startup that provides some great on-line tools to create and edit videos. I’ve also really enjoyed my recent speaking engagements, where I’ve been able to talk about the process of starting a company and some of the lessons I’ve learned.
In the meantime, Bloglines is in the capable hands of a great team at Ask, first and foremost led by Jim Lanzone, and including Robyn, Paul, Alan, Ben, Rob, Ryan, Andrew, Doug and Scott. Over the past year and a half, they’ve become the driving force behind the service and it is time to let them run with it. During that time, some of the major updates to Bloglines included: Package tracking, HotKeys, drag-and-drop feed management, Ping tracking, as well as vast and numerous improvements to the Bloglines infrastructure. And if that wasn’t enough, just look at the new Blog & Feed Search as an indication that there are lots of great Bloglines innovations to come.
I’m not disappearing. I’ll be one of the speakers at this Saturday’s Techdirt Greenhouse conference. I hope to see you there.

Bloglines Blog Search Launches

This evening, we rolled out an industry leading blog search engine for Bloglines. In addition, there’s a new ‘Blogs & Feeds’ tab on the main Ask site. We started working on the new search engine soon after Bloglines was acquired by Ask last year, and the engine is the result of a lot of hard work by both the Ask search team in New Jersey and the Bloglines team in California. Generating an index with over 1.5 billion archived blog posts, and keeping it updated with new blog posts (within 5 minutes of us receiving the posts) is non-trivial. Please check it out, it’s quite an impressive piece of work.

The Business 2.0 Next Net 25

Thanks to Om Malik, Erick Schonfeld, and everyone else at Business 2.0 for both naming Bloglines one of the Next Net 25 companies, as well as for inviting me up for a roundtable discussion yesterday afternoon:

Unfortunately, I could only stay for about half of the discussion, but the part I was there for was interesting. There was much talk about whether we’re in a bubble again, and whether things are different this time or not. My 2C is that the whole ‘Web 2.0’ thing is just more of the same and while things may look different now, most of what we’re seeing is just a continuation of trends that started back in the 1990s. Cheap companies? Yep, ONElist got to 1M users before we took outside investment. Fast growth? Sure, remember Hotmail? User generated content built around communities? Yep, ONElist again (and several others). This is not a criticism, by any means. I’m extremely happy about all of these trends, obviously.

Anyways, thanks again to everyone involved. Why did I have to leave early? To meet with my cat sitter. Sigh, what a life I lead. Anyways, there are more pictures of the gathering up on Flickr.

Behind the Scenes of the Bloglines Datacenter Move (Part 5)

Also See Parts One, Two, Three, Four.
The move itself went almost perfectly. At 2pm, we took the crawlers off-line and started copying many of the databases. At 4pm, we took the entire site down and started copying the remaining databases. Around 5pm, wandering around barefoot, I broke my toe, but that didn’t affect things (other than my toe). Around 7:30pm, it became clear that we would require an extra half an hour to complete things, so we updated the plumber page with the new estimate. Around 8:20pm, everything was back up and we completed testing the site. We took the plumber down at 8:30pm.
After the site came back up, we found a couple of small things that we didn’t discover during testing, but nothing major. And over the past couple of weeks, we’ve continued to tweak and tune the service. The one scare we had happened the Thursday after the move, when fully half of our database machines decided to freeze up, all within half an hour of each other. The site still functioned ok, but it definitely scared us. The biggest unknown with the move, at least for me, were the new machines themselves. How many of the new machines would fail when put under load? We had done a lot of stress testing before the move, but we weren’t able to completely test everything. Having half of the database machines fail almost simultaneously brought that fear to the foreground. Luckily it was a one time occurance, although we’re still tracing the problem. Update: This happened again yesterday. For those interested, we think it’s an issue with the ACPI support in the Redhat Enterprise Linux 4 Update 2 kernel on the Dell 1850s that we use.
So after all this work, what do we have? The extra hardware allows us to crawl every feed in the system twice an hour now. Also, the website is much more responsive. And we have room to grow. I think it was worth the broken toe.

Behind the Scenes of the Bloglines Datacenter Move (Part 4)

With two weeks to go before the move, we started having daily status meetings with all the people involved with the move: people from site ops, net ops and the entire Bloglines team. These only ran about 10-15 minutes each, but were invaluable in getting issues taken care of quickly. We were still working through issues with blog article migration, but we thought we could still make the December 16th date. We came up with estimates on how long it would take to transfer the other databases to the new co-lo, and arrived at a total of 4 hours of downtime, with an additional 2 hours of the crawlers being turned off ahead of time.

Unfortunately, at a point after that, it became clear that we wouldn’t hit the December 16th date; we’d most likely be ready two days later on Sunday December 18th. Ask Jeeves has a winter shutdown, which this year started on December 23rd and runs through January 2. We had a couple of options at that point: do the move on Sunday or one of the weekdays before December 23, or push the move out to the new year, most likely to January 6th. In my experience, user-based Internet services have two slow periods during the year: July/August and the last half of December. Because of this, and because moving to the new datacenter would greatly improve the user experience, we decided to push for the move to happen in December. We targeted Monday, December 19th to give us an extra day past when we estimated we’d be ready. And we decided to start the process at 2pm, which would hopefully let us finish without extending too far in the evening, while still avoiding the peak time of traffic to the site.

On Sunday, December 18th we put up a blog post announcing the upcoming downtime, and also inserted a link at the top of every page on the site alerting users to the downtime. One of the last things we did before the move was to have Ben, our UI/graphics guru, modify the Bloglines Plumber, giving him a pirate makeover. It was going to be a special downtime, and we wanted to make sure he looked good (we’re fans of both Talk Like a Pirate Day and the Flying Spaghetti Monster).

At this point, I want to get a little technical. Don’t worry, it’ll only last a paragraph and you won’t be quizzed. When planning a move like this, where a site will end up with a new IP address, you need to take some DNS issues into consideration. DNS is like the white pages of the Internet. It maps domains like to IP addresses, which are the actual machines. Each DNS record has a Time To Live, or how long the record is valid for (and how long you can cache the record before asking for it again). DNS records are cached all over the Internet, and many of these caches are broken. When planning this move, we did a couple of things:

  1. A week before the move, we turned the TTLs down to 5 minutes.
  2. Before the move itself, we put the Bloglines Plumber downpage up at the new datacenter.
  3. To take down the site, we configured the webservers at the old datacenter to proxy to the new datacenter.
  4. We then changed the DNS records to point to the new datacenter.

By proxying, I mean that the webservers would just act like a go-between, taking an incoming request, forwarding it to the webservers at the new datacenter, and returning the response. When we were ready to bring the site back up at the new datacenter, we removed the downpage at the new datacenter, but kept the webservers running at the old datacenter, which, to this day, still proxy requests to the new datacenter. That way, even if a client tries to connect to the old datacenter because they have incorrect DNS records, they’ll still get the site running at the new datacenter.

Ok, enough of the nerd lesson. I’ll wrap this up next time.

Behind the Scenes of the Bloglines Datacenter Move (Part 3)

As it happened, the new datacenter was built out before the custom blog article replication code was completed and tested. This was ok, because we wanted to stress test the new datacenter machines. After configuring the new machines, we started running some test crawls against an older version of our feed database. To differentiate this test crawler from the Redwood City production crawlers, we changed the User Agent. Many people noticed a crawler with the User Agent “Bloglines/3.0-rho”, and some speculated that rho were the initials of one of the engineers. Actually, rho in this case is the greek letter. We didn’t want to call it a beta, because it wasn’t really, so we went down the greek alphabet. Rho is greater than beta, you see. Yes, we’re easily amused.

The replication code started to stabilize, and we began copying blog articles from the old datacenter to the new one. This happened in fits and starts as we debugged the code. The fact that it happened without us having to take the site down was a great advantage. We also continued to test the Bloglines installation at the new datacenter.

Concurrently, we started working out the datacenter move checklist, enumerating all the items that had to be completed, and at which point. The blog articles were being copied in the background, but all the other databases in the system could only be copied when we could be assured that they wouldn’t be updated (ie. they were operating in read-only mode). With Bloglines, we could “cheat” a little. By turning off the crawlers in Redwood City, we could assure that many of the databases in the system would not be modified, while still keeping the site alive. We could then start copying these databases, and the total amount of downtime would be reduced further. So our move checklist was divided up into the following sections:

  1. Tasks that had to be completed before the day of the move
  2. Tasks to do after the crawlers were turned off
  3. Tasks to do after the site was taken down
  4. Verification steps after everything was moved to the new datacenter
  5. Tasks to do after the site was back up at the new datacenter

When we were reasonably confident in the blog article replication code and we had worked out a reasonably complete move checklist, we set a date for the datacenter move three weeks hence, the evening of Friday December 16. Friday evenings are the slowest, traffic wise. And that would give us an entire weekend to fix any issues that arose during the transfer. Seeing how the migration didn’t actually happen until Monday December 19, it’s safe to assume that some issues came up during the intervening time.

Tomorrow, I’ll talk about the joys of broken DNS caches and pirates.

How Much Downtime is Too Much?

I received an email about my datacenter move posts. Jeremy Kraybill asked:

    I’m curious if you considered a zero-downtime move at all, where you would keep the “old bloglines” still running while data was transferred to the new bloglines datacenter, and then switch over to the “new bloglines” via DNS after the new site was up? And either users have data loss of several hours (arguably better than downtime of the same amount), or you replicate transaction logs for user-critical data.

That’s an good question. We didn’t consider a zero-downtime move and the reasons why illustrate some of the tradeoffs to consider. One was the engineering effort involved. It would have required a substantial amount of additional work to pull off a zero-downtime (or close to zero) move. It would have also greatly increased the risk of something going wrong. We did enough work ahead of time to reduce the downtime to a 4 hour window, which we believed our users would accept. The benefits of moving to the new datacenter sooner rather than later also factored into our decision.

Those aren’t the only things to think about when considering downtime. Bloglines is a free service and isn’t currently monetized. If Ask Jeeves generated a significant amount of revenue from us, that would have factored into our thinking. But even then, the ‘net is littered with examples of sites like eBay. I haven’t checked recently, but eBay, at least in the past, had a policy of regular, scheduled downtimes.

In any event, with scheduled downtime, it’s important to communicate with your users. For us, that meant a post to the Bloglines blog 24 hours in advance. The blog post had specific times listed, along with links to a site that converted the times into all other timezones. We also added a link at the top of every page on the site alerting users to the downtime. During the downtime, we displayed a page that explained exactly when we’d be back on-line. And we updated that page when we went half an hour over our scheduled time. Finally, people appreciate humor. If you’re going to have downtime, make the down page fun. For the datacenter move, for example, we gave the Bloglines plumber a pirate makeover.

I’ll continue my tale of the Bloglines datacenter move tomorrow, with a post that could be titled ‘Rho Rho Rho Your Boat.’

Behind the Scenes of the Bloglines Datacenter Move (Part 2)

The simplest (and safest) way to move a site is to take it completely down, copy all the data to the new machines, and then bring the site back up at the new datacenter. We could have done that, but the length of downtime required would have numbered in the days, and we didn’t want to do that. Actually, an even simpler way to move a site is to physically take the machines and move them to the new datacenter. Going across country, that still would have required probably 24 hours of downtime, factoring in the time to pull the machines from Redwood City, pack them, put them on an airplane, unpack them, reinstall them in Bedford, and reconfigure them for their new network environment. And after a journey like that, chances are some of the machines wouldn’t come back up. So our only real option was to create a system that would copy at least a large amount of our data to the new datacenter in the background, while Bloglines was still live and operating.
The Bloglines back-end consists of a number of logical databases. There’s a database for user information, including what each user is subscribed to, what their password is, etc. There’s also a database for feed information, containing things like the name of each feed, the description for each feed, etc. There are also several databases which track link and guid information. And finally, there’s the system that stores all the blog articles and related data. We have almost a trillion blog articles in the system, dating back to when we first went on-line in June, 2003. Even compressed, the blog articles consist of the largest chunk of data in the Bloglines system, by a large margin. By our calculations, if we could transfer the blog article data ahead of time, the other databases could be copied over in a reasonable amount of time, limiting our downtime to just a few hours.
We don’t use a traditional database to store blog articles. Instead we use a custom replication system based on flat files and smaller databases. It works well and scales using cheap hardware. One possibility for transferring all this data was to use the unix utility rdist. We had used rdist back at ONElist to do a similar datacenter move, and it worked well. However, instead, we decided to extend the replication system so that it’d replicate all the blog articles to the new datacenter in the background, while keeping everything sync’ed up. This was obviously a tricky bit of programming, but we decided it was the best way to accomplish the move, and it would give us functionality that we would need later (keeping multiple datacenters sync’ed up, for example).
As the new machines were being built out at Bedford, work started on the blog article replication improvements. In the meantime, we still had a service to run. All growing database-driven Internet services have growing pains. All growing database-driven Internet services have scaling issues. That’s just a fact of life. So, in the midst of all this, we couldn’t stop working on improving the existing Bloglines site. It made for an interesting juggling effort.

Behind the Scenes of the Bloglines Datacenter Move (Part 1)

One week ago, we moved the Bloglines service from the AT&T datacenter in Redwood City, CA to MCI in Bedford, Massachusetts. This was a challenging and complex undertaking that required months of preparation by many groups. Now that the dust has settled, over the next couple of days I’ll explain some of the process involved.
We had been at AT&T since Bloglines first went on-line in June, 2003, and had been very happy with them. AT&T is a tier 1 colocation facility. They aren’t the cheapest, but we never had to worry about power outages or other issues that can crop up with other facilities. After we were acquiried by Ask Jeeves in February, we started talking about moving the Bloglines service to the main Ask facility, which is in Massachusetts. This made sense for a number of reasons: it would be easier for operations, it would be easier for us to quickly expand in the future, and it would be easier for us to tie into other parts of Ask Jeeves.
Once the decision was made to move, we had two tasks: figure out how many machines to build out in Bedford, and figure out how to do the move with the minimum amount of downtime. In my experience, estimating how much hardware you’ll need at some point in the future can be difficult, especially when you’re growing quickly and you don’t have a lot of history to use in estimating. I believe in the concept of overwhelming firepower (when in doubt, double or triple it), so we overestimated everything. In the end, the new system has 3 times the number of machines that we were running in Redwood City, and each of those machines is probably twice as fast as any of the old boxes. Once operations had the configurations, they set about ordering, installing, and configuring the machines. That left us with having to figure out how to move the site across the country while minimizing downtime.
I’ll continue that part of the story tomorrow.