Power, Heat and Hard Drives

Two things you need to worry about when running a server cluster are power and heat. Machines today may be small, but they put out a lot of heat and they suck up a lot of power. Many colocation centers (places where you keep your servers), weren’t designed for handling a lot of small servers that generate a lot of heat. This means, for example, that with Bloglines we can’t completely fill an entire rack at the colocation center with thin servers. The cooling system at the colo wasn’t designed for that kind of heat density.
How can you tell if you do have a heat problem? Modern machines have temperature sensors on the CPU and sometimes on the motherboard, which you can (with some difficulty unfortunately) access through the operating system. You can set triggers which shut down machines if the temperature increases past a certain point. That will hopefully prevent the CPU from being fried.
CPUs aren’t the only things affected by heat, of course. What may not be obvious is that you also need to worry about hard drives, which can be very sensitive to heat. If you find yourself with hard drive failures, you should seriously investigate whether you have a heat problem. About a month ago, a drive on a backup Bloglines database machine failed. That immediately raised a red flag about heat. Of course when dealing with a large number of hard drives, you’re going to have a failure every once in awhile. We haven’t had any failures since, but if we had, we would have looked at making adjustments to lower the temperature around the machines.