Mark's Tale

Long ago, in a parallel universe rather like our own, an ISP (who shall remain nameless) bought 12 RAID packs from a manufacturer (who shall remain nameless). Each of these RAID packs were intended between them, when used with appropriate servers, to hold email belonging to every single one of our dial-up and ADSL customers.

In principle it was a very sensible thing to do: Take 12 disk drives, put them in a case with three power supplies, add an intelligent controller that provided a single SCSI interface to the host server and managed the data across the drives, and you should have a very reliable, high performance disk storage system. The drives were formatted as RAID 5 with two spare drives, so that any drive failure would not affect the integrity of the data and the contents of the faulty drive would simply be rebuilt onto one of the spare drives from the parity copy held on the remaining working drives.

Nothing too unusual so far - this is how RAIDs are supposed to work, after all. Only these RAIDs didn't...

We only really realised just how much trouble they were going to be after the packs had been in service for a while and we had migrated all our customer email boxes from our previous outsource provider to the RAID systems. Then things started to go wrong.

To begin with we started to get SCSI errors from the RAID pack. You would expect this to be impossible as the RAID system should detect any drive errors and compensate for them by swapping in one of the spare drives and copying data to it. However, it seemed, after consultation with the manufacturers that the RAID controller was unable to cope properly with errors from the disk drives (?!).

Then, one by one, drives started failing, but instead of dealing with this in the way that a RAID should, our boxes simply either completely forgot what they were supposed to be doing next or, even more infuriatingly, took 12-14 hours to do the drive rebuild and then forgot what they were supposed to be doing next. Either way, the result was a RAID pack with completely scrambled data and around 20,000 mailboxes erased from existence. We eventually became quite adept at recovering what fragments of data were still accessable, but it took weeks to get anything useable off the packs.

Firmware upgrades for the RAID controller were tried (taking all 12 packs out of service and upgrading them was a long process all on its own), as were replacement IBM drives as the Maxtors originally installed were "known to have bugs in the firmware" but nothing made any difference at all. In the end, we even wrote our own monitoring system to interact with the command line system on the RAID console ports to try and get an early warning of drive problems and errors as the RAID OS itself seemed to be totally unable to do this, but it didn't help much.

If anything, we were told, trying to copy the data from a failing pack to a new one wouldn't help as reading the data may generate so many new drive errors that several might fail at once, something that no RAID could recover from. We couldn't even swap the drives out one at a time as there was a better than 75% chance that simply swapping out a single drive would result in the pack forgetting everything and losing all data on the disk.

Overall, I don't think I could recommend the purchase of this hardware to anyone who was storing critical data (and what else would you use a RAID pack for?).

So by now you can probably begin to understand why we took a pack out into the car park and beat the living **** out of it. I don't know exactly how many extra hours of work we had to put in to develop the original mail system, babysit the RAID packs overnight as they took 12-14 hours to painstakingly rebuild their data into a pile of garbage, recover what data we could from the trashed packs, communicate with all the affected customers, build an entirely new email system and move all the customer data to it, but there were a lot.

One memorable event just before Christmas 2003 not long before we moved to a new system (see below) resulted in me working a 44 hour day at the end of which we still had a trashed RAID pack with no readable data on it. This, in turn, led to me running a temperature of 103 in the small hours of Christmas Day and the complete destruction of my Christmas holidays that year. That's my personal reason for being a bit cheesed off with the things, though it's only one of a dozen similar events over nights, weekends and holidays that myself and my team had to deal with.

As a final insult, the last time we tried to use one of these RAID packs a few weeks ago there was a burning smell shortly followed by a blue flash and a loud bang as one of the power supplies blew up (taking down the power to half the server room with it). We tried again with a completely separate pack and, again, a power supply died in a very convincing fashion. They have now been declared too unsafe to use and are awaiting either reclaimation or destruction.

Incidentally, in case you are worried that the equipment we smashed up could have been put to better use, it had non-functioning PSUs (all three of them), a faulty controller card, drives that consistently produced errors and (as outlined above) didn't work anyway. About the only thing that might have been salvageable were the fans and we couldn't be bothered to remove them. If you need any spare parts (apart from the PSUs, which are too dangerous to use in our opinion) please feel free to make us an offer for them - anyone still unlucky enough to be using this equipment deserves all the help we can give.

We never did get any compensation for the problems from the manufacturers, or even any acknowledgement that the problems were not solved. The Directors simply ignored the calls and emails we made to them in the end. It will be interesting to see if this Web site changes that situation...

And the email system? We bought NetApp Filers (www.netapp.com) in late 2003 and our customer data has been completely safe on these so far, with billions of NFS operations so far carried out with no errors or downtime. I'd recommend them to anyone thinking of buying resilient disk storage systems.

Don't you love a story with a happy ending? :-)