Once Upon A Time
This is the story of a RAID array (actually several RAID arrays.) It begins in 2002 when we decided that we needed to buy a reliable storage platform to house our customers email...
For those of a non-technical nature, a RAID array is basically a box of hard drives that is resistant to failure. If one of the drives breaks, a spare steps in to take its place. Nothing needs to be restarted, everything just works as if nothing had happened. At least that's the theory.
So, we looked around and found an array that seemed ideal for our purposes. To avoid embarassment, we won't mention the name of the company who made them, but since we need to call them something we'll use the name "Fibrenotix". The product highlights included:
- Compact, space saving, 3U high design
- Automatic rebuild & hot-spare support
- Easy to use / Easy configuration / Quick install
We bought some, installed them, configured them and moved our mail platform onto them. It worked. We were happy.
Look, there's a picture of one over on the right. Okay, it wasn't
the prettiest bit of machinery out there, but it was going to live
in a datacentre, so aesthetics weren't that important.
Something Wicked This Way Comes
After a while one of the drives failed. The hot-spare burst into life and everything kept going. We replaced the failed drive and tried to return it under warranty. Fibrenotix had an interesting warranty returns policy. It went something like this:
- You contact the warranty returns person to request an RMA number.
- They fax an RMA form to you.
- You fill it in and send it back along with the drive.
- They send the drive off to the manufacturer.
- When the manufacturer returns a replacement drive they send it back to you.
Most companies keep their own stock of spare drives, but apparently this one did not. It's not exactly a fast procedure, but we followed it, filled in the RMA form, sent off the drive and waited. After a week or so a drive turned up along with a piece of paper saying "no fault found". Yes, the drive that their own hardware had marked as being faulty had taken a short trip around the country and come back to us. We were less happy.
Time passed. More drives failed. Some were even replaced. We began to notice that rather a lot of drives had failed. We started to become concerned and prodded the manufacturer.
More time passed. More drives failed. After a while, it appeared that the arrays weren't always noticing when a drive failed leading to errors and system outages - not a particularly useful trait for a RAID array. We became more concerned and prodded the manufacturer a bit harder. Our customers started muttering about poor service.
After a bit of investigation the technical people at Fibrenotix came up with a new firmware version that "was created for a number of customers who were experiencing the same issues". Why this wasn't the standard firmware wasn't clear, but we took the systems offline and upgraded them expecting this to fix the problems we had been seeing (oh, how innocent we were!)
More drives failed, but the new firmware had made a difference. Now, when a drive failed the array noticed and brought the hot-spare into use. Sadly, it didn't seem to actually copy any of the original data on to it beforehand, thus corrupting a large part of the array and making the whole thing unusable. Our mail system now had much larger outages. We became very concerned (and in some cases began to suffer from lack of sleep.) Our customers began to leave.
After much thought the people at Fibrenotix decided that the problem must be a faulty batch of drives and supplied some replacements. At this stage we were ready to believe pretty much anything, so much drive swapping ensued. Unsurprisingly, the problems did not go away. Our mail service did not improve. Our remaining customers were arming themselves with pitchforks and torches. We found another supplier.
The Knight In Shining Armour
All stories should have a happy ending. This is no exception.
We switched over to equipment from
Network Appliance which has
been working almost perfectly for well over a year. Replacements
for failed drives are typically on their way before we are even
aware anything has gone wrong and we are happy again. Our customers
have even put down their pitchforks.
Happily Ever After?
And the equipment from Fibrenotix?
Well, this picture sums it up.
"Isn't that a bit wasteful?" you ask. Yes, we thought that, too, but there's one last chapter to our tale of woe...
Not wanting to throw away a piece of equipment that may still be useful in some fashion, we powered one up not too long ago. Events went something like this:
- We put the RAID array into a rack.
- We connected it to a power supply.
- We turned it on.
- It went bang and smoke came of out it.
- The lights went out on lots of nearby bits of equipment.
The fate of the old RAID arrays was sealed. Enough was indeed enough. We thought long and hard about what we should do with them. We decided on this: