Two posts ago I sent my props to Robin Harris (Storagemojo), now I'm feeling the bloggers biteback. Robin - what are you
doing?
In his most recent post, Storagemojo links over to a post of Jon Bach's (dated Feb 5, 07) from Puget Custom
Computers where Jon discusses the problems of using cheap RAID controllers in desktop systems and he gives failure
numbers for drives that he has been working with in his business. Jon argues that RAID in desktop system is a royal pain because
repairing an array on these systems is more of a problem than restoring from backup. I'm not going to argue much with Jon here
and I certainly don't have a bone to pick with him, but I can tell you with some certainty, that my own experience on my own desktop
systems has been different. I would much rather replace and remirror a disk drive than restore an entire system. Then again, I only
support my family and me, a conundrum of requirements and systems that would probably be incomprehensible to anybody but me.
(I know there are readers out there that know what I'm talking about it). Jon apparently has a lot of good customers and I'm sure he
does what is right for them.
My heartburn is with Robin for implying that Jon's numbers might be relevant for enterprise storage and implying (by reference)
that RAID is a problem too. Geez, Robin, you know better. I'm not even going to address the RAID thing because that is so wrong.
On the disk side....engineering teams at all major storage companies spend many, many hours qualifying drives and working with
drive manufactuers to increase reliability. Then there is the by now famous burn-in process where drives are rigorously tested to
weed out those likely to experience early life failures. Then there is some likelihood that these drives live in server rooms with better
environmental conditions. Then there are advanced functions like background scrubbing to find and relocate bad blocks, yadda,
yadda yadda. And it is not a bunch of whooey- it is a necessity for staying in business in the ultra-conservative
enterprise storage market.
There are 3 main reasons the "big guys" like EMC, EqualLogic, Hitachi, HP, Netapp and everybody else don't publish their drive
failure numbers.
-
Contractual restrictions
-
High percentage of NTFs (no trouble founds) Storage system vendors and drive manufacturers agree to disagree that a drive
failed (which is probably why contractual restrictions about failure disclosures are in place). The drive manufacturers will always
claim a lower percentage of failures than the storage system vendors and customers. This doesn't make the disk drive
manufacturers less honest by the way - there actually are a high percentage of NTF drives returned where it appears there is
nothing wrong with them. I'ts rumored that some of these drives show up in retail outlets and other channels - something that, if
true, could possibly impact the numbers Jon Bach is seeing.
-
The first vendor to state their numbers loses - big time. Considering the amount of slop in the analysis, if any vendor were to
give numbers for drive failures, all competitors would publish lower claims and then the great wrestling match of mind numbing
analysis minutiae would start and foreheads would slap keyboards and drool would seep into the cracks.
And all these things that I'm pretty sure Robin understands, but it does make for fun blogging.