Ad
Ad
Ad
Pages: [1]   Bottom of Page
Print
Author Topic: No cure in sight for unpredictable hard drive loss  (Read 4200 times)
61Dynamic
Sr. Member
****
Offline Offline

Posts: 1442


WWW
« on: February 26, 2007, 10:55:24 AM »
ReplyReply

Here is an interesting article from ArsTechnica on Hard Drive reliability:
http://arstechnica.com/news.ars/post/20070225-8917.html

Google tracked the failure rate of 100,000 of their hard drives and found that their drives fail more frequently after the first two years. This contradicts the "bathtub" effect that states if a drive is faulty it'll fail in the beginning year of it's life then the failure rate tappers off until it reaches the end of it's life (the graph is shaped like a bathtub).

Quote
Earlier this month, Google researchers released a fascinating paper called "Failure Trends in a Large Disk Drive Population" that examined hard drive failure rates in Google's infrastructure. Two conclusions stood out: self-monitoring data isn't useful for predicting individual drive failures, and temperature and activity levels don't correlate well with drive failure. This throws conventional wisdom about predicting drive failures into question, so we sought out some independent expert analysis to weigh in on the findings. First, we briefly recap last week's Google study.
Logged
jani
Sr. Member
****
Offline Offline

Posts: 1603



WWW
« Reply #1 on: February 27, 2007, 06:06:28 PM »
ReplyReply

If you liked the Google research report, you'll enjoy Schroeder and Gibson's paper even more.

StorageMojo's summary is probably a bit more useful for those whose eyes gloss over at 16-page research papers. Here's the opening blurb:

Quote
Which do you believe?

    * Costly FC and SCSI drives are more reliable than cheap SATA drives.
    * RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.
    * After infant mortality, drives are highly reliable until they reach the end of their useful life.
    * Vendor MTBF are a useful yardstick for comparing drives.

According the one of the “Best Paper” awards at FAST ‘07, none of these are backed by empirical evidence.

In addition, NetApp has responded to the findings of the paper.

Enjoy!
Logged

Jan
61Dynamic
Sr. Member
****
Offline Offline

Posts: 1442


WWW
« Reply #2 on: February 27, 2007, 07:48:00 PM »
ReplyReply

The Ars post actually links to the Gibson report but I haven't had a chance to read it yet. Thanks for adding the Mojo link.
Logged
Serge Cashman
Full Member
***
Offline Offline

Posts: 200


« Reply #3 on: February 27, 2007, 09:14:14 PM »
ReplyReply

8% annual failure rate in 2 years is quite amazing. It looks like in 4 years a quater of the drives are dead.  (That's consumer-grade drives under heavy use but in a controlled environment.) I can relate to that.
« Last Edit: February 27, 2007, 09:28:48 PM by Serge Cashman » Logged
kal
Jr. Member
**
Offline Offline

Posts: 62


WWW
« Reply #4 on: February 28, 2007, 02:07:15 AM »
ReplyReply

Nobody in a self-respecting datacenter assumes RAID5 is safe. You usually have
hot-spare disks so that reconstruction begins within seconds after a drive failure.
You use small and fast (a.k.a. expensive) disks, so that reconstruction
takes less time. You use more advanced replication (e.g. mirroring or RAID6)
for really important data. You make backups, possibly to some different
physical location, and you take some of those backups off-line.

And then your data are still not safe, since some major regional disaster may
destroy all your redundant arrays and backups.

And btw: in the last four years I've seen quite a lot of failed disks, but very few
if any failed arrays. Backups are used far more frequently to recover deleted data
than they are used to restore failed systems. The user is the weakest link in the
chain (this of course includes me, I did delete half of my home directory once; and
restored it from an off-line backup)
Logged

Slough
Guest
« Reply #5 on: February 28, 2007, 07:02:08 AM »
ReplyReply

"RAID 5 is safe because the odds of two drives failing in the same RAID set are so low."

I recently had a disk crash. Both the master and slave disks went. Fortunately I maintained a backup of critical data including images on an external USB drive. So after installing a new disk I was up and running in less than a day including reinstallation of software and images. I suspect the probability of multiple drives going is not low, since they are on the same motherboard and power supply.
Logged
tived
Sr. Member
****
Offline Offline

Posts: 674


WWW
« Reply #6 on: February 28, 2007, 01:02:09 PM »
ReplyReply

thanks interesting read,

hmm, will I have to rethink my storage needs?

Henrik
Logged
dlashier
Sr. Member
****
Offline Offline

Posts: 518



WWW
« Reply #7 on: March 04, 2007, 03:12:32 AM »
ReplyReply

As an ISP I currently run about 50 harddrives, these are SCSI and properly cooled.  In ten years I have had exactly two drive failures - both Quantum. I've only got two Quantums and they both failed in less than a year. Needless to say I don't buy Quantums anymore. The rest of the drives are either IBM (now Hitachi) or Fujitsu. I switched to SCSI around 10 years ago because of a couple nightmare situations due to failed IDE drives. Several of the drives I'm running are approaching 10 years and the average age is probably around 5 years.

OTOH I've had very bad luck in workstations with ATA drives, particularly Maxtor. Average life seems to be about two years, sometimes much less, occasionally somewhat more, but then most PC boxes don't provide adequate cooling. IMO IDE drives have no place in a critical application although I've heard that some WD drives are built to commercial specs.

- DL
« Last Edit: March 04, 2007, 03:21:02 AM by dlashier » Logged

61Dynamic
Sr. Member
****
Offline Offline

Posts: 1442


WWW
« Reply #8 on: March 04, 2007, 10:57:12 AM »
ReplyReply

Quote
hmm, will I have to rethink my storage needs?
How would you change them? There is no alternative to hard drive storage at this time. The lesson to take from this study is to always have redundancy and backups.

Quote
...but then most PC boxes don't provide adequate cooling.
[a href=\"index.php?act=findpost&pid=104525\"][{POST_SNAPBACK}][/a]
Ah, but according to the studies in the article, heat had no correlation to hard drive failure. It is only when they reach extreme temperatures that it can be an issue. Google's drives were in a properly cooled environment. For the average home computer extreme heat is not a great concern since the level of activity is far from what you'd get in a server environment (oh, and the study also showed the amount of activity had no effect on failure rate either).

I think the problem with the consumer ATA/SATA market is there is a strong drive to expand capacity but not to improve reliability. So long as a majority of their drives last the life of a typical computer (2-3 years) they're happy.
Logged
jani
Sr. Member
****
Offline Offline

Posts: 1603



WWW
« Reply #9 on: March 04, 2007, 04:15:17 PM »
ReplyReply

Quote
As an ISP I currently run about 50 harddrives, these are SCSI and properly cooled.  In ten years I have had exactly two drive failures - both Quantum. I've only got two Quantums and they both failed in less than a year. Needless to say I don't buy Quantums anymore. The rest of the drives are either IBM (now Hitachi) or Fujitsu. I switched to SCSI around 10 years ago because of a couple nightmare situations due to failed IDE drives. Several of the drives I'm running are approaching 10 years and the average age is probably around 5 years.

OTOH I've had very bad luck in workstations with ATA drives, particularly Maxtor. Average life seems to be about two years, sometimes much less, occasionally somewhat more, but then most PC boxes don't provide adequate cooling. IMO IDE drives have no place in a critical application although I've heard that some WD drives are built to commercial specs.
Even 50 harddrives is just a widdle in the ocean, and must only count as anecdotal observations, at least compared to the studies cited earlier in this thread.

And if you note NetApp's observations regarding reliability, you'll find that they started using ATA drives around the turn of the century (it feels funny putting it that way). They were not the only ones.

Your personal experiences may be due to that earlier, SCSI drives used 10% of their capacity for automatic reallocation of bad blocks. Where an ATA drive would use the full 40 GB capacity, for instance, the corresponding SCSI drive would use only 36. When bit rot set in, the SCSI drive had quite a bit of headroom, while the ATA drives didn't.

I'm not sure whether current SCSI drives still make use of this feature. It may actually be less helpful today than it was ten years ago.
Logged

Jan
dlashier
Sr. Member
****
Offline Offline

Posts: 518



WWW
« Reply #10 on: March 05, 2007, 02:37:07 AM »
ReplyReply

Quote
Your personal experiences may be due to that earlier, SCSI drives used 10% of their capacity for automatic reallocation of bad blocks. Where an ATA drive would use the full 40 GB capacity, for instance, the corresponding SCSI drive would use only 36. When bit rot set in, the SCSI drive had quite a bit of headroom, while the ATA drives didn't.

Jan, perhaps, but the failures I've had have been catastrophic, not bit rot, but that may just prove your point.

I agree that 50 drives plus another 15 or so that have been retired is a small sample but I believe large enough to be statistically significant. Google apparantly did not have stats for scsi for comparision, but my experience with a couple dozen ATA drives parallels their experience (avg two to three years). There's a reason scsi drives cost up to 5x as much - the controller boards just aren't that much more complex. I'm sure they're made to tighter specs, better bearings etc. I do have some ATA's that have lasted - one is going on 7 years now but I've also had several fail in less than a year.

- DL
« Last Edit: March 05, 2007, 02:41:02 AM by dlashier » Logged

jani
Sr. Member
****
Offline Offline

Posts: 1603



WWW
« Reply #11 on: March 05, 2007, 04:07:35 AM »
ReplyReply

Quote
I agree that 50 drives plus another 15 or so that have been retired is a small sample but I believe large enough to be statistically significant.
No, not in the population of hard drives.

To achieve statistical significance with a reasonable degree of confidence for large populations, you need at least approximately 1000 samples.

Quote
Google apparantly did not have stats for scsi for comparision, but my experience with a couple dozen ATA drives parallels their experience (avg two to three years). There's a reason scsi drives cost up to 5x as much - the controller boards just aren't that much more complex. I'm sure they're made to tighter specs, better bearings etc. I do have some ATA's that have lasted - one is going on 7 years now but I've also had several fail in less than a year.
This is why you need to read the paper by Schroeder and Gibson that I linked to, which includes data not only on SCSI drives, but also on FC drives. You will note that they tested a larger number of SCSI drives than SATA drives, so you can't exactly nail them for testing ATA reliability instead of SCSI reliability.
Logged

Jan
Pages: [1]   Top of Page
Print
Jump to:  

Ad
Ad
Ad