Ad
Ad
Ad
Pages: [1]   Bottom of Page
Print
Author Topic: False Disk Failures  (Read 4258 times)
John.Murray
Sr. Member
****
Offline Offline

Posts: 893



WWW
« on: March 22, 2013, 02:05:16 PM »
ReplyReply

Fascinating read:
http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/

In a nutshell, of the reported disk failures in a datacenter, nearly 50% turn out to be a false failure.  The SAS vs: SATA statistics of an actual data center are also very interesting
Logged

francois
Sr. Member
****
Offline Offline

Posts: 6733


« Reply #1 on: March 23, 2013, 05:38:31 AM »
ReplyReply

Fascinating read:
http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/

In a nutshell, of the reported disk failures in a datacenter, nearly 50% turn out to be a false failure.  The SAS vs: SATA statistics of an actual data center are also very interesting

Thanks for sharing this article. It is very interesting.
Logged

Francois
tived
Sr. Member
****
Offline Offline

Posts: 688


WWW
« Reply #2 on: March 23, 2013, 09:13:36 PM »
ReplyReply

Great read, thanks

Henrik

PS: Have already experiences it on a small scale, and what a great releif it is to find that its up and running again after a reboot. Obviously that raises the next question to replace or to continue? I have now replaced all the disks with flash drives, and I haven't seen it since. I still got the drives and will probably still put them to use somewhere with redundance :-)

thanks
Logged
John.Murray
Sr. Member
****
Offline Offline

Posts: 893



WWW
« Reply #3 on: March 24, 2013, 12:55:00 AM »
ReplyReply

Henrik:

There are some excellent manufacturer supplied tools to verify disk health; for my personal use, I have no problem returning a disk to service.
http://www.seagate.com/support/downloads/seatools/
http://support.wdc.com/product/download.asp?groupid=606&sid=3

Professionally, there is really no choice - replace the physical drive, rebuild and move on...
« Last Edit: March 24, 2013, 01:05:54 AM by John.Murray » Logged

Ellis Vener
Sr. Member
****
Offline Offline

Posts: 1727



WWW
« Reply #4 on: March 24, 2013, 07:49:06 PM »
ReplyReply

Thanks John Murray.
Logged

Ellis Vener
http://www.ellisvener.com
Creating photographs for advertising, corporate and industrial clients since 1984.
Justan
Sr. Member
****
Offline Offline

Posts: 1876


WWW
« Reply #5 on: March 25, 2013, 12:28:38 AM »
ReplyReply

Great read, thanks

Henrik

PS: Have already experiences it on a small scale, and what a great releif it is to find that its up and running again after a reboot. Obviously that raises the next question to replace or to continue? I have now replaced all the disks with flash drives, and I haven't seen it since. I still got the drives and will probably still put them to use somewhere with redundance :-)

thanks

Agreed on the linked reference article.

I come across anywhere from 2 to dozen or so failed drives per year give or take for the year. Sometimes if the drive is put in a test environment where it doesnít get as hot, that will appear to solve the issue. Sometimes a reboot will solve the issue and often neither will.

I have to wonder if the tests they did in the article addressed the issue of relocating the suspect drive to a cooler environment and/or one that did not vibrate as much. I also wonder if they let a drive run for weeks after an unexpected failure? These variables can have a huge impact on reliability studies, and the article didnít mention it, or if it did, I missed that.

But anywho, I agree that once a drive shows itself as faulty in a production environment, I would refuse to return it to the production environment, unless Iíve found to a certainty that something other than the drive was the culprit. I suppose there is a cost point where it is worth it to use a suspected flaky drive, but it would be a small dollar value. Drives donít cost enough to justify jeopardize even an hour of time for 5 > people who rely on the drive. If done where there are 20 or > people who use the drive, doing so amounts to stupid management.

Of course some business IT monkeys donít even bother to label the drives for date a drive is placed in service, but I digress.
Logged

John.Murray
Sr. Member
****
Offline Offline

Posts: 893



WWW
« Reply #6 on: March 28, 2013, 12:10:05 AM »
ReplyReply

I come across anywhere from 2 to dozen or so failed drives per year give or take for the year. Sometimes if the drive is put in a test environment where it doesnít get as hot, that will appear to solve the issue. Sometimes a reboot will solve the issue and often neither will.

I have to wonder if the tests they did in the article addressed the issue of relocating the suspect drive to a cooler environment and/or one that did not vibrate as much. I also wonder if they let a drive run for weeks after an unexpected failure? These variables can have a huge impact on reliability studies, and the article didnít mention it, or if it did, I missed that.
The article simply says what it says, what I find fascinating is a data-center agreeing to share specific statistics with the author and allowing them to be published.  Remember, the data center referred to has ~2 million spindles.....

What bothers me the most is the idea of a 10-15$ controller having a "glitch" that ends up failing the array it's a member of.  As the author states, this is completely unacceptable in other controllers such as automotive & industrial EMC's.  Even given that, his speculation regarding accelerator pedal malfunctions in a handfull of Toyotas is probably spot-on....

As far as testing, you'll usually see Kesender test units, or a custom equivalent (we have the one linked).  Again, we *never* return failed units into production, and I personally don't know anyone that does.  Besides testing, we use our rig to securely erase failed / retired drives......

« Last Edit: March 28, 2013, 12:26:20 AM by John.Murray » Logged

robo60
Newbie
*
Offline Offline

Posts: 1


« Reply #7 on: October 31, 2013, 05:23:01 PM »
ReplyReply

The article simply says what it says, what I find fascinating is a data-center agreeing to share specific statistics with the author and allowing them to be published.  Remember, the data center referred to has ~2 million spindles.....

What bothers me the most is the idea of a 10-15$ controller having a "glitch" that ends up failing the array it's a member of.  As the author states, this is completely unacceptable in other controllers such as automotive & industrial EMC's.  Even given that, his speculation regarding accelerator pedal malfunctions in a handfull of Toyotas is probably spot-on....

As far as testing, you'll usually see Kesender test units, or a custom equivalent (we have the one linked).  Again, we *never* return failed units into production, and I personally don't know anyone that does.  Besides testing, we use our rig to securely erase failed / retired drives......



Hi everyone. I am an occasional reader of this forum and web site. I wrote the blog referenced, and its a pleasure to see you guys discussing it.

A few things as follow up. 1) it turns out I was pretty much on the money with Toyota's acceleration problems. The transcripts from the lawsuit released over the last week pretty much confirm it. 2) You're absolutely right - typically no enterprise will return drives to production. There are some insidious aspects to that. 1) a drive controller chip costs ~$4-$7. A disk costs an OEM ~$40 to $200 depending on "quality", performance, and volume. But they'll often charge you $400-$600 for the drive. Whats more is they charge you a big annual service fee which includes the drive replacements. They make a lot of money on that service so they like you returning drives. Its a way to keep a revenue stream.

In contrast, it turns out array vendors like EMC and NetAPP have been managing this stuff behind the scenes for years. They do resets. Rumors are they even do full reformat - essentially re manufacture the drive in place. And of course the guys who really have financial incentives to get it right - many with millions of drives, and google with probably 10 million - they keep the drives in service.

On heat and vibration - this is usually not a factor if servers are designed properly, and in the datacenters where this information was gathered they are very well designed. So those should not be a factor at all.

Rob


Logged
Pages: [1]   Top of Page
Print
Jump to:  

Ad
Ad
Ad