Ad
Ad
Ad
Pages: [1] 2 »   Bottom of Page
Print
Author Topic: Testing protocol for new hard disk drives?  (Read 4079 times)
Ellis Vener
Sr. Member
****
Offline Offline

Posts: 1851



WWW
« on: November 26, 2012, 09:19:27 AM »
ReplyReply

Based partly on recommendations in this thread http://www.luminous-landscape.com/forum/index.php?topic=72320.msg575810#new and on other research last week I purchased 5  Western Digital 3 TB Red drives for a RAID 5 or RAID configuration (probably will go with RAID 5). I order 5 instead of just 4  so I will have a spare on hand and also in case one tests bad straight of the box

The question now is: how to to test the drives once they arrive?

Any protocol or tools you recommend?
I already have current versions of DiskWarrior and Techtools
Logged

Ellis Vener
http://www.ellisvener.com
Creating photographs for advertising, corporate and industrial clients since 1984.
Steve Weldon
Sr. Member
****
Offline Offline

Posts: 1476



WWW
« Reply #1 on: November 26, 2012, 09:53:28 AM »
ReplyReply

I run SMART tests routinely every time I handle a mechanical hard drive, and I'll test the drive as part of overall system tests when I spin the machine up near it's max load and leave it there for at least as long as it takes to determine cooling fan settings (noise vs. performance), and as long as 48 hours if I have the luxury.
Logged

----------------------------------------------
http://www.BangkokImages.com
Justan
Sr. Member
****
Offline Offline

Posts: 1882


WWW
« Reply #2 on: November 26, 2012, 09:58:38 AM »
ReplyReply

The RAID controllers Iíve worked with do a great job of testing/diags when they initialize the drives.
Logged

Ellis Vener
Sr. Member
****
Offline Offline

Posts: 1851



WWW
« Reply #3 on: November 26, 2012, 11:42:43 AM »
ReplyReply

So justan, no need to run tests on the individual drives before installing them?
Logged

Ellis Vener
http://www.ellisvener.com
Creating photographs for advertising, corporate and industrial clients since 1984.
Justan
Sr. Member
****
Offline Offline

Posts: 1882


WWW
« Reply #4 on: November 26, 2012, 12:50:03 PM »
ReplyReply

^Not IMO.

When I do server builds, typically I let the install process do its thing and then run the system for a couple of weeks before putting the box in service.

In the distant past I did stress tests on the major components but they never revealed anything that just running the box for a while didnít reveal anyway, so I dropped the habit.

Plus, of course the beauty of RAID 5 is that there is built in fail-over ability for everything except of course the controller(s). My current in-house servers use Dell PERC RAID controllers and i run 2 RAID 5 3 drive arrays, each with their own controller. Nice performance, that.

As an opportunity to spout, even with RAID 5 you still need multiple reliable backups.
Logged

Ellis Vener
Sr. Member
****
Offline Offline

Posts: 1851



WWW
« Reply #5 on: November 26, 2012, 12:52:33 PM »
ReplyReply

As an opportunity to spout, even with RAID 5 you still need multiple reliable backups.

Of course. Raid arrays are  just mass storage mechanisms/strategies and  not in and of themselves backup solutions.
Logged

Ellis Vener
http://www.ellisvener.com
Creating photographs for advertising, corporate and industrial clients since 1984.
Justan
Sr. Member
****
Offline Offline

Posts: 1882


WWW
« Reply #6 on: November 26, 2012, 12:55:32 PM »
ReplyReply

As an opportunity to spout, even with RAID 5 you still need multiple reliable backups.

Of course. Raid arrays are  just mass storage mechanisms/strategies and  not in and of themselves backup solutions.

Yes.

You might be amazed how many do not understand that detail.
Logged

elolaugesen
Full Member
***
Offline Offline

Posts: 194


« Reply #7 on: November 26, 2012, 04:29:11 PM »
ReplyReply

And then make sure you store a backup copy off site/away from home.

Finally.   Test your restore procedures by actually using the backup files.
This should be done fairly frequently.
Cheers elo
Logged
John.Murray
Sr. Member
****
Offline Offline

Posts: 893



WWW
« Reply #8 on: November 27, 2012, 12:25:52 AM »
ReplyReply

You can't beat iometer, for server and critical applications, we test each and every spindle individually for 24 hrs before putting into service:
http://www.iometer.org/doc/downloads.html

We use 8k blocks with a 80% random factor, which basically beats the crap out of the drive.... Wink


If you are running windows, try SQLIO:
http://www.microsoft.com/en-us/download/details.aspx?id=20163

+1 backups
+1-squared testing backups
« Last Edit: November 27, 2012, 12:27:24 AM by John.Murray » Logged

Ellis Vener
Sr. Member
****
Offline Offline

Posts: 1851



WWW
« Reply #9 on: November 27, 2012, 08:20:56 AM »
ReplyReply

Unfortunately John none of those applications are for OS X.
Logged

Ellis Vener
http://www.ellisvener.com
Creating photographs for advertising, corporate and industrial clients since 1984.
Vladimirovich
Sr. Member
****
Offline Offline

Posts: 1320


« Reply #10 on: November 27, 2012, 08:52:26 AM »
ReplyReply

Plus, of course the beauty of RAID 5 is that there is built in fail-over ability for everything except of course the controller(s).
when one drive fails in RAID5 and you replace it the rebuild process will strain the remaining old drives hence there is a very good chance of one them old drives failing during the rebuild... RAID6 is better.
Logged
tived
Sr. Member
****
Offline Offline

Posts: 691


WWW
« Reply #11 on: November 27, 2012, 10:51:03 PM »
ReplyReply

I second the later and run that over 8 WD Red 3tb drives

Henrik
Logged
K.C.
Sr. Member
****
Offline Offline

Posts: 662


« Reply #12 on: November 27, 2012, 11:20:56 PM »
ReplyReply

Are you really going to run your RAID 24/7/365 and demand server level data transfers ?

I'd bet that's not the case. Something tells me this is a working drive or archive for photos. So comparison to the demands of a server aren't really valid. Unless you really want to use a lot of electricity and tell all your friends how cool your RAID is.

I'm a systems administrator and have run HP servers with RAID5 for years. It's appropriate for an enterprise level system, but I use real enterprise drives as well. Not the WD Reds. I did just buy a pair of the WD Reds for a Synology NAS at home.

Working photos and archives are all on WD or Seagate drives in Newer Technology Gardian Maximus hardware RAID1 storage. I turn them off when I'm not working and duplicate drives are stored off site.

Horses for courses.

Logged
Justan
Sr. Member
****
Offline Offline

Posts: 1882


WWW
« Reply #13 on: November 28, 2012, 10:26:52 AM »
ReplyReply

when one drive fails in RAID5 and you replace it the rebuild process will strain the remaining old drives hence there is a very good chance of one them old drives failing during the rebuild...

I agree that there is always a chance of multiple drive failures. This is why multiple backups are important no matter the RAID level because of the first (figurative) law of working with computers: chit happens.

Quote
RAID6 is better.

The additional parity block is an improvement over RAID 5. However according to some sources RAID 6 does slower writes than a similar RAID 5 array. If true, then one takes a performance hit on an endless basis in trade for something that might happen someday.

Is the extra parity block it worth the ongoing performance hit? I could argue it equally in agreement or disagreement and I bet you could too.
Logged

Vladimirovich
Sr. Member
****
Offline Offline

Posts: 1320


« Reply #14 on: November 28, 2012, 01:42:16 PM »
ReplyReply

I agree that there is always a chance of multiple drive failures.

chance is bigger because rebuilding is a bigger workload, that 's the whole point...



The additional parity block is an improvement over RAID 5. However according to some sources RAID 6 does slower writes than a similar RAID 5 array. If true, then one takes a performance hit on an endless basis in trade for something that might happen someday.

Is the extra parity block it worth the ongoing performance hit? I could argue it equally in agreement or disagreement and I bet you could too.


slower writing, no hit for reading
Logged
TStanding
Newbie
*
Offline Offline

Posts: 6


« Reply #15 on: November 29, 2012, 10:31:05 PM »
ReplyReply

Download a copy of SoftRAID from our web site (www.softraid.com).  You can use it for free for 30 days during the evaluation period.  We don't offer RAID 5 (yet), but we have a great disk certify feature which you can use for testing new disks and making sure they are reliable.

When certifying a disk, it does up to 8 full passes on each disk, writing a pattern out to the entire disks and then reading it back to ensure that there are no errors.  You can do all 5 disks at once (assuming you have a way to connect them to your Mac).

I usually use 3 passes and run it on every device I purchase before I start using it (disks, SSDs, CF cards, HCSD cards, thumb drives, etc).  It has saved me more than once.

Tim Standing
SoftRAID LLC

Logged
BartvanderWolf
Sr. Member
****
Online Online

Posts: 3876


« Reply #16 on: November 30, 2012, 04:43:22 AM »
ReplyReply

It has saved me more than once.

Hi Tim,

How's that? After initial formatting, all sectors should be good for use. With modern drives, when sectors get marginal during their lifetime, the data will be automatically be relocated, and the sectors disabled. What does running a write/read test reveal other than a total mechanical failure that would have also been detected during normal use?

Cheers,
Bart
Logged
TStanding
Newbie
*
Offline Offline

Posts: 6


« Reply #17 on: November 30, 2012, 11:19:43 AM »
ReplyReply

Hi Tim,

How's that? After initial formatting, all sectors should be good for use. With modern drives, when sectors get marginal during their lifetime, the data will be automatically be relocated, and the sectors disabled. What does running a write/read test reveal other than a total mechanical failure that would have also been detected during normal use?

Cheers,
Bart

First, my disclaimer: I don't have a lot of inside information for what I am about to say, just empirical observations based on buying 10 - 20 disks a year (for testing SoftRAID) and the experience of 10's of thousands of SoftRAID users.

- I have found that some new disk drives produce read errors during the first few months of use.  If you use Mac OS X, you may not see these as not all read errors get reported to the user.  In the past, when I have encountered a disk like this, I could get it to stop producing read errors by writing to every sector on the disk.  I don't know what is causing this, I just know that writing to every sector then reading them all prevents this from occurring.  I have seen 3 disks like this during the past 10 years (so approximately 2-3% of the disks I have purchased).  This observation is the reason we added the disk certify function to SoftRAID.

- I find that a small percentage of disks fail in the first couple of days of use.  If you put them into use straight away, you risk wasting time setting up a volume only to have to diagnose the problem of a flakey disk a few days later.  I encountered this a couple weeks ago when I set up a RAID volume for testing with a group of USB 3 thumb drives (from a reputable manufacture).  One of them had an intermittent read failure. It took me two days to figure out why my RAID volume was acting up.  When I certified all the thumb drives, I found one which frequently failed to read data from a specific block on the drive.

- The study by Google engineers on disk failure (see link below) shows that failure is bimodal, with a disproportionate number of disks failing early and late in their life.  It is better to try and cull these early failures before you put the disk into use.  This is borne out by the experience of our users, many of whom have reported new disks failing during certification.

- The same study by Google engineers showed that most SMART data was useless for predicting disk failure.  They did find that three SMART attribute values were predictive of disk failure, these values include the count for the number of reallocated sectors.  According to the study, if a disk reallocates just one sector, its much more likely to fail in the next 30 - 90 days.  We use these three SMART attributes in SoftRAID to warn users of the increased likelihood of disk failure.

The study by Google engineers which followed a population of 100,000 disks over 8 months can be found at: http://research.google.com/archive/disk_failures.pdf

Tim
Logged
PierreVandevenne
Sr. Member
****
Offline Offline

Posts: 512


WWW
« Reply #18 on: November 30, 2012, 02:45:38 PM »
ReplyReply

- I have found that some new disk drives produce read errors during the first few months of use.  If you use Mac OS X, you may not see these as not all read errors get reported to the user.  In the past, when I have encountered a disk like this, I could get it to stop producing read errors by writing to every sector on the disk.  I don't know what is causing this, I just know that writing to every sector then reading them all prevents this from occurring.  I have seen 3 disks like this during the past 10 years (so approximately 2-3% of the disks I have purchased).  This observation is the reason we added the disk certify function to SoftRAID.
...
- The same study by Google engineers showed that most SMART data was useless for predicting disk failure.  They did find that three SMART attribute values were predictive of disk failure, these values include the count for the number of reallocated sectors.  According to the study, if a disk reallocates just one sector, its much more likely to fail in the next 30 - 90 days.  We use these three SMART attributes in SoftRAID to warn users of the increased likelihood of disk failure.

A reasonnable view on the whole. I have observed the same behaviour in quite a few drives, but and this is a big BUT, the apparent improvement that can be observed after some use is almost always the result of a sector reallocation, which as you noted, is a bad sign in itself.

One word about the Google study: it found SMART parameters not to be useful from their perspective as a predictor of failure. The stated reason for this is that 36% of their failed drives showed not SMART anomalies. If you want to develop a predictor, that's annoying. Somewhat useful, but not too much.  But if you look at the problem from the user's angle, the picture becomes very, very different: I am not going to rewrite the whole paper here, but various errors lead to 10 times, 39 times, 14 times more chances to die in a relatively short term.

If your drive is about to die, SMART will tell you in about 2/3 of the cases.

If SMART tells you something, your risk of failure is at least ten times greater than normal.

While I have seen drives with one reallocated sector staying at one for several years. However, in my experience, significant but nowhere near the Google's sample size, an increasing count of reallocations is 100% predictor.
Logged
chrismurphy
Jr. Member
**
Offline Offline

Posts: 77


« Reply #19 on: December 14, 2012, 07:30:07 PM »
ReplyReply

The RAID 5 concern is not as much the incidence of a 2nd disk (full blown) failure, but two unrelated problems that negatively affect a rebuild should one drive fully fail. One is a read error, which is the actual loss of a sector of data and usually reported as a read error with no other result returned to the kernel. The other is a corrupt parity chunk which goes undetected.

A sector read error during rebuild will cause rebuild to halt. You can try to restart the rebuild and maybe it won't happen again. If it's a persistent read failure, the array is toast, and will have to be created from scratch and data restored from backups.

A corrupt sector that goes undetected by drive ECC will result in corrupt reconstructed data. And that will propogate into your backups, etc.

Anyway, for these and other reasons, I consider RAID 5 to be more trouble than it's worth, for those who care about their time and data, when used with consumer hardware.

Professional photographers in particular, and quite a few amateurs, produce an amount of data that a business of 100 or 1000+ produce. Of course those businesses have dedicated IT people to handle their storage requirements. Photographers don't. As a consequence there's a gap in good storage solutions, both reasonable cost and reliable.

As for disk testing, again nearline and enterprise disks are usable right out of the box, as they're tested before they're distributed. For consumer disks I use a combination of several SMART extended foreground tests (read only), checking SMART attributes for concerning changes, interlaced with either ATA Secure Erase (write and read) or writing zeros. ATA Secure Erase is faster since it's done by the drive firmware.

I'm not aware of any GUI apps for OS X that do these things. The defacto standard for monitoring and testing SMART drives, smartmontools, are CLI only programs. Binaries exist for Linux and Windows, but not for OS X for some reason. Instead smartmontools is available for building from source via Macports. I think most people would find it easier to boot from a Linux LiveCD that has smartmontools and hdparm (used for ATA Secure Erase) already on it, assuming they aren't already scared away.

Anyway, I tend to be more of a fan of a NAS based storage, Linux or FreeBSD based, which will have these tools integrated and can be scheduled, as can regular array scrubbings. NAS 4 Free combined with ECC memory, and nearline drives, is a pretty robust and scalable solution. You do give up some performance over Thunderbolt or USB 3 solutions for certain operations but I'll argue for other ways to mitigate this rather than advocate large locally attached storage.
Logged
Pages: [1] 2 »   Top of Page
Print
Jump to:  

Ad
Ad
Ad