This is lifted from the latest Imaging Resource newsletter (http://www.imaging-resource.com/IRNEWS/
With the discussion of backup strategies and RAID systems on here, I thought it would be relevant. The rest of the newsletter is worth reading too...
"Publisher's Note: IR Site Returns to Normal
It was a long and hairy battle, but the Imaging Resource servers are now back to normal.
Apologies for our outage last Friday through Sunday. It was testimony that the best laid plans of mice and men (but particularly the latter) are prone to going astray.
The short version of the story is that a bad drive appears to have taken down our entire Redundant Array of Independent Disks or RAID. Compounding matters, problems with the DiskSync backup system meant that a restore process that should have taken only a few hours went on for almost 20 hours.
A word of warning to users of RAID 5 systems: The redundancy built into RAID 5 will tolerate the failure of any one drive in the system, but only in the sense that it can reconstruct missing data. This is now the third time in my career that I've seen a RAID 5 system fail completely. (I ran a small systems integration company in a previous life, else I probably wouldn't have seen as many.)
In this case, one of the drives in the array failed in a way that caused it to interfere with data flowing over the SCSI bus. Data from all the drives consequently became corrupt, as did any "reconstructed" data that was written back to fix an apparently failing drive. Because the fault affected data from all drives, the system wasn't able to identify which drive was the actual source of the problem. It did indicate a particular drive in the array, but that drive was actually fine, it just happened to be the drive at the particular position along the SCSI bus that was most affected by the bus problem.
So we had to completely wipe the array (actually moving to a new server chassis and full array of entirely new hard drives), reload the OS and restore from our online backups.
This was where the second major hassle developed. We use a system marketed by our ISP under the name of DiskSync. It proved horrendously unreliable. It did indeed preserve all our data (it does seem to do a good job of that), but kept hanging whenever it encountered a symbolic link (or alias) in the directory structure. This meant that the process proceeded by fits and starts, needing to be restarted many
It may be we just didn't know critical info about how to use DiskSync, but a utility that purports to be a backup solution for Linux shouldn't hang whenever it encounters a symlink, even in its default configuration.
Going forward, we're going to configure our servers so one of the secondary boxes will be able to stand in for the main server in a pinch. Performance might be lower on the secondary box and some of the housekeeping and deployment services on the primary box won't be supported, but the site itself would be able to stay up and running.
Longer term, we plan to install our own hardware here in Atlanta to have hands-on access ourselves when we need it. This solution will also involve a 24x7 "hot spare" synced with the primary server every couple of hours, so we can transfer operations with the flip of a virtual switch.
On the face of it, this would be a more expensive solution than our current one using DiskSync. But when you consider that the revenue we lost as a result of this outage could have paid for the duplicate hardware in one fell swoop, the economics change."