Bitrot, huge disks, and RAID

A few months ago, I had a bad experience with rotting JPEGs. I stored them on a 3ware RAID with buggy firmware, and it took me a darned long time to figure out why 10% of my pictures were corrupt. (Flashing the firmware seems to have helped.) Most of the errors were tiny--one bit changed in a 10MB file. But enough to make them really impossible to recover.

As a result, I've become obsessed with understanding the bitrot problem and figuring out practical ways to solve it.

So let's take it from bottom to top:

1. Big disks, RAID, scrubbing

We're seeing enormous disks right now (2TB!) and on these disks sectors fail at a reasonable rate. If you're storing 2TB of data you can't afford to lose, you should really be guarding against sector and bit-level errors with something like RAID-1. Similarly, you should have your RAID do weekly verifies so that failed sectors can recover from a second disk. Your RAID won't do this unless you tell it to, so figure out how.

How this works: if you do a weekly "scrub" your RAID controller or Software RAID will read every sector on a disk, and if failures are detected, the disk will reallocate a sector, and the controller will copy data from the good disk. (If you were using just one disk instead, you'd have lost some bits from this sector, or all of the data there.) You are much less likely to see the same sector fail on two disks at once, so this kind of scrubbing works really well.

Scrubbing is easy. Your hardware RAID probably has a "verify" scheduler. If you're using Linux software RAID, you can put a script like this in your cron.weekly:

echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action

You don't have to do a lot more until a disk fails. But what you've done is ensured that media errors don't corrupt your data.

But you must scrub. You can't wait for a disk to fail, not any more. If you do that, you will find out about bad sectors when you're rebuilding the RAID (errors will show up on the "good" disk), and lose some data that way. Weekly scrubs are recommended.

2. Your RAID controller is buggy. Your disk controller flips bits randomly. Your bus scrambles data. You have bad RAM.

These things are not as well handled by most systems today. If you're using a system to store backups, you can always run checksums and do a full verify of your backup. This is an especially good idea, because it will help you detect hardware problems before they creep into other data that you have no second copy to rely on.

My issue was buggy RAID firmware. Bad RAM is the most likely culprit, so use ECC. It's hard to find cheap motherboards that use ECC (thanks Intel?), but it is important.

3. End to end checksumming

The only system that actually checks that the hardware is working is ZFS (available in OpenSolaris). I recommend it for a lot of reasons.

ZFS builds per-block checksums into a disk, so you can see silent disk corruption (e.g., bits your controller flipped, rather than bits your disk's checksumming might detect). This is better in almost every way than playing "trust me" with your hardware.

You still have to schedule weekly scrubs (ZFS calls this "resilvering").

Hardware for storage

In the meantime, I'm having a reasonably difficult time finding a server that is low-power, low-noise, has ECC RAM, and works with OpenSolaris. If you give up on low-noise or ECC, things get easier, but I do love over-constrained problems. Leave comments with good links, if you have any.

6 comments:

  1. For all of these reasons, I've decided to go with a Mac Pro. Yes, they've over-priced, but you get ECC memory, and you get vendor supported ZFS which can run on up to four internal disks (ZFS was announced as a feature of OS X 10.6 Server; unclear whether it can be kludged together on the 10.6 client, but you can download it now from macosforge, which is at least somewhat promising.)

    Presently, I'm playing it fast and loose with disk striping plus Time Machine backups. If a disk dies, I'm fine. If I get silent corruption, I may or may not be able to recover from the backup, which itself may or may not have had silent corruption.

    If I was willing to put up with more pain, I'd probably get one of these Windows Home Server boxes and try to shoehorn OpenSolaris onto it. That's less attractive because network filesystems are nowhere near the speed of local ones, even over fast networks. Plus I want all of the weird Apple semantics to "just work" which means I really do want the Apple port of ZFS.

    ReplyDelete
  2. I would encourage you to look at the Dell Precision 3400 or 5400 platforms- they support ECC and are generally of a higher build quality than the consumer lines. They don't have hot-swap drives or power supplies, but that's not crucial for a home server. They have room for 2 internal SATA drives, so with a pair of WD green drives, you can get up to 4TB in a single box with good sound and thermal levels. You can find people on EBay selling new grey-market motherboard/psu/chassis systems relatively cheap. I was able to put together a Quad-core/8GB ram/2Tb system for under $1000 in q1/2009. I don't want to promote any specific seller, but I can provide specifics about my situation if you are still in the market.

    ReplyDelete
  3. I solve this problem differently. I do not try to prevent bitrot because it will happen.
    Instead I scrub and verify data having it stored as multiple copies on different systems.
    Using different technologies.
    See longtermarchive.com for an overview.

    ReplyDelete
  4. longtermarchive.com says "It Works!" - any more clues?

    ReplyDelete
  5. Anonymous2:26 PM

    ZFS on Mac is dead, it appears. Check the link.

    I'm using an OpenSolaris server here just to get ZFS.

    ReplyDelete
  6. ZFS on Mac is not dead, it just isn't supported by Apple itself anymore:

    Maybe without apple support, you will not see many updated to ZFS, but you won't for OpenSolaris either (since it's dead).

    ReplyDelete