A good-enough ZFS NAS

[followup from October 2010: Don't get WD EADS drives for this, as they fail a lot. Also look for an ethernet controller that allows "jumbo frames" as Intel has decided the motherboard listed here shouldn't support that.]

I've kept an old 2001-era Dell box running OpenSolaris with ZFS for the last few years. Filled it up. But I was so very happy with ZFS that I wanted to build a new box. And who wouldn't want a Netapp-like storage system for cheap?

I thought this would be easy, but it really wasn't, and so I thought I'd write it up for everyone who's in a similar situation. (If you're impatient, you can skip down to the parts list below.)

My "impossible" goals for a ZFS NAS:
  1. Quiet!
  2. Low power
  3. ECC RAM
  4. Big storage (4TB+), with 5-6 disks
  5. Compatible with OpenSolaris
  6. Reasonable cost
If you didn't care about most of those things, you could just get a NetApp for $100,000. But look, you probably didn't do that.

At a medium pricepoint, you could just buy Sun hardware. e.g., this guy converted an ultra40 workstation to run ZFS, but sometime after that Sun canceled the Ultra 40. Now you can only buy the 4-disk version, and that's actually not enough for a huge NAS when the bootdisk eats one (more on that later).

Rack-mount stuff satisfies most of the above, but it's a little bit expensive, and very loud. And if you want loud, you could just buy a Sun Fire X4140.

Sun has been very slow to make Solaris work on non-Sun hardware, so compatibility turns out to be difficult. You can find people who've read some of the Sun whitepapers and cloned the hardware for a smaller cost. For instance, "Thumper" (Sun's massive ZFS box) apparently uses multiple AOC controllers like this one. The AOC costs $99 and supports 8 SATA disks.

I care about ECC RAM because I decided to worry about bitrot.

Even if you go with the latest generation chips, Intel has a small edge in power consumption. When I started this project, Intel had a 2:1 edge, so I was determined to get an integrated Intel solution with ECC RAM that was reasonably compatible. Sun's last generation hardware was mostly AMD, but the current stuff is Intel Xeon, so all this works ok. Also, Intel's integrated LAN, and SATA chipset (ICH9x) are well supported, which means you don't need add-on cards to get basic stuff done.

Parts list:
  • Supermicro MBD-X7SBL-LN1
  • 4GB ECC RAM
  • 5 x Western Digital Caviar Green 1.5TB (WD15EADS)
  • SATA DVD-ROM (any $20 one)
  • Antec P180 mini to hold 5 drives (I'm using an old Sonata). P183 if you want 6 drives.
  • Intel Xeon E3110. If you're cheap: a non-Xeon E5200. 45nm to save power.
  • OpenSolaris nevada build 101b
Motherboard/RAM: Intel's desktop chipsets don't use ECC, but I found the Intel 3000 chipset, which is sort of a low-end server/workstation chipset for not much more than a desktop board. Also it has 6 SATA ports and integrated video. You don't need the 3010 unless you want cool add-on cards.

The Intel board I ordered was actually really terrible, so my second try was a Supermicro MBD-X7SBL-LN1, also based on the Intel 3000. It's great: micro-ATX, integrated video, wonderful and sensible layout. Slow video, just like you want on a server. In contrast, Intel's board has power plug in the middle of the board, SATA ports at random angles, and it takes ages to boot (POST). But this Supermicro motherboard is really absolutely wonderful.

4GB of ECC RAM is now mostly free ($50). You could buy 6GB, or 8, or whatever you like. Solaris runs in x64 mode, so all your RAM will get used.

CPU: The cheapest 45nm Xeon I found is the Intel Xeon E3110. 3GHz dual-core seems excessive, but it is a Xeon, so it made me feel better, and it uses less than 40w.

Drives: Western Digital Caviar Green drives use 3.7W at idle. Most drives use 8W. Samsung EcoGreen are similar. This isn't spun-down, that's actually really spinning.

In contrast, Seagate 1.5TB drives use 8W at idle, and they have firmware bugs. (I just flashed my colo RAID and bricked one of the drives, so I had no interest in upgrading 5 of them.) I went with the Caviar Green.

For the moment, 1.5TB drives are priced at about 10c/GB, and 2TB drives are about 15c/GB. So 1.5TB is a nice choice for now.

Really read this: "consumer" drives need you to enable TLER (time-limited error recovery) before you put them in a RAID. Western Digital provides WDTLER.EXE to switch this. What you want is for a drive to fail fast (7 seconds), and tell the OS and RAID about it.

Also, turn on AHCI in your BIOS. Solaris supports it now, and it's faster. (You have to be in IDE mode to run WDTLER, though.)

Booting huge drives, booting ZFS

Drives with sizes above 1TB use a new partitioning scheme. Booting from >1TB drives is totally incompatible with the shipping version of Solaris 10u6. 10u6 will complain about everything regarding a 1.5TB drive. People say that the relevant patches might make it into 10u8, but this isn't a sure thing, and it probably won't ship for a year. Just skip the "stable" Solaris 10.

So in the meantime, you head back to OpenSolaris (the "community" version), which of course has the latest patches. I'm running "nevada" 101b. It supports the new partioning scheme, and the installer doesn't say anything about your drives being too big. This new build is supposed to provide boot support up to 2TB, but I'm not sure how tested it is.

Also, I was very pleased to learn that ZFS is now bootable.

Oh, but! ZFS is not bootable in RAIDZ (equivalent to RAID-5/6) mode, which you probably want for your main storage. So you need to dedicate some disks to boot ZFS, and create a separate RAIDZ to store data. (I cheated and used a single bootdisk, but I keep good backups.) Yes, I've devoted a 1.5TB drive to booting an OS that uses 20GB. Yeah it would be smarter to pull 20GB from the front of each drive and create a tiny mirror and a bigger RAIDZ. But anyway, I didn't do that. Maybe someone else will.

One caveat: smartmontools doesn't work with the Intel ICH9 ("-d ata" is not implemented). If you need SMART monitoring of your disks, get a Marvell-based controller, like the AOC above.

Now, to summarize, we've got:
  • Integrated video, LAN, ECC motherboard.
  • Low power, server-class parts.
  • Low-power hard drives, updated to enable TLER.
  • Quiet case, can't hear it.
  • 4GB RAM, x64-compatible Intel CPU.
  • >4TB available, with ZFS and snapshots!
  • Cost ~$1500.
  • It Works!
Other options

If you want to save the 1.5TB hassle and the TLER hassle, get the RE2-GP (this class of drive appears to add 70% to the price):


If you are willing to relax my ECC RAM requirement, a world of cheaper options (based on desktop-class hardware) opens to you. Lots of the motherboards seem to mix up two SATA chipsets, which seems really wrong to me. But here are some interesting links I found:



Also, many people have built ZFS NAS with 2 huge drives (mirrored), on an Intel Atom motherboard (the chip used in Netbooks). Unfortunately it is hard to find an Atom motherboard with >2 SATA ports. But this seriously minimizes power, and with the next-generation motherboards will do so even more. If you need 2TB or less this is a very good option.

19 comments:

  1. Nice work guy. It would be very nice to have something similar to openfiler or freenas but with the good zfs under clothes...

    ReplyDelete
  2. Exactly how important is ECC RAM, though? I've been reading and nobody seems to be able to agree on whether ECC is important for home servers where 100% uptime isn't vital. Also, I'm not sure there are many reasonably priced server boards available where I am. Any idea if there are any Core 2 Duo boards that support ECC and have 8+ SATA ports and a couple of PCI-e slots?

    ReplyDelete
  3. Recent numbers here:
    http://www.morganclaypool.com/doi/pdf/10.2200/S00193ED1V01Y200905CAC006

    "A recent study by Schroeder et al. [75] evaluated DRAM errors for the population of servers at Google and found FIT rates substantially higher than previously reported (between 25,000 and 75,000) across multiple DIMM technologies. That translates into correctable memory errors af-
    fecting about a third of Google machines per year and an average of one correctable error per server every 2.5 hours. Because of ECC technology, however, only approximately 1.3% of all machines
    ever experience uncorrectable memory errors per year."

    As for boards, most of the 8-port boards (e.g., ASUS) use two different 4-port controllers (one Intel, one something-else). This would probably work, but if I wanted 8 ports, I would probably use the AOC controller since it's only $99.

    ReplyDelete
  4. Hi. Would you be able to tell me how many watts you setup is pulling at the wall

    I once had S3000AHLX board that idled around 90w (with e4500 and 1 hdd), and now I'm thinking of getting an ASUS AM2 board since it has ECC support with much lower power usage at much lower price.

    If your Supermicro board is more power efficient than the Intel board I had, I might consider getting core 2 setup again. Thanks.

    ReplyDelete
  5. <80W per the kill-a-watt (with 5 caviar green drives and 4 sticks of RAM).

    ReplyDelete
  6. I recommend putting the OS on a pair of mirrored thumb drives and limit the writes by moving “/var” and “swap” to the spinning disks.

    ReplyDelete
  7. Are thumb drives as reliable as "real" SSD right now? I don't really know the reliability trade-off.

    ReplyDelete
  8. Thanks for the write-up. A friend also suggested this MB, so I got one.
    I'm an OpenSolaris (0906) newbie and am getting (in /var/adm/messages):

    Aug 15 19:47:14 foo unix: [ID 954099 kern.info] NOTICE: IRQ18 is being shared by drivers with different interrupt levels.

    That IRQ is listed by "mdb -k" as

    18 0x82 9 PCI Lvl Fixed 1 4 0x0/0x12 0, ata_intr, uhci_intr, uhci_intr

    Did you see this too? How did you fix it? The BIOS doesn't offer IRQ options except for the serial ports (IRQ 3 vs. 4).

    ReplyDelete
  9. Can you use AHCI mode instead of regular ATA mode?

    ReplyDelete
  10. Thanks. That did the trick!

    ReplyDelete
  11. I have a serious impression that the ECC requirement is not so strong as some suggest.

    If a NAS provides a non-ECC workstation with iSCSI "disks" to be used, the workstation would format the "disks" and therefore the data would be once again exposed to the memory errors ECC is supposed to detect and correct.

    Does that makes sense?

    ReplyDelete
  12. Good work! I am very nearly at the point of trying to duplicate your server. Have you considered (or have you already) added more information somewhere for those of us who want and can build/use such a server, but aren't able to come up with it on our own?

    A kind of mini reference design would be very nice.

    Or possibly, how many beers would I need to buy you to get you to do some coaching/advice on how to get my own running? 8-)

    ReplyDelete
  13. Why do you use the expensive xeon over a core 2 duo ?

    ReplyDelete
  14. @erik - I think the chip I have is basically an E8400, but the prices were very similar at the time. I guess I always believe Xeon has better thermal management, even when it's not true. :)

    ReplyDelete
  15. your Supermicro board is more power efficient than the Intel board I had, I might consider getting core 2 setup again. it will help me in my Project Thanks.

    ReplyDelete
  16. To those doubting the need for ECC - I say it depends on what you need to store. For me, my primary storage concern is home videos and pictures, which to me are impossible to put a price on.

    Here's a true story. A year or two ago, my process for backing up my important media was to plug in the SD card containing the pictures/video into my desktop Windows PC, then copy the files over to my Linux fileserver in the basement using either the Samba share + a manual MD5 check or just use rsync (on Cygwin). After that the SD card is formatted, permanently erasing the original media files.

    Anyway, while hash checking some files that I had copied over, I got hash check errors, signaling that the files were not a 1:1 binary match after the transfer. It took me a little while to stop blaming the Windows-Linux Samba bridge for the issue and finally run memtest86 and discover that the RAM was the culprit. It scared the daylights out of me. I didn't suspect it because the system ran remarkably stable for having corrupt memory issues (I didn't see any obvious issues in the system error logs, either). I've had Windows desktops act really weird when memory goes corrupt, so I'm usually not surprised at that point when I test the RAM and it ends up being corrupt.

    The file in question was a JPEG photograph of my daughter; opening the file worked fine and did not look any different to the human eye than the original. However, if the wrong bit(s) got flipped, I'm sure it could have completely corrupted the file. Other media filetypes may have a higher probability of corruption with randomly fiddled bits, as well.

    My point is that the showstopper (kernel panic / bluescreen) blowups are not what scare me - it's the silent corruption of your important data that is far more sinister and frightening. Now, you may be willing to take that chance at the tradeoff of slightly cheaper components, but I am certainly not. If you do decide to go non-ECC, I would encourage you to always use rsync or something that automatically does post-transfer file integrity checks.

    P.S. The RAM in question was Crucial Ballistix, which in hindsight has ended up being a very failure-prone model of RAM, although it was commonly the cheapest available for the high performance balance it offered.

    ReplyDelete
  17. Did you try smartmon with "-d sat,12" to use the 12 byte sata format?

    That works for me with the Intel D510 motherboard...

    ReplyDelete
  18. TLER enabled for RAID and set to 7s or less...but for hardware raid.

    For software raid disable TLER completely to let OS handle. Ie disable for linux soft raid and zfs.

    One should not be using hardware raid with ZFS so hence my suggestion about disabling it completely.

    ReplyDelete
  19. The point of ECC is so the ZFS system does not introduce errors/malformed data due to RAM which are then written to disk, perhaps backed up, deduplicated or archived.

    If bit-rot means nothing to you on 1TB+ drive RAID arrays, by all means use normal RAM. Ignorance is bliss.
    ECC RAM complements ZFS integrity checking and should be used without question on such a system.

    If your user writes garbage to your ZFS iSCSI target because their RAM is faulty. It is not the fault of ZFS.
    Taking that same users good data from their iSCSI disk space and then corrupting it unknowingly due to bad RAM is what ECC can mitigate.


    Data integrity, storage and backup, while maintaining its integrity is more complex than at casual glance.

    http://serverfault.com/questions/77710/is-bit-rot-on-hard-drives-a-real-problem-what-can-be-done-about-it

    http://blog.fosketts.net/2011/07/11/dropbox-data-format-deduplication/

    http://blogs.oracle.com/elowe/entry/zfs_saves_the_day_ta

    ReplyDelete