ZFS NAS followup: SSD is amazing

I've been running my ZFS NAS for about a year. By now, I've upgraded many times, currently at snv132 from 101b, and I've enabled dedup for the storage pool.

Here are a few notes and updated recommendations:

Single-parity RAIDZ makes me nervous; dual-parity RAIDZ2 is better for integrity and a group of mirrors is better for speed. Of the 5x1.5GB Caviar Green drives I started with, I've replaced 2 due to small failures that ZFS detected. (I can't easily upgrade to RAIDZ2.)

Weekly scrubs find errors, but you have to do a little work to optimize for scrub time. At one point, my scrubs took 80 hours, and now they take about 16 hours for a larger amount of data. What helped? A few things:
  • Disable access time (atime). Otherwise, snapshots with lots of files that you verify daily will each have their own metadata, and scrubs will take dramatically longer. Also, disabling atime gives you general performance boost.
  • Install a newer ZFS that has metadata prefetch during scrub (I think this was added in b129).
  • One of my datasets had 4 million files, 20 snapshots, and compression turned on, and destroying this dataset reduced my scrub time a lot. (It was a Mac rsync backup, which now uses Time Machine instead.) I think compression can slow down scrub, but it might be unrelated.
Dedup eats write performance, and you must use SSD

When I first enabled dedup and replayed all my datasets (zfs send/recv), I was able to write to the dedup'd RAIDZ volume at only 3MB/sec! Previously I could write at 60MB/sec.

The best theory I have is that the DDT ("dedupe table") was using more space than RAM, and so the number of reads from disk required to do a small write was very large. Not much would help this, until I put an SSD drive into the pool. DDT is now cached and I can write 25MB/sec usually.

Also, "upgrading" to dedup is somewhat difficult and time-consuming. There is no automatic way.

However, my "dedupratio" is 1.7 for the volume now, so even though the write speed isn't as good as before, the results are amazing, and I can tolerate it for the storage efficiency. The speed when reading is as good as before, also.

SSD+ZFS is magic.
ZFS is the first system that makes a tiny $90 SSD drive super-useful. With normal filesystems, you have to manually move "hot" data (like your OS) to the drive, and then you run out of space or spend $1000 to get more. ZFS does this automatically, using the SSD as a cache. I got the OCZ Vertex 30GB drive, and while I know there are faster Intel drives, this has made an enormous difference.

As I said above, SSD has improved my dedupe write speed by 8x. And it also serves as a cache of hot data, so if you read a lot of filesystem metadata (like you would by compiling over NFS) it can perform 50x faster than leaving it out of the pool. (This 50x number is based on a benchmark of opening all the files in a folder, reading the first 30k, then closing.)

Also, the SSD acts as a "log" device and can handle small writes much better than disks can. So when an NFS client wants to do a small write, the ZFS NAS can respond dramatically faster than a disk-based server can, but still can guarantee data integrity.

There is some debate about which devices are suitable for use as a ZFS log. Some devices may slow down a fast mirror, by writing slower than the disks would. Also, the device isn't allowed to do any RAM buffering to improve write speeds (and my OCZ Vertex might do this wrong). But in the meantime, NFS and CIFS are just quite a lot faster, so I will pretend that I'm not really in much danger of data loss. Currently, about 2GB of my SSD is devoted to log, and the rest to cache. Here's a solaris-discuss thread that says the OCZ is slower than it seems to me. My NFS compiles are incredibly fast right now.

You can spend more on an Intel SSD, and supposedly it's even faster.

Dedupe is an amazing technology, but you have to give it the hardware it needs. If I could figure out how to make a quiet case that held 10 drives, I'd probably avoid it. But for a 4-disk RAIDZ, it is a good match for me.

My advice is to add an SSD to your ZFS box, no matter what. For certain, you can use a $90 SSD to be a nice fast cache. If you want to use the device as a ZIL (ZFS intent log) and you're paranoid about data integrity, read the thread above and spend $500 on an SSD. Otherwise, I think my $90 one does pretty well too.

4 comments:

  1. Out of curiosity, how exactly did you add your small SSD to the zpool? Did you add it as a log device (implied by your discussion of write performance the the ZIL), cache device (L2ARC), or both (if possible)?

    In my understanding, log device usage can't effectively scale much past 50% of main memory size, rendering a 30gb SSD questionably useful on a modest scale server. The ideal would seem to be using the same SSD as a read cache (L2ARC) and log device (ZIL), as well as perhaps for the boot partition, but short of using plain files on the SSD as virtual devices for the log and cache volumes, I'm not sure whether this is even possible in Solaris.

    ReplyDelete
  2. As a quick follow-up, it seems possible to simply pre-partition a single SSD into boot, log, and cache partitions:

    http://blogs.sun.com/ds/entry/make_the_most_of_your

    ReplyDelete
  3. I partitioned it and added 2GB as log, and the rest as cache. So yeah a huge log doesn't help much.

    Most people say you shouldn't add an un-mirrored log on *older* builds (can cause unrecoverable failure of the pool), but it seems reasonable to do so on newer builds...seems like maybe.

    Also, people recommend different SSDs for different types of operations. Thread here on SSD best practices:
    http://opensolaris.org/jive/thread.jspa?threadID=127994&tstart=0

    ReplyDelete
  4. Thanks for the informative blog.

    I've been using a SSD drive as a ZIL on my ZFS pool and I've gained a big concern. You can't import your pool if its missing a log device.

    When your log device goes bad ZFS will fall back to using the data pool for the ZIL. That part is great. But if you reboot, your pool is hosed. If you export your pool, hosed. You can't import it without the log device. It's hosed.

    Best thing to do is mirror the SLOG or remove it when it goes bad so that a unplanned reboot wont make life suck.

    There are recovery procedures but I couldn't get the tool (logfix) to compile on latest SVN. This writes the labels to a new log device. One needs a copy of the GUID from zpool.cache too.

    More info:
    http://opensolaris.org/jive/thread.jspa?messageID=377018

    ReplyDelete