Getting rid of ZFS dedupe (and a new drive enclosure)

I previously wrote about my ZFS NAS, and how I managed to get dedupe to write moderately fast with a cheap SSD drive.

I've decided to go back on the decision to use dedupe at all, and I have some pretty great fast and noisy hardware to talk about instead.

Today, the publically-available "build 134" of OpenSolaris has quite a few dedupe bugs that make simple operations like destroying a dataset or even rm'ing a bunch of files very slow, sometimes leaving a server unresponsive for a day at a time. These bugs are getting fixed (there's a lot of talk about "single threading" and things like that), but it has been 6 months with not a lot of forward progress. The Nexenta guys are patching their OS from the development versions, which apparently are buggy, but I didn't want to deal with any of that.

At the same time, I had a security camera logging to my NAS, and almost filled it up. I tried to delete an old 500GB snapshot, and it locked the machine trying to remove all the deduped blocks. Basically there was no way to free up space. Also, my dedupratio at the time was about 1.8...when you're using 5TB on a 4 x 1.5TB RAIDZ there's no easy way to go back.

So I decided to get more disks and not ever turn on dedupe again.

I considered building a totally new server in a new enormous case, so I could put more disks in it. But this is tricky: you can get the cooling wrong, and a lot of the off-the-shelf bigger cases sound like wind tunnels. Instead, I decided to do it like datacenter guys do it: get a "direct attached" drive enclosure.

I found this enclosure from Sans Digital: the TowerRAID TR8X.

The TR8X is not cheap in home NAS terms (it's $400, plus a $300 LSI SAS controller), but compared to a box from Dell (which costs thousands) it is very cheap. And it has the advantages of a "real" external drive enclosure: it has good cooling, it has hot-pluggable drives, and it's actually SAS, so every OS on earth will recognize the drives instantly, and you can use enterprise SAS drives if you wanted to spend the money.

It turns out the cheaper "port multiplier" SATA controllers have mostly buggy drivers, even on Linux sometimes. But this SAS one works everywhere.

So I plugged it in.

I don't usually believe theoretical performance numbers will ever really happen, but a ZFS stripe on 8 drives actually did give me 700MB/sec performance.

With my old 4-disk setup, I was also nervous about running "single parity" RAIDZ. With this upgrade I was able to go to RAIDZ2 (so 3 disks have to fail before there's data loss), and the TR8X is still pulling >320MB/sec read and write with all that turned on. And a scrub of 5TB takes less than 4 hours. It's way too fast for anything I do.

This enclosure isn't for everyone: it is quite a bit noisier than my old quiet NAS, a lot of that is due to the really noisy Hitachi A7k2000 disks I got.

But it is a really solid piece of hardware, performance is amazing, and I didn't have to do a whole lot of reconfiguration to upgrade.

I wish everyone luck with dedupe, but I'm going brute-force for now, instead.

5 comments:

  1. If the main(?) point of going the zfs/open solaris route for your NAS is extreme robustness and reliability, how come long-standing, slow-to-be-fixed, bugs in a feature of the FS don't make you nervous, even if it's "just" performance bugs, not data loss bugs?

    (I ask since I'm intrigued by this option for myself, but your experience above makes me a little skeptical.)

    -matt

    ReplyDelete
  2. Good question. The dedupe feature hasn't been released to any "stable" Solaris build yet, so they released it to their developer track early. You could say "too early", but I think it has been good for them.

    I have no insight into how Sun/Oracle does their work, but I do see most of the bugs being fixed and a general attention to getting things right. It's an impressive group of engineers. Their bugs are public:

    http://bugs.opensolaris.org/bugdatabase/search.do?process=1&type=&bugStatus=&keyword=&textSearch=dedup&perPage=10&sortBy=date

    So I don't think this is some sort of neglected project, by any means. From what I've heard, they are working furiously to stabilize the dedup feature for inclusion in the main Solaris build, and this is taking some time to get right.

    I'm not sure what impact the Oracle acquisition has had on the development builds, though Oracle has committed to fully continuing work on this project.

    I think the ZFS product is really incredible, it's just a bit inconvenient to be on the bleeding edge all the time.

    ReplyDelete
  3. I don't trust drobo nearly as much. Too many data loss reports.

    ReplyDelete
  4. What. Data is such a F'ing hassle.

    ReplyDelete