Cloud DBs

I've been doing some research on "cloud databases" - the non-relational key/value storage systems that people are using to scale their web apps past MySQL and SQLite.

First are the Bigtable clones, where you actually get columns and higher-level features:
HBase: database used for Hadoop
Cassandra: database used by Facebook

These are "big" projects, and if you have a big application you might consider them. But there is a lot of code there, and they seem pretty new for their complexity.

Next are the low-level improvements on what people know as "DBM" or "Berekeley DB", the simple "put a value, get a value" interfaces. Typically these packages wrap a number of different backends: typically a fixed database (flat file), a hash table, and a B+Tree. Compared to Berekely DB, these guys are faster and usually LGPL:

By some reports, using this kind of code is 10-100x faster than using MySQL or Sqlite to do the same task. And bindings are good, supporting C, Java, Ruby, and Python.

Of course there's the popular "memcached" which stores key/value pairs in RAM across multiple machines. Memcached is interesting because people are using the protocol as a standard for persistent key/value storage (as well as what everyone knows, an implementation for RAM-only caching):

A feature you might want is to be able to access your database over the network, rather than by touching disk. Interesting entrants here:
http://memcachedb.org/ - Danga's memcached + Berkeley DB

Tokyo Cabinet and MemCacheDB support the "memcached protocol" and most of the above do some kind of rest-ful storage. CouchDB does map/reduce for its indices, which sounds neat but proves to be 100x slower than MySQL in practice.

Finally, you should know which of these systems support horizontal scaling (i.e. linear scaling when you add more machines), and those include HBase, Cassandra, and some layers on top of the key/value guys. Most of the above systems (including CouchDB) do not scale horizontally, and you basically make full replicas of all of your data, or just use them on one disk.

LightCloud: scaling layer built on Tokyo Tyrant.
Project Voldemort: used by LinkedIn and others

At this point, I'm very impressed with the Tokyo stuff, and I especially like that I can break the key/value abstraction and do cursor ops on the btree directly. So if I have 1000 keys that appear sequentially, it is insanely fast to fetch them.

For smaller projects I think I'm going to test out Tokyo Cabinet, and for larger ones Lightcloud. Love to hear other suggestions.

my mod_gzip settings (deflate.conf)

Paul Buchheit posted about gzip settings, and I thought I'd post my deflate.conf (apache) because the default apache stuff isn't nearly aggressive enough.

This stuff goes into conf.d/deflate.conf, and it's cribbed from several places on the web. Wish I could credit them, but I forgot.

If you don't use settings like this you'll find that your CSS and JS files don't get compressed, or you'll compress them all the time, even for the browsers that can't handle them, or you'll get your stuff cached by proxies that will serve the files to browsers that can't handle them, etc. I've not done totally exhaustive testing, but this is what I use on all my sites.

AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css
AddOutputFilterByType DEFLATE application/xml application/xhtml+xml application/rss+xml
AddOutputFilterByType DEFLATE application/javascript application/x-javascript

DeflateCompressionLevel 9

BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
# IE5.x and IE6 get no gzip, but allow 7+
BrowserMatch \bMSIE\s7 !no-gzip
# IE 6.0 after SP2 has no gzip bugs!
BrowserMatch \bMSIE.*SV !no-gzip
# Sometimes Opera pretends to be IE with "Mozila/4.0"
BrowserMatch \bOpera !no-gzip
Header append Vary User-Agent env=!dont-vary

There's an nginx equivalent around here somewhere, I'll dig it up.

Bravo - Google Updater open-sourced

Google has open-sourced the updater used by Earth, Chrome, etc.:
http://google-opensource.blogspot.com/2009/04/google-update-goes-open-source.html

The blog post above talks about two things: (a) transparency in how Google does things, and (b) saving people time.

I wouldn't underestimate (b), and notably how big a hurdle it is to write the basic "Client Software Infrastructure" today - downloads, builds, upgrade/downgrade, and especially autoupdate. You could spend a few months of a mid-level engineer to do this at a basic level, and a year to do it really right. Some big companies (e.g., Adobe) even make autoupdate clunky and awkward.

Google's system is well engineered, simple for the user, and it works really well.

Of course, Microsoft and Apple should provide these frameworks. Apple doesn't do much here aside from their own apps, though I like Sparkle, which is free.

Microsoft started to provide the basic tools in MSI, though it's not a full framework. But even their lower-level stuff is pretty much broken. A quick list of ways Microsoft dropped the ball with MSI:

1. Their installers are very slow.
2. MSI uses "chained certificates" that usually expire after a year, requiring software authors to chain together all old signing certs to make update work.
3. Uses an ancient compressor, that makes installers that are nearly 2x as big as the better compressors available today.

So anyway, thanks Google. Even though someone else probably should have done this instead, it's a big step forward.

A good-enough ZFS NAS

[followup from October 2010: Don't get WD EADS drives for this, as they fail a lot. Also look for an ethernet controller that allows "jumbo frames" as Intel has decided the motherboard listed here shouldn't support that.]

I've kept an old 2001-era Dell box running OpenSolaris with ZFS for the last few years. Filled it up. But I was so very happy with ZFS that I wanted to build a new box. And who wouldn't want a Netapp-like storage system for cheap?

I thought this would be easy, but it really wasn't, and so I thought I'd write it up for everyone who's in a similar situation. (If you're impatient, you can skip down to the parts list below.)

My "impossible" goals for a ZFS NAS:
  1. Quiet!
  2. Low power
  3. ECC RAM
  4. Big storage (4TB+), with 5-6 disks
  5. Compatible with OpenSolaris
  6. Reasonable cost
If you didn't care about most of those things, you could just get a NetApp for $100,000. But look, you probably didn't do that.

At a medium pricepoint, you could just buy Sun hardware. e.g., this guy converted an ultra40 workstation to run ZFS, but sometime after that Sun canceled the Ultra 40. Now you can only buy the 4-disk version, and that's actually not enough for a huge NAS when the bootdisk eats one (more on that later).

Rack-mount stuff satisfies most of the above, but it's a little bit expensive, and very loud. And if you want loud, you could just buy a Sun Fire X4140.

Sun has been very slow to make Solaris work on non-Sun hardware, so compatibility turns out to be difficult. You can find people who've read some of the Sun whitepapers and cloned the hardware for a smaller cost. For instance, "Thumper" (Sun's massive ZFS box) apparently uses multiple AOC controllers like this one. The AOC costs $99 and supports 8 SATA disks.

I care about ECC RAM because I decided to worry about bitrot.

Even if you go with the latest generation chips, Intel has a small edge in power consumption. When I started this project, Intel had a 2:1 edge, so I was determined to get an integrated Intel solution with ECC RAM that was reasonably compatible. Sun's last generation hardware was mostly AMD, but the current stuff is Intel Xeon, so all this works ok. Also, Intel's integrated LAN, and SATA chipset (ICH9x) are well supported, which means you don't need add-on cards to get basic stuff done.

Parts list:
  • Supermicro MBD-X7SBL-LN1
  • 4GB ECC RAM
  • 5 x Western Digital Caviar Green 1.5TB (WD15EADS)
  • SATA DVD-ROM (any $20 one)
  • Antec P180 mini to hold 5 drives (I'm using an old Sonata). P183 if you want 6 drives.
  • Intel Xeon E3110. If you're cheap: a non-Xeon E5200. 45nm to save power.
  • OpenSolaris nevada build 101b
Motherboard/RAM: Intel's desktop chipsets don't use ECC, but I found the Intel 3000 chipset, which is sort of a low-end server/workstation chipset for not much more than a desktop board. Also it has 6 SATA ports and integrated video. You don't need the 3010 unless you want cool add-on cards.

The Intel board I ordered was actually really terrible, so my second try was a Supermicro MBD-X7SBL-LN1, also based on the Intel 3000. It's great: micro-ATX, integrated video, wonderful and sensible layout. Slow video, just like you want on a server. In contrast, Intel's board has power plug in the middle of the board, SATA ports at random angles, and it takes ages to boot (POST). But this Supermicro motherboard is really absolutely wonderful.

4GB of ECC RAM is now mostly free ($50). You could buy 6GB, or 8, or whatever you like. Solaris runs in x64 mode, so all your RAM will get used.

CPU: The cheapest 45nm Xeon I found is the Intel Xeon E3110. 3GHz dual-core seems excessive, but it is a Xeon, so it made me feel better, and it uses less than 40w.

Drives: Western Digital Caviar Green drives use 3.7W at idle. Most drives use 8W. Samsung EcoGreen are similar. This isn't spun-down, that's actually really spinning.

In contrast, Seagate 1.5TB drives use 8W at idle, and they have firmware bugs. (I just flashed my colo RAID and bricked one of the drives, so I had no interest in upgrading 5 of them.) I went with the Caviar Green.

For the moment, 1.5TB drives are priced at about 10c/GB, and 2TB drives are about 15c/GB. So 1.5TB is a nice choice for now.

Really read this: "consumer" drives need you to enable TLER (time-limited error recovery) before you put them in a RAID. Western Digital provides WDTLER.EXE to switch this. What you want is for a drive to fail fast (7 seconds), and tell the OS and RAID about it.

Also, turn on AHCI in your BIOS. Solaris supports it now, and it's faster. (You have to be in IDE mode to run WDTLER, though.)

Booting huge drives, booting ZFS

Drives with sizes above 1TB use a new partitioning scheme. Booting from >1TB drives is totally incompatible with the shipping version of Solaris 10u6. 10u6 will complain about everything regarding a 1.5TB drive. People say that the relevant patches might make it into 10u8, but this isn't a sure thing, and it probably won't ship for a year. Just skip the "stable" Solaris 10.

So in the meantime, you head back to OpenSolaris (the "community" version), which of course has the latest patches. I'm running "nevada" 101b. It supports the new partioning scheme, and the installer doesn't say anything about your drives being too big. This new build is supposed to provide boot support up to 2TB, but I'm not sure how tested it is.

Also, I was very pleased to learn that ZFS is now bootable.

Oh, but! ZFS is not bootable in RAIDZ (equivalent to RAID-5/6) mode, which you probably want for your main storage. So you need to dedicate some disks to boot ZFS, and create a separate RAIDZ to store data. (I cheated and used a single bootdisk, but I keep good backups.) Yes, I've devoted a 1.5TB drive to booting an OS that uses 20GB. Yeah it would be smarter to pull 20GB from the front of each drive and create a tiny mirror and a bigger RAIDZ. But anyway, I didn't do that. Maybe someone else will.

One caveat: smartmontools doesn't work with the Intel ICH9 ("-d ata" is not implemented). If you need SMART monitoring of your disks, get a Marvell-based controller, like the AOC above.

Now, to summarize, we've got:
  • Integrated video, LAN, ECC motherboard.
  • Low power, server-class parts.
  • Low-power hard drives, updated to enable TLER.
  • Quiet case, can't hear it.
  • 4GB RAM, x64-compatible Intel CPU.
  • >4TB available, with ZFS and snapshots!
  • Cost ~$1500.
  • It Works!
Other options

If you want to save the 1.5TB hassle and the TLER hassle, get the RE2-GP (this class of drive appears to add 70% to the price):


If you are willing to relax my ECC RAM requirement, a world of cheaper options (based on desktop-class hardware) opens to you. Lots of the motherboards seem to mix up two SATA chipsets, which seems really wrong to me. But here are some interesting links I found:



Also, many people have built ZFS NAS with 2 huge drives (mirrored), on an Intel Atom motherboard (the chip used in Netbooks). Unfortunately it is hard to find an Atom motherboard with >2 SATA ports. But this seriously minimizes power, and with the next-generation motherboards will do so even more. If you need 2TB or less this is a very good option.