Jul 5, 2009

Mesh kinda worked!

[update: Despite its using an absurd quantity of resources, I'm letting Mesh finish syncing 13GB of stuff in a small folder. Moe.exe is using 170MB of RAM, but it's actually really working.]

[update 2: Mesh has copied 9GB in about 4 hours, for about 600kb/sec. That's about 10% utilization of my 802.11n network.]

I rebooted my laptop and Windows Live Mesh started syncing.

The amazing thing is that it's doing as many as 500,000 context switches/sec (no idea that was possible), and using most of my CPU (on a dual-core machine). So mostly the computer is running at "slow" speed while it's going.
This is on a 2.53GHz Core 2 Duo. I just had no idea copying files over a network took this many resources.

I guess you could do a context switch per packet, um, write the packet to a log file, um, write the packet to a database, maybe with a different thread for each task, and synchronize them somehow. But I think that might be more CPU-efficient than this.

And below is the graph of CPU...remember that's 80% of CPU on a dual-core box.







And finally, here's some more evidence of what's going on. This thing is actually writing data to disk 41 bytes at a time. That is just an insane amount of work for the kernel to do, and very unnecessary...



Posted by Picasa

Jul 3, 2009

Social Network APIs for spam control?

I'm wondering if there's a way that we can use the proliferation of social networks to solve part of the email spam problem.

For instance, if you always sent mail and asked Facebook (or, hm, OpenSocial), "Hey, give me proof that so-and-so knows me?"

Then, if Facebook signed that request using some simple PKI system, you'd include the sig in your email header, and you could verify it's not spam using PKI. (Twitter, etc. could do the same thing.)

It could be possible to roll this kind of thing into an Outlook plugin, and webmail systems could follow fast with OAuth implementations of the same thing. You could deploy this widely within 12 months, and your mail would get through, very reliably.

Jul 1, 2009

Blatantly Latent: WiFi makes network filesystems pokey

I was just doing some benchmarking of my favorite photo app against a mounted NAS volume, on wireless 802.11n. Actually it doesn't take benchmarking to notice that things are a lot slower over wireless.

A lot.

Slower.

802.11n is not at all, even close, "ethernet without wires". In some cases it takes 20x longer to do things than 100Base-T. It's not clear by looking at the specs why this is true. But from what I've observed, the reason for this gap has a lot more to do with latency than throughput.

I'm sitting 12 feet from my WRT610N router with an Intel 802.11n chipset in my laptop, going through one wall, and this is mostly what you'd call a state of the art combo.

I can stream data at about 6MB/sec over this connection.

But my "average" ping to my router is nearly 10ms, about the same as my DSL ping to my ISP. For things like big HTTP streams, 802.11n does really nicely. It can stream video, and stuff like that.

Network filesystems were invented 20 years ago, when throughput and latency were balanced differently. The most data you can request at a time over these systems is maybe 64k, and that's not so much when latency is high.

This means if you issue 64kb reads in a tight loop, you can maybe use half of a typical 802.11n line. But just barely, and it depends on noise, and if your computer is doing anything else.

It's not nearly as good as an HTTP stream.

And still, the 64k at a time approach sounds close to reasonable, until you realize that the code everyone's using to read files (stdio and jpeglib and libpng) have embedded default buffer sizes of 4-8k! And that's when it gets really bad.

Yeah. Oh it's slow.

If you make software, you really have to change these 4-8k defaults, or even implement readahead if you don't want to be entirely gummed up by a normal wireless network. Because if you're only reading 8k at a time, you're reading at less than 25% of your network's capacity, maybe 10%.

No way around it. You are just slow.

People will notice that, and you'll be slow compared to the people who know this and code around it. They'll be 4x faster easily, and maybe more.

My advice: pretend you're reading streams 64k at a time. Don't seek a lot. Don't use a database that reads in 4k blocks.

I guess we should also think about when these 20-year-old filesystems are going to get updated. In the long term, we need network filesystems that can deal with latency in a smarter way. WebDAV tried to do that with HTTP, but with an XML fetish that doesn't look efficient over the wire. In general, a smarter approach to block sizes (intelligent read-ahead) and hints based on recent usage would help a lot. If I've just opened a dozen 30MB .CR2 files and read their full contents, software that adapts to that case would be very nice, rather than running at 10% of the network's speed because of a 20-year-old protocol.

The batchy, async "sync" protocols are all very proprietary right now, and they don't degrade nicely to "be like NFS or Samba". There's a big split in both openness and in "sometimes I need async speed and sometimes I need synchronous operation".

The cost of not updating this piece of the technology stack over time is that the 4x gap in throughput you can find today between "synchronous reading" and "streaming" widens further, until there are no "common" and high-performance network filesystems, and we all use custom or async protocols for high-performance situations.

We could do that, but I think there's potentially a middle ground, using some read-ahead, and some batching to avoid these issues. And if applications can usefully say "be totally async" then we get protocols like the ones we see in sync solutions today.

Finally, wireless standards could put some focus on latency as well as throughput. It's nice marketing to stream 3 HD video streams, but it's also nice when that doesn't come at the cost of increased latency for other common operations.

It's important to keep throughput and latency in balance. Already the big improvements made in the 802.11n standard show a trend towards putting them out of whack. Let's see if software or hardware makes the next step towards improving that.

Jun 28, 2009

Sync is (almost) the new backup

I've been wrestling with how to keep my photos and videos and code and email backed up for several years now. Haven't lost much yet, but I spend a lot of time working on it. Currently I'm using a combination of rsync to a machine I put in a colo, plus a sync tool I made for myself to schlep files from Windows to UNIX filesystems, plus ZFS (with its awesome snapshots) running on Solaris, plus Mozy, plus a little Time Machine on the Macs, and occasionally Crashplan.

My idea in writing this post is that the only system that works for a majority of people is p2p sync of actual files, because this results in multiple "live" replicas of a file, which can be verified to work all the time:
  1. Not backup to DVD/USB drive
  2. Not cloud backup over 128kbps DSL
  3. Not cloud sync.
Here's why.

Cloud Dreams!

First what we've got is the "cloud" guys purporting to store all our stuff online! What a great solution, and it is the right solution, because it doesn't get lost. But look at the actual cost:

Amazon S3: $1.80/GB/yr
MozyPro: $6.00/GB/yr
1TB Box in colo: $1.50/GB/yr
MozyHome: $60/yr (unlimited magic)
1TB local hard drive: $90 ($0.09/GB, maybe $0.03/GB/yr)

Now the photos-only guys (no raw, limited video, etc.):
Picasaweb: $1.25-2.00/GB/yr
Flickr: $24.95/yr
Smugmug: $39.95/yr

Loss-averagers

Let's first blow up the "unlimited" storage idea. We're in an era of stupidly slow upstream connections, and these guys get away with "surely nobody is going to upload stuff to us for 6 months".

I installed special traffic shaping on my firewall and bought a really fast line so I could upload for 6 months without really noticing it, and then I did it. But really, is this pricing model real? Only because people don't use it. So they have a theoretical backup, not a real one.

For instance, Mozy clearly isn't faring so well when I store 400GB there. Looks like they are losing $2340 on me annually, when you compare to their "pro" pricing. Ouch.

Same with Smugmug, Flickr, etc. Having Smugmug store my stuff in S3 at $1.44/GB/yr (Amazon's bulk rate) losing them about $536/yr. I imagine Flickr pays more than that, but I don't have any numbers.

So, PCs are 30x cheaper than the cloud

Storing an extra copy of this data on a new hard drive in my computer costs $12/year, plus maybe $8 for power if I leave it on all the time, or about $20/year.

I could even buy a netbook with a 160GB hard drive, and use that as a backup brick for $1/GB/year, cheaper than Amazon s3.

Sync today...tied to the cloud

So today, we have serious R&D going into things like "Microsoft Live Mesh" and Dropbox, and Zumodrive, and SugarSync. They handle the hard problem of folders that stay in sync no matter what you do to them. Lots of people are working on this.

But all of these services (except Mesh) today are tied to costs in the $1+/GB/year world of cloud storage. And so I can't use them for storing my "important" data. Even though maybe they are cool, I'm not paying an extra $1000/year for the privilege, and most don't scale to the level I'd need anyway.

Backup today...sort of

On the backup front, Mac users can get a Time Capsule and store their data on a second hard drive. And there are dozens of backup programs that will copy data you might or might not be able to restore in the future. Microsoft has some sorts of backup built into Windows 7, with the Time Machine equivalent being relegated to "Ultimate".

Crashplan is probably the best p2p backup for regular users. It mostly works, even if its CPU and memory usage is a bit high.

But the problem with backup is twofold:
1. Restore has to work. You're not watching daily for failures in your external USB drive, unless you're an enterprise IT guy whose job it is to keep the near-line backup device functioning.
2. You are using hardware for no incremental benefit. Backup is complex and there's no benefit for you to buy more of it to store more backups. Why buy it or install it? Why verify it still works? When you run out of space, what happens?

People need a benefit for backing up today.

What's missing: p2p Sync

Microsoft bought a little company called FolderShare a few years ago, and hasn't improved it much. 20,000 files, max, not so many folders. But what FolderShare did is approximately, sort of, the right thing: a PC-to-PC replication feature where files are actually usable while they're replicated. Why is this so important? Because you didn't know that all your backups were corrupted, or that your backup USB drive wasn't working.

But if your files stopped working on your laptop but were still good on your desktop, that would be noticeable. You could fix it, and you'd notice it, and you'd hopefully get a new laptop before it was too late.

The only system today that appears to do p2p replication is Mesh, and it's a complex weirdo with Silverlight UI in a browser (huh?!). My install on XP took an hour and asked to setup remote desktop more prominently than setting up stuff to sync, and I couldn't figure out exactly how to sync data to my wife's computer. Mesh appears to be a toolset to solve problems, but it doesn't really help you figure it out much.

Using Mesh is a letdown. When I clicked on a folder in the web UI, it took 20 seconds to show the 15 files inside it, and I'm apologizing in advance for not having really run it through its paces. I uninstalled it. I get the idea, but the implementation is just awful.

Mesh also lets you replicate folders from all over your hard drive, which is the other useful feature you really want. Dropbox doesn't do this.

I'm wondering if somehow Mesh will win despite its unreasonable bulk. I suppose the Microsoft 3.0 rule still applies. They are certainly thinking right, even if the execution is off.

Meanwhile, there are rumors of Dropbox doing useable p2p sync. I think this would be a great thing, and maybe people would understand it. Dropbox mostly just does one thing and does it really well.

But I need to be able to bring home a new computer, add it to my network, wait for all my files to show up, and turn off the old one. Three copies of everything I care about, files shared between my wife and I. Easy right?

But somebody...really just needs to do it.

Jun 9, 2009

PC hardware is crap (99% uptime sucks)

Until now, I've mostly been on the 2 year "faster, better" upgrade path. New CPUs were 4x faster than what I had on my desk, so I upgraded. This happened often...and until the last couple of years, computers lasted longer than I kept them.

I realized the servers in my closet were going on about 6 years old and needed upgrades, and over the last year, the laptops are all 3 years old, and we've replaced power supplies and laptop keyboards and many hard drives and even LCDs, and just tonight my monitor stopped turning on, after my Macbook hard drive started clicking and not booting.

Planned obsolescence.

And all I can think is that the quality of PC hardware seems to be crap right now. Quality of hardware still matters. My computer is less reliable than my car, and a few years ago, it seemed to be much more reliable. When my car fails, it doesn't drive off a cliff, it just smells bad or stops moving. When my computer fails, it loses all my data and leaves me unable to do work for a week.

(MTBF of hard drives appear to be going up exponentially in the marketing literature. It's not true.)

If anything will propel cloud computing to the mainstream, it will simply be the demise of the PC as a reliable store for data. The complexity of client OSes makes it impossible to replace hardware or to ensure that data, settings, or availability of the "upgrade experience" is reasonable. The size of the data is huge, and moving it to a new machine is too difficult.

If I had to give my own IT operations an "N-nines" uptime percentage, I would only qualify for 99% uptime over the past year. It is common for each machine to have a full day outage once a year. We have 8 computers in the house, so generally the internet still works, but individual PCs aren't so great anymore.

99% uptime is awful, and it should be possible to fix the hardware AND the software to do better.

I keep great backups, so I don't usually lose data, but I lose a TON of time. Fixing things, restoring things, reinstalling things.

We're seeing cloud services with 99.9% and 99.99% uptime, and the PC is looking very dated in this model. Because the PC wastes your time, and the PC loses your data.

On the hardware end: why isn't every new PC shipping with RAID-1? OEMs seem to charge 5x retail prices for hard drives, so why aren't there free replacement parts for 3-5 years, like the manufacturers offer? Why can I only buy redundant power on server configurations? Why can't the cabling bus be separated from the power supply, so I can buy a standard part and replace it? Fans? RAM tests? Is it still 1991?

On the software side: why do so few people backup their data? Why is it so hard to restore a full OS, or to upgrade to a new machine? Where is my distributed cloud filesystem? I used one in college in 1993.

I could build most of this redundancy into a PC for $500 extra, and I would pay that for certain to have a reliable computer.

And even the expensive computers from companies like Apple don't have it, not much of it anyway.

The PC needs to evolve or be obsoleted by much better ways to store data.

Today, I can get service contracts that promise to replace things that break, maybe within 24 hours or 4 hours or a week, but nobody in the hardware industry seems to consider it their business to preserve my use of the computer or that my data stays around.

Hard drive manufacturers are working on their "data recovery" businesses (very profitable), instead of trying to improve the environment where so much data needs to be recovered.

Seagate, since you've lost so much data for me this year, why not ship a 2x2.5" hard drive in a 3.5" pack that does its own backups and beeps loudly when it fails and plugs into its replacement for migration? It's possible to build entirely new systems that work better.

I absolutely don't mind a fragmentation into "netbook" and "reliable home PC" market, or whatever it takes to have a place where a person's photos and documents can live without getting lost entirely every 5 years. But the software and hardware that's getting built has to adapt to current needs, and 5 year "catastrophic" failures are just no good for most people's photos and videos and important documents.

All my techy friends buy NAS boxes and setup Linux servers with RAID-6 and bizarre filesystems, and we all spend too much time at it. And lose data too. Regular people just lose data, and that's the end of it.

This really needs to get fixed, in hardware or software.

The long-term arrow is pointing towards software, and the only place things are moving is software.

But today's software isn't ready for it, not entirely.

And today, I think hardware manufacturers are missing out on a boatload of revenue by not offering better hardware, and I think they should meet the demand for it with innovative products that focus on making a promise to consumers that's not solely about price.

May 31, 2009

Getting the Whole Thing Done (comments on Google Wave)

I was excited to see some of the videos around Google Wave. And what I think they've done brilliantly is to build enough of what we used to call a "Demo", based on real tech. They've very simply picked one of the "right problems" to solve, and they've actually dealt with all the details required to get something done.

Most software projects are robotic foster children, cobbled together by program managers who mostly think of their jobs as copying what other people have done to have "feature parity", and engineers who think of user experience as a necessary evil. It's not an experience that balances the best of user needs with the greatest technology you can dream about.

You can be an innovator in one piece of the puzzle and forever whine that you were ahead of your time (I've done it), build a great technology that nobody uses, or insist that things should work a certain way but never build them. But the people who get it done work harder and slay a lot of beasts along the way, not by copying what others do, but by really trying to finish something new, they should be commended.

One of the things I've noticed is, when you start with the right problem and really do some work on it, you enable a new way of doing things. Consider that Flickr started as a gaming platform, and ended up solving some problems that enabled a new kind of community. These things are a different slice against the problem space, and a way of not letting the technology limit what can be done.

Our Hello product, which never made the hurdle to being on the Web, also did some incredibly novel things on a tech level. The idea that conversations were documents with all kinds of media, that you could merge participants into...yeah, that's very familiar to me.

I do believe that these things can only happen on small teams that really own a problem. They can't be told no, they can't have a surrogate parent dictating UI standards or technologies to be used. Because for the most part, these top-down choices keep you in the box that was there before.

Once you've seen this happen a few times, you understand that there are some technology paths that lead to magic. When you find yourself considering a technology that if you built it, you could use for 10 different applications, without spinning off into a 200-page spec or a mess of edge cases, there is magic there. The things you can keep simple, and still see the path forward, those are the great ones, those are the real enabling innovations that spur ten years of growth.

Similarly, to make good products, you have to have a belief that certain things should be both possible and easy for people to use.

To use a little example, we built drag and drop into the browser in a dozen proposals and demos and never got critical mass around distributing it. Drag and drop into the browser was one of the first conversations I had with the Blogger guys when we were talking about Picasa and Blogger/Google working together in 2003, and it got really shipped after 5 years. Man, I could tell you a dozen neat tricks for communicating with a webpage, but they're not something you can use, and that's not good enough.

The "Demo" piece of this is the belief that this should be easy and possible, and the will to make it happen. Attaching drag and drop photos into Wave, and making it important to the experience, and getting over that hurdle to convince everyone that it's obvious that it should happen. That Demo is an important act, because it sets a path for everyone to follow.

Technology innovation and user experience is this story that is very hard to tell as you're doing it, but is completely obvious in retrospect. Once someone innovates, it's obvious to everyone else that it always should have been that way, but it isn't obvious when you're doing it. So I do have some respect for the people who can put their arms around the whole problem and get something done.

I don't usually quote Larry Ellison, but I'll make this exception:
There are really four phases. In phase one, everyone tells you you're crazy and it's the stupidest thing they ever heard. In phase two, they say, "There is some merit to the argument. It's still crazy, but there's some merit to it." Phase three is, "Well, we've done it better than they have." And phase four is, "What are you talking about? It was our idea in the first place."

May 19, 2009

Earthquakes, from the Wackshit Citizen Science Department

Lorna has an art studio about 60 feet from our wireless router. So mostly it works, and on 98 out of 100 days she streams iTunes music from my computer, and she paints with her music playing.

But one day a couple weeks ago, the music would.not.stop.skipping. 3 hours of it. We reset all the routers, looked for secret settings in the firmware, tried to move things around.

"Rebuffering..." declared iTunes. And "rebuffering..." And "rebuffering..."

Finally, Lorna got fed up, bad mood, can't paint without music. Decided I could, should, and would fix it, and walked across the yard.

Arriving in my office, she found me looking a little bit shaken. We'd just had an earthquake, a quite big one.

She hadn't felt it. That was funny.

And iTunes worked again.

But we didn't think much of this until a couple days later, when we were talking, and, do you think? They could have been related? Maybe EM interference? No way? But the timing?

WE HAVE TO TEST THIS!

So I wrote some scripts to download stuff continuously from her Mac, out back. Measure the bandwidth, graph it.

So! Here is our little internal toy monitor thing: http://quake.herf.org/

How weird that you can monitor my wireless remotely. But now you can. Just what you wanted.







Today, there was an earthquake, and we didn't notice much on our little monitor (screenshot nearby). But it wasn't a big earthquake, so there's still hope.

And maybe, one day, it will catch something, and besides, I have to keep everyone entertained somehow.

May 5, 2009

Ramdisk for Photoshop OMG FAST

I happened to have 4GB of RAM lying around from a recent server rebuild (don't ask...)

So I stuck it in my XP box, giving me a total of 6GB. Brilliant thing to do for a 32-bit OS that only supports 2GB, right?

Well then you get this thing called Ramdisk Plus.

And you make a 3GB Ramdisk in "unmanaged" mode, which XP mostly doesn't notice because it doesn't care.

And then you tell Photoshop it can use that disk to swap. And oh boy, it does.

Here's perfmon showing 6000 writes/sec at about 500MB/sec while running "Add Noise". It does more than that if you ask nicely.


Apr 20, 2009

Cloud DBs

I've been doing some research on "cloud databases" - the non-relational key/value storage systems that people are using to scale their web apps past MySQL and SQLite.

First are the Bigtable clones, where you actually get columns and higher-level features:
HBase: database used for Hadoop
Cassandra: database used by Facebook

These are "big" projects, and if you have a big application you might consider them. But there is a lot of code there, and they seem pretty new for their complexity.

Next are the low-level improvements on what people know as "DBM" or "Berekeley DB", the simple "put a value, get a value" interfaces. Typically these packages wrap a number of different backends: typically a fixed database (flat file), a hash table, and a B+Tree. Compared to Berekely DB, these guys are faster and usually LGPL:

By some reports, using this kind of code is 10-100x faster than using MySQL or Sqlite to do the same task. And bindings are good, supporting C, Java, Ruby, and Python.

Of course there's the popular "memcached" which stores key/value pairs in RAM across multiple machines. Memcached is interesting because people are using the protocol as a standard for persistent key/value storage (as well as what everyone knows, an implementation for RAM-only caching):

A feature you might want is to be able to access your database over the network, rather than by touching disk. Interesting entrants here:
http://memcachedb.org/ - Danga's memcached + Berkeley DB

Tokyo Cabinet and MemCacheDB support the "memcached protocol" and most of the above do some kind of rest-ful storage. CouchDB does map/reduce for its indices, which sounds neat but proves to be 100x slower than MySQL in practice.

Finally, you should know which of these systems support horizontal scaling (i.e. linear scaling when you add more machines), and those include HBase, Cassandra, and some layers on top of the key/value guys. Most of the above systems (including CouchDB) do not scale horizontally, and you basically make full replicas of all of your data, or just use them on one disk.

LightCloud: scaling layer built on Tokyo Tyrant.
Project Voldemort: used by LinkedIn and others

At this point, I'm very impressed with the Tokyo stuff, and I especially like that I can break the key/value abstraction and do cursor ops on the btree directly. So if I have 1000 keys that appear sequentially, it is insanely fast to fetch them.

For smaller projects I think I'm going to test out Tokyo Cabinet, and for larger ones Lightcloud. Love to hear other suggestions.

Apr 17, 2009

my mod_gzip settings (deflate.conf)

Paul Buchheit posted about gzip settings, and I thought I'd post my deflate.conf (apache) because the default apache stuff isn't nearly aggressive enough.

This stuff goes into conf.d/deflate.conf, and it's cribbed from several places on the web. Wish I could credit them, but I forgot.

If you don't use settings like this you'll find that your CSS and JS files don't get compressed, or you'll compress them all the time, even for the browsers that can't handle them, or you'll get your stuff cached by proxies that will serve the files to browsers that can't handle them, etc. I've not done totally exhaustive testing, but this is what I use on all my sites.

AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css
AddOutputFilterByType DEFLATE application/xml application/xhtml+xml application/rss+xml
AddOutputFilterByType DEFLATE application/javascript application/x-javascript

DeflateCompressionLevel 9

BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
# IE5.x and IE6 get no gzip, but allow 7+
BrowserMatch \bMSIE\s7 !no-gzip
# IE 6.0 after SP2 has no gzip bugs!
BrowserMatch \bMSIE.*SV !no-gzip
# Sometimes Opera pretends to be IE with "Mozila/4.0"
BrowserMatch \bOpera !no-gzip
Header append Vary User-Agent env=!dont-vary

There's an nginx equivalent around here somewhere, I'll dig it up.

Apr 10, 2009

Bravo - Google Updater open-sourced

Google has open-sourced the updater used by Earth, Chrome, etc.:
http://google-opensource.blogspot.com/2009/04/google-update-goes-open-source.html

The blog post above talks about two things: (a) transparency in how Google does things, and (b) saving people time.

I wouldn't underestimate (b), and notably how big a hurdle it is to write the basic "Client Software Infrastructure" today - downloads, builds, upgrade/downgrade, and especially autoupdate. You could spend a few months of a mid-level engineer to do this at a basic level, and a year to do it really right. Some big companies (e.g., Adobe) even make autoupdate clunky and awkward.

Google's system is well engineered, simple for the user, and it works really well.

Of course, Microsoft and Apple should provide these frameworks. Apple doesn't do much here aside from their own apps, though I like Sparkle, which is free.

Microsoft started to provide the basic tools in MSI, though it's not a full framework. But even their lower-level stuff is pretty much broken. A quick list of ways Microsoft dropped the ball with MSI:

1. Their installers are very slow.
2. MSI uses "chained certificates" that usually expire after a year, requiring software authors to chain together all old signing certs to make update work.
3. Uses an ancient compressor, that makes installers that are nearly 2x as big as the better compressors available today.

So anyway, thanks Google. Even though someone else probably should have done this instead, it's a big step forward.

Apr 2, 2009

A good-enough ZFS NAS

I've kept an old 2001-era Dell box running OpenSolaris with ZFS for the last few years. Filled it up. But I was so very happy with ZFS that I wanted to build a new box. And who wouldn't want a Netapp-like storage system for cheap?

I thought this would be easy, but it really wasn't, and so I thought I'd write it up for everyone who's in a similar situation. (If you're impatient, you can skip down to the parts list below.)

My "impossible" goals for a ZFS NAS:
  1. Quiet!
  2. Low power
  3. ECC RAM
  4. Big storage (4TB+), with 5-6 disks
  5. Compatible with OpenSolaris
  6. Reasonable cost
If you didn't care about most of those things, you could just get a NetApp for $100,000. But look, you probably didn't do that.

At a medium pricepoint, you could just buy Sun hardware. e.g., this guy converted an ultra40 workstation to run ZFS, but sometime after that Sun canceled the Ultra 40. Now you can only buy the 4-disk version, and that's actually not enough for a huge NAS when the bootdisk eats one (more on that later).

Rack-mount stuff satisfies most of the above, but it's a little bit expensive, and very loud. And if you want loud, you could just buy a Sun Fire X4140.

Sun has been very slow to make Solaris work on non-Sun hardware, so compatibility turns out to be difficult. You can find people who've read some of the Sun whitepapers and cloned the hardware for a smaller cost. For instance, "Thumper" (Sun's massive ZFS box) apparently uses multiple AOC controllers like this one. The AOC costs $99 and supports 8 SATA disks.

I care about ECC RAM because I decided to worry about bitrot.

Even if you go with the latest generation chips, Intel has a small edge in power consumption. When I started this project, Intel had a 2:1 edge, so I was determined to get an integrated Intel solution with ECC RAM that was reasonably compatible. Sun's last generation hardware was mostly AMD, but the current stuff is Intel Xeon, so all this works ok. Also, Intel's integrated LAN, and SATA chipset (ICH9x) are well supported, which means you don't need add-on cards to get basic stuff done.

Parts list:
  • Supermicro MBD-X7SBL-LN1
  • 4GB ECC RAM
  • 5 x Western Digital Caviar Green 1.5TB (WD15EADS)
  • SATA DVD-ROM (any $20 one)
  • Antec P180 mini to hold 5 drives (I'm using an old Sonata). P183 if you want 6 drives.
  • Intel Xeon E3110. If you're cheap: a non-Xeon E5200. 45nm to save power.
  • OpenSolaris nevada build 101b
Motherboard/RAM: Intel's desktop chipsets don't use ECC, but I found the Intel 3000 chipset, which is sort of a low-end server/workstation chipset for not much more than a desktop board. Also it has 6 SATA ports and integrated video. You don't need the 3010 unless you want cool add-on cards.

The Intel board I ordered was actually really terrible, so my second try was a Supermicro MBD-X7SBL-LN1, also based on the Intel 3000. It's great: micro-ATX, integrated video, wonderful and sensible layout. Slow video, just like you want on a server. In contrast, Intel's board has power plug in the middle of the board, SATA ports at random angles, and it takes ages to boot (POST). But this Supermicro motherboard is really absolutely wonderful.

4GB of ECC RAM is now mostly free ($50). You could buy 6GB, or 8, or whatever you like. Solaris runs in x64 mode, so all your RAM will get used.

CPU: The cheapest 45nm Xeon I found is the Intel Xeon E3110. 3GHz dual-core seems excessive, but it is a Xeon, so it made me feel better, and it uses less than 40w.

Drives: Western Digital Caviar Green drives use 3.7W at idle. Most drives use 8W. Samsung EcoGreen are similar. This isn't spun-down, that's actually really spinning.

In contrast, Seagate 1.5TB drives use 8W at idle, and they have firmware bugs. (I just flashed my colo RAID and bricked one of the drives, so I had no interest in upgrading 5 of them.) I went with the Caviar Green.

For the moment, 1.5TB drives are priced at about 10c/GB, and 2TB drives are about 15c/GB. So 1.5TB is a nice choice for now.

Really read this: "consumer" drives need you to enable TLER (time-limited error recovery) before you put them in a RAID. Western Digital provides WDTLER.EXE to switch this. What you want is for a drive to fail fast (7 seconds), and tell the OS and RAID about it.

Also, turn on AHCI in your BIOS. Solaris supports it now, and it's faster. (You have to be in IDE mode to run WDTLER, though.)

Booting huge drives, booting ZFS

Drives with sizes above 1TB use a new partitioning scheme. Booting from >1TB drives is totally incompatible with the shipping version of Solaris 10u6. 10u6 will complain about everything regarding a 1.5TB drive. People say that the relevant patches might make it into 10u8, but this isn't a sure thing, and it probably won't ship for a year. Just skip the "stable" Solaris 10.

So in the meantime, you head back to OpenSolaris (the "community" version), which of course has the latest patches. I'm running "nevada" 101b. It supports the new partioning scheme, and the installer doesn't say anything about your drives being too big. This new build is supposed to provide boot support up to 2TB, but I'm not sure how tested it is.

Also, I was very pleased to learn that ZFS is now bootable.

Oh, but! ZFS is not bootable in RAIDZ (equivalent to RAID-5/6) mode, which you probably want for your main storage. So you need to dedicate some disks to boot ZFS, and create a separate RAIDZ to store data. (I cheated and used a single bootdisk, but I keep good backups.) Yes, I've devoted a 1.5TB drive to booting an OS that uses 20GB. Yeah it would be smarter to pull 20GB from the front of each drive and create a tiny mirror and a bigger RAIDZ. But anyway, I didn't do that. Maybe someone else will.

One caveat: smartmontools doesn't work with the Intel ICH9 ("-d ata" is not implemented). If you need SMART monitoring of your disks, get a Marvell-based controller, like the AOC above.

Now, to summarize, we've got:
  • Integrated video, LAN, ECC motherboard.
  • Low power, server-class parts.
  • Low-power hard drives, updated to enable TLER.
  • Quiet case, can't hear it.
  • 4GB RAM, x64-compatible Intel CPU.
  • >4TB available, with ZFS and snapshots!
  • Cost ~$1500.
  • It Works!
Other options

If you want to save the 1.5TB hassle and the TLER hassle, get the RE2-GP (this class of drive appears to add 70% to the price):


If you are willing to relax my ECC RAM requirement, a world of cheaper options (based on desktop-class hardware) opens to you. Lots of the motherboards seem to mix up two SATA chipsets, which seems really wrong to me. But here are some interesting links I found:



Also, many people have built ZFS NAS with 2 huge drives (mirrored), on an Intel Atom motherboard (the chip used in Netbooks). Unfortunately it is hard to find an Atom motherboard with >2 SATA ports. But this seriously minimizes power, and with the next-generation motherboards will do so even more. If you need 2TB or less this is a very good option.

Mar 30, 2009

Slow-motion finance: good housing starting to drop

Since I spent a lot of last year trying to understand real estate, friends now ask me things like, "Should I buy a house now? Prices are down in LA by half, right? And my realtor friend said a house sold last week for asking price."

Actually, I don't usually say whether or not anybody should buy anything. But if you want to know if desirable housing areas are near their bottom yet? With prices down almost 10% in nicer zipcodes? Nope, not close.

Housing is slow-motion finance. Frozen snails slow.

The first thing you should know is that houses don't sell for 20% or 30% below asking price. The vast majority of offers are just ignored. Most deals close at within 10% of asking, and the waiting game in the middle is where buyers don't buy and sellers don't sell. Some buyers buy, but a lot just wait.

So, what you see right before housing drops in price is that inventory goes up. The number of homes on the market goes way up. You should just figure out (homes for sale / homes sold) in a month, and if you see more than 6 months of inventory, things are slow. If you see 24 months of inventory, you're in free-fall. Prices are going to be on the way down for a while when you see this, and it doesn't matter what the median prices say right now.

Inventories in many LA Westside zipcodes have tripled in the last two years. It usually takes about a year of increasing inventory before the peak declines in prices happen. And that's the first of several of these years we're in right now.

And a quick review of how affordability is working on the high end:
  1. Higher end homes are now facing a 7% financing rate, compared with conforming loans being offered around 4%
  2. Temporary "conforming" limits are ended, so fewer loans are available at 4%
  3. Banks now check your income when you apply for a loan
  4. The top income tax rate is moving up to 39%, while the mortgage interest deduction is moving to 28%. (Previously both were 35%.) Did you notice this in the bailout bill?
  5. Oh, there was a huge stock market crash
  6. General acknowledgement that you're not going to get rich buying stuff you can't afford
  7. Option ARMs and other adjustable-rate loans are starting to get more evil, fast
These factors are all causing downward pressure on housing prices, but the thing that is really fascinating about real estate is how long it takes to decline.

In stock-market-years, housing takes forever to fall (from a buyer's perspective).

If we had a housing market that reacted like the stock market you'd see things like this:
  1. Interest rates for Jumbo loans move from 5 to 7%: prices go down by 20% tomorrow.
  2. Income tax rate change: prices down by 10% tomorrow.
  3. Market down by 40%: oh @#&...
But in the housing market, these things take years to play out. A change that reduces the affordability of housing by 10% will have no effect tomorrow or next week, and it will take multiple years to work its way into the market.

Despite current declines, my guess is that we won't see the full effect of price declines in "better" areas for 2-3 years.

Also, redfin has nice market data. For instance, they automatically make this page about Beverly Hills 90210, and you can swap in your own zipcode: http://www.redfin.com/zipcode/90210

Finally, one other interesting thing about the LA market. The median prices are being set by the massive foreclosures in the inland areas, not the areas where most people live. That seems silly, but right now it's true. So if you meet a realtor who tells you that prices are down 40% and you should buy, they're telling you a story about a place where you probably don't know anybody, and it's very very hot during the summer.

Mar 17, 2009

Bitrot, huge disks, and RAID

A few months ago, I had a bad experience with rotting JPEGs. I stored them on a 3ware RAID with buggy firmware, and it took me a darned long time to figure out why 10% of my pictures were corrupt. (Flashing the firmware seems to have helped.) Most of the errors were tiny--one bit changed in a 10MB file. But enough to make them really impossible to recover.

As a result, I've become obsessed with understanding the bitrot problem and figuring out practical ways to solve it.

So let's take it from bottom to top:

1. Big disks, RAID, scrubbing

We're seeing enormous disks right now (2TB!) and on these disks sectors fail at a reasonable rate. If you're storing 2TB of data you can't afford to lose, you should really be guarding against sector and bit-level errors with something like RAID-1. Similarly, you should have your RAID do weekly verifies so that failed sectors can recover from a second disk. Your RAID won't do this unless you tell it to, so figure out how.

How this works: if you do a weekly "scrub" your RAID controller or Software RAID will read every sector on a disk, and if failures are detected, the disk will reallocate a sector, and the controller will copy data from the good disk. (If you were using just one disk instead, you'd have lost some bits from this sector, or all of the data there.) You are much less likely to see the same sector fail on two disks at once, so this kind of scrubbing works really well.

Scrubbing is easy. Your hardware RAID probably has a "verify" scheduler. If you're using Linux software RAID, you can put a script like this in your cron.weekly:

echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action

You don't have to do a lot more until a disk fails. But what you've done is ensured that media errors don't corrupt your data.

But you must scrub. You can't wait for a disk to fail, not any more. If you do that, you will find out about bad sectors when you're rebuilding the RAID (errors will show up on the "good" disk), and lose some data that way. Weekly scrubs are recommended.

2. Your RAID controller is buggy. Your disk controller flips bits randomly. Your bus scrambles data. You have bad RAM.

These things are not as well handled by most systems today. If you're using a system to store backups, you can always run checksums and do a full verify of your backup. This is an especially good idea, because it will help you detect hardware problems before they creep into other data that you have no second copy to rely on.

My issue was buggy RAID firmware. Bad RAM is the most likely culprit, so use ECC. It's hard to find cheap motherboards that use ECC (thanks Intel?), but it is important.

3. End to end checksumming

The only system that actually checks that the hardware is working is ZFS (available in OpenSolaris). I recommend it for a lot of reasons.

ZFS builds per-block checksums into a disk, so you can see silent disk corruption (e.g., bits your controller flipped, rather than bits your disk's checksumming might detect). This is better in almost every way than playing "trust me" with your hardware.

You still have to schedule weekly scrubs (ZFS calls this "resilvering").

Hardware for storage

In the meantime, I'm having a reasonably difficult time finding a server that is low-power, low-noise, has ECC RAM, and works with OpenSolaris. If you give up on low-noise or ECC, things get easier, but I do love over-constrained problems. Leave comments with good links, if you have any.

Mar 11, 2009

Micro-lending for the rest of us...Also, real time accounting?

Lorna and I were talking over dinner about businesses going under, sometimes for really dumb reasons. For instance, you hear about mostly-profitable small businesses that can't get financing on the same terms they could before.

And so we were wondering why nobody we know can invest in (or even provide high-interest loans to) these companies. They just close up shop and disappear (like bank executives on a private jet? Hm, not really.)

It's pretty clear these guys can't go do an IPO to get financing. And angel/VC/private equity is interested in massive upside, not short-term loans to operating businesses.

But if you come back to it, why can a regular person invest in a public company? Because there are annual reports, and the SEC, and reasonable penalties for doing the wrong thing. And as a result you get something we call transparency, and people can sort of understand (after the fact) what's going on in a public company. And people go to jail for saying the wrong thing at the wrong time.

For private equity, something else happens: you get a small number of investors, typically with high net-worth, so nobody prevents them from taking big risks. In most cases, you actually get an inside view of a company's books. They're not public, but they're known to the parties who are investing.

The problem with all these old systems is that they ignore computers and the internet.

How do we get financing to small private companies without making them go through Sarbanes-Oxley and hire PriceWaterhouseCoopers to do their annual reports?

Well, how about we have them use accounting software that posts all their financial data online instantly, and make a market for investors who want to take risks based on that data?

Consider what happens if companies that want financing in this market commit to use one of these "open" accounting systems in their day-to-day operations (not paper, or one set of internal books and another for the public), and you'll have a million eyes to look over every receipt. Find a way to make this software free, and integrate it with all the accounting systems that are in place today.

Who would sign up for this? Well lots of small business owners, if the business they spent 20 years building is facing bankruptcy. Eventually, companies could have their privacy or the benefit of easy financing, or some mix in the private equity model.

And who else should do this? Right now, we're also trying to figure out how to build more regulation into our financial system.

But isn't the problem lack of information? With information, there are people to crunch the numbers. With mega-billion-dollar companies and form 10-K, there isn't much transparency at all. It's hard to understand, really.

But couldn't we build a real-time accounting system, with a variety of privacy models, to fix this really? Use it in government, use it for financial firms that get bailout money, use it for small businesses that need emergency lending.

Related thoughts: I think you might find some inspiration in Glen Kelman's opening of Redfin's financial model. Great stuff, and really not shocking enough to keep it private, if investment could flow more freely as a result.

Mar 6, 2009

Eric Lewis, Westin Lobby at TED

Eric Lewis did an impromptu performance in the Westin Lobby at TED.
Some samples...


Eric Lewis at TED 2009 from Michael Herf on Vimeo.

Data Solutions to Big Problems

There are a lot of problems that have proved resilient to "elegant" science - the classical science that says you can reduce most things in the world to a small equation or simple algorithm.

I've been having a ton of conversations about this recently, and I love pulling a particular quote out of context from the Hays/Efros paper "Scene completion Using Millions of Photographs". These guys were trying to solve the "remove that object from my picture" problem. Lots of people have tried, and previous approaches make blurry blobs or funny textures to fill in the gaps. In other words, the old methods don't work too well.

Hays & Efros say, "What if we had a big database of similar images to use?" And this wonderful thing happens:
Indeed, our initial experiments with the gist descriptor on a dataset of ten thousand images were very discouraging. However, increasing the image collection to two million yielded a qualitative leap in performance...
And reading this should get you very excited. It means that you can solve some problems with "lots of data" that you can't solve with a moderate amount of data. Their problem is solvable if you have 2 million images but not with 10,000.

We see similar things in Machine Translation, and many people have suggested biology and medical science will be revolutionized by similar techniques.

But this brings up two important issues:

1. Trust-the-Oracle-Science?

First one came from a discussion with my cousin, who's a newly-minted lawyer working with DNA evidence. And at a certain level, the legal community is now trusting an algorithm to tell them the truth. In effect, the query they're asking is now quite simple and specific: how similar are these DNA fragments? And so there's not a multi-dimensional Bayesian learning network or a lot of opportunity for a coding slip-up to mess with evidence. 

That's today. Simple means Admissable. Guilty or Innocent.

But consider what happens if we base a new science on machine learning techniques? We will have these very self-referential systems where "truth" is just part of a large statistical model, and small errors might multiply to have quite serious consequences. Predicting health risks and the usual impacts on insurance, knowing too much about a person's behavior, privacy, you get it. At a certain level, you wonder if the computer becomes an oracle that accidentally has too much power.

Traditional science must re-double efforts to incorporate the results from these new models, and establish firm theory for why the "oracle" is telling the truth, or not.

2. Google Science

More optimistically, there are a ton of interesting problems to solve using this kind of data-rich approach. 

Another conversation yesterday brings me to this question: can a regular 2-person research team make headway against the Googles of the world? Or does Google necessarily have such an advantage with its reams of data, such that research in certain areas, outside the Googleplex, comes to a halt? 

This is a harsh question, but you could argue that Google has a huge lead in machine translation, and more training data for, say, clicks on images than everyone else in the world.

But the Hays/Efros example gives me a lot of hope that these tools will be readily available. These two guys at CMU were able to build a database of 2.3M images from Flickr, all with public tags, and they were able to do useful computation on it. 

I do think that there is some danger of research becoming verticalized. If one organization controlled all medical data, they would also control the research results to be gleaned from it. So as we see changes in the next 10 years, we should ensure that whoever controls medical data is absolutely required to make aggregate data public.

But with a few small tweaks, the Flickr/Hays/Efros example and the existing data available on the web makes it possible for very small teams to make headway here, as well as the 900-lb gorillas. I'm very optimistic that there isn't a Google wall, yet.

Today, it is still mostly a small project to write a web crawler. And distributed systems that crunch through petabytes of data are quite affordable. And we are seeing serious efforts to make data like this public, accessible through APIs, and useful for research.

I think that today, if you're clever enough your two-person team can make the next great vision algorithm or medical breakthrough. And we should work to keep it that way.

Feb 25, 2009

Notes on Toolbar 6

I tried the quick search in Google Toolbar 6.

Typed "chrome":
#1 result: Search google for chrome
#2 result: [Run] Google Chrome

Typed "photoshop":
#1 result: Search google for photoshop
#2 result: Run Photoshop Lightroom
#3 result: Run Photoshop

So I know Google wants me to search, but it's a bit much, and the sorting isn't very smart. I would type "Lightroom" if I wanted Lightroom, after all.

I expected a more Quicksilver/Spotlight thing, so if I type in a file where I have a local app in my start menu by the exact same name, maybe running it could be done by hitting "enter" rather than "down arrow once or twice or three times, then enter".

Maybe I don't really need another way to search the web. I have a browser for that. This toolbar search is fast, and it could be useful if it really did what I wanted, instead of what it does. As it is, I'll probably uninstall it.

Feb 24, 2009

Why you shouldn't look at screensavers

16 hours/day.
70W (video card 30W + monitor 40W).
= $4 /month.

And you probably don't look at it that much.

Use power saving. It's the best screensaver. 

$0.01/month. Energy Star.

Feb 23, 2009

Property vs. Contract Rights

To me, the End Days Battle (i.e. where things really go in our economy) seems to be a fight between Contract Law and Property Rights.

Bankruptcy establishes a pretty clear hierarchy: 

Contracts lose to Property. Your creditors get to claim your assets, regardless of the weird contracts you wrote that obligate you to pay all your assets to your nephew.

And I must point out, that in this regard, CDS is just a Contract. It's not Property. And as we've written a lot of bad contracts in the last five years that nobody can pay for, at some point somebody's going to blow up a lot of them. 

I bet that even nationalization can't pay for some of this stuff. Didn't work for Iceland.

Really, some of the contracts just have to be torn up.