Nerdblog.com: March 2009

Slow-motion finance: good housing starting to drop

Since I spent a lot of last year trying to understand real estate, friends now ask me things like, "Should I buy a house now? Prices are down in LA by half, right? And my realtor friend said a house sold last week for asking price."

Actually, I don't usually say whether or not anybody should buy anything. But if you want to know if desirable housing areas are near their bottom yet? With prices down almost 10% in nicer zipcodes? Nope, not close.

Housing is slow-motion finance. Frozen snails slow.

The first thing you should know is that houses don't sell for 20% or 30% below asking price. The vast majority of offers are just ignored. Most deals close at within 10% of asking, and the waiting game in the middle is where buyers don't buy and sellers don't sell. Some buyers buy, but a lot just wait.

So, what you see right before housing drops in price is that inventory goes up. The number of homes on the market goes way up. You should just figure out (homes for sale / homes sold) in a month, and if you see more than 6 months of inventory, things are slow. If you see 24 months of inventory, you're in free-fall. Prices are going to be on the way down for a while when you see this, and it doesn't matter what the median prices say right now.

Inventories in many LA Westside zipcodes have tripled in the last two years. It usually takes about a year of increasing inventory before the peak declines in prices happen. And that's the first of several of these years we're in right now.

And a quick review of how affordability is working on the high end:

Higher end homes are now facing a 7% financing rate, compared with conforming loans being offered around 4%
Temporary "conforming" limits are ended, so fewer loans are available at 4%
Banks now check your income when you apply for a loan
The top income tax rate is moving up to 39%, while the mortgage interest deduction is moving to 28%. (Previously both were 35%.) Did you notice this in the bailout bill?
Oh, there was a huge stock market crash
General acknowledgement that you're not going to get rich buying stuff you can't afford
Option ARMs and other adjustable-rate loans are starting to get more evil, fast

These factors are all causing downward pressure on housing prices, but the thing that is really fascinating about real estate is how long it takes to decline.

In stock-market-years, housing takes forever to fall (from a buyer's perspective).

If we had a housing market that reacted like the stock market you'd see things like this:

Interest rates for Jumbo loans move from 5 to 7%: prices go down by 20% tomorrow.
Income tax rate change: prices down by 10% tomorrow.
Market down by 40%: oh @#&...

But in the housing market, these things take years to play out. A change that reduces the affordability of housing by 10% will have no effect tomorrow or next week, and it will take multiple years to work its way into the market.

Despite current declines, my guess is that we won't see the full effect of price declines in "better" areas for 2-3 years.

Also, redfin has nice market data. For instance, they automatically make this page about Beverly Hills 90210, and you can swap in your own zipcode: http://www.redfin.com/zipcode/90210

Finally, one other interesting thing about the LA market. The median prices are being set by the massive foreclosures in the inland areas, not the areas where most people live. That seems silly, but right now it's true. So if you meet a realtor who tells you that prices are down 40% and you should buy, they're telling you a story about a place where you probably don't know anybody, and it's very very hot during the summer.

Bitrot, huge disks, and RAID

A few months ago, I had a bad experience with rotting JPEGs. I stored them on a 3ware RAID with buggy firmware, and it took me a darned long time to figure out why 10% of my pictures were corrupt. (Flashing the firmware seems to have helped.) Most of the errors were tiny--one bit changed in a 10MB file. But enough to make them really impossible to recover.

As a result, I've become obsessed with understanding the bitrot problem and figuring out practical ways to solve it.

So let's take it from bottom to top:

1. Big disks, RAID, scrubbing

We're seeing enormous disks right now (2TB!) and on these disks sectors fail at a reasonable rate. If you're storing 2TB of data you can't afford to lose, you should really be guarding against sector and bit-level errors with something like RAID-1. Similarly, you should have your RAID do weekly verifies so that failed sectors can recover from a second disk. Your RAID won't do this unless you tell it to, so figure out how.

How this works: if you do a weekly "scrub" your RAID controller or Software RAID will read every sector on a disk, and if failures are detected, the disk will reallocate a sector, and the controller will copy data from the good disk. (If you were using just one disk instead, you'd have lost some bits from this sector, or all of the data there.) You are much less likely to see the same sector fail on two disks at once, so this kind of scrubbing works really well.

Scrubbing is easy. Your hardware RAID probably has a "verify" scheduler. If you're using Linux software RAID, you can put a script like this in your cron.weekly:

echo check > /sys/block/md0/md/sync_action
echo check > /sys/block/md1/md/sync_action

You don't have to do a lot more until a disk fails. But what you've done is ensured that media errors don't corrupt your data.

But you must scrub. You can't wait for a disk to fail, not any more. If you do that, you will find out about bad sectors when you're rebuilding the RAID (errors will show up on the "good" disk), and lose some data that way. Weekly scrubs are recommended.

2. Your RAID controller is buggy. Your disk controller flips bits randomly. Your bus scrambles data. You have bad RAM.

These things are not as well handled by most systems today. If you're using a system to store backups, you can always run checksums and do a full verify of your backup. This is an especially good idea, because it will help you detect hardware problems before they creep into other data that you have no second copy to rely on.

My issue was buggy RAID firmware. Bad RAM is the most likely culprit, so use ECC. It's hard to find cheap motherboards that use ECC (thanks Intel?), but it is important.

3. End to end checksumming

The only system that actually checks that the hardware is working is ZFS (available in OpenSolaris). I recommend it for a lot of reasons.

ZFS builds per-block checksums into a disk, so you can see silent disk corruption (e.g., bits your controller flipped, rather than bits your disk's checksumming might detect). This is better in almost every way than playing "trust me" with your hardware.

You still have to schedule weekly scrubs (ZFS calls this "resilvering").

Hardware for storage

In the meantime, I'm having a reasonably difficult time finding a server that is low-power, low-noise, has ECC RAM, and works with OpenSolaris. If you give up on low-noise or ECC, things get easier, but I do love over-constrained problems. Leave comments with good links, if you have any.

Micro-lending for the rest of us...Also, real time accounting?

Lorna and I were talking over dinner about businesses going under, sometimes for really dumb reasons. For instance, you hear about mostly-profitable small businesses that can't get financing on the same terms they could before.

And so we were wondering why nobody we know can invest in (or even provide high-interest loans to) these companies. They just close up shop and disappear (like bank executives on a private jet? Hm, not really.)

It's pretty clear these guys can't go do an IPO to get financing. And angel/VC/private equity is interested in massive upside, not short-term loans to operating businesses.

But if you come back to it, why can a regular person invest in a public company? Because there are annual reports, and the SEC, and reasonable penalties for doing the wrong thing. And as a result you get something we call transparency, and people can sort of understand (after the fact) what's going on in a public company. And people go to jail for saying the wrong thing at the wrong time.

For private equity, something else happens: you get a small number of investors, typically with high net-worth, so nobody prevents them from taking big risks. In most cases, you actually get an inside view of a company's books. They're not public, but they're known to the parties who are investing.

The problem with all these old systems is that they ignore computers and the internet.

How do we get financing to small private companies without making them go through Sarbanes-Oxley and hire PriceWaterhouseCoopers to do their annual reports?

Well, how about we have them use accounting software that posts all their financial data online instantly, and make a market for investors who want to take risks based on that data?

Consider what happens if companies that want financing in this market commit to use one of these "open" accounting systems in their day-to-day operations (not paper, or one set of internal books and another for the public), and you'll have a million eyes to look over every receipt. Find a way to make this software free, and integrate it with all the accounting systems that are in place today.

Who would sign up for this? Well lots of small business owners, if the business they spent 20 years building is facing bankruptcy. Eventually, companies could have their privacy or the benefit of easy financing, or some mix in the private equity model.

And who else should do this? Right now, we're also trying to figure out how to build more regulation into our financial system.

But isn't the problem lack of information? With information, there are people to crunch the numbers. With mega-billion-dollar companies and form 10-K, there isn't much transparency at all. It's hard to understand, really.

But couldn't we build a real-time accounting system, with a variety of privacy models, to fix this really? Use it in government, use it for financial firms that get bailout money, use it for small businesses that need emergency lending.

Related thoughts: I think you might find some inspiration in Glen Kelman's opening of Redfin's financial model. Great stuff, and really not shocking enough to keep it private, if investment could flow more freely as a result.

Eric Lewis, Westin Lobby at TED

Eric Lewis did an impromptu performance in the Westin Lobby at TED.
Some samples...

Eric Lewis at TED 2009 from Michael Herf on Vimeo.

Data Solutions to Big Problems

There are a lot of problems that have proved resilient to "elegant" science - the classical science that says you can reduce most things in the world to a small equation or simple algorithm.

I've been having a ton of conversations about this recently, and I love pulling a particular quote out of context from the Hays/Efros paper "Scene completion Using Millions of Photographs". These guys were trying to solve the "remove that object from my picture" problem. Lots of people have tried, and previous approaches make blurry blobs or funny textures to fill in the gaps. In other words, the old methods don't work too well.

Hays & Efros say, "What if we had a big database of similar images to use?" And this wonderful thing happens:

Indeed, our initial experiments with the gist descriptor on a dataset of ten thousand images were very discouraging. However, increasing the image collection to two million yielded a qualitative leap in performance...

And reading this should get you very excited. It means that you can solve some problems with "lots of data" that you can't solve with a moderate amount of data. Their problem is solvable if you have 2 million images but not with 10,000.

We see similar things in Machine Translation, and many people have suggested biology and medical science will be revolutionized by similar techniques.

But this brings up two important issues:

1. Trust-the-Oracle-Science?

First one came from a discussion with my cousin, who's a newly-minted lawyer working with DNA evidence. And at a certain level, the legal community is now trusting an algorithm to tell them the truth. In effect, the query they're asking is now quite simple and specific: how similar are these DNA fragments? And so there's not a multi-dimensional Bayesian learning network or a lot of opportunity for a coding slip-up to mess with evidence.

That's today. Simple means Admissable. Guilty or Innocent.

But consider what happens if we base a new science on machine learning techniques? We will have these very self-referential systems where "truth" is just part of a large statistical model, and small errors might multiply to have quite serious consequences. Predicting health risks and the usual impacts on insurance, knowing too much about a person's behavior, privacy, you get it. At a certain level, you wonder if the computer becomes an oracle that accidentally has too much power.

Traditional science must re-double efforts to incorporate the results from these new models, and establish firm theory for why the "oracle" is telling the truth, or not.

2. Google Science

More optimistically, there are a ton of interesting problems to solve using this kind of data-rich approach.

Another conversation yesterday brings me to this question: can a regular 2-person research team make headway against the Googles of the world? Or does Google necessarily have such an advantage with its reams of data, such that research in certain areas, outside the Googleplex, comes to a halt?

This is a harsh question, but you could argue that Google has a huge lead in machine translation, and more training data for, say, clicks on images than everyone else in the world.

But the Hays/Efros example gives me a lot of hope that these tools will be readily available. These two guys at CMU were able to build a database of 2.3M images from Flickr, all with public tags, and they were able to do useful computation on it.

I do think that there is some danger of research becoming verticalized. If one organization controlled all medical data, they would also control the research results to be gleaned from it. So as we see changes in the next 10 years, we should ensure that whoever controls medical data is absolutely required to make aggregate data public.

But with a few small tweaks, the Flickr/Hays/Efros example and the existing data available on the web makes it possible for very small teams to make headway here, as well as the 900-lb gorillas. I'm very optimistic that there isn't a Google wall, yet.

Today, it is still mostly a small project to write a web crawler. And distributed systems that crunch through petabytes of data are quite affordable. And we are seeing serious efforts to make data like this public, accessible through APIs, and useful for research.

I think that today, if you're clever enough your two-person team can make the next great vision algorithm or medical breakthrough. And we should work to keep it that way.