Data Solutions to Big Problems

There are a lot of problems that have proved resilient to "elegant" science - the classical science that says you can reduce most things in the world to a small equation or simple algorithm.

I've been having a ton of conversations about this recently, and I love pulling a particular quote out of context from the Hays/Efros paper "Scene completion Using Millions of Photographs". These guys were trying to solve the "remove that object from my picture" problem. Lots of people have tried, and previous approaches make blurry blobs or funny textures to fill in the gaps. In other words, the old methods don't work too well.

Hays & Efros say, "What if we had a big database of similar images to use?" And this wonderful thing happens:
Indeed, our initial experiments with the gist descriptor on a dataset of ten thousand images were very discouraging. However, increasing the image collection to two million yielded a qualitative leap in performance...
And reading this should get you very excited. It means that you can solve some problems with "lots of data" that you can't solve with a moderate amount of data. Their problem is solvable if you have 2 million images but not with 10,000.

We see similar things in Machine Translation, and many people have suggested biology and medical science will be revolutionized by similar techniques.

But this brings up two important issues:

1. Trust-the-Oracle-Science?

First one came from a discussion with my cousin, who's a newly-minted lawyer working with DNA evidence. And at a certain level, the legal community is now trusting an algorithm to tell them the truth. In effect, the query they're asking is now quite simple and specific: how similar are these DNA fragments? And so there's not a multi-dimensional Bayesian learning network or a lot of opportunity for a coding slip-up to mess with evidence. 

That's today. Simple means Admissable. Guilty or Innocent.

But consider what happens if we base a new science on machine learning techniques? We will have these very self-referential systems where "truth" is just part of a large statistical model, and small errors might multiply to have quite serious consequences. Predicting health risks and the usual impacts on insurance, knowing too much about a person's behavior, privacy, you get it. At a certain level, you wonder if the computer becomes an oracle that accidentally has too much power.

Traditional science must re-double efforts to incorporate the results from these new models, and establish firm theory for why the "oracle" is telling the truth, or not.

2. Google Science

More optimistically, there are a ton of interesting problems to solve using this kind of data-rich approach. 

Another conversation yesterday brings me to this question: can a regular 2-person research team make headway against the Googles of the world? Or does Google necessarily have such an advantage with its reams of data, such that research in certain areas, outside the Googleplex, comes to a halt? 

This is a harsh question, but you could argue that Google has a huge lead in machine translation, and more training data for, say, clicks on images than everyone else in the world.

But the Hays/Efros example gives me a lot of hope that these tools will be readily available. These two guys at CMU were able to build a database of 2.3M images from Flickr, all with public tags, and they were able to do useful computation on it. 

I do think that there is some danger of research becoming verticalized. If one organization controlled all medical data, they would also control the research results to be gleaned from it. So as we see changes in the next 10 years, we should ensure that whoever controls medical data is absolutely required to make aggregate data public.

But with a few small tweaks, the Flickr/Hays/Efros example and the existing data available on the web makes it possible for very small teams to make headway here, as well as the 900-lb gorillas. I'm very optimistic that there isn't a Google wall, yet.

Today, it is still mostly a small project to write a web crawler. And distributed systems that crunch through petabytes of data are quite affordable. And we are seeing serious efforts to make data like this public, accessible through APIs, and useful for research.

I think that today, if you're clever enough your two-person team can make the next great vision algorithm or medical breakthrough. And we should work to keep it that way.

1 comment:

  1. Scene completion (texture synthesis, in-painting) is one of my favorite difficult imaging problems, something I miss.

    My gut feeling when I was researching it was that the problem could better addressed by finding an improved image decomposition / domain in which to work (cf. the gradient domain work). Often a candidate source texture can be dramatically improved with a fairly simple tweak--changing brightness, saturation or lighting gradient. Could there be some way to factor out rotation and perspective distortion? For this problem, is there a better way to decompose an image into components that matter and components that don't?

    As far as broader implications go... In the case of scratch removal, stain removal and speck removal, the challenge is reconstructing reality as faithfully as possible; but in the case of blemish removal and healing brushes, the objective is usually creating a lie that's so good it is perceptually indistinguishable from truth; and that has more troubling implications: the difference between finding truth and finding lies indistinguishable from truth.