Nerdblog.com: Image decoding notes, multi-core, SSD

I found myself giving a mini-lecture to some relatives over Thanksgiving about SSD and multi-core, and how well they are suited to each other.

The latest SSD drives do an enormous number of io/s, and they can move >200MB/sec. On every count, this is way better than hard drives.

So that brings me to decoding images, and how multi-core (think 8-16 cores) CPUs will work. For images, you can do two things with a lot of cores: improve latency (load a single image faster), or improve throughput (process a whole lot of images at once).

To make things difficult, let's talk about 20MB raw files. You just shot 1000 of them (20GB!) Today's regular hard drive and 4-core CPU are pretty evenly matched, but very slow: you'll probably spin a CPU for a second or two per image, and you'll have to be careful about scheduling disk traffic to maximize throughput: 4 threads banging on a disk will seek too much and make you disk-bound. You can do it, but it's very tricky.

Now let's give ourselves a 200MB/sec drive, infinite seeks, and 16 cores. Split the work any way you like and you'll come out okay. Normal hard drives (short of a 16-way RAID) won't work. Using conventional disks, you'd have way more CPU than disk.

This is why I think SSD has to be disruptive at this point in the evolution of computers. In addition to power usage and size, a multi-core CPU simply requires way too much intelligence in your disk controller to let 4-16 threads with different goals share a disk. Even for bandwidth-intensive operations like image decoding, conventional disk has become the bottleneck. SSD and multi-core together are the perfect technology partners.

Anyway, back to decoding images: a typical 20MB raw file is compressed using a Huffman compressor, in one big stream. If your goal is to reduce latency, this is a bad choice, because most people will implement this using a single CPU. If you're making a new file format, put some natural blocking in, please. Doing the lossless portion of JPEG or raw decoding (Huffman) is often the #1 bottleneck in modern decoders. It's branch-heavy, and it's hard to parallelize.

But can you make a parallel Huffman decoder? I believe it's possible, because you can seek to the middle of a set of bytes and re-synchronize with the bitstream (it's a static table, not dynamic). There don't seem to be any simple implementations of this. Even Intel's IPP is slower than a good C implementation.

I think: linear scaling across cores is possible for Huffman decoding, and that the entirety of JPEG and raw decoding is also linearly scalable.

Does anyone know how in-camera DSP approaches this problem? A camera that shoots 10fps must have some pretty nice parallel magic.

2 comments:

metamerist6:16 AM
If you didn't see it, there was a nice article at anantech on SSD the other day.

http://www.anandtech.com/storage/showdoc.aspx?i=3531
Allan MacKinnon5:01 PM
This comment is a year too late, but many JPEGs are peppered with DRI/RSTn markers. Parallelizing the processing of the regions between RSTn markers is definitely do'able before you even begin to look at parallelizing Huffman or the encoded stream.

We weren't using this feature for parallelization but instead for minimizing the instantaneous memory footprint in a Java JPEG decoder in J2ME (don't ask -- J2ME has terrible native JPEG decoding support so we had to perform unnatural acts like this).

Only problem, not all JPEGs are peppered with RST's.