I found myself giving a mini-lecture to some relatives over Thanksgiving about SSD and multi-core, and how well they are suited to each other.
The latest SSD drives do an enormous number of io/s, and they can move >200MB/sec. On every count, this is way better than hard drives.
So that brings me to decoding images, and how multi-core (think 8-16 cores) CPUs will work. For images, you can do two things with a lot of cores: improve latency (load a single image faster), or improve throughput (process a whole lot of images at once).
To make things difficult, let's talk about 20MB raw files. You just shot 1000 of them (20GB!) Today's regular hard drive and 4-core CPU are pretty evenly matched, but very slow: you'll probably spin a CPU for a second or two per image, and you'll have to be careful about scheduling disk traffic to maximize throughput: 4 threads banging on a disk will seek too much and make you disk-bound. You can do it, but it's very tricky.
Now let's give ourselves a 200MB/sec drive, infinite seeks, and 16 cores. Split the work any way you like and you'll come out okay. Normal hard drives (short of a 16-way RAID) won't work. Using conventional disks, you'd have way more CPU than disk.
This is why I think SSD has to be disruptive at this point in the evolution of computers. In addition to power usage and size, a multi-core CPU simply requires way too much intelligence in your disk controller to let 4-16 threads with different goals share a disk. Even for bandwidth-intensive operations like image decoding, conventional disk has become the bottleneck. SSD and multi-core together are the perfect technology partners.
Anyway, back to decoding images: a typical 20MB raw file is compressed using a Huffman compressor, in one big stream. If your goal is to reduce latency, this is a bad choice, because most people will implement this using a single CPU. If you're making a new file format, put some natural blocking in, please. Doing the lossless portion of JPEG or raw decoding (Huffman) is often the #1 bottleneck in modern decoders. It's branch-heavy, and it's hard to parallelize.
But can you make a parallel Huffman decoder? I believe it's possible, because you can seek to the middle of a set of bytes and re-synchronize with the bitstream (it's a static table, not dynamic). There don't seem to be any simple implementations of this. Even Intel's IPP is slower than a good C implementation.
I think: linear scaling across cores is possible for Huffman decoding, and that the entirety of JPEG and raw decoding is also linearly scalable.
Does anyone know how in-camera DSP approaches this problem? A camera that shoots 10fps must have some pretty nice parallel magic.