Global signatures for filesystem contents

I've been using the "rsync trick" to backup my data for a couple years now. By creating hardlinks to historical files, and then rsync'ing on top, you can easily create a multi-day history of a filesystem. This idea first found me while reading slashdot, and I've adapted it to several of my own purposes since.

But recently I had this idea of looking at a filesystem as a collection of blocks (which they are, fixed-size in most implementations, and variable in a few), and considering what happens if you hash each block by content, and keep an index of those.

With such a structure, you could refcount each block written to a device, and if you were about to write the same block contents, you could simply refcount the existing block and not allocate one. I believe that contiguity would not be a problem (random seeks would only happen if you scrambled the blocks, which doesn't happen too often), because mostly you'd have copies of files that were slightly different than existing ones, or exactly the same, so files would refer to existing subsets, and seeks would be minimized.

Let's say you want to store 1TB using 32K blocks using this scheme -- you're talking about 256MB of hash information (to do 64-bit hashes, though you could get away with 32-bit if you wanted to do more checking). Probably you'd keep this in RAM, because there would be very little cache coherency.

But suddenly, you'd be able to write the same 1GB file to disk 10 times in 1.01GB. There are a lot of applications, backup among them, where this would be a very interesting thing to do.

So I guess I should crack open the Linux source, and start "backupfs" -- maybe it'd get done even.

No comments:

Post a Comment