Sync is (almost) the new backup

I've been wrestling with how to keep my photos and videos and code and email backed up for several years now. Haven't lost much yet, but I spend a lot of time working on it. Currently I'm using a combination of rsync to a machine I put in a colo, plus a sync tool I made for myself to schlep files from Windows to UNIX filesystems, plus ZFS (with its awesome snapshots) running on Solaris, plus Mozy, plus a little Time Machine on the Macs, and occasionally Crashplan.

My idea in writing this post is that the only system that works for a majority of people is p2p sync of actual files, because this results in multiple "live" replicas of a file, which can be verified to work all the time:
  1. Not backup to DVD/USB drive
  2. Not cloud backup over 128kbps DSL
  3. Not cloud sync.
Here's why.

Cloud Dreams!

First what we've got is the "cloud" guys purporting to store all our stuff online! What a great solution, and it is the right solution, because it doesn't get lost. But look at the actual cost:

Amazon S3: $1.80/GB/yr
MozyPro: $6.00/GB/yr
1TB Box in colo: $1.50/GB/yr
MozyHome: $60/yr (unlimited magic)
1TB local hard drive: $90 ($0.09/GB, maybe $0.03/GB/yr)

Now the photos-only guys (no raw, limited video, etc.):
Picasaweb: $1.25-2.00/GB/yr
Flickr: $24.95/yr
Smugmug: $39.95/yr

Loss-averagers

Let's first blow up the "unlimited" storage idea. We're in an era of stupidly slow upstream connections, and these guys get away with "surely nobody is going to upload stuff to us for 6 months".

I installed special traffic shaping on my firewall and bought a really fast line so I could upload for 6 months without really noticing it, and then I did it. But really, is this pricing model real? Only because people don't use it. So they have a theoretical backup, not a real one.

For instance, Mozy clearly isn't faring so well when I store 400GB there. Looks like they are losing $2340 on me annually, when you compare to their "pro" pricing. Ouch.

Same with Smugmug, Flickr, etc. Having Smugmug store my stuff in S3 at $1.44/GB/yr (Amazon's bulk rate) losing them about $536/yr. I imagine Flickr pays more than that, but I don't have any numbers.

So, PCs are 30x cheaper than the cloud

Storing an extra copy of this data on a new hard drive in my computer costs $12/year, plus maybe $8 for power if I leave it on all the time, or about $20/year.

I could even buy a netbook with a 160GB hard drive, and use that as a backup brick for $1/GB/year, cheaper than Amazon s3.

Sync today...tied to the cloud

So today, we have serious R&D going into things like "Microsoft Live Mesh" and Dropbox, and Zumodrive, and SugarSync. They handle the hard problem of folders that stay in sync no matter what you do to them. Lots of people are working on this.

But all of these services (except Mesh) today are tied to costs in the $1+/GB/year world of cloud storage. And so I can't use them for storing my "important" data. Even though maybe they are cool, I'm not paying an extra $1000/year for the privilege, and most don't scale to the level I'd need anyway.

Backup today...sort of

On the backup front, Mac users can get a Time Capsule and store their data on a second hard drive. And there are dozens of backup programs that will copy data you might or might not be able to restore in the future. Microsoft has some sorts of backup built into Windows 7, with the Time Machine equivalent being relegated to "Ultimate".

Crashplan is probably the best p2p backup for regular users. It mostly works, even if its CPU and memory usage is a bit high.

But the problem with backup is twofold:
1. Restore has to work. You're not watching daily for failures in your external USB drive, unless you're an enterprise IT guy whose job it is to keep the near-line backup device functioning.
2. You are using hardware for no incremental benefit. Backup is complex and there's no benefit for you to buy more of it to store more backups. Why buy it or install it? Why verify it still works? When you run out of space, what happens?

People need a benefit for backing up today.

What's missing: p2p Sync

Microsoft bought a little company called FolderShare a few years ago, and hasn't improved it much. 20,000 files, max, not so many folders. But what FolderShare did is approximately, sort of, the right thing: a PC-to-PC replication feature where files are actually usable while they're replicated. Why is this so important? Because you didn't know that all your backups were corrupted, or that your backup USB drive wasn't working.

But if your files stopped working on your laptop but were still good on your desktop, that would be noticeable. You could fix it, and you'd notice it, and you'd hopefully get a new laptop before it was too late.

The only system today that appears to do p2p replication is Mesh, and it's a complex weirdo with Silverlight UI in a browser (huh?!). My install on XP took an hour and asked to setup remote desktop more prominently than setting up stuff to sync, and I couldn't figure out exactly how to sync data to my wife's computer. Mesh appears to be a toolset to solve problems, but it doesn't really help you figure it out much.

Using Mesh is a letdown. When I clicked on a folder in the web UI, it took 20 seconds to show the 15 files inside it, and I'm apologizing in advance for not having really run it through its paces. I uninstalled it. I get the idea, but the implementation is just awful.

Mesh also lets you replicate folders from all over your hard drive, which is the other useful feature you really want. Dropbox doesn't do this.

I'm wondering if somehow Mesh will win despite its unreasonable bulk. I suppose the Microsoft 3.0 rule still applies. They are certainly thinking right, even if the execution is off.

Meanwhile, there are rumors of Dropbox doing useable p2p sync. I think this would be a great thing, and maybe people would understand it. Dropbox mostly just does one thing and does it really well.

But I need to be able to bring home a new computer, add it to my network, wait for all my files to show up, and turn off the old one. Three copies of everything I care about, files shared between my wife and I. Easy right?

But somebody...really just needs to do it.

20 comments:

  1. I've got the luxury that my "work" machine and "home" machine are somewhat fluid, since both are owned by Rice, yet it's reasonably common to mix things up. I've set up a spare 1.5TB of Firewire space on my work machine, from which I can rsync from home. It's slow; I have to throttle my upstream rsync below 300 kbps of my rated 768 kbps upstream to maintain reasonably snappy web performance). But it works and it's cheap.

    While my "work" machine is, indeed, backing up my home machine, I also get to see and use all that data (photos and music) at work. I don't need to sync work data back home in the opposite machine since all of my truly "work" data lives on professionally maintained file servers, version repositories, and so forth.

    ZFS really and truly is the missing element from all this and I'm deeply disappointed that Apple has striken mention of ZFS from its web site. ZFS on my Mac Pro at home was very much the plan, and that plan is now going to need to change.

    ReplyDelete
  2. I always enjoy learning what other people think about Amazon Web Services and how they use them. If you want to backup to Amazon S3 on Windows check out CloudBerry Backup.

    ReplyDelete
  3. There is one thing you have not covered with your sync solution. What happens if you are robbed or your house burns down. I would be happy if still have my data to work because my affirmation will pay for the damage but won't bring my data back.

    ReplyDelete
  4. Have you tried Wuala? http://wua.la/ They do p2p sync and you can earn storage by sharing it.

    ReplyDelete
  5. I've been using Dropbox for about 6 months. ~50k files in 15G, so clearly not quite the storage requirements you've got. But 3 times now I've installed the app on a fresh machine, waited a few hours, and had everything back up and running. Of course, I do a fresh install of apps and settings based on notes I keep, but it's been a great way to keep my data in sync across multiple machines.

    ReplyDelete
  6. Thanks for all the suggestions.

    I'm also using Dropbox, but more for moving documents and fonts around. I should try putting my "latest" photos and stuff on it and see how it works.

    ReplyDelete
  7. Having tried both Dropbox and Syncplicity, I can say that Syncplicity is far, far better than Dropbox and the likes.

    ReplyDelete
  8. Offsite backups are extremely difficult to implement in most environments because of throughput and encryption.

    I do not believe any solution that does not offer end to end encryption that cannot be reversed by anyone but the party sending the data is an acceptable option. Many solutions like Mozy do allow you to do this, but this does become a deal breaker for many of the home ground solutions like rsync and S3. Many people may not have this requirement, but I believe this is critical.

    The real problem though is how long it takes to upload/download your data. A 400GB backup as you mentioned would take well over a month to complete your initial upload. Downloading perhaps a quarter of that. Being able to keep the data in sync at typical user (who would be considering stronger backup methods) will have a difficult time just keeping the changed files up to date in a reasonable amount of time. It can be done, but the time it takes to upload/download becomes a big problem.

    In most cases, backing up to removable disk/media is a solution that solves all the problems you mention. Backing up to a local USB/eSATA/Firewire/Tape device on a daily basis and taking the unit off-site once a week (sooner or later depending on needs). When taking the unit off-site, a secondary unit would be brought back on-site to replace it. Using strong encrypion on the backup, the location you store the disk is not a major concern as the data is safe. Storing this backup at a friends, parents, or colleagues house is a very effective and economic way of handling this problem.

    One requirement I have for myself and all my clients is having three copies of all data. This includes the current live data, a backup, and current off-site backup.

    One thing to keep an eye on is when you are using this technique or many others, do not backup to the same file(s) on a daily basis. This will introduce a window of opportunity for data loss as your current backup is being overwritten and your new backup hasn't finished. If a system failure happens, you will need to recover from an off-site backup. At minimum I recommend two copies on-site and alternate on a daily basis, then one copy is taken off-site and rotated with the previous weeks copy.

    If you are able to use tape, this process is a lot easier as you can go with even more copies.

    Backing up 1-10GB, then online cloud solutions may be perfect, but once you get into the 25-100GB+, it becomes counter productive. This is especially the case with companies like Time Warner cracking down on their "unlimited download" limit and making new limits.

    This is a great post, and I have had this same discussion what seems like a million times.

    Christopher Spence
    Lexan Systems LLC

    ReplyDelete
  9. Christopher--really nice comments. USB drives do of course work, and I even managed to cycle my offsite disks a couple times when that was my major offsite solution. They also win in cost-effectiveness. Saved my butt quite a bunch.

    But with a few exceptions (RAID-1 enclosures, etc.) the hardware is relatively fragile unless you're redundant here too. To me the USB drive or NAS is just another level of redundancy, and I use the "3 copies" or even more for important data.

    Since you MUST be able to have this single disk restore all your data means you need to have multiple drives, and this gets complicated, because it's not verified to work as often as you'd think.

    I suppose the idea with "sync" is that I would rather have a folder with files in it as my backup than a file I can't decode from Veritas, Crashplan, Retrospect, or even "tar". In some ways, the most survivable and recoverable formats are the most popular ones. There are data recovery services for NTFS and HFS+, not so much with Windows backup.

    So I think, still, that a "backup" done to a working NAS or to a USB drive should be a simpler thing, not an opaque format but a filesystem with some metadata and snapshots added.

    ReplyDelete
  10. You don't necessarily need to use RAID1 or a RAID1 enclosure for your backup devices. As you would be creating a new backup each day (not overwriting the old one). Although you may or may not be using the same drive. If your backup window allows verification after backup, this will solve the RAID1 problem and allow you to easily use a single 500GB-1TB drive for your backup (providing you have at least two to have one off-site at all times).

    Acronis is a good solution, they will create an image file to an external device that is 256 bit AES encrypted. You can use a boot CD generated by the program to boot and restore 100% bare metal without first installing the operating system.

    One of the major problems I find with backing up off-site while maintaining encryption is it has to be done at a file level, container based encryption is great for local use but trying to backup a 500GB-1TB container (because you have no idea what your exact growth is, your container has to be rather large) is pretty much impossible, and cannot be done incrementally. I have yet to find a solid reliable way to backup off-site secure using encryption without using services like Mozy because of the requirement that encryption cannot be reversible by the data center. Rsync would be perfect, but windows rsync is weak and the only solutions that do encryption with rsync prior to delivery are not stable or mainstream.

    On the other hand, there are some cloud solutions that have an attractive alternative, but not for the typical home user. They have the ability to ship drives pre-loaded with your initial backup and will overnight a drive in return for recovery. This makes off-site backups a much more viable option.

    Backing to a NAS, is difficult to take off-site, unless you backup to your NAS/SAN and use an external device (tape/drive) to further copy it to a portable device. Once you break the 1TB limit of readily available hard disks, this solution becomes more difficult. At least for a non-enterprise.

    Windows backup is not really a good solution for mission critical backups, it has poor open file support and does not even exist in Vista or even Windows 7 (not 100% sure on Windows 7 but I believe that is the case as well).

    As a business or an enterprise, cost is not as prohibiting and there are even more ways to make this work. Although to be honest, you would be surprised how low priority backups are to many businesses, even those who claim loosing their data would put them out of business. We frequently find dysfunctional or absent backup solutions in place.

    What makes matters worse, disk technology has been increasing at an alarming rate in terms of storage, but performance has not been. Backup solutions have not been keeping up either. Creating an array to store 10TB of data is pretty easy, creating a solution to back it up will cost 10x-50x as much.

    Hope this helps, good luck in your quest, I feel your pain.

    Christopher Spence
    Lexan Systems LLC

    ReplyDelete
  11. The funny thing about multiple backups--backup software seems much smarter than restore software. In my experience, keeping multiple copies isn't as useful as it seems, because the software to sift through them is pretty limited. What software would you use to difference 3 mostly-identical trees of millions of files to find dupes and corruption? The UIs really suck for this.

    Regarding full copies: I like disk images if all I'm doing is replacing a hard drive and getting stuff back. But if I'm moving PC->Mac, having gotten fed up after my laptop motherboard stops booting? Files are then much more useful. Again, back to sync.

    I think there are a variety of solutions that solve different problems, but in terms of holding onto your photos for 50 years, I want a variety...also I want the ability to restore files from a new platform, new machine 5 years from now. And I think "backup" has not served that need very well.

    ReplyDelete
  12. Most people only make the switch from PC to Mac or Mac to PC once in their lifetime. Maybe twice, changing a daily routine for that one or two days I am not sure is ideal.

    As for the difference between three backups, that is a problem, but that is where verification of the backup file to confirm it is in good working order and even testing a restore is a good practice.


    As for sync, it lacks one thing that becomes a deal breaker for us, the ability to encrypt with absolute certainty that no one else can decrypt it. This gives you a lot of lead way where you store your off-site backups. I would not even store files on a remote dedicated server as admin access is easily compromised.

    Not only do many businesses neglect their backups, even more neglect the security of their backups. A single stolen backup tape can bypass all security on-site and throughout the operating systems.

    Christopher Spence
    Lexan Systems LLC

    ReplyDelete
  13. Crashplan is all you need. The software is free and will backup all of your data to a hard drive on another computer in near real-time. Or, you can backup to "a friend" who can also then their push their data back to you. www.crashplan.com

    ReplyDelete
  14. I can weigh in on why we don't do sync, and why our archives are not files.

    ----Sync----
    To me, the promise of sync is "Help me manage my data on these computers"
    So when I hear sync, I don't hear backup. True sync replicates all the human errors and machine corruption automatically. Got a virus that munges your data? That'll be syncd. Have a bad SATA controller that's dropping a bit 1 time in a billion? That file will be munged and the bad file will be syncd. Am I a fan of sync? As they say in Minnesota, "You betcha".

    Some day, perhaps CrashPlan will sync. Our compression and data de-duplication is so good, I dare say we'd probably be in the top 5% for efficiency, maybe even #1. But it's a lower priority for us.

    ----Backup----
    To me, the promise of backup is "Keep my data safe, no excuses."
    That's a big promise, it suggests more than just software, but a process that is proven.

    Process examples:
    If your data isn't backed up offsite, that's not good enough.
    If you have a lot of data and it's only offsite, that's not good enough either.
    If you backup onsite and offsite but don't actually TEST those backups on a regular basis, well, that's not good enough either.

    Data examples:
    The two main reasons people loose data are:
    a)Hardware failure (note hardware failure can be intermittent and thus hard to detect)
    b)Human failure (save over files, accepting a virus payload, deleting things we didn't intend to, the list is a long one)

    A good backup solution must protect against these situations as well. This necessitates the obvious need for versioning and deletion recovery.

    Today, CrashPlan does onsite and offsite backups with automatic autonomous archive validation. It delivers on "our" promise of backup. The thing we're most proud of is the fact our backups assume that the destinations will fail you.

    ----Why we don't backup----
    The surveys will tell you people don't backup due to difficulty. I think if you dig deeper, you'll see a component of that is cost. Cost presents itself in various forms - time to figure out the software, time to backup, disk space used, actual fees paid to a third party, you get the idea.

    That's why we invested heavily in reducing bandwidth and disk space used. It reduces costs. We de-duplicate all the data (this isn't lame "hey that's a duplicate file" data de-duplication, this is find byte pattern X in files A, B and C at any position.

    ---Archive Formats---
    You need to store data in a way that enforces the promise of backup while at the same time, helps solve the problem of folks not backing up.

    We use a "closed" format for the following reasons.
    - Store 4 Billion files in 10. This reduces overhead of destination file system.
    - Store every version of every file for all time, often in less space than the original.
    - Encrypt everything insuring nobody can tell what's inside
    - Store unique data from source only, no duplicate data.
    - Have redundant data and checksums in place to detect and heal around any archive corruption that occurs

    Security, Integrity, cost reduction, these are reasons why all your data is "packed up" into a few closed files. If we didn't feel these were so important, we'd have thrown the files out there like sync does. It would have saved us millions in R&D.

    So that's our reason why we're closed like that. We're still improving the "API" on our storage format, eventually you may see us open it up. At a minimum, you'll have command line tools to work with them.

    In summary, everything we do, we do under the promise of backup.
    Sync is a different promise, the two are not 100% compatible IMHO.

    They're similar, but not identical. I'd rather have the best backup promise implementation or the best sync implementation rather than compromising the two for something less.

    ~Matthew

    ReplyDelete
  15. Thanks Matthew, it's great to have this level of comments.

    I think the criticisms of sync you make are certainly valid for simplistic implementations (even rsync), but there's no reason that versioning and corruption detection can't happen on a filesystem, or a networked sync implementation.
    e.g.,:
    - You can easily move files to a "deleted" folder on each replica, rather than replicating deletes directly
    - You can do the same with modifications to files.

    I'm totally in agreement that sync (without filesystem support for snapshots) still leaves this "old" data vulnerable. But to be clear: even "old versions" of files in my model can be verified against other copies on the network. Of course, keeping ten versions of big files is ultimately ugly (mbox files could be deadly...), and you start wanting a more efficient format like Crashplan's. But I would love to have "deletes" and "last version" in regular files, and the rest in an opaque log.

    I'm just putting it out there--what if the opaque files ultimately cause us to lose more data (or at least access to it) in the long run? Why does restore software always suck compared to backup? To restore an old version of a file in most "backup" apps, I click, wait 60 seconds, etc. My filesystem browser is instant. So is my ZFS snapshot folder. These tools are really hard to get right.

    And if you guys had a simple set of tools written against your format, that would be amazing.

    I think Crashplan is one of the very best products in this space...it *always* comes up when I'm talking with friends about this topic. Giving away the software for the usual p2p cases is wonderful, and it makes it possible to get your data back, a lot more often than before. I'm almost a big fan. :)

    I am not using Crashplan right now (I'd like to), mostly because it seems to consume huge amounts of CPU--to use a couple examples, mozy and dropbox together use 1/5 the resources of Crashplan. If you can fix this, you'll have a total fan.

    Along with a good backup, I'm still thinking about how to have my most valuable files stored in folders on multiple computers, compared against each other daily. I think in the "keeping data for 20 years" plan, this fixes a lot of the icky places where backups fail, and I think a lot more people would use a system like this than use backup today.

    ReplyDelete
  16. re>cpu.
    You are in complete control on how much cpu we use. Under settings, you have CPU% for when you're at keyboard, and when you're away. Are you a performance nut? then set it to 0% present (meaning don't do anything) and 90% when you're away.

    Also - on every OS, we're scheduled as "idle" cpu usage. We're only using that which isn't used anyway.

    So from our point of view, this is a non-issue. It should be for you as well.

    re>sync
    Yes - you're right. Lots of neat things you could do. We've got some ideas over here as well... lots of efficiency potential!

    re>"data for 20 years"
    Ok, now you're talking yet another promise I call archive. That's entirely different as well... That is something you'll see from us before you see sync I think. That's the promise of, "I'm going to remove this form my laptop - you must guarantee me this file will be available 100 years from now and survive large geographic regions being offline". Very challenging promise.. can't say more other than to say we're doing some exciting things there.

    ReplyDelete
  17. Mozy is also much slower if I remember correctly. Even Acronis sucks up cpu cycles on our quad cpu machines (roughly 70-80% of the four cpus), but it also is able to backup 200gb in a little over an hour with 256 bit encryption.

    I have looked at CrashPlan myself a few times as well, I haven't given it a try though. The ability to have on-site while doing off-site is a great solution, we stay away from off-site solutions that do not have this functionality.

    I have played with ZFS a bit, and I love the snapshot feature, one of the biggest features SAN solutions bring to the table. ZFS is very immature and there are a few issues with its stability, especially on non-Sun OS.

    ReplyDelete
  18. This comment has been removed by the author.

    ReplyDelete
  19. Joe Said "From http://aws.amazon.com/s3/#pricing I find the cost to be listed as:

    $0.120 per GB – storage used / month over 500 TB.

    Where'd you get $1.44/GB/yr ?"

    .12 x 12 months is $1.44/yr.

    ReplyDelete
  20. Feel free to check out my Creating & Keeping Persistent Digital Memories too.

    I use cloud (Carbonite) for work files (disaster CYA). Evernote too. For the rest I use Acronis and live data shadowing using NTI Shadow. Two 1 GB drives are doing that. Need to get something offsite though.

    I also store a lot of files in PersonalBrain. Makes it easy to go File -> Create BrainZip and know that I have database & files all together. 1 place to backup.

    Really good post -- thanks!

    ReplyDelete