[sllug-members]: easy backups
Scott Patten
scott at pattens.net
Mon May 1 21:32:36 MDT 2006
Bart Whiteley wrote:
> On Mon, 2006-05-01 at 18:41 -0600, Mac Newbold wrote:
>
>> Today at 6:05pm, Bart Whiteley said:
>>
>>
>>> On Mon, 2006-05-01 at 17:50 -0600, Scott Patten wrote:
>>>
>>>> I'd like to add a second vote for rdiff-backup. It combines the
>>>> advantages of diff with those of rsync. It will very efficiently
>>>> create
>>>> a copy of a directory and also keep compressed backups of the changes
>>>> that occurred since the last backup. When you want to restore a
>>>> file,
>>>> you just copy it. When you want to restore an older copy then you
>>>> issue
>>>> an rdiff-backup command that decompresses and recombines the files to
>>>> give you the version that you are after.
>>>>
>>> What are the advantages of rdiff-backup over
>>> http://www.mikerubel.org/computers/rsync_snapshots/ ?
>>>
>> I'll answer a different question and say that rsnapshot has the advantage
>> that every file in any of your snapshots is a complete file, usually a
>> hard-link to identical versions of the same file. You can easily search
>> your backups with grep or whatever you normally like to use, and restoring
>> a certain version is as easy as doing a cp (I like cp -p too). The file
>> permissions and ownership are perfectly preserved. The biggest advantage
>> I've found in that is knowing that I can quickly and very easily do large
>> sweeping restores in case of emergency. Once I know which version I want
>> (like the most recent backup) it is extremely easy to get it and do
>> whatever you want with it. Since it uses rsync, it also only transfers the
>> files that have actually changed, and you can turn on compression for
>> remote backups, so that it saves on bandwidth and time. (It doesn't store
>> the files compressed, just gzips them during transfer over the network.)
>>
>> I suspect that an rdiff backup to a remote server would have to transfer
>> at least the smaller of the two file versions to find the difference
>> between them, and if so, it wouldn't provide any bandwidth/speed savings
>> over an rsync/rsnapshot.
>>
>> The place it could provide an advantage is in the space required for a
>> backup, especially if you have lots of large files that get small changes
>> (like log files for example). The diffs could also be bigger than the file
>> itself, though, if a majority of the file changed, since it shows the old
>> way and the new way in the diff. And generally diffs are ridiculously bad
>> with binary files, like images, compressed files, non-text documents (like
>> the MS Office formats), etc.
>>
>> The rdiff method can make it more fragile too. This method also adds more
>> complications for restoring or examining files in your backups, since
>> they're not whole files, just patches, basically. To restore, you need to
>> have a valid version of every patch in the dependency chain. I don't know
>> how they do it in rdiff-backup, but if the full backup was on April 1st,
>> and you wanted the version from April 27th, you might need 27 patch files,
>> and if you didn't have all of them, perfect and without any corruption,
>> you wouldn't be able to recover the file fully. If they do fulls monthly,
>> an incremental to the full weekly, and an incremental to the weekly each
>> day, then you'll only need 3 patches for everything to get back perfectly.
>> All of this is avoided by storing only whole file versions with
>> rsync-based methods.
>>
>>
>
> That's what I was thinking. As I read his description of rdiff-backup,
> I could only think of ways in which it was inferior to rsnapshot-like
> solutions. I was just wondering if I missed something...
>
Man you guys are amazing. At least I read your article. rdiff-backup
is in most respects rsnapshot-like.
If you're insane about speed then use rsnapshot. If you're concerned
about space or your time then try rdiff-backup. By the way,
rdiff-backup can save old versions as whole files, as diffs or as
gzipped versions of the files or diffs.
From the rdiff-backup site:
* *rsync <http://rsync.samba.org>* - the inspiration for
rdiff-backup. Although rsync and rdiff-backup do not share any
code, rdiff-backup uses the rsync algorithm, invented by rsync
author Andrew Tridgell.
Compared to rdiff-backup, rsync is faster, so it is often the
better choice when pure mirroring is required. Also rdiff-backup
does not have a separate server like rsyncd (instead it relies on
ssh-based networking and authentication).
However, rdiff-backup uses much less memory than rsync on large
directories. Second, by itself rsync only mirrors and does not
keep incremental information (but see below). Third, rsync may not
preserve all the information you want to backup. For instance, if
you don't run rsync as root, you will lose all ownership
information. Fourth, rdiff-backup has a number of extra features,
like extended attribute and ACL suport, detailed file statistics,
and SHA1 checksums.
* *rsync-based scripts* - Because rsync does not save incremental
information, it is usually inappropriate for backing up. There are
several utilities which use the rsync binary, but keep old data by
using rsync --link-dest option and rotating the destination
directory.
Compared to rdiff-backup, these are usually faster but use more
memory and disk space. They make each increment appear as a
separate complete directory, which is a neat feature. On the other
hand, these will usually be missing the features that are missing
from rsync (see above).
Here are various programs which use the rsync strategy:
o *Mike Rubel's rsync snapshots
<http://www.mikerubel.org/computers/rsync_snapshots/>* - the
original rsync hardlinking script
o *rsnapshot <http://www.rsnapshot.org/>* - based off Mike
Rubel's article
o *Dirvish <http://www.dirvish.com/>* - perhaps the most
feature-filled of these programs
o *Backup Buddy
<http://www.effortlessis.com/backupbuddy/others.php>* - a
simple, easy-to-use option
o *rsback <http://www.pollux.franken.de/hjb/rsback/>*
o *Snapback2 <http://www.perusion.com/misc/Snapback2/>*
o *rsync-incr <http://colas.nahaboo.net/software/rsync-incr/>*
o *rsyncbackup <http://rsyncbackup.erlang.no/>*
o *ccollect <http://linux.schottelius.org/ccollect/>*
o *RIBS backup <http://www.ribs-backup.org/>*
Features
For many people hard disks provide the form of persistent storage that
is most readily available and cheapest per MB. I think that rdiff-backup
is often the best way to back one hard drive to another.
*
*Easy to use:* In most cases, the command
rdiff-backup dir1 dir2
will work out-of-the-box to backup dir1 to dir2.
rdiff-backup dir1 user at system::/dir2
will backup dir1 to dir2 on a different system (provided
rdiff-backup is installed on both systems). rdiff-backup also
comes with a lot of up-to-date documentation
<http://www.nongnu.org/rdiff-backup/docs.html>.
*
*Creates mirror:* rdiff-backup makes the backup directory into an
almost exact copy of the source directory (the only difference is
one extra subdirectory on the backup side). If you delete a file
from the source directory you can simply copy it from the backup
directory, use "find" or "locate" to find the file, or use any
other familiar utility. Also, if the two directories are on
different disks, you can recover almost immediately if the disk
containing the source directory crashes, just by mounting the
backup directory where the source directory used to be.
*
*Keeps increments:* Normally, with a mirror, any changes made to
the source directory are immediately sent to the backup directory,
and old changes are lost. rdiff-backup saves those changes in the
form of reverse diffs, so you can recover the older form of the file.
For instance, suppose last week you deleted half of some document,
thinking that what you had written was garbage. Yesterday, your
backup event ran, saving these changes. Today you realize that you
were on to something and want what you deleted back. If you just
mirrored, you would be out of luck, since the copy on your mirror
would be the newer one. With rdiff-backup, the newer version would
indeed be present, but in a special directory (rdiff-backup-data/)
there would be a file that recorded this change. Running
rdiff-backup on this file recovers the version from a week ago.
*
*Preserves all information:* Whether you restore from the mirror
directory or from an earlier incremental backup, rdiff-backup will
reproduce your files exactly as they were. Files missing at the
time of backup will also be missing after the restore. Files hard
linked when backed up will be hard linked after the restore.
rdiff-backup also preserves permissions, user and group ownership,
modification time, device files, fifos, and symlinks.
Sometimes it is impossible for the information to be replicated
exactly on the destination. For instance, ownership cannot usually
be replicated without root access at the destination; windows file
systems may not be case sensitive and have no ownership at all.
rdiff-backup records file metadata in a separate file so that all
information is preserved even if the destination file system is
missing features.
*
*Space efficient:* Suppose you have a large database file that
changes a little bit every day. A normal incremental backup would
keep saving copy after copy of this database, wasting a lot of
space. rdiff-backup uses librsync, which implements the same
efficient diffing algorithm that rsync uses. It works on binary
files as well as text, so only a fraction of the data in your
database would be saved in each incremental backup.
*
*Bandwidth efficient:* rdiff-backup depends on librsync, and thus
uses the same diffing algorithm as rsync (rsync and rdiff-backup
strictly speaking do not share any code however). As a result,
when when writing to a remote location, rdiff-backup will only
send diffs over and can use much less bandwidth than, say, ftp or scp.
For instance, suppose you slightly alter large file A to make
large file A', and A is still on the remote system. When
rdiff-backup is run, it will only send over the diff A->A' (in
order to "copy" A' to the remote system). Neither A nor A' needs
to be sent in its entirety.
*
*Transparent data format:* Except for recording the hard link
structure of old data sets, rdiff-backup doesn't absolutely
require any data files formatted specifically for rdiff-backup. So
if you want to stop using rdiff-backup in the future, you won't be
stuck with any undecipherable files in some strange format. As
noted above, the mirror directory will just be a copy of the
source directory as it was when rdiff-backup was last run. Earlier
states of your files are saved just by 1) keeping a copy of them,
2) in diff form as produced by rdiff, or 3) as a gzipped version
of 1 or 2.
*
*Filesystem feature autodetection:* People use rdiff-backup in
many different environments. The filesystem they want to back up
may be on Linux, Windows, or Mac. It may or may not be case
sensitive, support characters like ":", have resource forks,
extended attributes, or access control lists. Moreover, the file
system they are backing up to may or may not support these features.
rdiff-backup tries to handle these situations automatically
without the need for switches like --acl --ea --no-ownership, etc.
When run it will run tests on both the source and destination
filesystems to see what features each supports like case
sensitivity, changing uid/gid ownership, resource forks, extended
attributes, or access control lists. To see the results of this
testing, run rdiff-backup with verbosity 4 or higher, as in |-v4|.
*
*Mac OS X resource fork support:* On Mac OS X systems,
rdiff-backup will backup the resource forks which store, for
instance, Finder information. Most unix backup programs would only
backup the data forks and discard the resource forks.
*
*ACL and EA support:* If rdiff-backup can find the pylibacl
<http://pylibacl.sourceforge.net/> and pyxattr
<http://pyxattr.sourceforge.net/> modules, and if the file system
supports these features, rdiff-backup will preserve Access Control
Lists and user-level Extended Attributes.
*
*Keeps statistics:* After each session rdiff-backup writes summary
statistics to a text file. You can inspect these to see how large
your repository is, how fast it is growing, and how much space
rdiff-backup is saving you, and more. Here is an example
|session_statistics| file:
StartTime 1124018521.00 (Sun Aug 14 06:22:01 2005)
EndTime 1124019454.64 (Sun Aug 14 06:37:34 2005)
ElapsedTime 933.64 (15 minutes 33.64 seconds)
SourceFiles 975715
SourceFileSize 13078345389 (12.2 GB)
MirrorFiles 975604
MirrorFileSize 13076177922 (12.2 GB)
NewFiles 119
NewFileSize 1103075 (1.05 MB)
DeletedFiles 8
DeletedFileSize 190653 (186 KB)
ChangedFiles 2032
ChangedSourceSize 395324417 (377 MB)
ChangedMirrorSize 394069372 (376 MB)
IncrementFiles 2233
IncrementFileSize 6098156 (5.82 MB)
TotalDestinationSizeChange 8265623 (7.88 MB)
Errors 80
rdiff-backup also saves very detailed statistics in a
|file_statistics|. This file is also in (compressed) text form,
but is usually too voluminous to read manually.
More information about the sllug-members
mailing list