Copying large files with Rsync, and some misconceptions

1 月 162024

Source: Copying large files with Rsync, and some misconceptions

There is a notion that a lot of people working in the IT industry often copy and paste from internet howtos. We all do it, and the copy-and-paste itself is not a problem. The problem is when we run things without understanding them.

Some years ago, a friend who used to work on my team needed to copy virtual machine templates from site A to site B. They could not understand why the file they copied was 10GB on site A but but it became 100GB on-site B.

The friend believed that rsync is a magic tool that should just “sync” the file as it is. However, what most of us forget is to understand what rsync really is, and how is it used, and the most important in my opinion is, where it come from. This article provides some further information about rsync, and an explanation of what happened in that story.

About rsync

rsync is a tool was created by Andrew Tridgell and Paul Mackerras who were motivated by the following problem:

Imagine you have two files, file_A and file_B. You wish to update file_B to be the same as file_A. The obvious method is to copy file_A onto file_B.

Now imagine that the two files are on two different servers connected by a slow communications link, for example, a dial-up IP link. If file_A is large, copying it onto file_B will be slow, and sometimes not even possible. To make it more efficient, you could compress file_A before sending it, but that would usually only gain a factor of 2 to 4.

Now assume that file_A and file_B are quite similar, and to speed things up, you take advantage of this similarity. A common method is to send just the differences between file_A and file_B down the link and then use such list of differences to reconstruct the file on the remote end.

The problem is that the normal methods for creating a set of differences between two files rely on being able to read both files. Thus they require that both files are available beforehand at one end of the link. If they are not both available on the same machine, these algorithms cannot be used. (Once you had copied the file over, you don’t need the differences). This is the problem that rsync addresses.

The rsync algorithm efficiently computes which parts of a source file match parts of an existing destination file. Matching parts then do not need to be sent across the link; all that is needed is a reference to the part of the destination file. Only parts of the source file which are not matching need to be sent over.

The receiver can then construct a copy of the source file using the references to parts of the existing destination file and the original material.

Additionally, the data sent to the receiver can be compressed using any of a range of common compression algorithms for further speed improvements.

The rsync algorithm addresses this problem in a lovely way as we all might know.

After this introduction on rsync, Back to the story!

Problem 1: Thin provisioning

There were two things that would help the friend understand what was going on.

The problem with the file getting significantly bigger on the other size was caused by Thin Provisioning (TP) being enabled on the source system — a method of optimizing the efficiency of available space in Storage Area Networks (SAN) or Network Attached Storages (NAS).

The source file was only 10GB because of TP being enabled, and when transferred over using rsync without any additional configuration, the target destination was receiving the full 100GB of size. rsync could not do the magic automatically, it had to be configured.

The Flag that does this work is -S or –sparse and it tells rsync to handle sparse files efficiently. And it will do what it says! It will only send the sparse data so source and destination will have a 10GB file.

Problem 2: Updating files

The second problem appeared when sending over an updated file. The destination was now receiving just the 10GB, but the whole file (containing the virtual disk) was always transferred. Even when a single configuration file was changed on that virtual disk. In other words, only a small portion of the file changed.

The command used for this transfer was:

rsync -avS vmdk_file syncuser@host1:/destination

Again, understanding how rsync works would help with this problem as well.

The above is the biggest misconception about rsync. Many of us think rsync will simply send the delta updates of the files, and that it will automatically update only what needs to be updated. But this is not the default behaviour of rsync.

As the man page says, the default behaviour of rsync is to create a new copy of the file in the destination and to move it into the right place when the transfer is completed.

To change this default behaviour of rsync, you have to set the following flags and then rsync will send only the deltas:

--inplace               update destination files in-place
--partial               keep partially transferred files
--progress              show progress during transfer

So the full command that would do exactly what the friend wanted is:

rsync -av --partial --append --progress vmdk_file syncuser@host1:/destination

Note that the sparse flag -S had to be removed, for two reasons. The first is that you can not use –sparse and –inplace together when sending a file over the wire. And second, when you once sent a file over with –sparse, you can’t updated with –inplace anymore. Note that versions of rsync older than 3.1.3 will reject the combination of –sparse and –inplace.

So even when the friend ended up copying 100GB over the wire, that only had to happen once. All the following updates were only copying the difference, making the copy to be extremely efficient.

Just one important note about one option that you suggested:

–append This causes rsync to update a file by appending data onto the end of the file …

The use of –append can be dangerous if you aren’t 100% sure that the files that are longer have only grown by the appending of data onto the end

So if you use the “–append” then modify the source file by changing the content on the middle of it (and not only append the content), you will get a different file on the destination! you can check it with md5sum.

I thought the delta-xfer algorithm is always used, unless you use the –whole-file option to disable it? So, I would’ve thought that the large file should be being transferred “efficiently” over the network; just that the receiver side is chewing up twice the space if not using –inplace (the existing local destination file is used as a “basis” and matching blocks can be copied from there to the temporary file).

—

I’ve reviewed rsync man page and I can conclude the same. Rsync by default uses delta-xfer algorithm for the network transfer, but it’s only the handling of the transferred delta chunks on the destination end that’s by default applied to a temporary file before being copied to the target file. The options the author has listed overwrites the target file reconstruction step by dropping the use of a temporary file altogether.

What I want to be able to do is backup in a manner similar to how Apple’s Time Machine does it, but again without the GUI. Specifically, on a regular schedule (every hour or every day) I want to do a full backup of all files to a different drive on the machine or elsewhere on the local network, but of course I only want it to write the data that has changed. My understanding is that Apple uses “sparse” files of some kind in Time Machine, so that’s fine, but what I really need is two things: The ability to easily recover a single file, group of files, or entire directory from the backup (not necessarily the most recent backup; I should be able to view or recover files from earlier backups in addition to the most recent, as is possible in Time Machine). But more important, and this is the thing that nobody seems to be able to explain how to do, is that I need to be able to reformat the hard drive and then put everything (system files AND user files) back as they were prior to the last update. A big bonus would be the ability to roll back the last kernel update and get back to how everything was just before that kernel update. In other words something similar to taking a snapshot of a virtual machine, then restoring from that snapshot, but without actually using a virtual machine.

—

As far as I know, rsync isn’t the solution for you. Rsync is for file syncing, and although it can be used (and is used) as a light backup solution, it isn’t well-suited to do everything you ask for (without some heavy-duty scripting).

I think, you should look at backup options available. One possibility would be this:

https://fedoramagazine.org/backup-on-fedora-silverblue-with-borg/

—

You could also look into Deja Dup.

—

Yes Deja Dup is a friendly front end for Duplicity (http://duplicity.nongnu.org/), something I incorporate into a lot of DR and backup solutions on a regular basis. They both take a lot of the guess work out of using tar and rsync together, and offer the flexibility to use in a myriad of ways.

—

Another option is Veeam Agent for Linux. There’s a free edition that will do what you want and you can set it up from the command line. There is not an ARM version of it yet, but it does exist for debian, rpm, openSUSE, and several other flavors.

If offers file level recovery and Forever Forward Incremental backups. Scheduling of the job occurs using crontab.

https://www.veeam.com/linux-backup-free.html

日	一	二	三	四	五	六
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

煎炸熊の記事本

Copying large files with Rsync, and some misconceptions

About rsync

Problem 1: Thin provisioning

Problem 2: Updating files

Leave a Reply