Backup user data with rsync


17 - 21 minutes read, 4282 words
Categories: backup
Keywords: backup bash

I know I do not have a ideal backup strategy, but I do believe in continual improvement process, and especially for backups, waiting for finding the perfect backup strategy is not the right approach.

Ideally, the programs used for backing up my data (and for restoring it), has the following properties:

  • support multiple platforms (at least Windows, GNU/Linux, and Android), as I do not want to find different solutions for the same problem

  • open source, and use an already existing format for the data, this should ensure that it will always be possible to restore and validate backups, even if developments ceases, or if there are any issues with the backup program

  • backup multiple directories

  • offer possibility to filter/ignore files and directories

  • rotating backups/backups with revisions/snapshots (no differential/incremental backups)

  • easy to use, in order to avoid noticing too late that the data has not been backed up correctly

  • being able to restore single files

  • can be triggered manually or automatically

This because I do not want to find out that I cannot access my data, or that I need to find different solutions for different operating systems.

I also want to use something I could recommend to less tech-savvy people for backing up their data.

Notice: I’m putting my effort into backing up the data, for backing up the operating system my criteria would be completely different. It would be, for example, completely fine to use system-specific programs.

This is a non-exhaustive list of programs I’ve found and looked at (not necessarily tried)

A more complete list can be found on https://en.wikipedia.org/wiki/List_of_backup_software

But somehow, none of those programs ticked all the boxes.

For example, because of its decentralized and deduplication properties, bup seems a very nice solution…​

But it still in its early stage, and I wanted to be sure that I could restore my files also with future versions.

attic is discontinued, but its fork, borg, is maintained. Unfortunately, it does not support Windows.

Duplicati works both on Windows and GNU/Linux. It is not as mainstream as others and has no official Debian package. Unfortunately, it is currently in a sort of limbo. v2 is labeled as beta and v1 as not supported, what am I supposed to use?

I’ve tested this program, but it did not seem to be very stable (it’s marked beta after all!), my backups got corrupted (at least the software claimed so), and I was not able to repair them, and restore my data!

rdiff-backup and rsnapshot were more what I was looking for.

With rdiff-backup, the latest snapshot is simply a copy of all files, on the drive, so restoring them is super-easy. Just copy the files. It is "harder" to restore older snapshots.

With rsnapshot all snapshots are simply a copy of all files. This is even better.

The biggest disadvantage of rsnapshot is that it is not good for automating the backup job unless the device to be backed up is on all the time. The files are not pushed to the server but pulled from the device, thus the server contacts the client and copies the data. Normally, in my case, it’s the server being on all the time, and the client decides whenever it wants to make a backup (by hand or automatically with a cron job).

As many backup solutions I liked (but not necessarily listed) used rsync behind the scenes, I tried to come up with my backup program.

I wanted to achieve the following properties (with rsync and possibly other tools):

  • rotating backups

  • unchanged files are handled with hardlinks in order not to consume too much space

  • should be able to enlist which files and folder to backup

  • should be able to blacklist files and subfolders (for example thumbnails or other files that are automatically generated)

  • backup folder name should contain the date, in order to make it easy to see when it was taken

Note that other properties that are surely important, like compression and encryption, were out of scope, as my use-case is primarily local backups (thus backing the data on a second drive or on a local remote machine), or backups via ssh on a second machine at home.

Also note that I do not want incremental or differential backups, as I want to be able to delete older backups, for example, if there is not enough space on my disk.

The first version

After some fiddling, this is what came out (do not use!)

#!/usr/bin/env bash

set -o errexit
set -o nounset

CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/backup/"

# -------------- configs

backupdir="/folder/where/to/store/backups"; # on remote
remote="remote device where to store backups"
readarray -t dir_to_backup < "$CONFIG_DIR/includes.cfg"

ssh "$remote" mkdir --parents "$backupdir";

currentdir=$(date '+%Y-%m-%d_%H-%M-%S');
currentdir="$backupdir/$currentdir";

rsync \
  --progress --info=progress2 --human-readable \
  --times \
  --links \
  --new-compress --recursive \
  "${dir_to_backup[@]}" \
  "$remote:$currentdir" ;

As I wanted a solution that works on multiple devices and systems, it felt easier and natural to store the paths to add in a configuration file. To support the XDG path specification I’ve used as config file "CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/backup/", which uses "$XDG_CONFIG_HOME/backup" if XDG_CONFIG_HOME is defined, otherwise fallbacks to "$HOME/.config/backup/".

As for config files, there is currently only includes.cfg, but variables like backupdir and remote should go in a config file too! includes.cfg is a line-based list of files/directories of directories to copy. The content is saved into an array with readarray and expanded to rsync parameters.

excluding certain files or patterns

rsync supports --exclude-from=<file>, so I’ve decided to add an excludes.cfg file in the config directory.

There are many automatically generated files (unfortunately), that one normally wants to avoid.

I’ve currently listed

  • Thumbs.db

  • desktop.ini

  • *.pyc

  • pycache

  • __MACOSX

  • .DS_Store

but there are surely a lot of other files that are completely unnecessary.

#!/usr/bin/env bash

set -o errexit
set -o nounset

CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/backup/"

# -------------- configs

backupdir="/folder/where/to/store/backups"; # on remote
remote="remote device where to store backups"
readarray -t dir_to_backup < "$CONFIG_DIR/includes.cfg"

ssh "$remote" mkdir --parents "$backupdir";

currentdir=$(date '+%Y-%m-%d_%H-%M-%S');
currentdir="$backupdir/$currentdir";

rsync \
  --progress --info=progress2 --human-readable \
  --times \
  --links \
  --new-compress --recursive \
  --exclude-from="$CONFIG_DIR/excludes.cfg" \
  "${dir_to_backup[@]}" \
  "$remote:$currentdir" ;

Avoid duplicate files

The main issue with the current approach is that if most files do not change, we will have multiple copies. This might not be a problem, but it is if we do backups often, like every couple of hours, it will quickly fill up the backup drive. So the next step is trying to avoid this unnecessary space consumption with hardlinks. Actually, rsync already supports a similar feature, as long as all backup directories have the same structure: --link-dest.

Thus I need to query the directory of the previous backup (if present) and give it to rsync as a parameter.

#!/usr/bin/env bash

set -o errexit
set -o nounset

CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/backup/"

# -------------- configs

backupdir="/folder/where/to/store/backups"; # on remote
remote="remote device where to store backups"
readarray -t dir_to_backup < "$CONFIG_DIR/includes.cfg"

ssh "$remote" mkdir --parents "$backupdir";
optpreviousdir=$(ssh "$remote" ls "$backupdir" | sort | tail -1 );


currentdir=$(date '+%Y-%m-%d_%H-%M-%S');
currentdir="$backupdir/$currentdir";

rsync \
  --progress --info=progress2 --human-readable \
  --times \
  --links \
  --new-compress --recursive \
  --link-dest="$optpreviousdir" \
  "${dir_to_backup[@]}" \
  "$remote:$currentdir" ;

Thanks to this, if "$optpreviousdir" is a directory on a remote with a previous backup, rsync will hardlink for us the files that did not change!

I soon found out the first big issue with this approach.

If for whatever reason a backup fails, the previous backup directory might have only a very small subset of the files to back up. This means that unconditionally using the previous backup might not be a good idea.

Using hard links has at least another problem: They are not supported on fat32.

While it is easy to say "avoid fat32", some systems, like Android phones or other less powerful devices, use fat32 formatted cards. Thus it is not possible, with the current strategy, to back up all photos from the internal storage to the sd card. For those that have an Android device that supports exFAT, no, even exFAT does not support links.

Note: there seem to be hacks for adding hardlinks to fat32, but I did not explore it further, as depending on the operating system the backup might get screwed.

And what about Windows? Thankfully I’m mainly interested in Windows 10, not the older versions. This means that is is possible to create symlinks without administrator rights. Note that the article is a little bit outdated, as I tested my backup script on a fresh Windows 10 version 20H2, and I did not even need to enable the developer settings. One can test for hardlinks from Cygwin

touch a.txt
ln a.txt b.txt
ls -la a.txt b.txt # note that link count is 2 and not 1
echo "Hello World" > a.txt
cat b.txt

otherwise one can use fsutil (it’s there since Windows XP!) to verify that two files are hard-linked together.

fsutil hardlink create b.txt a.txt
fsutil hardlink list b.txt

If I would need to support an older Windows version, then running the backup script from Cygwin as administrator should not be a problem, but I did not test if everything works unchanged.

Distinguishing between partial and complete backups

I decided to distinguish between complete backup and backups that did not finish. This is useful for having a good overview of the current status. It would be very unfortunate to try to restore some files, only to notice that the last backup was not complete, and begin to search for the last complete.

One might argue that the partial backup should be discarded, but I tend to disagree. Also, incomplete backups, depending on their status, can contain a lot of useful information.

#!/usr/bin/env bash

set -o errexit
set -o nounset

CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/backup/"

# -------------- configs

backupdir="/folder/where/to/store/backups"; # on remote
remote="remote device where to store backups"
readarray -t dir_to_backup < "$CONFIG_DIR/includes.cfg"

ssh "$remote" mkdir --parents "$backupdir";
optpreviousdir=$(ssh "$remote" ls "$backupdir" | grep -v partial | sort | tail -1 );


currentdir=$(date '+%Y-%m-%d_%H-%M-%S');
currentdir="$backupdir/$currentdir";
tmpdir="$currentdir.partial";

rsync \
  --progress --info=progress2 --human-readable \
  --times \
  --links \
  --new-compress --recursive \
  --link-dest="$optpreviousdir" \
  --exclude-from="$CONFIG_DIR/excludes.cfg" \
  "${dir_to_backup[@]}" \
  "$remote:$tmpdir" ;

ssh "$remote" mv --no-target-directory "$tmpdir" "$currentdir";

Now, thanks to the temporary .partial directory, I’ve also solved the first issue with hardlinks. --link-dest will now point to nowhere, or to the last successful backup, and in case of failure, it will be possible to easily distinguish a partial backup from a complete.

Improve space optimizations

When I decided to do a couple more tests, I noticed that if a backup was nearly complete, and many files changed, it might have been better to let --link-dest point to the partial backup!

Also parsing the output of ls, while it should not be problematic in this case, is not an ideal solution.

Looking at the man page, I noticed that rsync supports multiple --link-dest directories. Thus I can remove for this case the distinction between incomplete and complete backups, and I changed how those are handled:

#!/usr/bin/env bash

set -o errexit
set -o nounset

CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/backup/"

# -------------- configs

backupdir="/folder/where/to/store/backups"; # on remote
remote="remote device where to store backups"
readarray -t dir_to_backup < "$CONFIG_DIR/includes.cfg"

ssh "$remote" mkdir --parents "$backupdir";
mapfile -d '' optpreviousdirs < <(ssh "$remote" find "$backupdir" -maxdepth 1 -mindepth 1 -type d -print0);


currentdir=$(date '+%Y-%m-%d_%H-%M-%S');
currentdir="$backupdir/$currentdir";
tmpdir="$currentdir.partial";

rsync \
  --progress --info=progress2 --human-readable \
  --times \
  --links \
  --new-compress --recursive \
  ${optpreviousdirs[@]/#/--link-dest=} \
  --exclude-from="$CONFIG_DIR/excludes.cfg" \
  "${dir_to_backup[@]}" \
  "$remote:$tmpdir" ;

ssh "$remote" mv --no-target-directory "$tmpdir" "$currentdir";

With find "$backupdir" -max depth 1 -mindepth 1 -type d -print0 I’ve created a \0-separated lists (as \0 is not a valid filename-character) of directories. Thanks to mapfile -d '', this can be easily saved in a bash array. Later, when invoking rsync all paths are prefixed with --link-dest=.

Handling errors

As long as I used a similar script on my GNU/Linux machine, everything seemed to work perfectly. Thanks to termux, I decided to test it on my Android device for backing up all my pictures.

As I had some connectivity issue (apparently because of the battery manager), rsync lost the connection while transferring the files. I had to execute it multiple times and generated a lot of temporary backups.

It occurred to me, that in some situations it would be nice to give the user the option to retry to execute rsync without generating a new backup.

This could be accomplished by giving a backup directory as a parameter instead of generating every time a new one, but the quick and dirty solution is adding an interactive query.

As I used set -o errexit, an exit code different from 0 would stop the execution immediately, so I had to disable this flag temporarily and ask the user what to do:

#!/usr/bin/env bash

set -o errexit
set -o nounset

CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/backup/"

backupdir="/mnt/pi/rbackup/backupfede/rsync2"
remote=backupfede
readarray -t dir_to_backup < "$CONFIG_DIR/includes.cfg"

ssh "$remote" mkdir --parents "$backupdir";
mapfile -d '' optpreviousdirs < <(ssh "$remote" find "$backupdir" -maxdepth 1 -mindepth 1 -type d -print0);

currentdir=$(date '+%Y-%m-%d_%H-%M-%S');
currentdir="$backupdir/$currentdir";
tmpdir="$currentdir.partial";

rsync_res=1
while [ $rsync_res -ne 0 ]; do
  set +o errexit
  rsync \
    --progress --info=progress2 --human-readable \
    --times \
    --links \
    --new-compress --recursive \
    ${optpreviousdirs[@]/#/--link-dest=} \
    --exclude-from="$CONFIG_DIR/excludes.cfg" \
    "${dir_to_backup[@]}" \
    "$remote:$tmpdir" ;
  rsync_res=$?;
  set -o errexit
  if [ $rsync_res -ne 0 ]; then :;
    read -rp "rsync failed with error code $rsync_res, retry? [y/N] " retry;
    if [ "$retry" != "y" ]; then :;
      exit $rsync_res;
    fi
  fi
done

ssh "$remote" mv --no-target-directory "$tmpdir" "$currentdir";

For a headless automated backup solution, it might make more sense to retry to execute rsync a couple of times, and then exit. For an automated solution on a PC, it could also be possible to interact with the user, for example, with a notification and a timeout.

handling name clashes when backing up multiple folders

As long as I’ve used my script only on my Desktop systems, all the names of the directories I wanted to back up were pretty much unique: ~/Documents, ~/Images, ~/Workspace, and ~/Music.

On my Android phone, it is very easy to get name clashes if it has an external SD. There is a DCIM folder in the internal storage and a DCIM folder on the sd card.

In order to avoid collisions, I decided to add to rsync the --relative option. This preserves the relative location of all folders backed up. At first, I thought it would not be that nice, as if I am backing up a folder deep in other subfolders (until now it has not been the case) then its data is not as visible, but it is at least a saner default choice.

config file

I’ve thought a little but what to use as a config file. It should be possible to be able to comment out lines, and it should be easy to parse from bash.

In the end, I’ve decided to use a bash script and source it. It is a terrible idea as it means the config file could contain code that gets executed…​ so don’t do it!

#!/usr/bin/env bash

set -o errexit
set -o nounset

CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/backup/"

readarray -t dir_to_backup < "$CONFIG_DIR/includes.cfg"

if [ ${#dir_to_backup[@]} -eq 0 ]; then :;
  printf "%s does not contain any directory to back up\n" "$CONFIG_DIR/includes.cfg" >&2; exit 1;
fi

source "$CONFIG_DIR/settings.cfg"
if [ -z "${backupdir+x}" ]; then :;
  printf "backupdir not defined in %s\n" "$CONFIG_DIR/settings.cfg" >&2; exit 1;
fi
if [ -z "${remote+x}" ]; then :;
  exec_ssh=();
else :;
  exec_ssh=(ssh "$remote");
fi

"${exec_ssh[@]}" mkdir --parents "$backupdir";
mapfile -d '' optpreviousdirs < <("${exec_ssh[@]}" find "$backupdir" -maxdepth 1 -mindepth 1 -type d -print0 );

currentdir=$(date '+%Y-%m-%d_%H-%M-%S');
currentdir="$backupdir/$currentdir";
tmpdir="$currentdir.partial";

rsync_res=1;
while [ $rsync_res -ne 0 ]; do
  set +o errexit;
  rsync \
    --progress --info=progress2 --human-readable \
    --times \
    --links \
    ${remote:+"--new-compress"} --recursive \
    --relative \
    "${optpreviousdirs[@]/#/--link-dest=}" \
    --exclude-from="$CONFIG_DIR/excludes.cfg" \
    "${dir_to_backup[@]}" \
    ${remote:+"$remote:"}"$tmpdir" ;
  rsync_res=$?;
  set -o errexit;
  if [ $rsync_res -ne 0 ]; then :;
    read -rp "rsync failed with error code $rsync_res, retry? [y/N] " retry; # bashism
    if [ "$retry" != "y" ]; then :;
      exit $rsync_res;
    fi
  fi
done

"${exec_ssh[@]}" mv --no-target-directory "$tmpdir" "$currentdir";

deleting old backups

This is easy, just delete the folder. Ideally, this can be automated. Either the user can configure a maximum number of backups, or if there is not enough space the oldest backup gest deleted.

The easy way is letting the user define the maximum number of backups, in our script we would add that at the end, after a successful backup

if [ -n "${maxnumbackups+x}" ]; then :;
  optpreviousdirs=("${optpreviousdirs[@]:$maxnumbackups}");
  printf '%s\0' "${optpreviousdirs[@]}" | xargs -0 --no-run-if-empty -- ssh "$remote" rm -rf;
fi

But what if the backup was not successful? As I currently do not delete automatically backups, I do not have a good solution and put too much thought into it.

Probably a better solution would be querying how much space is left, and let the user know if there are problems.

issues and improvements

Windows support

As Cygwin permissions and Windows do not match exactly, I had some issues. I’ve currently added --executability --chmod=ugo=rwX to the parameters to rsync, which at the moment is good enough, but not if you want to preserve such metadata.

Another option might have been setting noacl in /etc/fstab.

It might be interesting to see how other rsync ports for Windows behave…​

time to calculate files to send

tested on a phone with nearly 200GB of pictures…​ most of the time is spent calculating the file list, and then because of a network error it needs to start again from scratch. This seems to imply that the current solution does not scale well, even if on a PC I can handle backups of 400GB without any issues.

Worst, if for whatever reason the job fails, even repeating it currently means recalculating all files to send.

A possible approach could be trying to split and parallelize the rsync jobs, even if in the case of the Android phone, as most images are in the same folder, it might not be that simple.

complete backup

rsync uses the file size and modification time to decide if it needs to check if a file has been changed or not. It is possible to use --checksum to test the content instead of using such heuristic, but it means reading the content on every file both on the client and the server. Especially on Windows, and depending on how much data you are backing up, this operation will take much more time. It might make more sense not to use --checksum most of the time, and the once every, for example, month, use --checksum to ensure that every file has been backed up. It might also make sense to distinguish those backups where --checksum has been used from the others, by using a slightly different name for the directories.

no redundancy

As unchanged files are all hard-linked together, changing one of them by accident means destroying all snapshots. A possible solution is to make all files read-only, rsync supports --chmod, but it makes it unpractical when restoring the files. It might be more practical to mount the file system as read-only for everyone, except a special user whose sole role is to execute the backup job…​ This solution might not be viable when using Windows and an external drive.

tagging folders

Currently, the script does not care that much how the backup folders are named. The user could rename them to add meaningful information, like backup-before-reset or backup-after-importing-library. But by doing so, might interfere with the logic for deleting older backups, thus folders must begin with the time they were created.

provide a GUI

As the whole backup script is mostly a rsync and a couple of ssh commands, it should be possible to create a simple GUI. Otherwise, it might be nice to see some statistics on the backups, like how much space they take and so on, but there exist already a lot of useful programs, like WinDirStat for windows and ncdu for the console. Unfortunately, most Windows programs do not acknowledge or handle hardlinks. For example, the total size reported by WinDirStat will be wrong.

Granted, most graphical programs won’t work if the data is backed up via ssh.

File permissions are lost

Depending on the file system. Currently, this is a non-issue for me, as the main use cases are user files, like images, documents…​ as long as the data is readable it’s all good. Also because often I’ll do a backup from my phone on the server…​ and "restore" the backup on my pc, thus in this case preserving permissions would be troublesome.

compression

Files are not compressed, but this can be made through the filesystem (BTFS, Btrfs, ZFS, and NTFS. Unfortunately not Ext or FAT partitions).

validating data

There is no support for validating the data before backing it up. This can be done both before and after the synchronization. For example, .mp3 files can be verified with mp3val, .jpg files with jpeginfo. For many formats, it can be found a corresponding program.

But those do not help to restore the bits of the backup that get corrupted over time.

Probably a good filesystem is sufficient to do the hard work, like ext4 with journaling.

Automation

Can be done with cron on GNU/Linux, Task scheduler on Windows, and I need to check what is there for non-rooted Android phones. It should be noted that in case of failures it is important to notify the user. The first two approaches, and one does not exclude the other, are

  • send an email, which does require an internet connection, but can notify the user who is not at the device in real-time

  • Use a system notification

Conclusion

Contrary to other backup solution, my script (which, by the way, is still a work in progress) does not have a test suite or stood the test of time, but as backups are simply a copy of the files, I’m sure I’ll be able to access the backed data whenever I want.

Also, the script relies mostly on rsync and optionally some commands through ssh, both battle-proven and extremely backward compatible programs and available on nearly all systems, Presumably, both will be there for a long time, so I am going to assume that the current script can be used for the next years without modification with future versions of rsync, ssh, and operating system. This, unfortunately, does not hold for android, as never version does discourage the usage of certain functions on which the whole software landscape of GNU/Linux systems relies. Also never android versions are restricting the access to files and folders, thus Termux is already not available anymore on the Google Play Store. Unfortunately, it seems that reusable solutions are getting much harder to find for "smart" devices.

Nevertheless, do not use my script as your only backup solution. I am not doing it. This is only one additional way I ensure the availability of the data I’m interested in.


Do you want to share your opinion? Or is there an error, same parts that are not clear enough?

You can contact me here.