Context ๐
I had been trying to figure out a more efficient way of backing up directories on my computers. For example, I have a huge directory where my Obsidian vault lives. This vault is filled with a bunch of pdf and markdown files that I use for my everyday notes. I do like keeping this folder backed up in case my laptop or PC dies.
The way I would backup this folder was by simply turning it into a GitHub repo and uploading everything privately to my GitHub account. This was not what GitHub was made for and I was nearing 2GB for the entire repo. It was time to search for a different solution.
Enter the infamous xz
vulnerability. This put compression libraries on my radar. So I decided to get the entire directory and compress it down using xz
and gzip
. I noticed that the compression on xz
was far better than gzip
. For my purposes I wanted to optimize for size at the cost of compression speed. Additionally the nice people at r/DataHoarder had mentioned that a really cool software program called par2
was a thing. I could recover any amount I wanted from a specific file, using bit parity magic.
For my purposes, I want to store big directories as small as possible, store their MD5-Sum to make sure I can verify the files are correct, and store their parity file to be able to recover any part of them I want (up to 30% in my case). This is because sharing things on a local NAS or even with Syncthing this way would be very easy and efficient.
Some downsides to this system are:
xz
is not as common as I thought and it’s been somewhat complicated to move files to systems that don’t have this utility.- Incremental backups are not possible.
PS: I recognize that xz
might have vulnerabilities that might not be
Archiver ๐
Archiver is a bash script that helps you backup whatever you want, however you want.
Usage ๐
For example:
archive.json
[
{
"name": "Wallpapers",
"target": "~/Media/Pictures/Wallpapers",
"archive": {
"name": "walls",
"destination": "~/OneDrive/Wallpapers"
},
"timestamps": {
"last_archive": 1712438594,
"last_upload": 1712438693
},
"sync_command": "onedrive --synchronize --single-directory 'Wallpapers'",
"md5sum": "6de26f11ad638fd145f3d1412e0bf1c6"
},
{
"name": "Books",
"target": "~/Documents/Books",
"archive": {
"name": "books",
"destination": "~/OneDrive/Books"
},
"timestamps": {
"last_archive": 1712439022,
"last_upload": 1712439638
},
"sync_command": "onedrive --synchronize --single-directory 'Books'",
"md5sum": "b4ae6185bb5a20d19c0b30f9778a10cb"
}
]
You specify list of attribute sets with three key parts:
target
: the target directory you wish to backuparchive
: the name of the archive and where you want to store the archivesync_command
: the command you wish to use to back up this specific directory
Other details like timestamps
and name
are useful for other purposes if you wish to climb under the hood to use them. The MD5-Sum is also useful if you wish to verify the legitimacy of your files after retrieving them.
PS: an example archives.json
is provided.
Methodology ๐
- Your
target
gets converted into a tarball. - That tarball is compressed into an
xz
archive.- This format was chosen because of its excellent compression ratio.
- Though in the future I would like to implement multiple formats for this.
- A parity archive is created from that compressed tarball.
- Uses the
par2cmdline
utilities. - A single block file with 30% redundancy is created.
- Additionally, you can use the index file that’s created but
par2
doesn’t really need it.
- Uses the
- A unix timestamp and MD5-Sum is taken from the archived tarball.
- Your
sync_command
hook is run at the end and a secondary timestamp is taken at the end of this. - Your
archives.json
file is updated with all of the fresh timestamps and MD5-Sum.