Skip to main content
backups

My backup methods

There is an old folk saying:

There are two kinds of people: those who backup, and those who will.

If you are not kind of a person, who does backups – luckily it’s not a hard task to set up automatic backups on the application servers. Let’s consider few options.

Background

Let’s say we have a Linux server with MySQL database, nginx HTTP server and running web application. There is no CDN, everything is being kept on that machine. What has to be backed up?

  • MySQL data,
  • Files uploaded by potential users (let’s call it media),
  • Maybe application logs?
  • Maybe nginx logs?

For sure we should not back up:

  • The source code – we have a repository for it,
  • MySQL/nginx configs – same as above in combination with deployment scripts (ansible, chef, puppet, fabric),
  • Libraries and packages installed during application deployment – same as above.

The rule of thumb is that the backup data combined with the source code from our repository should give us a running and fully functional system. It’s obvious, but worth mentioning.

Backup methods

I distinguish 2 categories of backup methods:

  1. Without cloud (manual download, cron + rsync),
  2. Cloud-based (with repository, Dropbox, Amazon Glacier).

Manual download

Thing is rather simple. You go to target server and create a zip archive with data. Then you download it over FTP, using rsync, scp or any other method. It is also worth mentionaing that some hosting providers offer tools like cPanel or others.

Advantages:

  • simple.

Disadvantages:

  • time-consuming,
  • manual,
  • we have to remember about it,
  • … and more.

Crontab, rsync and remote server

This approach is quite popular nowadays. In fact it’s an “upgraded” version of the previous method. The difference is that the entire process becomes automatic. We can accomplish it by creating a bash script that creates desired zip archive with data. Next we schedule that script in cron (https://en.wikipedia.org/wiki/Cron). The last step, data upload, can be accomplished in 2 ways:

  • source server uploads data to the backup server using rsync (https://en.wikipedia.org/wiki/Rsync). It can be done at the end of our bash script, or
  • backup server downloads data by itself. In this approach we have to make sure that the backup server with have something to download. So, for example, source server creates backup archive at 2:00 am and the backup server downloads it at 4:00am.

Advantages:

  • automatic,
  • write once and forget forever.

Disadvantages:

  • we need to maintain another server that keeps backups,
  • what if the backup server goes down?

Use your repository

Why can’t we use GIT as a backup tool? It sounds easy and it is easy. Simply replace rsync in your backup routine with a git commit and git push.

Advantages:

  • automatic,
  • write once and forget forever,
  • no need to maintain any additional server that keeps data,
  • cloud-based.

Disadvantages:

  • cloud-based GIT usually has a limit of 1 or 2 GB per repository.

One might say that git is rather designed for storing and managing text-containing files, but reinforcements are on their way and those are git-lfs or git-annex or bup or commercially available Perforce.

Dropbox

Dropbox provides headless version of its daemon (https://www.dropbox.com/install-linux). You can install it on your linux machine, start in the background and copy backup archive to dropbox’ hot folder.

Advantages:

  • automatic.
  • write once and forget forever.
  • no need to maintain any additional server that keeps data.
  • Cloud-based.

Disadvantages:

  • Dropbox Free Tier is limited to 2GB of data. Dropbox Plus costs around 100€ per year. There is nothing in between.

Interesting option to consider might be Duplicati (multiplatform).

Amazon Glacier

Amazon Cloud Storage Service comes with two additional solutions: S3 (https://aws.amazon.com/s3/) and Glacier (https://aws.amazon.com/glacier/). Glacier is less expensive, but it comes with a handicap: data access costs more. It sounds like the best solution for backups. We upload data often and (hopefully) rarely request it.

To upload data to the cloud we can use  AWS CLI (https://aws.amazon.com/cli/).

Handling this routine is the most complicated thing in this article. Many things have to be configured via AWS Management Console. We will come back to this topic later, in a separate article because of its complexity.

Advantages:

  • automatic.
  • write once and forget forever,
  • no need to maintain any additional server that keeps data,
  • cloud-based,
  • there is no maximum limit for data volume,
  • cheap.

Disadvantages:

  • glacier Vaults are updated once per day.
  • data access can take between 3 and 5 hours. Faster access costs much more.
  • there is no web GUI for it. It means that for every action you have to prepare scripts or use existing libraries like AWS CLI or external tools like FastGlacier, AGSU, Arq or, mentioned already, git-annex.
  • requires multi-step configuration.

Testing

Remember to test if your backups actually work and you are able to restore the system from them. Remember GitLab problems (https://www.theregister.co.uk/2017/02/01/gitlab_data_loss/)? They had about 5 different backup mechanisms and none were working reliably.

Good practices

  • Use more than 1 backup technique,
  • make sure that backup works,
  • make sure that you’re able to restore the system from backup archive,
  • do not backup files that are not needed (like source code),
  • monitor your backup servers in the same fashion as you monitor your application servers
  • if you have limited storage size then remember to rotate backup archives,
  • make it automatic,
  • add it to your deployment scripts (fabric, chef, ansible, puppet),
  • use rsync instead of scp. Rsync simply works better, see: http://stackoverflow.com/a/20257021 (use this link if you are extremely concerned about privacy).

Deploying backup mechanism can take a while but it’s not an option. Nowadays it’s a must-have feature. Ask yourself a question: what would happen if…?

Python and Django

I use Python and Django on my daily basis. Luckily, The Python Community developed many modules that can help us with backup management:

Probably every single technology has something similar so it’s highly possible that you will not have to write anything from scratch.

Related and further materials

http://blog.tkassembled.com/326/creating-long-term-backups-with-amazon-glacier-on-linux/

At the Intersection of Git and DevOps

Marcin Skiba

Marcin is a full stack software developer based in Łódź, Poland. He loves to learn new technologies, improve coding skills and share his knowledge with others.

One thought on “My backup methods

Leave a Reply

Your email address will not be published. Required fields are marked *