I have been knee deep in backups for the past few weeks, but I think I can finally see light at the end of the tunnel. What looked like a simple enough idea to implement turned out to be a much more complicated task to accomplish. I don’t know why, but there seems to be practically no information at all out there covering this topic. Maybe it’s just because backups suck? Either way they are extremely important to the vitality of a company and without a workable set of data, you are screwed if something happens to your data. So today I am going to write about managing cloud data and cloud backups and hopefully shine some light on this seemingly foreign topic.
Part of being a cloud based company means dealing with cloud based storage. Some of the terms involved are slightly different than the standard backup and storage terminology. Things like buckets, object based storage, S3, GCS, boto all come to mind when dealing with cloud based storage and backups. It turns out that there are a handful of tools out there for dealing with our storage requirements which I will be discussing today.
The Google and Amazon API’s are nice because they allow for creating third party tools to manage the storage, outside of their official and standard tools. In my journey to find a solution I ran across several, workable tools that I would like to mention. The end goal of this project was to sync a massive amount of files and data from S3 storage to GCS. I found that the following tools all provided at least some of my requirements and each has its own set of uses. They are included here in no real order:
- duplicity/duply – This tool works with S3 for small scale storage.
- Rclone – This one looks very promising, supports S3 to GCS sync.
- aws-cli – The official command line tool supported by AWS.
S3cmd – This was the first tool that came close to doing what I wanted. It’s a really nice tool for smallish amounts of files and has some really nice and handy features and is capable of syncing S3 buckets. It is equipped with a number of nice and handy options but unfortunately the way it is designed does not allow for reading and writing a large number of files. It is a great tool for smaller sets of data.
s3s3mirror – This is an extremely fast copy tool written in Java and hosted on Github. This thing is awesome at copying data quickly. This tool was able to copy about 6 million files in a little over 5 hours the other day. One extremely nice feature of this tool is that it has an intelligent sync built in so it knows which files have been copied over. Even better, this tool is even faster when it is running reads only. So once your initial sync has completed, additional syncs are blazing fast.
This is a jar file so you will need to have Java installed on your system to run it.
sudo apt-get install openjdk-jre-headless
Then you will need to grab the code from Github.
git clone [email protected]:cobbzilla/s3s3mirror.git
And to run it.
./s3s3mirror.sh first-bucket/ second-bucket/
That’s pretty much it. There are some handy flags but this is the main command. There is an -r flag for changing the retry count, a -v flag for verbosity and troubleshooting as well as a –dry-run flag to see what will happen.
The only down side of this tool is that it only seems to be supported for S3 at this point – although the source is posted to Github so could easily be adapted to work for GCS, which is something I am actually looking at doing.
Gsutil – The Python command line tool that was created and developed by Google. This is the most powerful tool that I have found so far. It has a ton of command line options, the ability to communicate with other cloud providers, open source and is under active development and maintenance. Gsutil is scriptable and has code for dealing with failures – it can retry failed copies as well as resumable transfers, and has intelligence for checking which files and directories already exist for scenarios where synchronizing buckets is important.
The first step to using gsutil after installation is to run through the configuration with the gsutil config command. Follow the instructions to link gsutil with your account. After the initial configuration has been run you can modify or update all the gsutil goodies by editing the config file – which lives in ~/.boto by default. One config change that is worth mentioning is the parallel_process_count and parallel_thread_count. These control how much data can get shoved through gsutil at once – so on really beefy boxes you can crank this number up quite a bit higher than its default. To utilize the parallel processing you simply need to set the -m flag on your gsutil command.
gsutil -m sync/cp gs://bucket-name
One very nice feature of gsutil is that it has built in functionality to interact with AWS and S3 storage. To enable this functionality you need to copy your AWS access_id and your secret_access_key in to your ~/.boto config file. After that, you can test out the updated config to look at your buckets that live on S3.
gsutil ls s3://
So your final command to sync an S3 bucket to Google Cloud would look similar to the following,
gsutil -m cp -R s3://bucket-name gs://bucket-name
Notice the -R flag, which sets the copy to be a recursive copy instead everything in one bucket to the other, instead of a single layer copy.
There is one final tool that I’d like to cover, which isn’t a command line tool but turns out to be incredibly useful for copying large sets of data from S3 in to GCS, which is the GCS Online Import tool. Follow the link and go fill out the interest form listed and after a little while you should hear from somebody from Google about setting up and using your new account. It is free to use and the support is very good. Once you have been approved for using this tool you will need to provide a little bit of information for setting up sync jobs, your AWS ID and key, as well as allowing your Google account to sync the data. But it is all very straight forward and if you have any questions the support is excellent. This tool saved me from having to manually sync my S3 storage to GCS manually, which would have taken at least 7 days (and that was even with a monster EC2 instance).
Ultimately, the tools you choose will depend on your specific requirements. I ended up using a combination of s3s3mirror, AWS bucket versioning, the Google cloud import tool and gsutil. But my requirements are probably different from the next person and each backup scenario is unique so a combination of these various tools allows for flexibility to accomplish pretty much all scenarios. Let me know if you have any questions or know of some other tools that I have failed to mention here. Cloud backups are an interesting and unique challenge that I am still mastering so I would love to hear any tips and tricks you may have.