E3 Backing up data
Overview and discussion of backing up data on a home user's budget to keep your stuff safe.
Episode Notes
This is a compilation of 3 pieces of audio, firstly an introduction, recorded today. The second piece is an Audioboo I recorded in 2016 about backing up data on personal computers. The third piece is recorded today giving an overview of how I back up data today. The third section was recorded on an iPhone so happy to get comments on audio quality, and anything else you'd like to say on the podcast. backblaze http://www.backblaze.com restic http://www.restic.org/ scoop https://www.scoop.sh/ rclone: http://www.rclone.org
Transcript:
Hello everybody, and welcome to Kerry’s Chaos, episode 3. This episode is all about backing up data and backing up data effectively on a home user’s budget. The first section was recorded in 2016, and basically gives an overview of some of the options available, and the second section is how I back up in 2020. If you have any queries, comments, etc, you can reach me as kerry at gotss dot nett, or khoath on Twitter.
I’m wondering whether people have trouble with me recording things on the phone like I did for the last section, or whether the audio is of sufficiently good quality, and how people enjoyed the podcast. So, curious on feedback on that, because it is taken from several disparate sources.
Good evening everybody and welcome to this short chat on backing up big data on a budget.
How many times have you heard it? Back-ups are important. You need to back up your data so that you don't lose it in the event of a catastrophe. The question arises: how many back-ups should you have? Where should they be stored? How much do you want to spend protecting your data?
There are a number of options for backing up your stuff. Some people use external hard drives. Some people use CD and DVD media, Blue Ray media, tapes. Some people even print data out to be scanned back in. But let's consider the average hoarding person. You've managed to collect a couple of terabytes of data over the years, because let's face it, you've been on the internet, you've got a reasonably fast internet connection and your hoarding instincts as a person has resulted in you downloading a stack of stuff that is considered highly, highly important to you.
The first thing you need to decide is how much of your data actually needs to be backed up. Yes, it would be inconvenient if those three seasons of Game of Thrones were deleted off your hard drive but you could probably torrent them down again. Not so much the back-ups of your commitment ceremony or some song you slogged four hours out on to work with with a group of friends that are never ever going to meet in person ever again.
So first of all, decide whether you're backing everything up or whether you're only backing the important stuff up. If you're backing up everything, then that's pretty straight forward and we can move on to the people who might be backing up some of their stuff.
How much do you want to spend on storage and how reliable do you want that storage to be? Nothing is ever a hundred percent reliable so you're always working against probabilities, equipment failure and acts of God to decide how stable and reliable your back-ups are going to be.
If you have under a terabyte of data, then potentially one of the online cloud services such as or Dropbox or Box or Spider Oak might be worth looking at. However, it's worth considering: do you trust these services to look after your data and if you intend to encrypt the data before you back it up, how are you going to do that? With what program, what algorithm and what are you going to do in the event that you lose a pass phrase? Google Drive
There are various back-up services that claim to be able to back up unlimited amounts of data for, for example, $5.00 a month per computer and I'm referring here to services such as Crash Plan and Backblaze. A couple of things to keep in mind with these services. You probably want to use the local encryption settings so that your data is encrypted before it leaves your computer so that even if Crash Plan or Backblaze or one of the similar services is subpoenaed for your data, nobody will be able to decrypt it. This is probably important if you've nicked a whole pile of pirated stuff or you've got a whole lot of data on your hard drive that you really shouldn't have; child porn; terrorist bomb plans; the list goes on. (I hope nobody listening to these audio boos has those sorts of things on their hard drives.)
Now Crash Plan claims that they do in fact back up unlimited data but there's a couple of things to consider when backing up unlimited data. How fast is your internet upstream? If your internet upstream is fast enough, you might be able to push one to ten gig of data a day. Still means that backing up two terabytes of data is going to take a significantly long time. You can drop $350.00 on a C drive that they send to you and you fill up with up to a terabyte of data and you ship it back to them. They preload that to your account. However it is still going to take significant time and significant bandwidth, possibly impacting your internet use, to decide whether that is actually worth backing up all of that data over the internet.
If you decide to back it up though, Crash Plan, Backblaze, whatever, will store all of the stuff that you need to store. Do however be aware of the terms and conditions of the plan and read the find print to make sure that unlimited truly does mean unlimited.
Another option for people may be external drives and a lot of people are seeing external hard drives on Amazon for $129.00 so that you can buy a 5 terabyte hard drive. These are probably quite handy and can store a lot of data, but a couple of things need to be kept in mind. As I always say and as many other storage experts fail, it's not if a hard drive will fail, it's when a hard drive will fail. All hard drives fail eventually. Some of them fail within a day or a week or a month of being owned, and some of them are still ticking away for ten years. The probability of hard drive failure is something you can read research papers on, and there can be endless debates as to whether Western Digital or Seagate or insert your other favourite drive brand, is the best type of hard drive. But even that can fluctuate between manufacturing batches, temperature considerations and shock considerations.
So if you are going to go out and buy yourself a five terabyte hard drive, it might not be a bad idea to go out and buy yourself two five terabyte hard drives, so that you've got one that's connected to your computer as hot storage, and a back-up drive that acts as your back-up in case something goes wrong with the hot storage. You'll need a way to keep the two drives synchronised. If you're on a PC platform, I would strongly suggest something like Robocopy. Tera Copy is fine in the GUI and the Microsoft Sync Toy will handle GUI lists of folders that need to be kept in sync. However, I'm not sure what the limitations on Sync Toy are. Robocopy will quite happily copy terabytes from one drive to another and keep the archives in sync. You do however need to be careful with Robocopy however, because you need to specify the /xo switch so that it doesn't copy old information over new information. It's also worth noting that Robocopy is a command line utility and unless you get hold of a friendly geek to help you with the batch file, then you may have trouble automating this. Also, not all batch files are created equal. I've seen a lot of batch files for Robocopy missing the /xo switch. But a peruse of the Robocopy documentation does in fact point out that /xo is somewhat important. You also need to exclude the files that you don't want to back up with Robocopy such as BTSync folders, Dropbox control information, etc.
The other thing you probably want to keep in mind is if you're on a Mac or a Linux box, you may want to consider Rsync for backing stuff up. Rsync is handy in the fact that it is quite flexible, can handle thousands and thousands of files and can be fired off from chron jobs relatively easily.
So you have two hard drives. One primary, one secondary. You went and ponied up and got two five terabyte hard drives. It's up to you whether you switch the hard drives around on a weekly or monthly basis so that the spare becomes the regular one and the regular one becomes the spare. But there's another thing that you need to keep in mind. Even storing data on hard drives has a probability of failing. Drives have an uncorrectable bit error rate that means that occasionally they're not going to be able to pull back a sector that was written to them. This doesn't happen very often but it does in fact happen. What is the guarantee that all of the data that you have written to your drives is actually uncorrupted?
I would strongly suggest finding a utility that will generate SHA1 sums or MD5 sums of trees of files. Make lists of the files on your hard drive with their MD5 or SHA1 sums and scatter the manifest and catalogue across a couple of cloud services so that if you do need to run a test on a hard drive to see if indeed it is failing, you consider pulling back the SHA1 sums and running it against the data to catch any differences. At least that way you will be able to tell which of your two drives is good and which has gone bad.
The other thing to keep in mind is that all of this takes some time and some ingenuity to actually set up. You'll have to find the right utilities; you'll have to find the right batch files and you'll have to be disciplined enough to actually carry out this back-up plan on a regular basis. Things like Carbonite, Crash Plan and Backblaze make it easy because services run in the background that back this data up to the cloud or your external hard drives or your friends' computers. Now it's worth mentioning that if you do use Crash Plan to back up data to your friends' computers that you'll be using their hard disk space and you'll have to negotiate with them but also keep in mind that the data is in fact encrypted and your friends don't get access to your Crash Plan data.
If you are going to back up data to locally connected hard drives, you may wish to consider whether the data should be encrypted to protect it from prying eyes. But the SHA1 sums are certainly worth considering.
So some of you were going to ask me, “Well how do we even know if hard drives are going to fail? Is there any warning that a drive is on the way out?” Well it turns out that in approximately 70 to 80% of cases, there is actually warning that a drive is going to fail. The technology that tells you this is known as SMART: Self-Monitoring Analysis and Reporting Technology. There are utilities for Linux, MacOS and Windows called SMARTMon Tools that can run in the background on your system as a service and can provide you with information about impending drive failures. This is fairly good for connected drives that are directly installed in the computer. But some USB to SATA bridges don't pass through the SMART Inquiry commands in a standard method. You may have to do some fiddling to actually get SmartMon Tools to check these.
Other external drives such as the WD series of drives and some of the Seagate drives do come with software that is meant to monitor the health of the drive, and warn you potentially of an impending drive failure. It's possible that the warning will not come in time though, and a mechanical fault that stops the drive from powering up or stops the drive from spinning will give you no amount of SMART warning even if you do choose to use this technology. So SMART is one of those things that just makes things a little bit safer and a little bit more informative. It has allowed me however to replace failing arrays in RAID arrays.
RAID. I suppose I should mention RAID. A Redundant Array of Inexpensive Disk drives. If you have multiple copies of data, then it's less likely that you're actually going to lose the data. This is a pretty simple idea. RAID 1 is an exact mirror of the data. Two drives contain exactly the same data. It's fast to write data to both drives identically. It actually doubles read speed if you're using both the drives but it does halve write speed. So RAID cards with caches and stuff can be useful so the operating system can dump a bit under a gig of data at the drives and the drives can get on writing it to the array. Be aware though there are pitfalls to RAID.
Most people consider that going out and building themselves a RAID 5 array in a home NAS is probably going to be a good way to back up the three or four terabytes of data they've got. A couple of things with RAID 5 arrays that I have learnt from painful experience. When a drive fails in a RAID 5 array, it is imperative that you replace the failed drive as soon as possible. That is dash on down to the computer store the day the drive fails or have another drive as hot spare. There is however a problem with RAID 5 and that is that once one drive has failed in the array, it is potentially possible and in fact more probable than you'd think, that whilst rebuilding the array onto the spare drive that you've just replaced, one of the second existing drives will fail. If two drives fail in a RAID 5 array, you're essentially left doing block level restores and dragging as much data off the arrays that appears readable as possible with tools that may break your brain. RAID 5 does use a fair amount of CPU. So for example some of the home NAS's which vary in accessibility such as the QNAP, etc, they'd use a 1.2 GHz arm processor and a cut down version of Linux with maybe 256 or 512 meg of ram, will read and write to the RAID drives fairly OK but will burn a fair bit of CPU doing it. These home NAS's will take anywhere between two and four devices and are fairly quiet and fairly low power and have iTunes servers and all sorts of other stuff in them. The question you've got to ask yourself is how accessible are the web interfaces, are they in fact usable, and are you going to pay for the extended tech support who will help you rebuild the array in the event that it fails or are you an MDADM ninja and can SSH into the thing like I usually do and rebuild the arrays by hand provided that they're willing to be rebuilt? RAID 5 is a nice option because it is n -1 drives. So if you have four two terabyte drives, (so that's eight terabytes of actual storage), and you put them in a RAID 5 array, one drive is used or the amount of storage for one drive, because in RAID 5 the parity is actually plexed across all of the drives. One drive's worth of data is used for parity redundancy information, which means that with four two terabyte drives, you will end up with six terabytes of usable storage minus a little bit for administrative overhead in RAID 5 array. RAID could probably have its own discussion, and I could do an entire boo on RAID, and that may happen another night.
RAID 6 is a little bit nicer because we run the array with dual parity drives. This means that you can handle the loss of two drives in a RAID array that is in RAID 6 mode. You don't really win much though if you've only got four drives in your RAID. Four minus two is two so if you've got four two terabyte drives, you only end up with four terabytes of fairly reliable storage in RAID 6. RAID 6 however does start to make sense if you have six or eight drives. If you have eight drives in RAID 6, you lose two drives for redundancy, which means that if we have 16 terabytes of storage, and we have eight drives, we subtract two drives for storage, and we end up with 12 terabytes of usable storage. Which means that the storage ratio is more efficient the more drives you have in RAID 6. However, don't think you're going to go out and build a RAID 6 array with 27 drives. Unfortunately, the more drives you put in a RAID array, the increased chance of failure that one of the drives is going down for the count and isn't coming back up again. In fact, I could probably do an entire boo on the failures and shortcomings of RAID.
But that will give you some idea as to how safe your data may or may not be. For the technically apt, you could potentially store your data on cloud services such as or Amazon S3. You will however have to be fairly competent with command line tools and web API's unless you're going to use something like Amazon Back-up for S3 or Amazon S3 Explorer, which are two of the almost accessible apps for Windows. There are of course command line tools for the Linux and Mac users such as S3CMD that will put and retrieve objects from buckets on Amazon S3 including multi-part uploads etc. Amazon S3 however does cost, and you'll have to look at their pricing page. Essentially three cents a gig a month to store last time I looked in the US West 1 region and nine cents per gig to actually retrieve the stuff from Amazon S3. You can store the data for 0.01 cents a gig if you push the data off into Glacier, which is to say that when you push data off into Glacier, the storage costs reduce amazingly. However the restore time jumps to five hours; three to five hours if you restore an Amazon Glacier batch. Also if you store more than, (I believe), 25% of your data, there are extra restoration fees for Glacier. Somebody's probably running around in a data centre somewhere jamming tapes into tape drives. I have no idea whether this is actually true and if anybody has any information about how Glacier actually works, I'd be happy to hear from you. Google Drive
Google is playing with a new set of technologies which is currently in beta called Google Nearline Storage. The ability to back up piles and piles of data to Google services with the correct web API's with a three to five second restore time. I don't know whether that classes as warm storage, but if the technology matures and becomes reliable, it could be quite useful for people running a blindy radio station. Back up a couple of terabytes of music that you've got for your radio station on the Google Nearline Storage, and have a jukebox application that allows you to pull back a song in three to five seconds from warm storage, whilst the other song is queued and playing, or you're banging on about what time it is and how many friends are tuned in.
Look guys, if you have any questions about big data and big data home storage, I'd be happy to hear from you, and I'd be happy to answer them. I don't know whether this talk has been useful to anyone or instructive. If there are any things people want me to talk about specifically, I'd be more than happy to put some posts out there to inform you guys about how to handle big data storage and stuff like that. If you've listened this long, thank you very much for listening. Goodnight everyone.
So we are essentially four years on from the big data on a budget post. I think that file was generated in 2014, and it is now 2000, no, 2016. It is now 2020. So, four years on. And I suppose the big question everybody would have is “How am I backing up my data in 2020?” I do have backups, and I have those backups of all different sorts of computers and things.
For a start, on my main desktop I’m using Backblaze, because I feel that five dollars per month for my primary computer (five dollars US), is a bargain, and that backing up that sort of data is sensible to a Cloud destination. And it has satisfactory encryption and things to keep that data safe. It means that I can use the Backblaze app on iOS to restore the data, or I can actually generate zips of the data and restore it onto the computer, downloading it from Backblaze, or you can pay them a fee and they will send you a hard drive.
I also, however, have on-site backup. So as well as having Backblaze, I have a utility called Restic. And I may do a complete podcast on Restic, because it’s a very awesome utility. It backs up a tree of files, and it uses data duplication, and it saves snapshots of the file system at various points in time. And because it saves backups at various points in time, I have a Restic job that backs up the key files of my computer to B2 Cloud. Now B2 Cloud is a cloud storage system that is half a cent per gig for storage, and I believe one cent per gig for retrieval. And you can retrieve that data from B2 Cloud and restore it to your computer, so I have snapshots of the key files on my computer stored.
I also have local hard drive copy of data which is backed up to a 4 terabyte Seagate external drive. But because I know about the reliabilities of external drives, I have about three or four other hard drives that have the 453 Gig backup set from my Restic backup jobs stored on them.
I also have Restic backups of other computers like my primary Dell in the bedroom, and the Dell on the kitchen table. And that way, I feel that I have satisfied my backup requirements, and I have multiple copies of my backups should the worst ever happen and a drive fail in my primary computer.
So, that is how I’m currently backing up my data in 2020, and I’m always open to suggestions, input on other things I should be doing, and I think I will do a podcast on the actual ins and outs of Restic as a backup tool.
I also use Rclone to sync a whole stack of file trees to cloud storage, Rclone being the Rsync Swiss army knife of cloud storage. Talks to about 60 different backends, or 30 different backends, very powerful utility. May also do a podcast on that one.
So, hope that has answered some of the questions about how I back up big data on a budget.
Support Conversations with Kerry by contributing to their Tip Jar: https://tips.pinecast.com/jar/kerrykos
Find out more at https://kerrykos.pinecast.co
This podcast is powered by Pinecast.