Some Thoughts on Backups

I)Introduction

This has been covered a lot of times and a lot of places, but since I had an incident the previous Sunday (as of this writing) where my desktop ate its home partition after a power outage I figured it was a good time to write about backups since they were fresh on my mind.

The 3-2-1 rule is simple and fundamental to backups. You have no less than 3 copies of your data, stored in 2 different forms and 1 off site backup. Although there are more standards that can be integrated, this is a good safe start and will mitigate a lot of problems.

An example of this in play is what I’m currently doing in my home lab. I have my PVE machine, a backup server hosting with Proxmox Backup and a dump of my VMs to tape. It’s a simple, straight forward architecture. In a more modern environment the tape may be displaced for S3 or something compatible. There’s ups and downs to both.

II)How to back things up

When I say how to back things up, I don’t mean the technology I mean the analysis of the data its self. This requires the business to have a good understanding of their systems criticality, and the overhead to actually make copies. Is the data disposable? How much data loss is acceptable? What cost is the business willing to incur to make sure that their data is protected to an adequate level? How long can they wait for the data to be brought back online, and what are the fiscal impacts of it?

Two key terms in the business are Recovery Time Objective and Recovery Point Objective often referred to as RTO/RPO respectively. These are the two key numbers that should be discussed with any backup discussion. What do they mean?

Recovery Point Objective is how much data loss is acceptable. If you’re running backups on Sundays and your system gets hit by a meteor on a Thursday, the data between Sunday and Thursday when the server was smashed is the Recovery Point Objective.
Recovery Time Objective he time to resolution. Something breaks, you restore it, business resumes That time is the RTO.

Without these things being known, a backup project will wind up rudderless and stuck in implementation.

III)What Scenarios Are We Protecting From?

There are two key scenarios that backups or replication can protect against:

Physical faults
Logical faults

Physical faults are obvious, that could be the usual drive failures, a system getting struck by lightning, water in the server, dinosaur attacks etc. RAID while much better than a single disk is not a panacea for data loss and never will be. RAID controllers can eat disks. I’ve seen a disk add zero out a whole set of drives.

Logical faults can be more diverse. This could be anything from data corruption, a ransomware attack, or even an employee deleting something they shouldn’t.

The distinction is an important one. Replication technologies can protect against physical faults, while snapshots protect against logical faults. A true backup can protect against both. My desktop had RAID, and it did nothing to prevent my partition from getting corrupted during a power loss.

It is important to note Replication is not a backup. I’ve talked with people before that had files deleted, management asked them for the replica and they had to break the news it got deleted at the speed of replication.

IV)The Technologies

The technologies needed to achieve these goals vary on the goals themselves. This is why it’s better to start with goals over products. It may take multiple technologies to achieve a goal or set of goals.

As an example, if 24 hours is an acceptable amount of loss a conventional back up will work. If the amount of data loss allowed is say 15 minutes, it is likely that integration with storage in one aspect or another is likely. Rolling array snapshots are a good bet here. At an application level binary logging for databases is another bet as is versioning. Shadow Copy is an easily enabled Windows technology that works well in many environments. On the Linux side BTRFs and Linux LVM offer snapshotting as well. With synchronous replication and the right technologies, zero or near zero data loss may be possible. It is usually saved for things like banking transactions, and done at great monetary cost.

Before implementing array snapshots at these levels, it is likely best to understand implications of doing them and testing accordingly. In VMware for instance, snapshot consolidation can cause issues with up time especially when there’s a large amount of change and long periods of time. BTRFS has issues with volumes in the tens of terabytes in my experience especially if the OS is older like RHEL6.

In enterprise this can also involve complex storage replication systems. This may mean fully synchronous replication to a close site followed by asynchronous replication to a site farther away.

The recovery time objective (RTO) is largely based on how much data you have to restore, file count, the network, and the media it’s on. There can be nuance to RTO, but it’s a good general guideline. Sizing and testing are best discussed for any serious enterprise application or specific back up system. The 1s and 0s can only move so fast and a second can seem like an eternity when you’re on a bridge waiting for the backups to restore.

V)Putting it to use

So, what does all of this mean? It really boils down to this:

Figure out what needs to be protected
Figure out the protection goals
Determine what product(s) need to be used to achieve those goals
Implement a solution
Put solution into normalized operational state
Run test restorations periodically
Periodically re-evaluate for changes in backup needs with the business

The planning and review parts are really important. If an organization is siloed without good process around provisioning, things can easily get missed in the backup realm. Backups are something that people largely ignore until they need them, and by then it’s too late. A certain amount of proactive management is required. Having a robust server build process can be really helpful in collecting information about desired backup state, and should be done before a server is built.

Regardless of the product, my experience is there is always a level of care and feeding to daily backup operations. Someone has to go investigate why a backup didn’t complete which is fairly common. As an administrator expect some time every week in the backup realm, at the minimum checking reports to ensure backups ran correctly. Larger organizations will tend to have a dedicated administrator running their backups.

VI)Back to my home lab

How did I make out after losing my primary desktop’s data? Not bad. I got lucky in that I had built a file server I was migrating data to the night before, so I had a fresh copy that was rsynced to it. That made my RPO a few hours after I got my desktop built, which I can’t complain about. I’ve certainly lost more data at home over the years before I knew better. Even if I didn’t have that NAS copy, I have a copy on LTO tape. The future goal is to obviously push the data into a backup system.

The only part of the solution I don’t have is I haven’t gotten a place off site to put the tapes yet. Maybe it’s time to start looking into a safety deposit box.

By utadmin

Leave a Reply Cancel reply

You Missed

In Defense of Generalism

Proxmox VE Storage Integration: A Business Case

Extreme budget home labbing

Type I vs Type II statistical errors

By utadmin

Related Post

Leave a Reply Cancel reply

You Missed