Maintaining effective recovery management procedures is not a trivial task. From making sure processes are running correctly to controlling the costs of operations, there is no shortage of challenges. In this chapter of the Shortcut Guide to Availability, Continuity, and Disaster Recovery, we will examine five of the top operational challenges we commonly face in recovery management:
These five challenges are interrelated. For example, dealing with growing data volumes is directly related to controlling costs. Monitoring is less directly related to recovery operations but is just as important—we do not want to find out about a failed backup operation when we try to restore critical data after a disaster. As we address each of these five challenges, we will consider both the fundamentals of the individual challenges as well as how the challenges influence each other.
Scheduling and monitoring go hand in hand. It is the combination of deciding what types of backups are required and making sure they are performed correctly. As with most IT operations, recovery management is driven by business requirements, so in the course of this discussion, we will have to refer back to those requirements when deciding on the proper schedule of backup types.
In Chapter 1, we looked at the business case for management recovery. That discussion included some obvious requirements, such as restoring from isolated failures and compliance, and some not‐so‐obvious requirements, including varying backup policies based on the business value of data. Meeting these requirements with operational procedures entails implementing the right combination of backup types at the right times.
Let's start with a quick review of the various types of backups, then consider how we combine them to achieve the level of data protection we need.
There are a few different types of backups because we need to balance competing requirements in recovery management. Ideally, we would have comprehensive coverage of all data at all points in time and we would be able to restore any or all of that data rapidly. Add to that minimizing cost and resources required to perform backups, and we would have the ideal solution. We will not be getting our ideal solution anytime soon, so we will have to settle for the optimal realizable solution.
Pragmatic recovery management solutions build on three types of backups to find that optimal solution:
These methods have different advantages and disadvantages and so tend to complement each other as part of a recovery management strategy.
Full backups, as the name implies, make a complete copy of data that might later need to be restored. It has a number of advantages. A full backup is a completely self‐contained backup, so a complete restore operation can be performed using a single backup. A key disadvantage of full backups is the time and space required to create and store them. The size of full backups is proportional to the amount of data to be protected. (The size is proportional to, not equal to, the size of the source data because of compression).
Incremental backups reduce the time and space required for full backups by copying only data that has changed since the previous backup. Consider a simplified example to see how significant the storage reduction can be.
If we assume that only 10% of data changes, then the amount of storage required for backup can be significantly reduced. For example, assume we have to backup about 5TB of data. Assuming a 30% compression rate by the backup software and a 10% rate of changes, an incremental backup of 5TB of data can be stored with as little as 350GB of storage (see Table 3.1).
Total Data Size (GB) | Full Backup Size (GB) (30% Compression) | Changed Data Size (GB) (10% Rate of Change) | Incremental Backup Size (GB) (30% Compression) |
500 | 350 | 50 | 35 |
1000 | 700 | 100 | 70 |
2000 | 1400 | 200 | 140 |
5000 | 3500 | 500 | 350 |
Table 3.1: Incremental backups yield significant savings in storage when compared with full backups.
Clearly the savings in storage is substantial; however, in IT as in economics, there are no free lunches. What we gain in reduced storage costs and time to perform backups is accompanied by a disadvantage during restore operations.
In its simplest form, restoring from incremental backups requires restoring from a full backup and then restoring each incremental backup performed since the last full backup. Depending on backup software, it is possible to create a synthetic backup, which is a backup that results from merging a full backup and some number of incremental backups. The advantage of synthetic backups is that they combine the advantages of both full and incremental backups. As Figure 3.1 depicts, a synthetic backup is a full backup plus all changes since the backup was made.
Figure 3.1: Synthetic backups merge full backups and incremental backups to allow for more efficient restore operations.
An alternative method to incremental backups is the differential backup. Like incremental backups, differential backups start with a full backup and then back up only changes. The difference is that the differential backup captures all changes since the last full backup, not since the last backup whether it was full or incremental. An advantage of differential backup is that it only requires the full backup plus the latest differential backup to perform a complete restore. A disadvantage, relative to incremental backups, is that additional storage is needed for the differential backup.
Given the various types of backups, what is the best combination of these backup methods to provide adequate data protection while controlling costs?
There a number of ways to schedule backups that are based on the idea of a rotation. The idea started when tapes were virtually the only option for most businesses, so we sometimes hear these methods referred to as "tape rotation" schemes. Rotation schemes balance the desire to re‐use backup storage while maintaining a reasonable number of recovery points.
One of the simplest schemes is the round‐robin method. A business has a fixed amount of backup storage, either tapes or disk, but we will describe the process in terms of tapes for simplicity. Under the round‐robin scheme, a tape is used for each backup and once all tapes are used, the tape containing the earliest backup is re‐used. A small business, for example, may be satisfied with using five tapes and performing a full backup each night of the business week. Every Monday, the previous Monday's tape is overwritten, and so on for each day. This method is simple to manage but leaves the business without a recovery point earlier than the prior week. This method makes it difficult to maintain off‐site backups while preserving the ability to rapidly recover in case of an isolated failure.
A common rotation scheme is based on a combination of monthly, weekly, and daily backups. This is sometimes referred to as the grandfather‐father‐son (GFS) method. The basic idea is that a full backup is done each week (the father) and differential backups are performed each day (the son). At some point in the month, the latest father tape is promoted to "grandfather" status and removed from the rotation. This backup can be stored for long periods in off‐site storage. The remaining weekly backups are maintained, typically on‐site, and overwritten after 1 month. The daily backups are overwritten on a weekly basis. There are a number of variations on the GFS scheme, but the key points are that it supports long‐term off‐site backups as well as weekly and daily backups.
The GFS scheme provides a good balance of efficient use of tapes or disk with the ability to maintain multiple recovery points.
Monitoring backups is a two‐part process. We want to monitor backup operations to ensure they execute as expected and we want to verify that the backups performed are valid and usable.
As the number of servers that requires backup grows, so does the complexity of monitoring and managing backup operations. The complexity stems from a number of factors:
As a baseline, backup administers should capture and monitor several metrics for all backup operations:
In addition to those metrics, it can be important to track errors and failures. Significant errors, such as a failed backup, should be addressed as soon as possible in most situations. Minor failures, such as the detection of a bad block on a disk, should be monitored over time in case the failure is not an isolated incident but part of a larger systemic problem.
When you need to restore a file from a backup, it is the wrong time to find out the backup is corrupted. There are few ways to verify backups, although details will vary with backup software.
One way to verify backup is to have the backup software perform a verification pass immediately following the backup operation. This operation compares the contents of the original files with the files on the backup medium. The ability to perform this type of verification will depend on your backup software. It will also extend the time required to perform a backup, which may not be an option if you are already working within a tight window of time.
Compression features may also provide the ability to check the integrity of the compressed files without performing a block‐by‐block comparison. This could be faster than other full comparison operations but again depends on the features of your backup software.
A tried‐and‐true method is to randomly select a backup, restore it, and compare it with the original file. This is helpful but does have some drawbacks. First, unless a backup administrator is well versed in statistics and probability, she might not adequately sample the set of backups at least from a formal quality control perspective. Second, data is constantly changing, so the original files may not be available to compare against. (After all, one of the reasons we make backups is to be able to restore a previous state). Even with these limitations, it is useful to perform some level of verification using features offered by backup software and to randomly select backups for manual verification.
Scheduling the right combination of full, incremental, and differential backups helps to ensure a business will be able to recover to a number of points in the case of isolated or catastrophic failure. The verification process is an extra but worthwhile step to help ensure the integrity of backups.
Costs are always a factor that must be taken into consideration when managing risks, and backups are no different. An optimal backup schedule will balance costs with the ability to recover to different points in time. Verification processes may be less than ideal because of other business constraints, including costs. Nonetheless, proper scheduling and monitoring of backup operations contributes to data protection and overall risk mitigation related to data loss.
Magnetic tapes have a long history in information technology, and they have proven themselves a reliable and cost‐effective storage media. As volumes of data have grown along with the seeming never‐ending desire for faster backup and restore operations, the information technology industry adopted disk storage as an alternative. There are advantages and disadvantages to both, so choosing between them is not a trivial task. The key to finding the right choice, or the right combination of tape and disk, will depend on a sound understanding of operational requirements, budget constraints, and your business' data protection strategy.
Tape backups have a number of features that make them a clear choice for business backups:
Now, let's consider for the disadvantages:
The fact that tapes are easy to handle is an advantage as well as a disadvantage shows the kinds of challenges and lack of clear‐cut choices we have with backup media.
Often, the advantages of tapes outweigh their drawbacks. Advances in disk technology created a viable alternative to tapes.
Disk storage has a number of advantages for data protection. Some of these, not surprisingly, correlate with disadvantages of tapes. The advantages of disks as a backup media include:
Disk storage does have its downsides:
So how should one balance the pros and cons of both options to come up with the optimal mix?
The best option for your business is determined by balancing often‐competing requirements. For example, maintaining a low‐cost solution is an obvious concern, but the total cost of data protection includes more than the cost of tapes and a tape drive. If there are significant opportunity costs associated with long recovery times, backing up to disk may actually cost the business less in the long run than using tapes. However, backing up archived user directories to tape may provide the required data protection at a lower cost than using disks because rapid restoration is not essential.
The first step to identifying the best option for your business is to classify data according to the level of protection required for each type of business data. In this case, the level of protection includes:
Some data will require near‐time recovery points and rapid recovery, such as an order management system. For these data, the additional cost of disks can be justified. For other types of data, the reliability of the storage media may be especially important. Although email may appear at first glance to be a moderate concern, that can easily change in the event of litigation. If e‐discovery is required in response to litigation, a business should be able to produce required documents from online systems or archives.
Also, consider the need for data life cycle management. The amount of data stored cannot grow forever without cost; at some point, the value of retaining archives will be outweighed by the cost of maintaining those archives. The difficult process is determining when one reaches that point.
For each category of data, determine the volume of data that must be processed in a given time period to meet RTOs. Also, determine the amount of storage required to meet RPOs. These two pieces of information are somewhat dependent on each other. For example, the amount of data that needs to be processed to meet recovery point and time objectives will depend on which type of backup is used. You might want to consider a few different scenarios with different combinations of full, incremental, and differential backups to understand the trade‐offs between time to backup, time to restore, and the volume of storage needed for backups.
Figure 3.2: Data should be categorized according to recovery and archiving objectives to facilitate determining the optimal use of disk and tape storage.
The last step in the process is selecting the most cost‐effective storage option that meets your requirements. If an option does not meet the needed recovery point or time objectives, the question of cost is irrelevant.
Be sure to consider a combination of tape and disk options to meet all requirements. For example, it may be best to use disk‐to‐disk backup to meet recovery objectives and then use tapes for archiving. When the archives are created from the backups, the process is called "disk to disk to tape." One of the advantages of this approach is that the time to backup up from a production system is reduced, but you still retain the long‐term cost advantage of using tapes. In addition to cost advantages, this method can reduce the load on servers being backed up by allowing archives to be made from a copy of production data. Another cost consideration with regards to archives and disaster recovery is the cost of offsite storage.
Offsite storage is an essential part of comprehensive data protection strategies. It provides a level of protection in the event of a disaster at a primary facility. Businesses will vary in their use of offsite storage and the types of facilities used. For example, for a small business, offsite could be a safe deposit box in a local bank. Larger businesses might use different offices to store each other's backup tapes or they may use a third‐party service provider. Regardless of what form of offsite storage is used, there are a number of considerations to controlling the cost associated with this part of data protection.
The time required to move data to offsite storage, and the associated cost, will depend on the type of storage media. Moving data to offsite disk storage, such as a backup service that hosts disk arrays for clients, will have relatively low labor costs. Depending on the volume of data transmitted, there may, however, be a marginal cost for necessary network bandwidth. The time required to move the data will depend on the volume of data and the network speed. Here again, RPOs and RTOs will help determine the necessary capacity.
The cost of moving tapes to offsite storage will be typically dominated by labor costs. Small businesses may be able to use informal practices, such as having an employee leave the office early to drop off tapes at another office on the way home at the end of the day. Other businesses will require more established procedures using staff or third‐party providers to transport tapes.
Any time tapes are moved offsite, there is a risk of losing or damaging the tapes. A lost tape could be a minor inconvenience or a significant problem, depending what is on the tape. If the tape contains sensitive information, such as personally identifying information, protected health care data, or financial information, it should be encrypted to prevent unauthorized disclosure. Consider compliance requirements when transporting tapes offsite to ensure information is adequately protected when it leaves the business.
The cost of offsite storage will vary with the volume of tapes stored offsite, the length of time they are stored, and possibly the number of times the storage facility is accessed. In addition, the quality of the storage facility with regard to data protection will influence cost. Renting a self‐service storage locker from a local vendor might be the least‐expensive option possible, but we can forget about any type of environmental controls, fire suppression equipment, or other necessary controls. At the other end of the spectrum, storing tapes inside a mountain may protect your tapes as well as one can expect, but the costs for such a service may only be justified for the most sensitive corporate information.
Cloud storage is an emerging option for offsite storage. Public cloud providers typically charge by the volume of data stored and the length of time the data is stored. This type of "pay as you go" model may help reduce the need for additional hardware as the volume of data grows. As with other third‐party providers, we must consider the long‐term viability of the cloud storage provider along with the reliability of its service. Also, consider using strong encryption when storing confidential, sensitive, and private data in the cloud. Remember that the definition of strong encryption changes with time. Encryption algorithms and key lengths that were once consider sufficient for protecting information can be compromised using today's hardware and cryptanalysis techniques. The cost of offsite storage depends on a number of factors, including the volume of data we store, what type of media is used, the level of service provided by the storage facility, and how the data is transported to the facility.
Businesses are faced with growing volumes of data. When we look into the source of this phenomenon, we find there is no single source of additional data; instead, it is driven by a wide range of business initiatives and requirements:
Much of this data will fall under the umbrella of enterprise data protection strategies, which means more data to back up. It does not necessarily mean more time to back it up or longer recovery times or even more budget to cover the cost of additional hardware and storage media. About the only option left is to improve backup technology. A significant advance in this arena has been in the area of data deduplication.
Deduplication takes advantage of the fact that the contents of data blocks on disk are often duplicated across multiple data blocks. Duplication of data creeps into even the most storage‐conscious organizations. In part, this is due to the fact that it is often easier to produce and manage information if we duplicate information. Consider a few examples: Multiple employees save the same copy of a report. Software developers save a number of versions of similar application code. Sales reports with common corporate and department data are sent to sales staff along with their personalized reports. This type of duplication creates opportunities for more efficient storage on backup media.
Deduplication works at the data block level. During a backup operation, a data block is read and compared with all other data blocks that have been read before it in the same backup operation. This step is faster than it may sound at first. Instead of explicitly comparing blocks with each other, a value (known as the value of a hash function) is calculated for each block. If a prior block has the same value, the backup program only needs to store a pointer to the previous block instead of storing another copy of the block. This can result in significant savings in storage.
There are two of options when performing the deduplication process, both of which have advantages and disadvantages:
With source‐side deduplication, the network traffic is reduced because duplicate blocks are not sent to the backup servers—only references to an already copied block are sent. A potential drawback is that the source server's performance is adversely affected by the additional load. This can be mitigated by scheduling backups during non‐peak demand periods. However, if the source server is a virtual machine, you should consider the load on the physical server by other virtual machines during backup. When backing up virtual servers, be sure to consider the particular needs of virtual environments.
Figure 3.3: Deduplication reduces the number of data blocks stored in backups by substituting pointers to previously backed up data blocks.
Target‐side deduplication performs the deduplication operations on the backup server. This approach can offer greater throughput in highly distributed environments with dedicated backup servers. A tradeoff is that network traffic is not reduced because duplicated blocks are copied from the source system to the backup server.
Deduplication is a key technique for accommodating growing data volumes in data protection strategies. With deduplication, existing backup infrastructure can keep up with growing data volumes without requiring an equivalent increase in storage media.
The last of the five key operational challenges of recovery management we will examine is disaster recovery. A business' disaster recovery plan answers the question "How would we keep the business running if there was a catastrophic loss at a key business site?" It is important to understand how this differs from other reasons we perform backups. If someone accidently deletes an important file, someone in tech support can restore the file from a recent backup. The staff is in place, the backups are available, the backup applications and servers are up and running, and, perhaps most importantly, it is an isolated incident. It may be an important file, but chances are its loss would not adversely affect ongoing operations. Disaster scenarios are different.
Disaster recovery plans are implemented when normal business operations cannot continue as normal. This can be caused by a range of factors:
In cases such as these, normal backup and restore procedures are not enough and the chances of maintaining normal business operations will depend heavily on how well the business planned prior to the disaster. That planning should take into account both physical requirements and application requirements.
The physical requirements for disaster recovery include both information technology infrastructure and a physical space to house equipment and employees. At minimum, this includes:
Proper planning will help identify essential business operations, levels of service required, and the minimal amount of physical space required in a disaster situation. This planning in turn will help control the cost of maintaining the physical requirements for disaster recovery.
With the physical requirements attended to, the next step is planning on ensuring data and applications are up to date and ready to carry on business operations in the event of disaster. A basic question that must be answered is, How long can operations be down before there are significant, adverse consequences? Can your business, for example, continue to operate if data on backup tapes has to be restored to standby servers before the disaster recovery site becomes operational? Can the business recreate data that was created after the last offsite backup and before the disaster struck? If the answer is no to either question, replication and high‐availability services should be considered.
Replication systems are used to copy data from primary servers to standby servers on an ongoing basis. Replication systems can efficiently capture changes to disks, copy the data to standby servers, and ensure data is duplicated offsite in a reasonably short period of time. This approach does require continuously operating hardware at both the primary and disaster recovery sites as well as sufficient bandwidth to keep the standby server up to date.
Replication services maintain the data but do not control the process of switching operations from the primary to the standby servers. High‐availability systems perform this function. If rapid failover is needed, high‐availability systems can be configured to monitor primary servers and automatically switch to the standby servers as soon as a failure is detected. As a general rule, the more speed and automation required in the failover process, the greater the cost. In situations where disaster recovery budgets are constrained and slower failovers are tolerable, a manual failover process can be used.
An alternative to maintaining a dedicated disaster recovery site is to use a third‐party disaster recovery service. The business model of such operations is based on the idea of pooled risk. It is not likely that all customers would need to use standby servers at the same time, so they can more efficiently provide disaster recovery services than a typical business.
Maintaining continuous business operations in the event of disaster is one of the most difficult challenges in recovery management. Planning is a key to success. By prioritizing business operations and identifying operations that can tolerate degraded performance, business can find a balance between maintaining services and controlling the cost of disaster recovery services.
Recovery management entails a number of technical challenges. Scheduling and monitoring backups, choosing the right storage media, controlling the cost of offsite storage, keeping up with growing data volumes, and planning for disaster recovery all present technical issues. Understanding business drivers behind recovery management is necessary to properly prioritize business operations which in turn is essential to making the appropriate choices when confronting each of these challenges.