Professionals can develop their businesses with effective strategies, stay ahead of the competition by analyzing dynamic market conditions, and build brand loyalty with exceptional customer service—and it could all turn out to be for nothing. That is, if the IT systems that store, manage, and distribute our information fail and there is no recovery management process. It would be as if we had never done our work in the first place.
Data loss is an all‐too‐common problem. We lose information on the small scale with damaged laptops and misplaced flash drives. We lose information on the large scale with natural disasters that destroy entire data centers. Sometimes human error is at fault and sometimes applications fail with unfortunate consequences. Regardless of the cause of the initial data loss, the ripple effects can result in redundant work to reproduce the lost data, or in the worse case, to legal liabilities, brand damage, and business disruptions.
Recovery management is a framework to mitigate the risk of lost data and lost IT systems. It includes practices such as making backup copies of essential data, maintaining stand‐by systems in case primary systems fail, and establishing policies and procedures to costeffectively protect business assets, applying appropriate procedures based on the value of the data. Of course, no business practice will eliminate all risk or guarantee we can recover from calamitous events. We can implement cost‐effective measures that allow a business to continue to operate at, or near, normal operating levels in spite of adverse events.
Recovery management addresses the threats of different kinds of losses, from hardware failures and software bugs to stolen laptops and malicious acts. One of the surprising aspects of recovery management is the number of different situations that benefit from having a sound plan in place. Some of these are obvious, but many are not.
The "obvious" drivers behind recovery management are the reasons that come to mind first when we think of file backups and stand‐by servers:
It is easy to imagine a what‐if scenario in these areas, especially if you have ever had a hard drive fail or tried to recover a system after fire or water damage. A brief meeting with an auditor can dispel any lax attitudes toward maintaining the integrity and availability of essential corporate data.
Isolated failures are limited in scope affecting few people or business processes. A typical example is the accidently‐deleted file. Someone may decide they no longer need a file and delete it. Fortunately, popular end user operating systems (OSs) frequently have a staging area for deleted files (for example, the Windows Recycling Bin and the Mac OS trash can) so that removed files can often be recovered by end users. Once the staging area of deleted files has been purged, restoring a backup copy of the deleted file is the best way to recover it.
Utility programs are available for recovering files even after they have been purged from a staging area, like the Recycling Bin. These programs work by reclaiming the data blocks on the disk that contain the contents of the deleted file before another application overwrites those blocks. Once the contents of a block have been written over, there is no easy way of recovering it, at least by conventional standards.
Application errors present another type of isolated error. A bug in a database application may incorrectly update data. As with OSs that provide a staging area for deleted files just in case there is an error, databases often store recovery information at least for short periods of time. If an error is caught in time, the database application can help recover the correct data. After that, restoring data from a backup copy is often the preferred method for correcting the mistake.
In addition to our own applications, we need to consider the risk of malicious applications, generally known as malware. Sometimes these programs are designed to corrupt files on compromised devices. If the corruption is found in time, backups can be used to restore files to their original states. Unfortunately, not all data loss incidents are so easily remedied.
Many of us only think of natural disasters when we are paying our insurance premiums. Like insurance, though, we will be glad to have backups and disaster recovery plans if disaster ever occurs.
Large‐scale disasters, such as Hurricane Katrina in 2005 and the Northridge California earthquake in 1994, are infrequent, but fires, flooding, and other local events are common enough to warrant disaster recovery planning. Some of the key elements one needs to consider in a recovery management strategy relate to these catastrophic failures. When formulating a recovery management strategy and defining requirements, consider questions such as:
Stakeholders in a business depend on IT professionals to protect information assets of the enterprise from the worst consequences of disasters. Regulations and internal policies define collective expectations for protection. Ensuring compliance is another obvious driver behind recovery management.
There are many dimensions of governance and one of them is ensuring that business can continue to function under a range of circumstances, including the failure of key processes and systems. Recovery management plays an important role here. In the event of technical failure, human error, or natural disaster, business has a means to recover and re‐establish a normal operating mode.
Compliance often entails more than just having a "Plan B" in the event of disaster. We need to demonstrate we have that capability in place and test it periodically to ensure our recovery management policies and procedures continue to meet the changing needs of the business.
Hardware failures, natural disasters, and compliance are obvious drivers behind the adoption of recovery management practices. They are not, however, the only aspects of business operations that should drive, and benefit from, recovery management.
It is easy to think about backups and disaster recovery in the most basic terms: Make copies of important data so that you can restore in case of an adverse event. This is certainly sound reasoning but it does not capture everything we need to consider about recovery management. The problem with this line of reasoning is that it focuses only on data and not on other business aspects that drive the creation and use of that data in the first place.
Figure 1.1: Additional requirements for recovery management become clear when we consider the business strategy and operations that drive the creation, analysis, and management of business data.
If we examine why we create, analyze, and manage the particular types of data we have, we will find that the tasks are tied to some operational process. For example, we keep customer data for order fulfillment and sales operations. Human resources data is kept to track employees' performance history, benefits, and skill sets. These operations are in turn created in order to execute a business strategy, such as increasing market share, improving customer service, and retaining top talent.
If we consider each level of this three‐tier model as a source of influence on recovery management, we can ask two broad questions. First, how does each tier shape requirements for recovery management? Second, does recovery management enable new capabilities that allow us to expand or improve each tier? To answer these questions, we will start at the bottom and work our way up with:
These levels all include a combination of business and technology issues but with varying emphasis. Data‐driven requirements are dominated by technical considerations while business strategy is, not surprisingly, subject primarily business considerations.
Rule number one of data‐driven recovery management is that not all data is of equal value. Before we can define recovery management procedures, we need to understand how data falls into different groups based on:
Sometimes it is more important to recover all data than it is to get it back quickly. A company's financial database may be down for several hours without significant impact on the business, but if even a single entry in the general ledger were missing, the integrity of financial reports is lost. In other cases, the time it takes to recover data is the most important factor. For example, as long as a company's product catalog is unavailable for online purchases, online revenues stop and purchases are potentially lost to competitors.
The duration between a data loss event and the point at which the data should be available again is known as the recovery time objective (RTO). The point in time from which we should be able to recover lost data is known as the recovery point objective (RPO). RTO specifies how long we can tolerate being without our data; RPO specifies how much lost data (in terms of time windows) we can tolerate.
RPOs are based on how much data we are willing to lose to a data loss event. Figure 1.2 depicts a basic backup strategy employing nightly backups. At any point in time, we can recover all the data from the previous day, but any data created or modified during the day a data loss event occurred would not have been backed up. This may be sufficient for applications with a low number of transactions during a day, such as an HR database tracking changes to employees' 401(k) funds. If data is lost, it is neither difficult nor expensive to recreate it. Applications with high levels of transactions or those for which recreating data would be difficult require more robust recovery management strategies, such as continuous data protection.
Figure 1.2: In the case of a simple example, backups are performed nightly. This implements the previous day's close of business as the RPO. In this example, the business is willing to risk the need to recreate up to a full day's worth of transactions.
In addition to deciding on an RPO, we must decide how long we are willing to be without our data. Some categories of data can have relatively long RTOs. Again, an HR application may be down for a day without severe adverse consequences. Sales and customer support applications and data, however, may require near continuous availability. In the event of data loss, the business operations that depend on these systems may not tolerate the time it would take a systems administrator to find the proper backup tape, select the lost data, and restore it to the application. In this way, our RTOs and RPOs constrain our options for implementing backup and recovery.
Figure 1.3: RTOs are defined by the amount of time that can pass between a data loss event and the restoration of data before there are adverse consequences for business operations.
Another constraint on how we implement backups and disaster recovery procedures is the rate at which data volumes grow. There are many sources for increasing volumes of data:
The rapid growth in data volumes is driving the adoption of better data management techniques, such as more efficient management of network storage devices and the use of deduplication in backup systems.
Data is easily duplicated. Database records, email messages, and multiple versions of a document can all be data structures that result in redundant data. An obvious question is: Why backup up and store redundant copies? Why not backup up one copy and track references to where the data is re‐used? This is exactly what data deduplication does.
Deduplication processes operate either at the source system being backed up or at the target system receiving the copy of the backup. As each block of data is processed, the deduplication process determines whether a block of data with the same content has already been backed up. If it has, the system stores a reference to the copy that was made earlier instead of making another copy of the block.
As our expectations for continuous availability grows, acceptable RTOs shrink. It is difficult to find maintenance windows to update applications, patch OSs, and perform other routine maintenance because customers are coming to expect 24 hour a day, 7 day a week access to applications. Again, the answer depends on the type of data and its level of criticality for business operations, but it is safe to say, for many customer‐facing applications, the tolerance for downtime is close to zero. Businesses look to continuous data protection to ensure continuous availability. If data is so critical that we cannot tolerate virtually any downtime, data replication is probably the appropriate strategy.
With replication, as data is written to a primary system, it is copied to a stand‐by system that maintains a close to real time copy of data from the primary system. If the primary system fails, operations switch to the stand‐by system and continue as normal. When the primary system is restored, data that had been updated on the stand‐by server is copied to the primary server and then operations can be shifted back to the primary system.
Key considerations include:
Figure 1.4: Replication duplicates all transactions on a standby server. In the event of a server or storage failure on the primary devices, the standby devices can be rapidly deployed.
Replication supports disaster recovery as well as continuous availability. Stand‐by servers may be located in different offices or data centers from the primary site. This helps to mitigate the risk of site‐specific threats, such as fire and flooding, to the primary site.
The type and volume of data we have drives some recovery management requirements. Factors such as how long we can function without certain types of data and how long we can wait before data is restored have long been fundamental issues. The increasing volumes of data are also driving the need for more cost‐effective storage strategies such as data deduplication. Expectations for continuous availability and the needs of disaster recovery are well met by replication technologies. In addition to these data‐drive requirements, the day‐to‐day operations required to maintain an IT infrastructure are also the source of recovery management requirements.
Operations‐driven requirements focus on the implementation aspects of recovery management. These are the issues that systems administrators and IT managers have to consider when formulating the best way to implement the data‐driven and business strategy–driven requirements. Three commonly‐encountered types of requirements are:
We will delve into the technical details of these and other operation issues in Chapters 2 and 3, so we will just introduce some of the most salient elements of these issues here.
In an ideal world, recovery management procedures require minimal manual intervention, especially with backups, replication, and other ongoing tasks. Even in our less‐than‐ideal world, backup procedures and disaster recovery preparations should be as automated as possible for two reasons. The more automated the procedure, the less opportunity for human error. Backups can be scheduled for automatic execution. Logs can be generated for future reference. Errors can be flagged and generate alerts to notify systems administrators.
Restoring data and services can be automated as well. When replication is used, automatic failover to a stand‐by server can be enabled, at least with some replication systems. A drawback of this type of rapid failover is a potentially more complex configuration. For example, an additional service may be required to detect the failure of the primary server and automatically redirect service to the stand‐by server. Alternatively, if automatic failover is not used, a systems administrator could manually update a local domain name server to map a domain name to the stand‐by server instead of the primary server.
Virtualization can significantly increase server utilization and reduce costs but it also introduces new variables into the recovery management equation. The most basic question is how should we back up our virtual machines? There are several options:
Treating virtual servers as physical servers can simplify backup procedures, but it requires installing a backup client on each virtual machine. The second option eliminates the need for installing a client in exchange for shutting down the virtual machine. This may be acceptable depending on the function of the server. Snapshot copies require a staging area to store the snapshot and can briefly degrade performance of the virtual machine while the snapshot is made. The best option will depend on the specific business requirements of the virtual machine.
Backup and restore operations become more complex when we are working with files that are used with applications such as email servers and database servers. Consider some of the characteristics of database servers, for example. Databases typically use a small number of large files to store data about a large number of transactions. This has a number of implications for backup and recovery operations:
As we can see, the way applications use file storage to implement services, such as data management and email, can have an impact on the way systems administrators implement recovery management operations. Recovery management requirements are shaped in part by operational considerations, such as the efficiency of day‐to‐day procedures, the increasing use of server virtualization and the implications for backup operations, and application‐specific constraints on the way we perform backups for databases and email systems.
Unlike data‐ and operations‐driven requirements, business strategies are as varied as businesses themselves, so there is no universal set of requirements we can all adopt as our own. Instead, in this section, we will consider two broad strategies that can provide examples to help elucidate the types of business strategy–driven requirements in your own business.
A business may decide that providing online access to detailed, historical account data is crucial to improving customer service. Implementing this strategy will require increasing amounts of storage to support the customer service application, but it will also increase the demands on recovery management services. These increased demands include additional backup storage and increased throughput to continue to meet RTO and RPO with larger volumes of data. Meeting these demands can be done with a combination of additional hardware and network services as well as improved backup techniques, such as deduplication.
Availability is a fundamental attribute of online services. It would be hard to imagine running a factory without a steady supply of power; it is equally difficult to imagine running a modern business without continuous access to the applications and data that provide business services. To mitigate the risk of lost services, businesses can implement redundancy at multiple levels:
We must remember that there are many ways a business service can fail, so there will be multiple techniques required to mitigate that risk. Both traditional backup operations and replication services should be considered when there is a need to maintain continuous access to business services.
By considering recovery management from the perspective of data, operational, and business strategy requirements, we can identify essential aspects of business processes that need protection. The next logical step is to develop a plan for addressing those needs.
The first stage in developing a recovery management strategy is assessing threats and risks to services. This is followed by assigning RPOs and RTOs as well as defining the policies and applications needed to address those threats and ultimately implement the recovery management strategy.
Risks are adverse outcomes that we typically want to protect against, such as data loss, system failure, or security breaches. Threats are ways in which a risk can be realized, for example, a data loss (the risk) could occur if a poorly‐developed application inadvertently deleted files from a server (the threat). Although there are many types of threats, we will consider several with obvious impact on recovery management.
Threats that disrupt the functioning of IT services fall into several categories, all of which must be addressed in a recovery management strategy:
Hardware failures are better understood than software failures. Consider the fact that when we buy hard drives, we can get estimated mean time between failures. This metric does not tell us when a hard drive will fail, but it at least gives us some idea of how long we can expect the device to function, at least on average. Software, including OSs, is more complex and diverse as well as developed under widely varying levels of quality control. There are no well‐established metrics comparable to mean time between failures for measuring the reliability of software. From a recovery management perspective, it is safe to assume that both will fail and could corrupt relatively isolated sets of data or damage entire disks of data; given that assumption, we backup appropriately.
Security threats can pose significant threats to information systems. Threats such as viruses, worms, Trojan horses, and blended threats (multiple attack vectors in a single package) can all be used to destroy or tamper with data. Data breaches that result in large numbers of disclosed records are well documented in the popular press. Security threats to the integrity and availability of data are less frequent topics of discussion but still dangerous to businesses. Reliable and timely backups can make a significant difference in the cost and time it takes to recover from a security breach. Of course, if files on backups are corrupted by malware or other security threat, this is not an option,. Often the best strategy is to have a security management strategy in place to mitigate the risk from malware and other security threats. One way to mitigate malware risks is to use backup software that contains antivirus software, which can scan files during both backup and restore operations.
Natural disasters need little explanation. The key questions we need to answer about disaster recovery include where to store backup copies of data and stand‐by servers, how long are we willing to tolerate service disruption, and what procedures need to be in place to ensure services can be started at a disaster recovery site. In addition, what are the steps to resuming normal operations once the primary site is up and functional?
Human error will always be with us, so we must design systems in ways to minimize the potential impact of error. Programmatic techniques, such as validating input and prompting for verification of destructive operations, are one way. Organizational techniques, such as separation of duties and requiring authorizations from multiple individuals are another way to mitigate the risk of human error.
Disruption caused by power failures can be mitigated with multiple power supplies. Large data centers may employ a redundant source of primary power, including on‐site generators, which may not be practical for smaller facilities. Facilities of any size should consider uninterruptable power supplies (UPS) for temporary power. A UPS can provide power during brief outages and allow time for a controlled shutdown of systems in the event of long outages. These types of risks are just some of the ways risks to business can be realized.
When systems are down and information is unavailable, businesses are adversely affected.
Some of the most immediate concerns we have about loss of business continuity are:
Businesses face a host of risks to their operations and many risks can be realized by multiple types of threats. It is prudent, and cost effective, to plan ahead and develop a recovery management strategy before an adverse event occurs.
A sound recovery management strategy is a combination of (1) policies that address the various data, operations, and business requirements with respect to the risks and threats a business faces and (2) applications and technologies that enable the implementation of those policies.
The purpose of recovery management policies is to document and put into practice methods for mitigating the risks facing businesses. Five types of policies are essential:
Policies should define the scope of what should be done to mitigate risks; technical implementation details are defined after policies are formulated. They are codified as procedures that are executed by systems administrators and other IT professionals responsible for day‐to‐day operations.
Backup policies specify what types of data and applications should be backed up, the RPO for each type of data, and the RTO for each as well. For example, and HR database may have an RPO of the previous business day and an RTO of 4 hours. Procedures for this policy may call for a combination of weekly full backups plus incremental nightly backups.
Continuity and failover policies focus on critical data and applications. The purpose of these policies is to ensure that systems that should be available at all times are protected with high‐availability techniques. For example, a sales database may have an RPO of the last 10 minutes and an RTO of 10 minutes as well. These demanding constraints warrant a replication‐based solution.
Disaster recovery policies specify what disaster recovery procedures should accomplish and who should be involved. These policies specify criteria for establishing disaster recovery sites or services, such as location in separate buildings or different localities depending on the criticality of the data and services protected. They should also specify the RPOs and RTOs of different categories of services. The policy should also include some description of when disaster recovery procedures are implemented, typically when service infrastructure is so compromised that normal services cannot be maintained.
Disaster recovery policies should also indicate the need to test disaster recovery procedures and systems at regular intervals. Modifications to procedures should be tested when they are implemented and then tested again during regularly scheduled test operations.
Security policies must take into account much more than recovery management but should include directives on the appropriate use of the Internet and restrictions on installing nonauthorized software on company devices.
In addition to policies defining what is required of disaster recovery procedures, we need applications to meet those needs.
Disaster recovery depends on two types of systems: backup and restore applications and high‐availability solutions. Backup and restore applications give us the means to recover from a wide array of adverse events, from hardware failures that lose data and software bugs that corrupt the integrity of data to natural disasters that destroy entire data centers. It is important to consider backup and recovery operations when deploying new systems and implementing new business services. We must be able to back up and restore critical data with the time ranges allotted to us by the business. Growing volumes of data make this more difficult; however, techniques like deduplication can help us keep pace with the growth in data volumes.
High‐availability solutions allow us to replicate services and data on stand‐by servers and keep them up to date. These solutions are essential when we must maintain 24 × 7 systems and allow for extremely short RTOs.
A recovery management strategy should take into account a variety of requirements. Some of these requirements are a function of the criticality of the data we have, some are dictated by operational and efficiency considerations, and others are derived from business strategy. Regardless of the source of the requirements, a sound recovery management strategy starts with codifying those requirements in policies that can be used to develop operational procedures to protect business services and data. Applications such as backup and restore systems and high‐availability solutions play a critical role in implementing those policies and procedures. We will turn our attention to those implementation issues in the next chapters.