Most organisations will have an IT disaster recovery (DR) plan in place. However, it was probably created some time back and will, in many cases, be unfit for purpose.
The problem is that DR plans have to deal with the capabilities and constraints of a given IT environment at any one time, so a DR plan created in 2005 would hardly be designed for virtualisation, cloud computing and fabric networks.
The good thing is that the relentless improvements in IT have created a much better environment—one where the focus should now really be away from DR to business continuity (BC).
At this stage, it is probably best to come up with a basic definition of both terms so as to show how they differ.
- Business Continuity: a plan that attempts to deal with the failure of any aspect of an IT platform in a manner that still retains some capability for the organisation to carry on working.
- Disaster recovery: a plan that attempts to get an organisation back up and working again after the failure of any aspect of an IT platform.
Hopefully, you see the major difference here—BC is all about an IT platform coping with a problem: DR is all about bringing things back when the IT platform hasn’t coped.
Historically, the costs and complexities of putting in place technical capabilities for BC meant that only the richest organisations with the strongest needs for continuous operation could afford BC: now, it should be within the reach of most organisations; at least to a reasonable extent.
Business continuity is based around the need for a high availability platform, something that was covered in an earlier article. By the correct use of 'N+M' equipment alongside well architected and implemented virtualisation, cloud and mirroring, an organisation should be able to ensure that some level of BC can be put in place to provide BC for the majority of cases.
Note the use of the word "majority" here. Creating a full BC-capable IT platform is not a low-cost project. The organisation must be fully involved in how far the BC approach goes—by balancing its own risk profile against the costs involved, it can make the decision as to at what point a BC strategy becomes too expensive for the business to fund.
This is where DR still comes in. Let’s assume that the business has agreed that the IT platform must be able to survive the failure of any single item of equipment in the data centre itself. It has authorised the investment of funds for an N+1 architecture at the IT equipment level and, as such, the IT team has now got one more server, storage system and network paths per system than is needed. However, as the data centre is based on monolithic technologies, the costs of implementing an N+1 architecture around the UPS, the cooling system and the auxiliary generation systems were deemed too high.
Therefore, the DR team has to look at what will be needed should there be a failure of any of these items, as well as what happens if N+1 is not good enough.
The first areas that have to be agreed with the business are around how long it will take to get to a specified level of recovery of function, and what that level of function is. These two points are known as the recovery time objective (RTO) and the recovery point objective (RPO). This is not something that an IT team should be defining—the business has to be involved and must fully understand what the RTO and RPO mean. In particular, the RPO defines how much data has to be accepted as being lost—and this could have a knock-on impact on how the business views its BC investment.
For example, in an N+1 architecture, the failure of a single item will have no direct impact on the business, as there is still enough capacity for everything to keep running. Should a second item fail, then the best that will happen is that the speed of response to the business for the workload or workloads on that set of equipment will be slower. The worst that can happen is that the workload or workloads will fail to work. In the former case, the RPO will be to regain the full speed of response within a stated RTO—which would generally be defined as the time taken for replacement equipment to be obtained, installed and fully implemented. Therefore, the DR plan may state that a certain amount of spares inventory have to be held, or that agreements with suppliers have to be in place for same-day delivery of replacements. The plan must also then include all the steps that will be required to install and implement the new equipment—and the timescales that are acceptable to ensure that the RTO is met.
In the latter case where the workload has been stopped, then the RPO has to include a definition of the amount of data that could be lost over specified periods. In most cases this will be per hour or per quarter hour; in high-transaction systems, it could be per minute or per second. The impact on the RTO is therefore dependent on the business’ view of how many 'chunks' of data loss it believes it can afford. The DR team has to be able to provide a fully quantified plan as to how to meet the RPO within the constraints of the business-defined RTO—and if it is a physical impossibility to balance these two then it has to go back to the business, which will have to decide whether to invest in a BC strategy for this area, or to lower its expectations on the RPO so that a reasonable RTO can be agreed.
In essence, BC has to be the main focus for a business: it is far more important to create and manage an IT platform in a manner for the organisation to maintain a business capability. The DR plan is essentially a safety net: it is there for when the BC plan fails. BC ensures that business continues, even if process flows (and therefore cash flows) are impacted to an extent. DR is there to try and stop a business from failing: as a workload has or workloads have been stopped, the process flows are no longer there.
The two elements of BC and DR are critical to have within an organisation—the key is to make sure that each compliments and feeds into and back against each other to ensure that there are no holes in the overall strategy.