BC vs. DR vs. HA
The terms Business Continuity (BC) and Disaster Recovery (DR) are very often used interchangeably. While they are somewhat related, they are not the same thing. Business Continuity refers to the ability to maintain business functionality in the case of an outage, planned or unplanned, while Disaster Recovery is more IT-centric and technical in nature and refers to the ability to recover a particular component that might support one or more business capabilities.
As an example, we might have a loading platform that provides operational data feeds at 15-minute intervals into a relational database platform. This relational database platform is used by a critical business dashboard to monitor some Key Performance Indicators (KPIs) used by Business Leadership. In this example the data loading platform, the relational database, and the KPI dashboard should all have individual DR Plans to assure that each of the individual components are adequately protected against an outage.
The Business Continuity Plan would address how the operation of all three components work together to support the flow of data that is critical to the KPI Dashboard. If the data loading process is not operational, or if the database is not available, the Dashboard’s value is diminished; likewise, it doesn’t make any sense to protect the Database at a very high level and de-prioritize the load platform and/or the dashboard platform.
So basically, as the diagram below depicts, Disaster Recovery is a component of Business Continuity and a Business Continuity Plan will likely be comprised of many Disaster Recovery Plans.
The other term in this diagram that is sometimes used interchangeably with Disaster Recovery is High Availability (HA) or Fault Tolerance. High Availability/Fault Tolerance refers to redundancy that is built-in to protect against a component failure within a system. This redundancy comes in potentially three forms:
-
Hardware Redundancy – Redundant disk storage, power, network, etc.
-
Software Redundancy – Clustering and load balancing
-
Environmental Redundancy– Fault domains and availability zones
Depending on how resilient a platform is required to be, there might be many redundancy points built-in to protect against multiple, simultaneous failures. An example of this relative to “Hardware Redundancy” might be with the network connectivity for a hardware node. It could be configured with multiple network interface cards (NICs) and multiple network ports on each card. When the network connections are defined, the ports could be bonded together across all cards and those ports could further be wired across multiple network switches, so it would be possible to lose a network card, a port/cable on other card, and one of the network switches ALL at the same time, with NO loss of network connectivity for the node.
This is an example of a fault tolerant, highly available network connection for a hardware node:
When high availability/fault tolerance starts to get into the Environment Redundancy space, then the distinction between High Availability and Disaster Recovery becomes blurred a bit, but for the sake of this discussion we’ll assume that high availability provides protection within a geographic instance of a system and Disaster Recovery provides protection across geographic instances.
Active/Active (or Dual Active) vs. Active/Standby
So now that we’ve drawn a line between Business Continuity and Disaster Recovery, how do terms like Active/Active or Warm Standby fit into the picture? These are terms that refer to how you plan to utilize your systems. As an example, let us assume we have a primary production system and a secondary DR system.
Active/Active (or Dual Active) – Both systems are utilized to run production workloads during regular, non-DR processing periods. Data may be completely synchronized between the two systems, or there may be a slight tolerance for synchronization lag. The bottom line is that business workload would be balanced in some way across the system to make full use of the assets, but in the case of a failure on one side, the surviving system would be able to handle the critical production workload.
Active/Standby – ONLY the Primary system is utilized to run production workload during regular, non-DR periods. The Secondary System is only utilized if the primary becomes unavailable. As a further definition here, the Standby can be designated “Hot”, “Warm”, or “Cold” depending on how tightly the systems are synchronized and how quickly the standby can be up and running production workloads. Generally, a “Hot” standby can be up and running in seconds to minutes of a primary failure, while a “Warm” standby might be a few hours to a day or two, and a “Cold” standby might be several days to a week or more.
Many companies are starting to standardize their terminology based on well documented and well-known standards such as ITIL (Information Technology Infrastructure Library). ITIL is a set of detailed practices for IT Service Management that focuses on aligning IT services with the needs of business. ITIL has a set of terms that they use to classify the different levels of availability. Here is a quick terminology cross reference:
Now that we have high level definitions for varying levels of availability, the next blog in this series will cover some techniques for gathering availability requirements and making sure that your Business Continuity and DR Plans are focusing in on the most critical areas. Basically, how do you get the most bang for your buck when it comes to a DR/BCP Solution. This will include defining Recovery Point (RPO) and Recovery Time (RTO) objectives for your business applications.