Learn about the three key metrics that help devise an effective business continuity plan and disaster recovery strategy.
To be always up and running is the goal of all IT services. Whether it is enterprise applications, IT applications, cloud services, or supporting data center services, business resilience depends on how quickly mission-critical services are restored once there is a disruption. It could be due to a single server failure affecting one application or network downtime affecting a set of systems in a location, or an entire site outage due to a major crisis. Resiliency is a crucial element of an organization’s business continuity (BC) and disaster recovery (DR) plan that helps formulate preparedness, protection, response and recovery objectives to quickly and effectively get the disrupted services operational with minimal data loss and user impact.
Data storage plays a pivotal role in ensuring application availability and performance. When there is storage-related downtime, it has a detrimental impact on application access and, in turn, business continuity. In this blog, we look at the importance and applicability of three principal metrics which, in the storage world, matter just as much as – if not more than – anywhere else in IT. These are:
- Recovery point objective (RPO)
- Recovery time objective (RTO)
- Recovery time actual (RTA)
The adage ‘less is more’ cannot be more apt than for this trifecta. The shorter the value (measured by the unit of time) of these metrics, the greater the efficacy in responding to storage failures and resuming business services. The ideal case, however, would be to keep all these metrics at zero. Getting them closer to zero one or as low as possible is the goal for the IT team. To achieve shorter recovery times, having the right set of data backup and recovery practices in place is instrumental.
Recovery Point Objective
Consider a site-level incident where the entire data storage is down, affecting many applications. Here, RPO can be understood as the time period of data loss that the applications suffer dating back from the time of the incident to the when the last known good status of data is available for recovery. This can be understood like it is a service level objective or a measure of loss tolerance. What period of time is realistically acceptable by the organization to suffer data loss when the storage failure affects data access? So, if RPO is defined as 12 hours in the business continuity plan and the last known available data backup before the outage is from 9 hours ago, then the RPO threshold has not been violated.
Recovery Time Objective (RTO)
RTO is also another service level objective which is used to set the target expectation for the IT team to get the service operational again. RTO denotes the period of time the organization defines as the service level to restore the affected service since the event of disruption (in our case due to a storage issue). For example, the RTO for a high availability scenario can be set as 5 minutes for a small incident like a disk failure, which necessitates a mirror copy to be made active. In the case of a disaster recovery scenario, where the primary site and DR site are separated by a long distance, TBs of data backup needs to made available at the DR site (typically through remote replication), many connections have to be reconfigured and services restarted means RTO can be many hours or even days.
Recovery Time Actual
RTA refers to the actual time period elapsed to complete the data recovery and make the storage copy available for application access. While RTO is the estimated value set as a target, RTA is the actual time measured against it. For good data governance and compliance, RTA achieved must be lesser than the RTO set in the BC/DR plan. In some cases, IT teams simulate a DR-like scenario in a test environment (parallel to and independent of production) and examine the effectiveness of their backup and recovery tool by measuring RTA. If there is a significant time gap between estimated RTO and actual RTA, you may need to revisit your failover strategy to ensure that the switch from source to target happens faster.
The figure below shows a schematic representation of RPO, RTO and RTA where the actual data loss and recovery time are higher than the set objectives. Hence the organization’s failover and recovery response is not in compliance with their estimated BC/DR plan.
Factors to Consider While Planning RPO and RTO Targets
There are various factors that need to be considered by IT Operations while estimating RPO and RTO metrics as part of their BC/DR plan:
- Failover automation: In the case of automatic storage failover, the RPO and RTO numbers will be much lesser when compared to manual recovery process. So, set your objectives based on how automated and fast your failover process is.
- Frequency of copying data for restoration: RPO, specifically, depends on how often (far back in time from the event of disruption) data copy is taken and is available. By using the copy of the last known good status of data, which is closest to the time of the failure, data loss can be considerably minimized.
- Distance between the storage and copy: If both the primary storage and its replicated copy are available at the same site, RPO and RTO numbers can be low. If they are available at different sites, based on the geographical distance between the sites, the objectives need to be increased accordingly.
- Assessment of application criticality: Given the business relevance of the application to be restored, IT can define the recovery metrics. Is it a highly critical application handling frequently accessed data? If it is the case, then RPO and RTO need to be a lower value so that there will be concerted focus and effort in bringing it back up soon.
- Economics of recovery: RPO, RTO and RTA metrics are all influenced by various factors in the IT environment, such as the tools available for backup and recovery, the skills of the IT staff involved and how trained they are at handling different failure scenarios, and the availability of snapshots/backup copies, additional hardware costs and storage space, etc. So, the overall cost associated with achieving a demanding RTO could be higher. By assessing the financial impact involved, RTO can be set at a realistic number that is both achievable and affordable.
Three Lines of Defense from DataCore Software-Defined Storage
DataCore SANsymphony is an industry-leading software-defined storage (SDS) solution that incorporates many data protection techniques out of the box to help achieve your business continuity and disaster recovery goals and improve RPO and RTO.
The first line of defense is achieving local redundancy with the help of SYNCHRONOUS MIRRORING. SANsymphony automatically replicates data from your active storage device to a mirror device and creates an active-active grid between them. In the event of a device failure, SANsymphony will automatically failover to the mirrored device thus ensuring business continuity and high availability for the application. Once the affected storage device is operational again, SANsymphony will resynchronize with that device and restore its original connection path to the application.
Both the failover and failback are zero-touch processes and happen without manual intervention. And because of the near real-time speed at which this happens, both RPO and RTO values are maintained at zero and there is no data loss or application impact. Three-way synchronous mirroring is also supported by SANsymphony for increased resiliency.
The second line of defense is disaster recovery of a full site or data center by leveraging ASYNCHRONOUS REMOTE REPLICATION. Here, SANsymphony creates copies of data from a primary site to the DR site and provides redundancy over long distances. Because of the distance between the sites, data mirroring happens asynchronously and not in real time. When there is a site outage, SANsymphony fails over to the DR site enabling business operations to continue with minimal disruption.
- RTO is typically longer because the DR site connection must be configured to the production application and associated services may need to be restarted.
- RPO, on the other hand, depends on the asynchronously replicated copy of the last known good data, and is typically achieved in minutes.
Indeed, site failover helps to provide redundancy in the event of a natural disaster, but it can also be used for controlled site swaps for scenarios such as planned site maintenance, scheduled power outage, construction activity, etc. Additionally, bidirectional replication feature in SANsymphony allows for swapping between the primary and remote site based on IT need.
The third line of defense is rolling back to the last known good data status. For this, SANsymphony supports three techniques: BACKUP (through integration with backup solutions such as Veeam), SNAPSHOT, and CONTINUOUS DATA PROTECTION (CDP). All three of these are point-in-time recovery methodologies where a copy of the data is taken periodically in the form of a backup (which is typically less frequent), snapshot (which is comparatively more frequent) and CDP rollback volumes (which provide one-second granularity for data rollback to a previously known good state before a disruption happened).
CDP only creates copies of incremental changes to the data and does not copy the entire storage volume every time. Based on a restore point chosen by the IT administrator before the occurrence of a disruptive event, CDP creates a rollback volume which is then served to the application. Say, in the case of a ransomware attack, if you want to roll back to a time in history just before the breach happened, CDP can be used. You can achieve close to zero RPO and very fast RTO.
Note: CDP using SANsymphony supports rolling back data within a 14-day interval. CDP is not a replacement for backup or snapshots and is recommended to be used as a complementary recovery utility.
The figure below highlights the differences in RPO and RTO values between storage backup, snapshot, and CDP with SANsymphony. RPO for CDP is the lowest amongst the three.
The 3-2-1 Strategy to Improve BC/DR Objectives
The 3-2-1 rule is a time-tested BC/DR strategy that helps to minimize failures and long recovery times. According to this, it is advised to keep at least three (3) copies of your data on two (2) independent storage media with one (1) copy of data stored offsite. If one of the storage locations becomes inaccessible, there will be another data copy to failover from. This strategy can be followed as a best practice to improve operational continuity.
To summarize, RPO, RTO and RTA are instrumental in planning for BC/DR. Understanding how they work for your specific IT environment and business needs will help you set realistic objectives and realize faster recovery times. With the help of DataCore SANsymphony, you can leverage built-in capabilities such as mirroring, replication, snapshots, CDP, etc. to lower your recovery objectives and actuals (to even zero or near-zero values), and thereby reducing the impact of disruption on application access and availability. These metrics are also important for compliance as auditors might be looking for your recovery SLAs to compare the actuals against. Contact us to find out how SANsymphony can help you plan and implement data protection and recovery strategies to achieve business resiliency.