There is nothing certain but the uncertain.
This maxim generally holds good in life, but more so in a business. Especially when it comes to making sure data is available and accessible all the time, there can be many uncertainties that cause interruptions to data access, affect data integrity, and even result in data loss – which all have a direct detrimental impact on business continuity.
When your mission-critical applications and data suffer a downtime, it impacts your top line and bottom line. While your business may incur revenue losses and productivity downswings, your operational costs could go higher in facing delays while fixing the issue and restoring the status quo.
To minimize these risks, there are various business continuity and disaster recovery practices that enterprises put in place, and it behoves the IT team to ensure that data is reliably stored, always available, and accessible. In doing so, IT organizations deal with many metrics that help define the SLAs that help measure quality of service and meet compliance requirements. Let’s look at some of these data storage metrics and understand their implications on the business.
Availability can simply be understood as system uptime, i.e., the percentage of time the storage system is available and operational, allowing data to be accessed. Highly available systems are designed to minimize downtime and avoid loss of service. All organizations expect to achieve high availability for their applications and business services. This is not achieved by a single IT component alone. High availability depends on many IT infrastructure components including the storage hardware and software to work in concert as expected, minimizing downtime by quickly restoring essential services in the event of a failure.
Availability is typically calculated in number of 9s.
1 nine = 90% availability, 2 nines = 99% availability, 3 nines = 99.9 % availability, 4 nines = 99.99% availability, and so on. The converse of availability is downtime. So, if a storage system has an annual SLA of 7 nines availability (99.99999%), it would suffer just 3.15 seconds of downtime in a year. You need to fully understand your business requirements and the costs involved to be able to determine and set your availability goals. Service providers, too, offer availability SLAs as part of their contracts.
To improve availability, organizations generally use replication techniques that create redundant data copies to enable continuous data access. Avoiding single points of failure is key to improving data availability.
Durability refers to the continued persistence of data. Businesses will have long-term data retention goals. This is achieved by improving durability of the data and the storage infrastructure preserving it. Especially in the context of object storage where data is archived and preserved for longer terms, it is important to achieve higher durability. A high level of durability ensures that the data does not suffer from bit rot, degradation, or any form of corruption or data loss.
RAID alone won’t suffice. By taking regular backups, by replicating and erasure-coding data/objects, and by enabling WORM/immutability, durability of data can be increased.
Reliability is typically associated with the infrastructure storing the data. It refers to the probability that the storage system will work as expected. A storage system may be available for a certain period of time, but it may not work as expected. In that case, the reliability will be low. Various factors contribute to increasing reliability of a system. It’s not easy to measure reliability. One common metric that is used to indicate reliability is mean time between failures (MTBF). MTBF is the predicted elapsed time between inherent failures of a storage system during normal operations. If MTBF is high, it is an indicator that reliability is low.
Preparing comprehensive test procedures to cover real-time production scenarios and edge cases can be helpful to equip the storage system to handle failures. Proper sizing, configuration, hardware maintenance and software patching/updates also contribute towards improving reliability.
Resiliency describes the ability of a storage system to self-heal, recover, and continue operating after encountering failure, outage, security incidents, etc. High resiliency doesn’t mean there is high data availability. It just means that the storage infrastructure is equipped enough to overcome disruptions. Resiliency is not a standalone metric; it spans business continuity, incidence response, and recovery techniques to reduce the magnitude and duration of disruptive events.
One indication of resiliency is measuring mean time to repair (MTTR), which captures how long it takes to get the storage infrastructure up and running after a failure. Lower the MTTR, better the resiliency.
Resiliency of a storage system can be improved through redundancy and failover and by building in software-defined intelligence to automatically detect issues and self-heal in a short span of time.
#5 Fault Tolerance
Fault tolerance is similar to the concept of availability, but it goes one step further to guarantee zero downtime. While a highly available storage system may have minimal interruption, a fault-tolerant system will have no service interruption. Having a more complex design a fault-tolerant system is a typically quite expensive to maintain: it will involve running active-active copies of data all the time with the necessary automation to fail over when encountered with any components of a storage system failing and causing downtime. And this failover will be non-disruptive in such a way that applications and data access are not impacted at all the business continues to function as expected.
Synchronous mirroring is widely used in the storage world for enabling fault tolerance. Data from the primary storage is synchronously mirrored to another storage device located in the same site or in a metro cluster. Automatic failover, resynchronization, and failback mechanisms ensure continuous data access and business operations overcoming downtime. Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are maintained at zero for a fully fault-tolerant system.
Availability vs Durability vs Reliability vs Resilience vs Fault Tolerance
Hope you got a better understanding of these mission-critical data storage metrics and the differences between them. Architects constantly make trade-offs to achieve higher levels of these dimensions due to the incremental costs (both financial and in added latency) required. Make sure to consider these appropriately as you build and optimize your storage infrastructure.