Wendy Stutzman

How to Avoid System Outages & Most Common Severity Level 1 Issues: Part 1

In the Pursuit of 100% Availability

You are already aware that if you and your users cannot access the system or its data, it can translate into more than a loss of revenue—it can damage your organization’s reputation. At DataCore we are committed to delivering the highest availability possible. In this series, our goal is to define severity levels and clarify why outages happen and how you can avoid them.

Defining Severity 1 Issues 

At DataCore, a Severity 1 issue is when no data can be accessed. We are constantly improving our documentation and products to avoid  an outage that impacts product data. In a study of the most common severity level 1 issues that resulted in the product data becoming unavailable, the top two reasons for this occurring were because the site had a system-wide power outage, or all the DataCore thin-provisioned pools became full at the same time.

What to Do When Failure Is Not an Option

When two DataCore servers with virtual disks that are in a synchronous mirror configuration lose power at the same time (or nearly the same time), it can result in a double failure. Recovering from a double failure requires manual intervention to decide which side of the virtual disk held the last known good data. Once this has been decided, the user must select the ‘Force Online’ option to allow hosts to gain access and for the mirrors to begin synchronizing.  This may not be easily ascertainable if, for instance, the power had been restored, and then lost again, or one side of the mirror had a previous issue when power was lost. In this case, we recommend  that an incident be opened with our technical support team so that the last known “good” side of the virtual disk can be determined.

How to Quickly Recover from an Outage

One way to avoid this entire scenario is to use an Uninterruptible Power Supply (UPS) on one of the DataCore servers (and all the components used in the path from the application to its data). This is a relatively easy step  to prevent that server from losing power at the same time as the partner server.  With the proper power commands, it will allow the server with the UPS to shut down properly in the event that power cannot be restored in a timely manner.  It is prudent to implement a UPS even if the servers are located in the same data center that has full power redundancy.  Why? Because even these data centers are not immune to catastrophic power outages.

The reason this works is because if both sides of a mirrored configuration were to lose power, as long as the outage of one is staggered in sufficient time from the other one, it allows the mirrors to re-synchronize without human intervention once the systems are operational again. Conversely, if they both lose power at or around the same time, even though power is restored, the Hosts cannot access the virtual disks until a human being indicates which side has the most current data.


Stay Tuned for More Tips

Please refer to the DataCore Online Help Guide for more information on how to configure UPS Support.

Get a Live Demo of SANsymphony

Talk with a solution advisor about how DataCore Software-Defined Storage can make your storage infrastructure modern, performant, and flexible.

Get a Live Demo