Business Continuity Measures for Volatile Times

Webcast Transcript

I think the audience will find this very timely. It’s usually—it’s certainly down here in South Florida. About this time of year, we start to concern ourselves more with the weather. Although I think a lot of you’ve been north of here and places in the northern hemisphere, the weather seems to have been a bit more critical part of our thinking. So, what the advice that follows is intended to help you basically maintain your business continuity plans by adapting to those inevitable, yet unforeseen, turn of events that are real certain to shake things up for you.

It does require that we do really an objective self-assessment of the risks, especially those risks which have cascading effects because those tend to have a big influence on how well you can shore up your safeguards. A lot of these have quite preventable consequences.

And so, what we’re here to do is kind of share with you some of the experiences of ours from a couple of decades doing this and give you some of the tactical choices that you can make to account for the recurring upheaval just around the corner.

So, the first thing that I want to do is help you with a quick test. That test is basically trying to figure out how would your business continuity and disaster recovery strategy be compromised if you were to make, in the next year or so, some significant change.

The makeup of your storage systems, the suppliers that is the manufactures of that, or maybe the topology—that is, maybe you’re making, thinking of moving from a classical 3-tier SAN, to a hyperconverged system. So, put that in the context of how you currently do business on continuity and disaster recovery and also look at how the weather patterns might be affecting those choices that you make.

The weather, in particular, has been shifting quite regularly and surprising a lot of us. I think even the weather forecasters are kind of shaking their heads about what’s going on here. But it’s absolutely true—regardless of what you think the origins are—that the intensity and the frequency of these shifting weather patterns has become much more severe.

And areas that really, at one time, you would think “Wow, we’re really—we don’t ever see any of that kind of weird storms and weird behavior,” are suddenly getting hit with major damage: hurricanes and flooding and tornadoes and other natural disasters.

So, that brings it to the forefront and it’s just one of those things, but aside, really aside from acts of God and everything else that we’re concerned with, I really want to focus on more mundane and more self-caused issues that you should be considering.

Now, I’m guessing that some of you are already in the market for some new gear. You made me going through a hardware refresh on your servers and on your storage, and as you’ll look at the possibilities out there, some things draw you. They call it the “shiny object syndrome.”

It’s when you look at how appealing a brand-new capability might be. Now, what I want to do is think about that from the perspective of if you make that change, if you bring in that new piece of gear, how would it affect the way you approach business continuity and what difficulties, in particular, will that transition cause you?

Those are the pointed areas that we want to talk about, especially if you fast forward a year from now, introspectively, looks at what you did and say “Wow, did I make some good decisions?” Did you make some good decisions at the time or was that short-sighted? And that’s the essence of what we want to prepare you for, so you can anticipate some these and you can actually feel good about those decisions a year from now.

Why do we even care about that? Because they tend to be very disruptive. Some of the forward-looking, modernization steps that you may be taking may have some hidden consequences, and those consequences tend to have a lot to do with the way you currently replicate data to disaster recovery site. Maybe the way you take even snapshots for your backups and how you do failover from one location to the other.

So, in this picture, what we’re showing is the primary data center using some form of, say, synchronous replication to DR site. The way you approach that today, how would that be different, if you were changing out some of that gear that handles that replication? And how would that change the way you would fall back and restore your primary side?

Because these are engrained procedures in your BCDR plans, and they can be quite disturbed by any choices you make going forward. So, really what we want to do is prevent you from being in this situation where you kind of say, “Wow, I didn’t see that coming. I had a made a casual decision that seemed to be compartmented and it seemed like the right choice at the time, but I had not considered what the downstream effects are.” So, we want to put you in a good position to address that and have all of that well understood.

So, for our first poll, what I want to do is just get a quick idea. How many different tools does your organization currently employ to safeguard critical data at a DR site?

This’ll give us a measure of are you dealing with just, maybe, don’t even have a DR site or you’re dealing one or two, just a couple or you’re dealing with a handful of different things in order to get different portions of your critical data an extra copy somewhere else, where you can recover, should something happen to the primary data center.

So, let me look at the votes that are coming in. The first thing I see is that there’s a—it’s just starting to roll up. Let me give that a second. And I’ll have a second poll later which is also related to this, and so, it’s just kind of helping frame the rest of the conversation, so I know where to steer it.

So, right now what we’re seeing is about a third of you don’t even have a DR site. Wow, that itself is a little bit puzzling, but you have to understand from a budgeting standpoint that may be out of your reach right now. So, hopefully we’ll give you some ideas on how to address that.

Another third or so say you have a just one tool. About 28% are saying about one tool. There’s about 14% right now that’s saying two tools and a quarter of you, surprisingly, have several that you’re dealing with.

So, several means you got your hands full with trying to figure out what are—how do you coordinate these different elements of your disaster recovery plan because they basically are possibly sandboxed from each other. But there’s a good answer to help with that.

So, let’s continue on. The first thing I want to point out is in this regard is some of you, maybe, the reason you’re operating the way you are is you’re using features that are built into say, your SAN. Your storage area, network array. That device, that intelligent storage controller, is providing some of the data protection, the replications services for you.

So, if some of you that answered two or several, the reason you might have said that is you might have had maybe two models of storage in your SAN. And each one of those uses a different replication technique. At the end of the day, they’re trying to copy data across to the DR site, but they maybe be using different protocols or unique proprietary mechanisms to do that.

So, that in itself, creates some difficulties. The second thing is that because they are in the array, those—the capacities are isolated, and they’re not available to be shared. And so, you have to make some explicit judgement as to which data you should put in one location, which should go in the other, and over time, as the value of that data changes, you may need to juggle those decisions a bit.

So, with that in mind, what DataCore principally is going to offer to you in this case is a way to consolidate those data protection services. So, there’s some uniformity to them. Instead of relying on the capabilities of any specific array or any specific storage device, for example, we basically up-level the place that those functions occur. So, that any storage that you put in the mix, today’s storage and future storage, will benefit from the same operational procedure.

You can perform infrastructure wide BCDR without concern for any one particular brand or model. That’s the essence of what we want to elaborate on. And the way we accomplish that is by having a continuum of safeguards. That is, a collection of capabilities that span the entire broad range of strategies you’re going to use to protect the information.

They range at the extreme or critical data requires zero RPO and zero RTO, synchronous mirroring. And that synchronous mirroring can occur both on local scale, that is within just a few a feet of each other, where two copies are there. But, more likely in these contexts, it’s more about doing stretched clusters and metro clusters and I’ll show you some illustrations of that. A companion part of the capability there is the ability to automatically failover to that other copy, should something go wrong with the primary.

And so, these operate in what’s known as an active-active configuration, so there’s never ever any interruption in data access, nor is there any data loss should you have complete failure one-half of your storage infrastructure.

There is also- mentioned some things about three-way resiliency when you want to go just one level further. And then there’s a second set I put in that bucket which has to do with regional failures, so I have an entire metropolitan area that may be affected by earthquakes and floods and just more grave danger.

And that’s where measures like async replication and advanced site recovery kick in, where you’re trying to get much further away with your second copy and be able to restore that in a timely fashion. And then, there’s some—and, by the way, in that case, you’re—basically there are some tradeoffs. And those tradeoffs have to do with dealing more with a crash consistent image, rather than that zero RPO or zero RTO. You will have, potentially, some things to recover from and there’s going to be a hiccup in the behavior.

And then there’s other steps like continuous data protection and snapshots, which basically give you a point in time, well known point in time. CDP may be new to a few of you. It is a way to put—create kind of like a DVR, digital video recording, of any I/Os critical data that you can rewind in time, back.

So, let’s say you were hit by something like ransomware. You could actually turn back the clock on your data and go right to point where the attack occurred and restore the view before the data had been encrypted. And so, you can kind of thumb your finger at the folks who are trying to take you for ransom there. So, interesting capabilities to pursue.

What this means is that I can maintain, or you can maintain, the same BCDR process while you refresh gear. We know that refreshing the hardware and in some cases, firmware and everything else, can be a fairly volatile situation, despite all the best practices.

So, what we’re basically saying here is we have developed mechanisms throughout the years that allow you to pull out old gear, replace it with new—in terms of the storage makeup—and do that behind the scenes, on a work day, at normal operating time, without having to take down time. And all of that can happen, even if you’re choosing a different manufacturer, different model of product that was incompatible with a previous one.

So, the migration of the data from the old to the new occurs in the background, non-disruptively. You can do this as frequently or as infrequently as your organization requires, but the point is your operational procedures remain constant, even though you’re choosing a different place to store your data. So, this is just the completion of that animation.

It’s also true, so—some of us are thinking, “Well, okay, but I’m not running SANs or I’m not—SANs are, kind of, in my—what I’ve been running, but I’m starting to look at things like hyperconverged systems and how would this help me in that, cutting over to it?”

The part of the beauty of what DataCore’s doing, is again, by up-leveling where these data protection services occur, is that topology doesn’t matter. So, in several of our accounts, what you’ll find is customers are making their face first, split steps into HCI by using hyperconverged at their disaster recovery site, or even at the other end of their metro clusters. So on the one hand, they’ve built an environment with classical external SAN arrays. Yet, they’re mirroring that information to a new standup, which includes just basic servers with internal storage and run the DataCore software on that, and that is also the host for the virtual machines that will failover.

It’s a very dramatic cut from visual standpoint, from the optics, yet operationally, they look identical. The same failover techniques, the same recovery techniques apply as if we had like systems on both ends.

Now there are some specific items to avoid. I’ll call them, “no-no’s” here, but it basically says, “Given what you know now, you should, at all costs, avoid relying on a hardware array for your data protection.” Because that’s going to lock you into certain behaviors that are going to create the upheaval downstream.

So, it basically undermines—any time you do that, you undermine the utility and you can’t cross that box boundary, and by box boundary I mean the physical hardware on where that runs. And you also tend to prematurely shorten the life of that equipment because when new gear comes into place and it doesn’t use the same technique for replication, then you basically ending having to make a choice or fall into the camp of, “Oh, I’m doing it two different ways or three different ways to pull this off,” and that’s not manageable in the long run.

The second no-no is choosing a strategy for DR that is specific to its topology. So, either it’s saying, “Well, I do this always and I expect always to have a SAN underneath me.” Or “I expect it to be strictly hyperconverged.” Those can be, as we say, here equally short-sighted. They don’t give you the opportunity to clearly segregate the role of the storage versus a compute, and you can be affected by so many variables in terms of quality of service. So, kind of free yourselves from those constraints is the essence of our message.

I want to give a good example of a customer that’s done this and has done this early on. They’ve been doing this for over nine years. It’s Maimonides Medical Center. They do critical patient care in the New York area, and they, obviously, are not in a position to accept any planned or unplanned downtime.

So, what they’ve done is they’ve gone through a complete—at one point, they were running a—the entire data center was at one location in the hospital. They then—what they did was they split that into two active-active sites, with one at the main hospital and the other at the MIS facility some blocks away.

And throughout that maintained two active copies, data in two places as we describe it, so that whenever there was a maintenance activity or something broke at one of the sites, the others would seamlessly take over. They’ve been doing this without any storage-related downtime for nearly a decade. Might be a decade by now.

And it’s a sizable configuration of one petabyte. There’s a large community of clinicians, medical staff, and doctors that are depending on this and run a quite a variation on applications. And, basically, they’re saying this is the way they run their ship. Some examples of this are a little bit visually, I think if you’re more on the technical—on the networking side, you can appreciate the second picture.

This is for a customer that runs different generations of Dell gear. They ran at the primary site of the West site, they’re running Compellent, and they also have as their metro cluster, they run EqualLogic. But because, normally, those two couldn’t talk to each other—in fact, they don’t have—neither of them have the same capabilities. So, what DataCore does is it rationalizes the use of this and it standardizes the tools that they used to make the copies across them, irrespective of who happens to be underneath.

They happen to have made that choice and when the next 3par, something else they want to put in here comes into the picture, if they were switching out to HP or if they wanted to stay within the Dell family, those would be choices they could make at any time because it’s not germane to their procedure. It’s just the place where data lands at the end of the day and from where it’s retrieved.

This also gives you a little bit more picture of some of the redundant path you would take. So if you’re concerned about any single point of failure, part of the cable structure in terms of the inter-site links and the links to the switches, you have to be well thought-out. And that’s part of the best practices that our authorized partners can provide you with.

Here’s another configuration. This is also in the healthcare, so those tend to be much more paranoid about taking care of data this way. In this case, this hospital, medical center, runs three sites. So, they have the two in the Manhattan on the right are the principal crunching centers. And then they have their fallback site: this is in Secaucus, New Jersey.

This was a very important choice that they made because when superstorm Sandy came through here and flooded the Manhattan area and also had huge power outages, everything else that was going on here. Those sites were able, on the right, were able to failover to the higher ground, the Secaucus colo, which was sitting dry—high and dry fortunately.

Now, for us boaters not a good thing, but certainly for data center ,it’s a really good situation, and that kept their business running. That kept their critical data services running. Obviously, they had to shuffle a bunch of other things around to do that, but from an IT perspective, they were well-protected.

And once things were restored at those Manhattan facilities, they could quickly just flip the switch and, in fact, come back to that. So, in this case, the two are relying on a common site, they’re both doing the same thing. They’re doing both, synchronous mirroring to that site and their reciprocal relationship.

In some—more and more, in fact, I’m finding is the need for a third copy. And the reason for the third copy is interesting. There’s two ways this could place—and I’m speaking specifically about a third nearby copy—that also is synchronously mirrored.

So, in this case what you’re trying to do is that when you lose one, so in this picture I have the west nodes, the middle of the east and the third copies on the north node. Some in a campus environment, within the same metro area. When we need to take one of those systems down for some period of time—preventive maintenance, upgrade, anything else—what tends to happen, if you just have two of these systems, depending on how they were sized, you might find that if west only relied on east to take over, east might be a little bit sluggish if it wasn’t well-proportioned to handle the excess load.

And at the point, you basically have this one unit that is—you’re kind of a little bit more vulnerable because if that baby were to go down, as part of some rolling failure, then that would be a casualty as well. So, in order to maintain a highly available environment, even when you’ve taken one of those nodes down, you used this three-way resiliency.

And that gives you kind of that peace of mind about, “Hey, yes, I can tolerate. I’m willing to do the preventive maintenance more frequently, knowing that I’ve always got that shored up by the other two nodes, and I know that my users are not going to be finding any—we’re not going to suffer from a performance standpoint while I make that cutover.” Really good approach to keeping everything as you would expect.

I did want to also share with you some of the visuals, in terms of what’s going on behind the scenes, from an administrative console standpoint. And this is the perspective that one gets when it’s in charge of the entire kit and caboodle.

From here, we are orchestrating the replication that occurs, we’re able to determine what systems may have been put out of—on maintenance. We can look at the performance behavior that’s occurring and look at that in a device- and supplier-independent way, as well as topology-independent way.

We have a 360-degree view of our data protection strategy, without having to deal with the specific nuances of any one of those models or brands. That’s the real important part of this thing. We learned this once, we continue to apply that same technique throughout even as things shift.

What this ends up looking like, so as you elaborate your own infrastructure and you start moving from what may have been silos that had different topologies for providing a uniform way to standardize those BCDR practices, despite the diverse elements that make up these different pockets.

So, on the center piece of this picture, what I want to draw your attention to, that might’ve been the- your system of records, for example, where your major tier-1 apps are running. It may be using the classical, big external SANs serving both bare-metal machines and virtualized machines. And more often now, I’ve also seen some container in terms of development effort, and those would all be proper clients that would be consuming storage from those external storage areas.

In this case, DataCore is the leveling layer, the storage virtualization layer, that is pooling those external storage arrays and presenting them as a collective view of the path.

On the left-hand side, you may also be—have selected certain, lets say VDI, virtual desktop infrastructure, where you want to test what this hyperconverged clusters look like. And that may be—you may have made that choice, we’re still using the same DataCore software there to facilitate how we do our BCDR.

Same for our secondary storage, what you see at the DR site and the remote office and branch office is a similar software function, or the same software function, but with slightly different topologies. And the ROBOs, the ROBOs would look more like a hyperconverged cluster, the DR site needed to be a little beefier in this case. I looks more like a converged infrastructure, where there is still the separation between the external hose there that consumers are storing and the providers storage below it, but we’re not relying on big arrays to pull that off.

We’ve already, basically, instituted the local storage and pooled those resources or just external trays, but we don’t need intelligent controllers anymore there because that is being subsumed—that function is being consumed by the DataCore capabilities, which I’ll show you in a graph later.

Essentially then, what we’re doing is we’re saying that the software is giving you a choice of topology-independence, ways to implement this and those may change over time. You may have started on the left of this where you’re poolng and virtualizing the storage array, then you kind of collapse that by those arrays come off least or they’re useful life expires. You replace them with just internal storage inside these servers, x86 servers, which are acting as storage controllers.

And then in the third wave of your transition, you may actually bring in the virtual machines and the workloads right on to those same servers. You may have to beef them up at that point, but you can certainly move gracefully from the left to the right with your objective of trying to reduce the complexity of your infrastructure, while maintaining the same set of practices.

And then you might later come back and say, “Wow, that wasn’t such a good idea. I actually wanted to separate my host from my storage and kind of create a hybrid.”

The way that we provide the software to allow that, is we essentially have three editions. There’s the enterprise class, the all-inclusive, richest feature set and then we have kind of the mid-range, the ST edition, as well as an edition targeting the large secondary storage opportunities that you have in the adjacent community here, big file servers, things like that, where you’re not going to want to spend the same price per terabyte as you would for your most critical workloads. So, that—any of those can operate, any of those licenses, these are all software licenses that you run on an x86 server and they are—they can all play in any of these topologies.

Here’s the look at the broader stack. So we’ve been focusing in this conversation with the data protection services portion of this. The things like synchronous mirroring, the replication and snapshots and CDP, but there are a number of other services that you would expect and need from a storage infrastructure, whether they have to do with provisioning, whether it has to do with charting and alerting or the understanding from an analytics standpoint what your future capacity requirements are, and maybe adjustments to your quality of service.

So, all of those are an intrinsic portion of what the software stack brings, and it doesn’t matter how you come into this. Whether you’re coming through fibre channel or iSCSI or what kind of storage protocol you are using on the backend. It may be NVMe for your really fast low-latency requirements and you’re going to hook up some of those NVMe flash cards directly on these DataCore servers. Or you may be using direct-attach disk or even having some of that capacity sitting in the cloud. It matters, not what you’ve chosen, you’re simply tasking those different nodes with specific responsibilities based on the workloads you expect from them.

With that, I wanted to have a second poll. Now that you’ve kind of seen the bigger picture of what we’re trying to get at, I would very much like to understand which of the following changes, if you were to make an assessment right now, of your data storage infrastructure would be most upset your BCDR strategy.

If you were (A) if you were suddenly told hey in the next sixth months, we’ve made—we’re convinced that we need to move our DR location to the cloud. In some of the cases, some of you who didn’t have a DR location, we’re going to put a DR location in the cloud, how would that affect you?

Or might it be that you’re thinking about adding an all-flash array to your existing SAN. Evaluate how the techniques that you’re using for replication, would those change dramatically as a result of that? Not saying conceptually, I mean actually. How does your BCDR plan change? And if you were to suddenly cut over to a hyperconverge system, maybe even have that alongside your current storage, would that be very, very- operate very differently? I’m guessing it would, so I would like to get an idea from you.

So, Carlos, I’m not seeing the second poll results. I don’t know if you can see it.

Carlos Nieves: I can see them, Augie and, right now, we have about 68% of the voters have chosen “Move your DR location to the cloud.” About 10% “Add an all-flash array to your existing SAN.” And the rest “Introduce a hyperconverged system alongside or in place of your current storage.”

Augie Gonzalez: Okay, so what I want you—walk away after we’re done here today, is kind of write for yourselves, that’s—the fact that I’m having to make that move, how consequential is it? How does that ripple across the procedures and the training and the auditing that I need to do on my BCDR plan and how can I prevent that from being such a catastrophic change in the way I do my safeguard. Thanks for those inputs, by the way.

So, DataCore is here to help you. It’s here to help you on several different fronts. Today, we’re very much focused on the business continuity and disaster recovery as part of today’s webcast, but really any time you’re thinking about making a storage expansion or a storage refresh, that may be a good trigger to reconsider how you’re going to approach your techniques for keeping safe copies somewhere else.

We are also very much apart of organizations who are going to a consolidation effort, where they’re trying to reign in the variety of gear, actually pool those resources and deal with them in a uniform way, so we can help you there, and also at the edge whether you’re running a remote offices or branch offices and adding to the main data center and trying to create a reciprocal arrangement between the DRs from using the main data center as DR site or those ROBO facilities.

Or at the edge, you need send some of that data back. It uses the same techniques, so we can be of really good assistance on that front. Specifically, for you as individuals, there’s several responsibilities that we can help you with.

So, if your mandate is to ensure operational continuity, we can help you do that even as hardware ages and as new gear takes its place. And I think that probably would be job one and part of the tools that we offer to do that is the ability migrate the data behind the scenes, so as this gear gets placed—is replaced, we can make that shift, evacuate the old gear, move the information to the new one and nobody has to know that you went through that.

Even to the point where that gear has appeared somewhere different for the location, not just the supplier and the model, but the location has also changed. And doing all of these things while you sidestep any points of failure and outages, so you may, in fact, have severe outages as part of your scenario, in terms of one building on campus, but systems continue running, clearly.

There’s some side benefits, as well, and that’s the ability to pull capacity across what might be otherwise treated as isolated devices. So, rather than saying, “Well, I’ve got this island of storage and I have made some explicit decisions about what applications I use on that, I can actually collectively and aggregate the resources at my disposal, no matter how varied, and treat that as a pool of resources that I can draw from based on peaks and values in my consumption.”

And last, is the ability to actually authenticate the consumers. It is determined who deserves the high priority service and who are we trying to prevent from being rogue consumers, and who gets the middle-of-the-road service. All of those are part of the intelligence capabilities here.

If there’s anything I want to let you—leave you with, is that don’t let these hidden dependencies put your BCDR plan at risk. Whether that’s location, whether that’s topology or suppliers—it’s keep an eye on that. Keep an eye on—see, the gravity that any switch there might be creating for you.

And really form our standpoint, in the assessment that we’ve heard from our customers. Some of these is- for example, the City of Carmel, is a really good webcast from them. They’ve been deployed using DataCore since 2010. So, that’s what? Nine years? As well, zero downtime throughout that with all kinds of changes, obviously, occurring through that period.

So, think of us as an independent software vendor that’s uniquely capable of preserving your well-crafted BCDR strategy, yet allowing you modernize and return back money to the organization because you’re not having to throw away things that suddenly have been ripped and replaced.

And, if you wanted to share some of this information quickly, I think, with some of your colleagues who may not be ready to sit on a webcast for the 45 minutes or so that we’re working with here. Give them a quick look at these measures.

And there’s several resources there, but there’s some two minute videos that give you an idea exactly what were talking about and illustrate what we mean by these cascading dependencies, so very much encourage you to do that.

And with that, I think what we’d like to is go through our questions, see what we’ve got here. And let me run through those Carlose, pardon me.

So, the first one is, “Does it work well with Office 365?” That’s an interesting one. So, office 365 is hosted software, and so there’s already an infrastructure behind the scenes that’s taking care of that.

You would be relying on the SaaS supplier for that and that’s kind outside of your control in general if you’re running the hosted one there. If you happen to run local office on local servers, then absolutely it would help.

This is the second question, which is, “Are there any special storage requirements to use your software?” I read that as, “Are there any special compatibility things that we care about that might preclude us from doing that?” And no, the software basically expects to talk to storage, standard storage using the familiar protocols, whether those are SAS or SATA or NVMe. They use all at the, natively—most of those things are just basically the SCSI protocol that’s used to convey that.

We look like a host, that storage, and those the traditional way that that connection occurs. It’s completely compatible, so we don’t have to go through special qualifications to do that.

There’s another—one more question in here, “How do you price the software?” I gave a little hint to that. It’s basically capacity. It’s how much capacity is placed behind the responsibility of DataCore. How much is usable capacity that we can draw on and based on that we charge on a price per terabyte.

More volume than you put under our control, the lower the price for terabytes, and then those three editions that I showed you earlier, the EN, the ST and LS just set a feature capability. Some cases everything built in, the others have a more tailored feature set that’s applicable for mid-range and the ones where performance is not so much a factor, it’s the LS. That covers it.

I don’t see any other questions. Got a couple more minutes here, if you’d like to and I’ll turn it over to Carlos.

Read Full Transcript