Does your team need to manage billions of files, petabytes of data or thousands of tenants with limited staff and resources? Do you require the scalability of cloud storage secure in your data center? Do you want to provide cloud storage services or find a cost-effective way to keep scaling data sets online and accessible?
Object storage is the solution for you!
However, there are significant differences between various object storage solutions. How do you develop the framework for your evaluation? What questions should you ask internal stakeholders and object storage vendors, and what should you consider when testing workflows?
In this educational webinar, IDC’s Eric Burgener and DataCore’s Ryan Meek and Adrian Herrera discuss this topic with an emphasis on providing a framework for evaluating the differences between object storage solutions.
Adrian: Hello, everyone, and welcome to our webinar on evaluating object storage solution. My name is Adrian Herrera, I’m going to be acting as the moderator today. And I have Ryan Meek with me. Eric Burgener is having some technical difficulties. He will log on when he can. But he will try to join us here in a few minutes.
And with that, why don’t you introduce yourself, Ryan? You give a little of background. I’ll give a little bit of background about myself, and then we’ll get started with the content. So take it away, Ryan.
Ryan: Sounds great. Thanks, A.J. And thanks, everyone, for joining us today. I am Ryan Meek. I am a technical director at Data Core Software in sales in – in the Americas. I actually started out my career in development and I’ve along the way, in the storage industry I’ve been a developer, a tester, a product manager, and a trainer. So before going into sales and before carrying a bag and being on the technical side, I came in through multiple roles, to say the least. Thanks, A.J.
Adrian: Yeah, so you have, you know, great experience, all the way from the – developing the actual objects to our software all the way to implementing it across various environments and workloads. So am looking forward to getting your opinion on a lot of the content that we’re going to cover. And my name is Adrian Herrera. I have hosted a lot of these thoughts leadership webinars at DataCore.
I actually started my background more on the media side, on the digital audio side, where I was doing product management and then I, you know, parlayed into consumer storage services. I did a little bit of time at Yahoo and then moved into enterprise Cloud storage and after that into object storage. So I’ve had a number of different product management, product marketing roles, all relating to media, consumer storage, and also enterprise storage.
So here’s what we hope you learn in this webinar. You learn the evolving requirements driving the need for object storage; object storage architectures, deployment and management option, criteria to consider during your object storage evaluation, and the framework and timeline for a typical object storage evaluation. So we’re going to, you know, try to keep this very high level, we’re going to try to keep this very informative and conversational.
So please feel free to ask your conversation, or your questions throughout. You have the question dialogue in your control panel and your go-to webinar or go-to meeting control panel. So please, feel free to ask questions throughout this presentation.
And this part of a series that we’re hosting, that we’re holding at Data Core. We – it’s all about object storage and object storage in relation to other types of storage. And we – I always like to set the foundation by just presenting the three types of storage architectures. We have block, we have file, we have object. You know, from a high level, those are the three different types of storage architectures.
On the block side, you know, block is high transaction, or it’s made for high transaction data, a high rate of change data. It can be, you know, viewed as, you know, like a blank volume, just a blank disc for operating systems and some software databases, those types of applications to run on. In the middle we have file architectures, and our architectures are really what, or what people are most familiar with. They’re very similar to the way you file data in a file cabinet. You have directories, you have trees, you have file names.
And I think this is the way that most people in the world, when they think about storage, that’s where their mind goes, data is stored on file systems or in file systems, I should say.
And then you have object storage. And object storage has been around for quite some time. It’s been around for about 15 years. Most people interact with object storage on a daily basis, even if they don’t know it. All major Cloud storage solutions are based upon object storage, but now we’re seeing the momentum and the popularity of object storage increase.
And we’re going to cover why that is in this webinar. We’ll go over the reasons and the evolving requirements, or the evolving requirement for object storage and why its increasing in popularity. So Ryan, do you have anything to add to that?
Ryan: Yeah, I would just note there, you know, on the object side, [Swarm 00:04:55] has been a first class purpose built object storage system for over 15 years now. And when we would originally be evaluating object storage solutions, while A) there just weren’t many, but B) we would spend much of the time explaining essentially what’s on this slide. So this is very much the way the understand started at the beginning.
People are familiar, mostly, you know, storage folks are familiar with Block. End users are familiar with File. And then Object was really about communicating, you know, how it’s different, what’s important to an object system, the different dimensions to be aware of, like the number of objects, the way that capacity can be added, the ease of use tradeoffs with, you know, systems that are purpose built for one particular use case.
And so we’ll just look forward to getting into a lot of those topics in this presentation.
Adrian: And then here’s something that we need to point out, the difference between object stored protocol and object stored architecture. And the reason being is that you can have file protocols on top of object storage and you can have object protocols on top of file system based storage. And you’ll notice that we put some indicators there. On the file system side you have POSIX compliant SMB and NFS, and limited scale S3.
And on the object side you REST, your standard RESTful interfaces, which most object storage solutions have, well, all object storage solutions have. Their primary interfaces are RESTful interfaces, usually based SGP. Now most object storage solutions support S3. But the difference is in the underlying architecture. So even though you have a protocol or a way to interface with underlying storage infrastructure in the way that the data is actually addressed, if you have those protocols on different architectures that weren’t optimized for those specific ways to interact and communicate without addressing mechanism in the underlying media.
You do have some issues, performance issues, through put issues, if something wasn’t designed specifically for that type of communication. So that’s why – go ahead.
Ryan: Oh, no. I would say real specifically, you know, file systems, it’s difficult to add capacity and it’s difficult to add, you know, many readers and writers to the same files, especially because they are POSIX compliant. You know, you tend to see file locking. You tend to see usage that depends on file locking. You know, many people reading and writing from the same files are difficult. And so when you overlay objects system on an underlying file system, for everyone here that knows what happens to a file system; when you put 40 million files in one directory, for instance; they just get really slow.
And you know, that is a – is one simple way that that can affect performance. And then, you know, in the other direction, an object system, you know, the notion here is, or what’s noted here is limited SMB, NFS, you know, that’s specifically because use cases sometimes really depend on that sort of either POSIX compliance or bio system locking or, you know, user pathways that are pathways that are more or less built for file.
Now the object system in general isn’t limited in its own way. It’s not limited in capacities, it’s not limited in numbers of objects in the same way. But if someone, you know, has a purpose built NFS application and is, you know, used to the performance of a purpose built NFS server, where you’re essentially sending bits and the bits are landing on a disc and then the write is complete, you know, when that’s compared to a protocol translation layer in the middle, and then the actual bits ending up in the object store, it feels like more latency; it is more latency.
So they’re not, it’s not a one to one comparison in each direction and it’s just really good to understand which protocols are being used and if an object system is layered on top of a file system, you know, it’s really good to be aware of the limitations it will inherit.
Adrian: Yeah, and the reason we’re bringing that – yeah, we’re bringing that up because when you evaluate object storage solutions, a lot of it is going to be validation of workloads, and you’ll see why in a bit. We’re going to cover that information. And I believe Eric has joined us. Eric, can you hear us?
Eric: You bet. Good to be here, gentlemen.
Adrian: Yeah, likewise. Eric, do you want to just quickly introduce yourself. You came just in time, because the next slide is an IDC slide.
Eric: OK. So yes, I’m Eric Burgener. I’m a research VP in the infrastructure systems group at IDC. And I focus on block file and object based storage there in my research.
Adrian: Yeah. And with that, Eric, so we just covered the three types of storage architectures. We covered, you know, the protocols vs. storage and it’s good to set the stage because, you know, at IDC you had some very interesting results from a survey you sent out and we’re going to cover that in a bit. But Why don’t you walk everyone through how object storage is evolving based upon your, or IDC’s research.
Eric: OK. So yeah, there’s definitely a sea change happening in object storage. And you know, we started to see the evidence of this in the last couple of years. It was really tied initially to what we call digital transformation, which is basically enterprises moving in the direction of much more data-centric business models, capturing a lot more data, analyzing that data, using a lot of new technologies like artificial intelligence, machine learning, etc., to be able to analyze that data.
And that’s really brought in a lot of new requirements, particularly in the storage infrastructure for other types of new technologies. Solid stage storage, we’re seeing a lot more QLC out there, NVMe is becoming a lot more prevalent. And a lot of the reason why people are deploying these new technologies is in direct response to the requirements of these new workloads that they’re putting in place as they move towards these much more data-centric business models.
Now in the architecture side, we’re also seeing some changes happen there. We’ve been tracking software defined storage as a market for at this point not quite 10 years, but we’ve certainly seen extremely rapid growth in that space, and we’re seeing a lot of cases where when legacy systems come up for technology refresh, enterprises are moving towards software defined scale out type architectures. In the fallen object space, we’ve also seen a move towards a unified name space. So in other words, a separate file based system, a separate object based system, operating under a single name space that could be searched very simply and it also eases the movement of data back and forth between those two that allows you to sort of optimize data placement based on what the requirements are.
Does it require low latency, high performance? Are we looking more for, you know, lower cost, massive capacity, that kind of thing. We’ve also seen on the unstructured storage side, so file an object, systems supporting a lot more access methods. And one of the things that’s driving that is there’s been a significant interest in moving towards denser work load consolidation. So you know, it’s interesting, some of the primary research we did last year around how enterprises are responding to digital transformation indicated that about 70 percent of the enterprises that are currently going through it will be modernizing their server, storage and/or data protection infrastructure just within the next two years.
They see this as a critical success factor in their ability to digitally transform their companies. And so this idea of if we’re getting rid of older systems, can we move from having more storage silos to fewer storage silos, what kind of capabilities do we need in these new systems to be able to consolidate workloads? And multiple access methods is definitely one of those areas.
Containerized environments, certainly is more in Enterprise’s move towards dev ops, and they’re, you know, they’re looking at re-architecting applications around micro-services based designing, making them more Cloud native. This is becoming a key issue and certain storage systems going forward will need to support those environment through things like CSI and Kubernetes. Hybrid Cloud, this basically is the way that enterprises are building IT infrastructure these days. There’ll be some work loads that’s more effective to have those in the Cloud. Others you keep in on-prem and you need a unified management console that allows you to get the observability, the visibility across that entire IT infrastructure, regardless of where it may reside.
So these are some of the things that are coming out in the architectures. Another key thing I’ll mention on the object storage side is the infusion of enterprise class capabilities that traditionally work in object storage, that as customers look to put new kinds of work loads in those environments, you get an object storage platform that supports an NVMe based Flash tier. This gives you the ability to deliver lower latencies, which gives you the opportunity to put workloads in in the past you might not have put in object storage; you can put those there, and potentially get rid of another storage silo.
So with that kind of consideration, features like snapshots, different kinds of replication, different kind of erasure coding options, those things all start to become more important as customers need capabilities to mange the data they’re placing. So all of this is really resulting in a sea change going on in the object storage space these days.
Adrian: Yeah, I think it’s interesting to point out, you classify this as data-centric. And if we were to point to the architectures and technologies, I would say that, you know, object storage and block storage are really the two architectures that were developed and architected specifically for data. I think Ryan touched upon a lot of the points in object storage, why it was developed and architected specifically for data.
That leads to these new sets of object storage workloads. I think some of these, you know, can be seen as traditional. Some of them are a little newer. But you know, these are all use cases that we have certainly seen in our, you know, over a decade experience with object storage.
Can you walk everyone through these and then maybe, Ryan, if you want to comment on some of these from an actual use case perspective, what we’ve seen in the field?
Eric: Yeah, you bet. So, you know, traditionally object has been generally used for things like backup, archive, DR. That’s sort of the classic case where what you really cared about was low dollar per gig cost, massive capacity. And often immutability of the data. Some of these newer workloads that are being put in place – so from an analytics point of view, workloads that look more like sort of the older high performance computing type style workloads. And also in the ability to move mass environments and run them on top of object now because of object’s newfound ability to deliver lower latencies and provide some of the other capabilities that traditionally you had to get a dedicated filer for.
So this is a perfect example of workloads that are being consolidated, in many cases onto object platforms. You know, I mentioned data earlier, and that people are collecting significantly more data because of their digital transformation. So going forward 70 to 80 percent of all the data that will captured and retained for analysis and other types of purposes will be on structured data. So that is going to be file an object. Block is not a growth area. That’ll certainly continue to stay there on the primary side but all of the growth is in the unstructured arena with file an object.
And given the newfound capabilities of object, you know, there’s a real serious question that could be asked here about can, how many workloads can we run in these environments, where we can meet the availability to fast recovery, performance requirements, that maybe traditionally people don’t associate with object platforms?
So again, you know, this is, especially these top two, these are the – I mean top three – these are the newer workloads being deployed or created as part of digital transformation. And you know, object is becoming an increasingly attractive platform for consolidation [unintelligible 00:18:36].
Ryan: Yeah, I think that’s all right one. I think that’s all right on, Eric. And you know, we see that in big data and analytics, which, you know, people started talking about big data; a lot of times, you know, they meant Hadoop, or they meant Hadoop file systems. And that turns into data lakes and it seems like it’s really gone away from those ideas and more into, you know, large object systems with more and more enterprise feature requirements, I think as you were just noting in the previous slide.
Also, you’re right, A.I. And M.L. is driving much of this growth. And those aren’t necessarily independent use cases in the same clusters, but they definitely can be. You know, on the traditional [NAS 00:19:29] use case side, we just see more and more data that for no other reason than NFS was a great way to access more and more data, it’s becoming, you know, there’s a big legacy data set that needs to be incorporated and people are also planning for the future.
So you know, NFS as well as rest access. The high performance computing space we see exploding. The data protection space as well. You know, we don’t see it quite as much as in proportion as we used to. And then I would say the only, other big one I would just note would be large scale video is still really interesting for object storage.
And I don’t know if that showed up in your research as much. That would be a question.
Eric: Well, you know, so that’s one of – I would put sort of the media and entertainment style workloads in the HPC arena just from the point of view that, you know, often what they’re about is streaming performance, where we need, you know, high [serial conflict 00:20:31], and the scale out object platforms actually are great architecture for that.
Ryan: That’s what we see, you know, especially because HTTP is kind of the built in protocol. And then once you can stream on HTTP, you know, you can reference a very large collection, the sort of, you know, needle in a haystack kind of queries, you know, very quickly, no matter how long it’s been since content has been viewed. So we definitely see these use cases picking up.
Eric: You know, on that point you just made, I think one of the real advantages that object brings to the table is this much richer meta-data that you have with that data type. And you know, searches, particularly when you’re trying to define specific data sets that you’ll be running analytics on as an example. This ability to, you know, you’ve got all that information about the data itself that really lets you assemble those sets, those new data sets much more rapidly when you need them. So I think, you know, there really are some interesting capabilities within object for these newer workloads.
Ryan: Absolutely. That’s a great point, and I think what that means for evaluating object storage is, you know, pay attention to the meta-data system, and you know, just make sure it scales along with the object system that may or may or not have had more effort put into it. You know, early on we saw meta-data systems based on relational databases, and I can very comfortably say that’s not really the way you want to go with this.
Adrian: Yeah. And we’ll definitely cover meta-data coming up here. You both have thrown out the phrase large, right, and scaling. And I just want to define that for all of the listeners out there. When you say large data sets, when you say scaling data sets, you know, what do you mean? So Ryan, maybe, I know you’ve had a lot of experience here. What did you see 10 years ago? And what are you seeing now, just from a capacity and file count? Like average capacity, average file count? Just to set the foundation.
And then, you know, Eric, you can go over this slide, which kind of shows the growth of object storage and the trajectory of where it’s going. But Ryan, why don’t you define that just so the viewers out there know what you mean when you say large and scaling?
Ryan: I mean in the beginning we would think of large as 100 to 250 terabytes, and we always designing for petabyte scale. We were then happy when single petabyte opportunities started coming around, you know, more than a few years ago. Probably around five to eight years ago. And then more recently it’s, double digit petabyte installations are really more and more common. So you know, the system should be comfortable at multiple petabytes.
It should not be, you know, optimized or, you know, the performance limits shouldn’t be below, you know, five, six, seven petabytes at least.
Adrian: And how about object counts and file counts?
Ryan: I mean it gets interesting where the object counts really start to make file systems slow down, and we see that in the hundreds of millions range. You know, for sure you don’t want to put 10 billion, 50 billion objects or files into some kind of legacy file system. And you know, you need to really think about how the system will behave under that scale. But we see that very regularly these days.
Adrian: Yeah, we throw out numbers like billions and it sounds, you know, like a very, very large number. And who can have a billion files, a billion objects? But it’s becoming quite common for many organizations to have billions.
Ryan: It happens faster than they think it was going to happen.
Adrian: Yes, yes. And Eric, that kind of is a good segue to growth that we’re seeing. We’re seeing a lot more capacity, we’re seeing a lot more files. And that’s really coming out in what you see here, correct?
Ryan: Yeah. So you know, always in the unstructured space, object has been roughly twice the size of file. And in fact it’s been growing slightly faster. So there’s decent growth rates for both. The file side, you know, over this five year forecast period, is going to be growing at just under 10 percent [Tagger 00:25:00] whereas the object storage is going to be more, closer to 11 percent. So they’re relatively close. But clearly object is much larger.
And you know, the problem about the size of data sets that Ryan mentioned, that’s only going to get worse as more enterprises get more into digital transformation and they’re collecting and retaining a lot more data. Because particularly for machine learning style workloads, the more data you have, the better results you’ll get, the better business insights you’ll be able to drive off of that data. So people are clearly going to be retaining more.
You know, one other point I’ll make here on this slide is that a lot of Cloud based storage in the past has been object. And we’ve really seen as enterprises sort of get into this use of the Cloud and what kind of work loads does it work best for, there is a bit of repatriation that is also going on at the same time. And in fact, some primary research we did in 2020 indicated that 84 percent of enterprises have repatriated at least one work load from the Cloud back into an on premise infrastructure.
And there tend to be five reasons for that. So that’s performance, availability, security, governance, and/or cost. Generally one of those five reasons drive you to bring something back into an on-prem location, but this growth on the object side, there’s a lot of object storage being deployed on premises today as well as in the Cloud. And often those workloads that were in the Cloud before, but you have other associated applications, or maybe sometimes that compute that you’re going to be using to run the analytics against that data is actually sitting in your data centers.
You know, this idea of locality is one of the reasons that people will want to get the data closer to the compute and have it both sitting in an on-prem location. Now obviously they can do that potentially in the Cloud as well. But they’re going to be paying for the compute resources there and things of that nature. So you know, the point I want to make here is that this object growth that we see, which clearly this is the dominant market in the unstructured space, that has a major on-premises component associated with it.
Adrian: Yeah –
Eric: Yeah, and we also see – excuse me, A.J. So we also see a driver for that being, you know, the data was collected at the edge, and that is closer to the office locations, which is closer to the data center. So you know, if you’re training self driving cars and you have, you know, petabytes of video coming in that needs to be indexed and classified, you really – I mean sending that up to the Cloud before operating on it isn’t really what you want to do.
You want to run your analysis much closer to where it’s gathered and then possibly store the result sets or do continued analysis up in the Cloud. And then agreeing with you again too, we see performance oriented workloads being repatriated back to on-prem, especially when they were associated with like a mandate to send a large set of activities to the Cloud.
Adrian: Yeah, and that’s really – that’s a perfect segue for what’s driving the need for object storage. You both touched upon a lot of elements that fall into these categorizations, right. We spoke about data growth, about accounts and users. I think you both just covered the distributed access and applications well, especially in this hybrid Cloud environment. But – and Eric, I think you covered, I mean you add all those up and it’s really a storage TCO, a data access and a business flexibility issue, correct?
I mean at the end of the day, if you can’t get your data fast enough, if you can’t put it to where it needs to be processed fast enough, then ultimately, you know, that comes out as lost revenue or lost opportunities. Are there any other characteristics from the business perspective that have been communicated to you, that the impact of maybe – I mean you mentioned cost is a big factor in the Cloud, correct, from a repatriation perspective?
Eric: Yeah. So you know, generally what the cost issue is are egress charges associated with moving the data to different locations that you needed. And you know, that’s been the primary issue. If you’re just going to write data once to the Cloud and leave it there, the Cloud actually is very cost effective. But it tends to be, you know, these use cases, how you’re going to use the data is what drives that. Another comment I’ll make, though, about budget is that, you know, I mentioned workload consolidation before.
And this idea of replacing two, three, sometimes four existing storage silos with a single platform that can meet all the performance and availability requirements of each of those independent platforms for their workloads, but now you’re managing it all centrally. So this actually is a huge TCO savings. Because you deal with fewer vendors, you’re paying less maintenance, because it’s all, you know, in a single system. You’re managing it centrally. You’ve got less power and cooling, less floor space. So this idea of workload consolidation is being driven in large part by wanting to make their infrastructure more efficient and certainly consolidating, making it more compact, is one way to do that. But the, you know, the caveat there is can we still meet our SLAs in our environment. And that would be the reason to not consolidate going forward.
So you know, that’s again where this idea of, wow, you can buy, you know, an all Flash object storage platform if you want it, or you can buy tiers of solid state type of thing. And that gives you an ability to place workloads that might require more performance on that object platform along with others, other workloads maybe that don’t, and they might sit more on the HTD side. And you know, so you’ve got the flexibility there. You know, to me it seems like the real driving factor behind all of these decisions that are being made around infrastructure modernization is enabling agility.
Agility is the driver. That was the driver for the Cloud. It’s the driver for all these new decisions around infrastructure modernization.
Adrian: Yeah, absolutely. And you have some great data to back up some of the points you just spoke about. Why don’t you walk everyone through the next few slides?
Eric: Yeah. So you know, I had mentioned earlier how multiple access methods are associated – or they’re actually one of the feature requirements for workload consolidation. And one of the things that we asked last year, so this is a survey that we completed in December of 2020, about companies’ existing usage and plans to use, systems that have multiple access methods. So you can see that almost half of the organizations, 42 and a half percent, will have invested systems that support multiple access methods on the same system by the end of 2021.
So that’s pretty critical when you think about, you know, classically file systems did NFS and SMB, object was S3 or Swift. And you know, now you’ve got a lot of systems that support all of those as well as a number of additional ones. You know, you mentioned Hadoop and how there were a lot of data sets out there and the ability to gain access to those and maybe bring them into your new environment, because you still want to use that data for future analytics. You know, fuse is another one we see. FTP. You know, there are a variety of access methods out there that organizations seem to be interested in, but I would say that the top ones, NFS, SMB6 and S3, are the ones that seem to get the most play.
And so this is just great evidence that, OK, you know, there’s a lot of workload consolidation going on out there. You can see how many people are interested in this particular access method feature.
Adrian: Yeah, and we definitely see that in the market too. And here you cover the unstructured – the latency requirements for storage systems, which I think is very interesting.
Eric: Yeah. So you know, surprising, right? If you think about five years ago, did anybody ever think about latency when it came to object? Not very often. But that is – that’s very commonly thought about these days as you think about what workloads you want to place on what types of systems. So in this one, almost 65 percent of the organizations out there, you know, they expect to add technologies that are going to allow them to deliver lower latencies in their unstructured storage environments.
So this, you know, this might mean things like Flash, but it might also mean things like, hey, we want to be able to use replication as a data protection mechanism for smaller files and objects that we keep in that platform and for larger ones we want to use erasure coding.
You know, we get the performance we need on those smaller files. We get the dollar per gig cost benefit of using EC, wherever you’re spreading that data across more devices out there. And so there’s a lot of ways to potentially impact latency other than just let’s put Flash in the system. But clearly that has been one of the popular approaches.
Adrian: It’s actually very gratifying to hear you note that tradeoff between replication and erasure coding. We put a great deal of effort into making that something that could happen transparently, where, you know, data was stored with replicas and then after a certain amount of time, maybe when it wasn’t as hot, it would be converted to erasure coding. And we were very early. So it’s gratifying to see people start to appreciate that kind of a feature set.
Eric: Yeah. You know, I’ve also seen cases where a policy can be set for storing data files or objects under a certain size, might be stored using replication as the protection mechanism if they’re above a certain size, stored with EC, and then obviously the ability to move back and forth between those, you know, on demand within the system is also an important feature for flexibility.
Adrian: Absolutely. And then, you know, the usage drivers. We covered a lot of these, but do you want to walk everyone through this data?
Eric: Yeah, you bet. So this was a case where, you know, we had asked in the survey people to define their high performance workloads. And so, you know, we got a sense of what the percentage was of, you know, how many of your workloads do you consider high performance, how many do you not? Of those ones that are, how many of them are running on object? So that’s this particular data set.
So you can see that – so it’s almost 35 percent of the people out there, they’re expecting low latency capabilities from these next generation object stores that they’d either already purchased or are looking to buy because of the workloads they’re planning to put on top of those. So you know, performance, there’s the through-put aspect, there’s [I-Ops 00:36:13], there’s latency. And different workloads are interested in different aspects of performance.
But you can see that, you know, there’s a pretty healthy use of object platforms for workloads that enterprises have self-identified as high performance for their environments.
Adrian: Yeah. And I think the takeaway here, as you described it, is object storage is finally moving out of just the cheap and deep and archival categorization. I think if you can take anything away from all of this analysis, it’s that. You know, for the first 10 to15 years of its existence, you know, object storage was always in the, you know, hey, it’s only good for archive. And now what we’re seeing is that, no, it’s also good for some primary use cases depending on what those primary use cases are.
And we’re saying, just a lot more innovation, and a lot more definition of what those use cases are. And –
Eric: Well, that – yeah. But that requires that the object platforms support some of these newer capabilities that you didn’t see in object platforms, you know, in the past.
Adrian: Absolutely, absolutely. And we’re going to cover some of those characteristics in the following slide. So now we’re shifting into – we set the stage, we talked about object storage and how it’s evolving, we talked about the market drivers and why object storage is evolving. And now let’s talk about object architecture’s deployment of management. So in your evaluation, what do you specifically look for under these characteristics?
And with that, Ryan, you know, maybe you want to walk us through the high level object storage architectures.
Ryan: Yeah, sure thing, A.J. So you know, the two main architectures we generally see in the field are – can be characterized by the words ring and mesh, where a ring-based system is a system where the content of the object actually implies its location at some point on the ring. And so by essentially have a handle to the object, you know which place in the ring to go to for content. Mesh based systems may be – they have different characteristics. Essentially the object data itself doesn’t imply a location in the cluster.
And so it’s much easier to do things like add capacity. Because it doesn’t imply a migration of objects from one place to another. The mesh based systems are also in cases easier to add, the different form factors, different size discs; ring based systems, they definitely have ways to deal with this, and they’ve also been very successful at scale. So it’s not to say that they wouldn’t be good solutions for some good problems. But it is good to know which one you’re dealing with when you’re evaluating a system.
I think the other thing to know is metadata location and whether the metadata is actually encapsulated in the objects, in the object system or whether it requires, you know, an additional [node sequel 00:39:40] database. And so I think once you look at a system and can ascertain whether it’s a ring based system or a mesh based system, you start being able to predict a little bit more about how it will behave.
Adrian: Eric, do you have anything to add?
Eric: You know, I think the point that Ryan makes about data movement that’s going on in the background as you’re expanding the system, so that’s just something that’s important for people to be aware of as they make purchase decisions here. Because you know, how you tax basically the back end infrastructure that connects all the nodes in either of these two architectures, how heavily utilized that is and how heavy utilization of that network impacts other performance for other workloads that are going on in that system, while it may be rebalancing as a result of just adding a new node, things of that nature.
So that’s certainly something to ask about for people that are looking to buy one of these two architectures.
Ryan: Well, it should be on the must list, because generally people, once they provision an object system, if it’s successful, they will want to expand the capacity. I mean we just see it over and over again. And so they should really understand how the system will behave. You know, can it add capacity at run time? Can it do it with servers, you know, that will be on the market in a year or two?
And you know, is it designed for that kind of expansion? So I would put that on the must list.
Adrian: And that – so, and I think we covered another reason why just with the scale, where, you know, before saw millions of files, now we’re seeing billions of files that change occurred, you know, over just a matter of years, you know, also, you know, hundreds of terabytes to single digit petabytes and now double digit petabytes.
So know the data’s there. Now the storage, from the management perspective, you know, why don’t you walk us through this, Ryan, and then also maybe give everyone a concept of what a typical environment looks like by the number of servers. I think that would help everyone out there, you know, get their head around what we’re talking about; how many servers, how many drives do we typically see that – do people need to manage?
Ryan: OK. Yeah, so I mean but taking it from the top, you know, the single pane of glass Web management idea goes back to the excellent points Eric was making about enterprise features now really being required from the object stores. You know, we see quite a bit of interest in those requirements, in event notifications so that the system’s easy to maintain.
We have barely talked about automated failover and recovery, but it needs to be hands off. You know, these systems need to be able to go be self healing and recover from failures without any, you know, IT intervention and have full availability, high availability. We just talked about adding capacity on demand. And then to your point, you know, what is a one petabyte cluster look like? Well, I mean these days that might be in as few as seven servers, which may be, you know, an idea choice for something like a four-two erasure coding strategy, just because it leaves more servers than slices.
And it has excellent characteristics for high availability. You know, each one of those seven servers may have between 12 and 24 discs, you know, between 10 and 20 terabytes each these days. That’s another places where a purpose built object system really matters because if it’s a volume that fails and then all of the other servers in the cluster participating in the recovery, the recovery becomes very short. Where the recovery becomes very fast, that just leads to more [nines 00:43:48] of data protection, and that goes back to high availability and data protection.
I would say that the last point here is volume portability, and I would say it’s not really a feature that’s requested by customers in many cases. But when they understand how powerful it is, they can tend to use it. So for instance, volumes that are filled up at an edge location, and by volumes we mean discs essentially. So discs that are filled up with data at an edge location, whether that’s a, you know, a mobile data gathering mechanism, or whether it’s a mobile Army hospital, or whether it’s, you know, a sporting, a live sporting event, you know, that data and those discs can be added to a larger cluster in another place simply by plugging the disc in.
So that’s been used for everything from, you know, extremely high through put Sneaker-Net. Sorry for the geek joke there, to you know, IOT data gathering, and – or actually, you know, medical imaging at older hospitals that have very little through-put. So you know, there’s powerful features in the background resulting from purpose-built architectures.
And I think it all goes to, you know, Eric’s point about, you know, enterprise feature sets and more and more requirements for object based systems.
Eric: Yeah, you know, one thing I’ll add there, Ryan, on the automated failover and recovery, so one of the things that was sort of an issue for object platforms in the past, if you compare them to the recovery times and the impact of recovery, like you know, what’s the overhead that gets on the system when you’re recovering, and what impact does that have on other workloads, you know, the noisy neighbor type of issue. And I think that, again, some of these newer technologies and the opportunity to choose between different kinds of data protection, erasure coding vs. replication, all of those can potentially speed recovery times and lessen the impact, you know, of when you have to go rebuild data in the system because of various failures.
So you know, to me recovery and availability, clearly they’re related, you know, recovery times and availability. But they’re not exactly the same thing. And I feel like object systems in the past have had pretty good availability but if there was a failure, they had slower recovery times than some of the other systems. And I think that is one of the things that’s really change with this new era of object platform.
Ryan: I would 100 percent agree with you there.
Adrian: Yeah, and something else has changed, right. Data and tenant management. Do you want to walk everyone through data and tenant management? I just want to also do a quick time check. You know, we have about 10 minutes left.
Eric: Yeah. You know, I think this also goes to the points we were making about quality of service, manageability, single points, you know, single pane of glass; the ability to browse and query metadata and search based on metadata. So if you have hundreds of millions of objects, you know, or billions of objects, you really want those searches to return in still digit, double digit, millisecond latencies. That gives you Web scale search over very large collections. And then doing it at scale. You know, if you’re a telecom, you know, you do need these quality of service limits in place.
You need metering and quota support. You know, you need to be able to connect into identity management systems, you know, even at the corporate level. So these all become, you know, features that are – enterprise features that are moving into object stores and should be there in a mature form that you can actually POC and verify in your own environment.
Adrian: Yeah, absolutely. And this is definitely moving beyond just the cheap and deep object stores that we have seen in the past. And you’ve talked about metadata a lot, Ryan. Do you have anything to add to this specific slide? We give a little bit more detail on metadata and what that means in the object world.
Ryan: I mean especially from the time check perspective we should probably move through. But I’d just say the main thing to look at from an evaluation perspective is evaluate the metadata modules as well as the base object modules, especially if they’re completely different systems. And you know, evaluate how they perform at scale. Evaluate, you know, evaluate them with hundreds of millions of objects in it. You know, check for the availability of custom metadata and custom tagging. It’s really more and more important, especially to the machine learning and artificial intelligence use cases that we were talking about that are on the rise.
Adrian: Yeah, definitely. And then open source vs. commercial. I know you, Eric, you have some opinions on open source vs. commercial. You do too, Ryan. So why don’t you both share those.
Ryan: Eric, you go first.
Eric: OK. Well, you know, many people have probably heard this, but open source, it’s often like that free puppy that, you know, yeah, it’s free but once you get it, there’s a lot of work and there’s a lot of cleanup involved. And you know, there is open source out there. I think there’s a lot of platforms that have used open source as the core and they’ve created a commercial distribution around that.
And that adds some of the things that traditional open source lacks, which is, you know, generally they don’t have good documentation associated with it. There’s no commercial quality tech support. There’s no specific roadmap that you can count on. I mean certainly new features come out from the community, but it doesn’t have the same kind of predictability that you get from more commercial offerings.
And so you know, we’ve typically seen people that deploy open source have this sophisticated management expertise to be able to deal with those kinds of environments because they clearly require more for deployment and ongoing management. The commercial platforms have generally addressed those shortcomings. They provide, you know, great documentation, enterprise class, worldwide tech support. They have roadmaps that they manage their systems to that they communicate on a regular basis with customers.
So you know, that – to me that’s kind of the difference. The only reason to go with open source has often been a cost issue, although there are some other concerns. If you’re going to be doing some of your own development around a platform, that, you know, you don’t have that ability to add on your own feature sets onto to commercial offerings. Whereas you do have that with open source and that’s something that’s attractive to some enterprises. But generally the enterprises are more interested in the commercial offerings because, you know, it’s a budget issue. They don’t have the deep technical expertise to do commercial, you know, to do development on these platforms.
They need systems that are ready to work out of the chute today and have good complete management capabilities. So that’s kind of how I would contrast the two.
Ryan: I mean I would echo those points.
Adrian: Ryan, do you have anything to add?
Ryan: And you know, yeah, real specifically, I mean we are big fans of open source. You know, we use Linux extensively. And we’ve seen very successful installations with open source object systems. In many cases they’ll be in places like national labs or universities where they can have a group that’s responsible for deploying and maintaining it, whereas, you know, some – we can do very large installations in some of the more enterprise IT commercial offerings, and you know, specifically to our experience we can stand up petabytes of data in a day and then it can be self healing and it actually doesn’t need a group.
So in some, in many ways it’s just about understanding how much effort you want to put into the system and how you want it to behave.
Adrian: Yeah, definitely. The expertise of your staff too, right? And that’s a good segue into the build vs. buy. Do you want to walk everyone through this, Ryan?
Ryan: Yeah. So I mean the build option is certainly, where people ship software as well as provide appliances, the build option is just about being aware of your use case, sizing your hardware, understanding, you know, the number of objects you’re looking at. And really being flexible. Many times it’s about, you know, you have the ability in some cases to come with your own vendor, and you can come with your own vendor and you can have them being maintaining the hardware, and they’re familiar with everything they need to do.
So that’s a very convenient way to go. As opposed to appliance model, which is much more turnkey. It can be much easier to purchase. It can be a simpler solution for maintenance as well as, you know, getting software support. So in many cases it’s just a little bit of a simple vs., you know, set decision.
Adrian: Yeah, but let’s talk about all the knobs you can go ahead and customize. I think this slide covers it. Right? Everything you need to consider when putting an appliance in place. And this is, you know, if you do buy a pre-packaged appliance, and we’re also believers in that. We have pre-packaged appliances that are available for purchase. You need to size it for specific use case, correct? And this is everything that you consider as a solutions architect, right, Ryan, when developing or architecting a solution for someone?
Ryan: Absolutely. And you know, really simply, you know, large form factor, [many U 00:54:19] servers are going to be appropriate for different use cases than, you know, NVME or Flash based discs. And you know, you can pick the ones that are right for your use case, counting object size, capacity, availability, and everything that needs to be considered. It’s – there are many different dimensions to the choice. And so it’s interesting to get into that decision at a detailed level.
Adrian: And again, it’s – I think the point to take away is there’s a lot to consider when talking about work flows, when talking about work loads. And this brings us to the approach and timeline for object storage evaluation. So you know, Eric, this is an IDC slide. Why don’t you walk everyone through the evaluation approach.
Eric: OK. Well, you know, a lot of the primary research we’ve done over the last 18 months indicates that as CIOs move into this new era and move through infrastructure modernization, they’re thinking a lot more about optimal workload placement. And so when you start a process thinking about what are the requirements of your particular workload, how quickly will it grow, how much data do you need to start with, what are the performance requirements in terms of latency, throughput, I-Ops, etc., how much availability is required?
So that’s really the right first place to start. And then from there it makes it much easier to match what the requirements of the workload are with the architectural capabilities of different approaches. You know, and we talked about a lot of the bullet points that I’ve got here under architectural capabilities. But there are certainly different implications for different architectures. Right?
I mean like, you know, scale out architectures are going to be a great way to deal with high growth environments, environments where growth might be unexpected. You know, you’re not sure you need that flexibility vs. older scale up designs. And you know, that may not be real appropriate because most object is scale out; but I mean one of the reasons why it is, is because generally object environments are large and they experience a significant amount of growth.
But that also applies to a lot of the software capabilities that reside within those platforms and are also driven in many cases by some of the architectural decisions that get made. So you know, there’s got to be a close match between the specific workloads you’re planning on putting on the platform and the platform capabilities itself. Other considerations, you know, we mentioned these.
The conceptual and deployment models, I’ll just mention one thing very quickly there. Because your build vs. buy slide, IDC actually recognizes five what we call enterprise storage consumption models. And build vs. buy as you’ve shown, storage appliance, software only, are two of those. The other two are not necessary applicable in the object space, but just for kicks, so everybody knows what we’re talking about, is – there’s a converge infrastructure model, there’s a hyper-converge model, and then there’s a Cloud or sort of a services related model.
And each of those different consumption models target different kinds of workloads and different sets of customer requirements. So you know, I completely agree with the comments that you made about build vs. buy. There’s strong reasons why customers want to buy software and add their own hardware vs. people that want to buy appliances. But those obviously need to be taken into account here to match the model that is your preference with the workload requirements and, you know, what architectural capabilities does that bring to the table here.
So these are the starting points in my mind as you’re thinking about what are we going to buy, what do we want to consolidate onto it, what platforms are we going to be requiring to do that? This is the starting point.
Adrian: Yeah, absolutely. And this is going to be our last slide. We’ll go ahead and open it up to questions. If you have questions, go ahead and type them in. We have a few that have already come in. We’ll go ahead and address them after we present this final slide. But Ryan, why don’t you walk everyone through the evaluation timeline as I start to organize the questions?
Ryan: Yeah, so we typically see customers, you know, getting some idea, I want to object to my environment; you know, this looks like something that I either definitely need to do or need to decide whether I need to do something about this. That typically lasts around three to six months, is what we’ve seen. Then they go through a phase of identifying candidates. They go through word of mouth. They go through Google. They go to conferences. Once they have identified candidates and put evaluations in place, we see each one of those lasting around one to two months.
And it’s not all up time and features. It’s, you know, gathering information, preparing the network, installations and deployments, and then work flow testings. If they’re being incredibly thorough, it includes adding capacity, it includes, you know, eliminating servers and discs and monitoring the system during recovery times. And then procurement. Who knows how long that takes.
Adrian: It depends what industry you’re in. And if you’re the government, many, many years, right?
Ryan: It does.
Adrian: So we have a few questions here. We’re going to go ahead and open up the questions. Again, this is a typical evaluation timeline. We just want to set the foundation for what to expect from an evaluation perspective. This can actually happen faster. I think the fastest that we’ve seen was measured in the weeks. But this is what we typically see from start to finish. Here is some next steps, but as everyone reads the next steps, they’re very easy to read here, we do have a few questions.
The first is about latency. You know, we spoke about latency. So what does that mean in the object storage world? Are we talking nanoseconds, milliseconds? How many milliseconds? So Ryan, why don’t you take that.
Ryan: I mean even with – so with HTTP in general, I mean there will be some built in latency. And then we tend to see from request to response the single digit millisecond. There are certainly sub-millisecond latency systems and that is possible. But we typically see the desired latencies being single digit millisecond. And with a long tail response pattern and a histogram, you can certainly get outside that. But you really want them mien to be in single digits.
Adrian: Eric, do you have anything to add to that based upon your research? Did you specifically ask organizations about the level of latency that they’re looking for these days?
Respondent: Well, we didn’t specifically ask that in the survey, but yeah, there is something I’d like to say here. So you know, this year there’s going to be about 30 billion dollars spent on storage – on external storage systems. About 17 billion of that will go to systems that have hard disc drives in them. So there’s still a lot of primary workloads that are being run on systems with hard discs.
And the typical latencies for hard disc based systems range between five and 20 milliseconds. So if you’re able to deliver with an object storage platform, let’s say two to four milliseconds of latency, and you’re able to do that more consistently than on a typical, you know, [HTD 01:02:00] based array, that means that your object platform can take on those workloads that you’re running on that array and it can meet and in fact beat the performance requirements there.
So I, you know, yeah, there’s a lot of excitement around all Flash and [sub-millisecond 01:02:16] and there are clearly workloads that need that, but there are still a lot of workloads that are running on systems that don’t deliver that kind of latency and they’re doing just fine.
Adrian: Absolutely. That’s a good point. The next was about data mobility and hybrid Cloud environments as a requirement. So it was something that we didn’t specifically call out. But where would you weight that in order of importance based upon organizations that you talked to, Eric? Because I think, you know, this is something that we also see as an organization. We hear a lot about hybrid Cloud and being able to move data from different Clouds. Are organizations actually doing that? Are the moving data from let’s say S3 to Azure and back and forth? Are there work loads actually out there that require that?
Eric: Well, there are. But you know, this idea of there’s movement that’s happening on a weekly or even monthly basis is – that doesn’t really represent what’s happening out there. One of the reasons people want mobility with their data is pricing reasons, because, you know, if they can easily move their data from let’s say Amazon to Azure or Google, then that means that there’s probably going to be more price parity for similar services between those three, among those three vendors.
So that’s one concern. But there clearly are cases – if you’re going to repatriate a workload back into on-prem, you want that to be as easy as possible. There might also be cases where you decide you need to purchase some new services for analysis against data you have in the Cloud, and maybe your current Cloud provider doesn’t have that service.
So you might think about, well, maybe we want to move to this other Cloud provider because they offer those capabilities and we’ll use them for these workloads now. So I think that – the flexibility that you get is important, but it’s not something that typically you’re going to be using on an extremely regular basis. And you know, there may be some work loads where that is true, that you do have much more, you know, frequent usage of that capability and that kind of concern.
But I think that, you know, it’s a flexibility that it just gives you more options going forward as a IT strategist.
Adrian: Definitely. And I think that wraps up the questions. We do have a few more questions that we’ll go ahead and answer directly via email, since I know we’re already over time. But I just wanted to thank the both of you so much for your time. Obviously thank the viewers and those who have stayed with us going over time here; it’s a very interesting topic, as you can see. We could have probably spent a half a day if not a full day on this particular topic.
But you know, both IDC, Eric is there if you want to reach out to IDC and schedule some time with him. We’re always available and at your disposal at Data Core. So if we can act as an advisor or if we can help you in any way, please do not hesitate to reach out to us at email@example.com. Any closing comments for everyone out there? Eric, let’s start with you. Ryan, we’ll wrap up with you. So Eric, any closing comments for the viewers?
Eric: Yeah, I just think it’s very important for people to understand this change that’s happening in object and that it’s a whole new world for what kind of workloads you can potentially move to that environment. With the massive scalability that object supports, that’s a great match for what’s happening in most enterprises today, that they’re going to be dealing with multiple petabytes, tens of petabytes of data, and object is a great place for that data.
It’s cost effective, scales well, it’s got rich metadata. So please don’t think about object as what object was five, six years ago. It’s very different today.
Adrian: And Ryan, any closing comments for everyone out there?
Ryan: Yeah, I mean I would really echo that. I would say, you know, object, it’s not just for backup and archive anymore. You know, look for primary workloads, petabyte, scale data sets, you know, rich metadata models, really exciting use cases on the horizon and currently. And so yeah, echoing what Eric said, and I just want to thank everyone for joining us.
Adrian: All right. Thanks again. So with that, this concludes our webinar. Thank you, everyone, for joining us. Thank you, Eric. Thanks, Ryan.
Ryan: You bet. Thank you.
Eric: Thank you.