Cloud storage systems are different from the storage systems that we are accustomed to today in that cloud service providers — companies such as Amazon and Google — began their efforts a decade or more before they actually opened their public services.
When they were first establishing themselves more than decade ago, they were focused on constructing an infrastructure with tremendous scalability to support storage demands of their large Web properties — we’re talking about tens of thousands of servers distributed throughout the world, networked to provide 100 percent availability to ensure data and services are always available to consumers.
Listen to the podcast (13:32 minutes).
When these cloud service providers were designing their systems, they very quickly came to the realization that traditional backup techniques weren’t going to work for them. As many of you probably know, the problem with backup is that the model is predicated on the idea that you need file servers, email servers, databases and so on to establish a production tier — Tier 1 storage.
In addition to Tier 1 storage, you also require a backup system that essentially copies all of that data onto another system — this could be a tape library, or it could be a disk-to-disk target. This backup software may do a lot of compression and deduplication and try to do its best, but ultimately it is taking all that data that’s on the production tier and moving it to another tier, copying it to another tier. That puts a tremendous load on any environment, and while that is painful enough in a data center, it is simply not feasible to do when you’re talking about petabytes of data, added every quarter, with exabyte size repositories.
A Perfect Copy
What the cloud providers had to do is develop protection as an integral part — an intrinsic part of the storage system — rather than as an add-on.
If you’re familiar with RAID technology, where you have several disks that are backing up and protecting each other in an array, it is the same concept but applied at a much bigger scale, with much greater reliability built into it.
Essentially, what the cloud providers do is assign storage nodes in their systems. A storage node is a server with internal storage (hard drives). They may be RAID protected, but the important thing is that the server is then connected to another server via a standard Ethernet connection. Many of those servers connected together in a cluster form one of the storage centers for a cloud storage system. These clusters can then be connected across geographies to multiple data centers all around the world.
It works something like this: When data lands on one of these severs, that server will then push copies out to other servers that are in the cluster — other servers that are in other data centers around the world. This is very powerful. Essentially what this means is the unit of failure of the system is a server loaded with disk drives, connected via Ethernet to local storage and across a wide area network to other data centers. The entire system is built around the idea that one can protect data by making exact copies of the data.
This massive change in system architecture was made possible with the falling price of servers with internal storage. Over the years, the price has come down to a point where the servers that make up these nodes are so inexpensive that they can be thrown away upon failure.
By building into a system like this, with the capability of instantly making perfect copies of your data, those who purchase cloud storage are not only getting tremendous reliability and data durability, but also availability — a perfect copy of your data is now available in multiple locations all over the world. This is a resource that has recently become available which allows you to protect your data in a way that was never before possible.
In utilizing this technology, IT organizations can fundamentally change how they think of protecting their data. Now the challenge lies in determining how to get this data to these data centers and ensuring that when you need that data back, you’re able to retrieve it in a timely fashion.
Strengths and Weaknesses of the Cloud
The challenge when thinking about how to leverage cloud technology to protect your data in a different way is to consider what the cloud is good for, and what the cloud is not good for.
One of the biggest benefits of protecting data in the cloud is that the cloud is not in your own data center. It doesn’t take the resources of your data center, and if something were to happen in your data center, you would still be able to fall back to this external resource. A problem with cloud storage is the data is outside your data center, so if you need to get a lot of data in a hurry back from the cloud, you would be not in a good place. Even through multiple T1s, once you start talking about terabytes of data, you are really talking about waiting days before you can gain access to that data.
In a traditional backup model, if you need to be able to recover terabytes of data very quickly, you would employ disk backup for your backups, as opposed to tape. You would have multiple GigE connections between the backup server and that target, and then you would try to siphon, as quickly as possible, all of your full images and apply the incrementals back into the primary tier of whatever host you were trying to recover. This is a model that is painful — it can take hours; however, this is how most IT organizations today meet their recovery time objectives.
Rethinking Primary Storage
When you’re thinking about the cloud for storage — and remember, the reason you want to think about the cloud for storage is because it offers tremendous protection and availability to protect your data — it is important to consider how you will bypass the need to bring all that data back before you’re operational. Because the pipes are only so big, if you need to move terabytes of storage, you’re not going to have the luxury of multiple GigE connections to the cloud.
One way of doing this is to rethink how you handle primary storage, and essentially build the backup into the primary tier. It’s not dissimilar from the way the cloud vendors had to rethink backup to deal with scalability requirements. At Nasuni, we have developed a file server — and this is just an example of how you can approach the problem — that has backup built into it. We are using delta technology, snapshotting, compression and deduplication built into the primary tier so that you can push out a very small image of your primary storage tier to the cloud. That saves your network, and it saves you storage in the cloud, but more importantly, when something happens in your data center and you need to bring that file server back online, we can very quickly restore a working image of that file server from these highly compressed images. We can do this because we act as your local file server so we know exactly what critical pieces of information are needed to flow back.
For example, on a 10 terabyte file server, the amount of information that you need to bring back for that file server to give your users full access to those 10 terabytes of files is probably in the order of a few hundred megabytes — not even a gigabyte of storage. So being able to be smart about how you recover that critical metadata fast and first, before you start worrying about bringing in specific files that your users are going to want, is very important. This can only be done if you are in the primary tier — if you’re actually the application tier — not behind a backup system. The benefit to you is that you can now eliminate your internal backup storage system.
Think about the cleanness of that model. You’re talking about a Tier 1 storage system that needs no backup and that can recover from a full failure instantaneously from the cloud. You can essentially build a new file server, reconnect it to your account in the cloud, and then instantly have access to all of your files. Now your users are going to experience degradation in performance before the working set, the stuff they are actually working on is back to being cached locally, but they’re going to be able to access those files right away. That is sufficient in most businesses, especially if you’re talking about user files, document management systems and so on. That’s one way of rethinking how you architect Tier 1 storage so that you eliminate backup and still meet recovery time objectives, even though the data sets are very large.
We see a lot of people that are still thinking of using the cloud as backup — it works great for a few hundred gigabytes, however, and beyond that, organizations start getting really nervous about being able to pull back all that data. So what we’re saying is that’s not a use case that is very good for the cloud, because you’re always going to be constrained by the pipes. It’s a much better thing to rethink how you achieve Tier 1 storage and to use systems that are fundamentally designed to speak natively to the cloud for storage, and provide fast, intelligent recovery without multiple GigE connections to the cloud.
The discussion on cloud storage and how it can offer tremendous levels of reliability is a very interesting one, but I think what is more applicable to your business is how you can rethink your own backup strategy to leverage those systems — by rethinking how you do storage in Tier 1 to eliminate the complexity in your backup systems.