Learning to Love Reasonable Downtime
May 24, 2010 6:00 AM PT
Working for a disaster recovery solutions designer is often difficult. After being bombarded by ad slogans, magazine articles and just plain life experience, many company executives are looking to achieve a mythical figure for server uptime. Perceived uptime of 99.999 percent -- or "five nines" -- equates to about five minutes and 30 seconds of unexpected downtime per year, and the number is achievable.
The problem is that this number is only achievable if the language is carefully scrutinized and an overwhelming amount of hardware and software is brought to the table to produce the desired effect. This leads to two issues in the modern enterprise: either management doesn't understand the language, or they are unwilling to allocate the appropriate amount of budget for the project.
Language is a funny thing. Words can mean many things, and often words are completely overlooked in order to achieve the meaning that a person would prefer the language would have. In this case, the issues surround two phrases: "perceived uptime" and "unscheduled downtime" -- both of which are absolutely critical to understanding the idea of five nines of uptime.
Perception and Reality
Perceived uptime is generally accepted to mean that, as long as the user gets the appropriate response from the server, the system is considered to be online. This means that a load-balanced set of Web servers will serve Web pages to users, even if one of the servers goes completely offline for a time. The servers experience downtime, but the system does not, as another Web server can handle the load of the failed device for a time. Five nines of perceived uptime is accomplished for the system as a whole, but that doesn't mean that each and every server in the system can achieve five nines up uptime individually.
The other component of the definition of five nines that's traditionally overlooked is "unscheduled downtime." All servers require maintenance. This could be updates, patches, fixes, changes and other very normal and very important configuration changes. Many of these maintenance procedures require the server to be offline for some period of time, even if it's just for a reboot. Now, an average Windows server reboot can take upwards of 10 minutes, but let's say that in your environment it's closer to three minutes. Do that twice in one year, and you're beyond the 5 minutes 30 seconds that five nines of uptime would require. The issue here is that all patches and maintenance solutions fall into scheduled downtime, and therefore are not counted toward the total downtime numbers. This means that you could have entire weekends that the servers and systems are down, but still meet the requirements of five nines metrics.
On the other side of the equation of what's usually ignored are the massive hardware and systems requirements that you'll need to put in place in order to meet five nines of perceived uptime for your environment. Everything will require redundant servers, storage and networking in order to make this work, which will quickly add up on the budget sheets. For example, if you need three terabytes of storage for the production data-set, you will need at least three more for the redundant dataset. Add to this the possibility of another three for a remote dataset, and your storage budget just tripled. You also need to load-balance any servers that handle client requests, and either load-balance or otherwise cluster back-end servers as well. This means you need at least twice the number of servers necessary for a non-load-balanced solution set, and possibly a higher multiple if you also want site redundancy.
Next, your network must have redundancy if any of your solutions require incoming our outgoing network access. Otherwise, the loss of a network link will create perceived downtime. This means multiple links from multiple service providers to avoid any single point of failure in the networking solutions. Also remember to get multiple routers and switches, as they can be single points of failure as well.
Add It Up
As you can see, achieving five nines is not impossible, but is expensive, and therefore may not be the best choice for every system in your environment. As there are many solutions that can provide redundancy without the need for each of these components to be duplicated (though each system will need at least some level of redundancy), sometimes allowing for a limited amount of downtime can dramatically decrease your overall cost of ownership. Recovery Time Objectives (RTO), or the amount of time a server can be offline before severely impacting business goals, becomes a critical part of disaster-recovery planning for your environment. Saying that you want near-zero downtime will quickly drive up your budget on servers that have no need for it, such as non-critical file servers. If you can live without the systems for as little as 15 minutes, the doors to what solutions at which price points you can use suddenly open wide.
Some systems will require five nines of perceived uptime, but understanding what that term truly means, and making sure that each system truly requires it, will have a large impact on both your end-users' happiness and your bottom line. Analyze each system to determine what the uptime requirements truly are, and whether those systems can justify the budget that reaching that uptime level will require. Based on that analysis, find the right solutions to give you the best uptime possible, but without spending more than required to meet it.
Mike Talon is an enterprise systems engineer at Double-Take Software.