Over the years, I have been involved with my share of data center recovery exercises — everything from power outages to hurricanes to earthquakes to chemical explosions.
Each time, no matter how well the business continuity plan is written, there is always something that is learned.
Here are a few lessons that stand out from my experience.
Where’s the Business Continuity Plan?
Before you tell everyone that you have a business continuity plan (BCP) in place, you should actually make sure that you have written one so you know what to do. Oh, and by the way, everyone should know where it is so that it can be easily located. This sounds like a no-brainer, but you would be surprised at how many companies jump in with both feet without actually planning out their recovery process.
So what is the lesson learned? It’s important to actually write out a business continuity plan and point out the weaknesses — and don’t be afraid to present it to your executive team. However, before you do that, I suggest knowing the answers to the questions they are going to ask, which will help you get their buy-in and support for the plan. Here are a few things you will need to know:
- How much will this cost?
- What if we lost our data center today?
- What are our existing recovery plans?
- What disasters are we currently protected against?
- What is our risk to specific disasters?
How Much Will This Cost?
This might be the easiest question because the answer isn’t a dollar figure. Instead, address what a data center disaster would cost the company in man-hours and business impact. While your costs may involve purchasing risk analysis or additional hardware (all of which are important and will need the management team’s support), the more important question you can ask is, “What will this cost us if we don’t?”
I can tell you for a fact that the management team will understand the impact when your company can’t service its customers, can’t receive messages and can’t close sales. The cost is simply justified. You only need one business continuity plan to protect you from every disaster, and let’s face it, whether it is a tsunami or a simple mistake of pulling the wrong disk from the array, you are at risk for downtime. It is money well spent, and the knowledge of having a detailed BCP in place puts everyone at ease.
However, as we said before, when you have your BCP written, make sure you know where it is. The last thing you want is to have to think “now where did I put that?” or “who had it last?” In fact, I’d highly recommend getting a fire-proof safe. Put the plan in, along with some granola bars, coffee and aspirin — because you will need all three.
What If We Lost Our Data Center Today?
It’s important to be honest, because you will shortchange yourself if you try to sugar-coat it. The truth is, if a company can’t recover its business-critical systems within a 48-hour period, the risks of going out of business increase significantly, and at the very least, there’s a loss of customers, productivity and ultimately revenue. That is the truth, and it will get the attention of most executives, who will therefore provide the support and resources you will need to implement your plan.
It’s important to remember that the plan doesn’t get written in a weekend and implemented the following week. It takes months to do, and there is a process with 10 steps that are clearly outlined. This is followed by testing, revisions and continuous updates, so executive support is imperative to the overall process.
While this may be the extreme, I know most companies certainly have the ability to recover their business-critical systems within a two- to four-hour RTO (return to operation) and are protected from the most common data center incidents like a failed drive, processor or power supply. These are the most common disasters IT managers face (and hopefully the only ones your company will have to face). However, that doesn’t cover an incident like I experienced a few years ago, and not one I would have thought to include in a BCP.
What Are Our Recovery Plans?
A few years ago, I was managing a BCP team where we were setting up the controls to replicate systems from five locations on the east coast to a disaster recovery facility in Arizona. A new rack of servers was duplicated with the primary data center IT systems and sent to one of the locations to back up one of the remaining sites.
The shipping company dropped the rack off the loading dock and the impact shot the drives through the chassis.
However, that wasn’t the real disaster we faced, just one of the challenges that we had to deal with along the way.
The real disaster occurred when we were about to bring the disaster recovery facility online and I received a call that the UPS (uninterruptible power supply) at one of the locations had exploded. I had never considered the fact that UPS units are essentially large, chemical-filled batteries that can explode, and when they do, they cover everything in all of their chemical-makeup glory.
In addition, it was the primary power source for the data center, so not only was the power out for all of the IT systems, but those IT systems were also covered in toxic goo, and we had to promptly contact hazmat for cleanup. Once it was determined that the data center wasn’t coming online any time soon, the recovery process was started, and systems were brought online within 15 minutes, restoring operations to a functional level. This was only possible because everyone knew what their responsibilities were, and we reacted as a coordinated team with controlled violence. So, know your recovery plan inside and out!
What Are Our Existing Recovery Plans?
You may be surprised to learn that your existing data center recovery plans may not be in all that bad of shape. However, there is always room for improvement.
Most recovery plans include some form or combination of tape for recovery, which is an option, but only for those systems that have a greater than 24-48 RTO/RPO (recovery point objective). It’s not really a solution if you needed to recover an entire data center.
What is required to recover an entire data center is a colocation or disaster recovery facility with a virtualized blade server infrastructure to minimize overall footprint, power and cooling costs, and so the systems can be readily available versus readily recoverable. Being available versus recoverable is a big difference when it comes to RPO and RTO.
What Disasters Are We Currently Protected Against?
Most IT managers will have procedures in place that protect against a server, storage failure or corruption, but few are protected against entire data center failures. There are many types of disasters that will be identified in the risk assessment of a BCP, but they boil down to three categories.
There are sudden impact disasters like environmental accidents, chemical spills and fires; weather-related disasters like hurricanes, tornadoes and earthquakes; and human-related disasters like malicious attacks. These are the types of disasters that are typically addressed in a business continuity plan to fully engage the disaster rather than just recover a few servers. It is a tougher sell to executives to prepare for this type of disaster, but the whole point is to be proactive and preventative in order to keep the business assets protected. The best disaster is the one that doesn’t happen, and the best plan is the one that doesn’t need to be enacted.
What Is Our Risk to Specific Disasters?
Depending on the location of the corporate facilities or data center, some of these three disaster categories may or may not apply to your organization. For example, although there is a fault line that runs through the middle of the Midwest United States, it is far more likely that northern California will be hit by an earthquake rather than St. Louis. Similarly, it isn’t likely that San Francisco will be as at-risk for a tornado as a place in the Midwest. This will all be identified in the risk assessment of the BCP analysis to help evaluate and rate the level of risk to which your company or data center is subject.
Another environmental example that is often overlooked is when a company is near a major interstate highway. In this case, there is the potential risk of a semi truck full of chemicals or other hazardous materials overturning and necessitating the evacuation of several square miles. An event like this recently occurred just north of Boston, where a chemical company that made paint exploded, leveling an entire city block and causing structural damage to buildings beyond that.
While these types of events are less common than the typical “worm du jour” attacking your IT systems, it is certainly something that is important to consider, not only for the recovery of your data center, but also for the health and safety of your company’s most valuable asset: its employees.
Brace Rennels is CBCP (certified business continuity professional) at Double-Take Software.