Corporate data centers are running ever more efficiently, and even delivering on the promise of self-monitoring and automation, but there is always the chance that something, somewhere, somehow might go terribly wrong.
To confront the risk, organizations need solid assessment strategies, and ways to identify where staff, skills, infrastructure or sanity may be lacking. TechNewsWorld talked with Daniel Berg, vice president and CTO, services, for Sun Microsystems to find out more about how enterprises can go about assessing and cutting IT risk.
TechNewsWorld: First, how important is risk management when we’re talking about corporate data centers?
Dan Berg: Risk management is a critical component in the operation of any data center. Operational risk, although usually intangible, is a leading indicator of the health and overall availability of a data center. Measuring this risk can involve looking at many different aspects of a data center’s operations. This risk measurement should include at a minimum skill levels of the staff managing a data center, measurement of the risk relative to the processes used, and systemic risk which would include areas like configuration and capacity analysis.
Having risk measurements gives IT management the best view of the overall health of their operations. A cause and effect relationship can be set up such that when changes are made to skill levels, operational process or infrastructure modifications, management can see and predict what the impact will be.
TNW: Why is it so significant to IT operations?
Berg: Risk is a leading indicator for health of an IT operation. Management can have a real understanding of the risk inherent in their operations. Having defined methods to understand this risk can give them confidence of the real state and health of their operation.
Consider that one in every 200 “administrative touches” to systems results in an error condition of some sort that needs to be fixed. You have a one-half-of-a-percent probability of having a problem just by going to work. Sun has done numerous studies of mixed IT environments at customer sites so we can look at yours and determine what we call your Operational Risk Index. For example, if your administrators have had Sun-certified training, the chances of them making a boo-boo are 59 percent lower.
TNW: What are the key symptoms of a data center at risk?
Berg: We look at a number of things including systemic complexity and configuration, skill levels and user training, operational processes, security conditions and patch levels. Any one of these areas can contain clear symptoms of issues and thus have risk. However, it is important to take a holistic view to this risk in considering the severity and probability of each risk component. An analogy would be the risk associated with an airplane and an automobile crash. The probability of a airplane crash is low but the severity is high. On the other hand, the probability of an automobile crash is relatively high but the severity is typically lower.
So what risk is critical? The answer depends on what service level agreements and other operational metrics are critical to your business, but having an understanding and quantifying probability and severity of the risk is critical. We look at both the severity and probability of each of the risk components. This is important as not all risks are the same. Probability of a risk is not sufficient by itself — you must also understand the impact or severity if the risk becomes real.
TNW: How does an IT department make a diagnosis?
Berg: In our case we have a risk model and a service to generate a risk level. The Operational Risk Index is the model and process by which we measure risk. Our Preventative Service offering includes an Operational Risk Index analysis to help inform the preventative actions that are taken.
TNW: What are the possible prescriptions?
Berg: There are many prescriptions and in no case is there a single prescription. The fact is that there are many ways to mitigate risks. For example, maybe you have a relatively high risk relating to your capability to backup and recover data. Would you mitigate risk relative to your backup and restore process? How about the skill level of the folks who perform the backup/restore? What about checking to see if your tape libraries are well maintained and have the proper patch levels? Any one of these could be a potential prescription, but in general if you can look at all of these risks with a summary index then you are able to determine what action will decrease your overall risk level.
TNW: What are the metrics for assessing the outcomes?
Berg: We think of your Operational Risk Index like a human cholesterol level. If a doctor tells you that your cholesterol level is high, it does not mean that you will die tomorrow, it does however give you an indication that you have a better chance of dying than if your cholesterol level was low. And it’s equally true that if you have a low cholesterol level, it does not guarantee that you won’t die tomorrow. The important thing is that your Operational Risk Index or cholesterol level gives you a baseline on which to determine your overall health. You can then begin to monitor and manage the overall health by understanding which direction the index is moving.
It’s easiest to use your overall risk index for a given service or data center. This index should give you a better indication than just about anything else as to what you can expect as outcomes from the service or data center. The outcomes can be typical measures including availability, incident rates, transaction volumes, etc. The key here is to link risk indexes to your expected outcomes.
TNW: Where are there still challenges or room for improvement in assessing and dealing with risk management in the data center?
Berg:The biggest area I see is finding better ways to automate the measurement of process and skill risk. Today this is usually done by interviewing people and documenting the process. But as technology like business process modeling and process orchestration become a reality, then it gets much easier to capture process risk just by inspecting the process notation. It also becomes easier to compare with industry best practices.