Improving IT Ops Service Levels and Efficiency
Operations teams are the front line of today's IT organizations. They're responsible for delivering critical business services to internal users and customers. With the complex interconnections that make up modern enterprise applications, though, conventional IT management tools can't help these teams deliver the service levels expected by the business.
The key is to embrace a new ideal of managing every infrastructure component from the perspective of how it impacts end-user service levels. To do this, IT operations teams must understand the three laws of service-oriented IT operations management:
- Transaction response times are more important than server resource utilization;
- Every component affects transaction response times; and
- Every infrastructure later affects component response times.
By incorporating these three laws into their day-to-day processes for application monitoring, problem solving and problem prevention, IT operations teams can deliver on service level improvements and increase efficiency.
The most common source of information about application performance problems for IT operations teams is the help desk -- meaning that an end-user complained about the problem before the IT operations team even knew about it. To detect slowdowns early -- before end-users call to complain -- apply the first law and focus on transaction response times. Specifically, a service-oriented monitoring program should follow these steps:
- Determine which applications and transactions are most critical -- make these the priority.
- Set up alerts on critical transaction response times. Build alerts on both the average response time and the slowest response times for any individual. Make these alerts your team's highest priority.
- Set up alerts for key component response times. This requires that you know which components make up the infrastructure of your critical applications. By detecting when a key component's service level is starting to degrade, you can identify and fix problems before your overall SLA has been violated. Make these alerts the second priority.
- Set up alerts for key component resources. These are alerts for machine resources -- CPU, memory, etc. These alerts will indicate only that the key component may have a problem in the future, so make these alerts the third priority.
- Integrate all the above alerts into your process and tools. Alerts in a vacuum are of little use. Integrate the above three types of alerts into your event management system -- and prioritize these three higher than any other remaining alerts. As part of a separate process, review the remaining alerts and remove those that provide no operational value.
- Assign responsibility for first triage of all your alert types. Generally it should be the team with responsibility for end-to-end service level delivery. Make sure that the responsible teams are notified whenever a service-level alert is created.
Problem SolvingResolving outages quickly is often the most difficult and highest-profile part of IT operations. Most IT operations team address outages by holding a bridge call -- where subject-matter experts from each technology group (the Web-server tier, the app-server tier, the database people, the mainframe people, etc.) call in to discuss how their respective components are performing and attempt to isolate the problem. At large enterprises, it is not uncommon to have 50 or more people on these calls!
The service-oriented approach is much more efficient. Instead of gathering all your high-value people for a conference call, equip your IT operations team with visibility to see historical transaction response times.
Start with the transaction data from 10 minutes prior to when the problem first surfaces.
Identify the slow transactions from the trouble period, and follow those transaction "hop-by-hop" across the infrastructure. By looking at the response times at and between each node in the infrastructure, your team can isolate the performance bottleneck and identify the slow component.
With the slow server identified, drill down into the server stack to find the root cause. Check OS, virtual machine, storage, and component resources to get to the heart of the problem.
Now you can assign the trouble ticket -- to the appropriate owner of the problem resource.
Once the problem is addressed, verify that service levels have been restored.
The last area, problem prevention, incorporates all three of the laws of service-oriented IT operations. Service-oriented problem prevention involves taking a systematic approach to finding and fixing potential problems before they occur. In particular, it involves finding existing production architecture abnormalities, applying problem-solving methodologies in preproduction, and verifying that changes have no impact on SLAs during change management.
Typical architectural problems that can be identified include failed connections, unexpected dependencies, errors, transaction hangs and queuing, excessive requests, and antivirus backups during peak hours.
The service-oriented approach to IT operations applies a simple idea -- that each component of the IT infrastructure should be measured by its impact on user service levels -- to the daily work of the IT operations team. By applying it, IT operations teams can significantly improve on service levels while reducing the overall inefficiencies in their activities.