Beyond War Room Network Management
The IT news headlines are full of incidents of major cloud instances brought down for days, and unfortunately often weeks, with some of the largest of these due to network issues in association with virtualization and storage sprawl. The price in the cloud era for such disruptions is very high and very public.
A big part of the solution to preventing such outages comes from comprehensive, automated and increasingly integrated network management capabilities. The tasks before network managers have never been more daunting. There are far more devices, hybrid networks, hybrid compute resources, higher levels of virtualization, and there is a need to maintain security and compliance requirements throughout.
What's more, the pressure to keep cost down and to seek lower cost alternatives for converged infrastructure remains a constant companion to business and IT architects, and therefore an ongoing network challenge.
Into this environment, HP has delivered a wide-ranging update to its Network Management Center suite Version 9.1. The emphasis is on a comprehensive lifecycle approach to network management with deep data gathering, automated root cause analytics, and intelligent and proactive response features that enable consistently high performance and network reliability.
BriefingsDirect recently sat down with Ashish Kuthiala, director of product marketing for HP Software's Network Management Center, to dig into the new offerings and to better understand why previous fragmented approaches to network performance and stability just won't hold up for most enterprises. The discussion is moderated by Dana Gardner, principal analyst at Interarbor Solutions.
Listen to the podcast (36:24 minutes).
Here are some excerpts:
Dana Gardner: What it is about the new IT environment that is taxing the older ways of network management?
Ashish Kuthiala: When you're looking at the network today, it has become very complex and is increasingly becoming more complex. With new domains coming in, such as voice over IP (VoIP), webcasts and video traffic, multiprotocol label switching (MPLS) services, unified communications and cloud computing and virtualization, it just becomes a nightmare to manage your network for your business.
Then you look at the volume of network devices coming online. Now, everyone wants to be in the instant-on enterprise mode. Everyone has to be connected. Everything has to be connected. Everyone expects immediate gratification and instant results. You have to respond to this opportunity continuously, and "any time, anywhere, any way" is the new tagline for anybody who is working.
Let's look at the job of the director of network ops in a particular IT organization. Not only does he have to configure, manage and standardize a network, he has to provision, he has to deliver, and he has to report on it. He has to do it very proactively and he has to do it very strategically at the lowest cost possible.
IT budgets are shrinking or remaining flat, whereas the demands on IT are really going up. It's estimated that a customer can lose about (US)$70,000 a minute during network outage, as I'm sure you've seen in the recent news. It's a big business inhibitor if the network goes down. It is what provides the experience to the end user for all the IT services that they experience.
Gardner: Why isn't the previous mode of network management able to keep up?
Kuthiala: Today, if you were to look into a customer's IT department managing a network environment, you would often see a war-room like approach to managing networks. ... They're very reactive. They have multiple tools, legacy approaches, and a lot of band-aids. The inability in tying together what used to be separate domains has become unacceptable.
The inability to cope up with the scale and complexity, the different teams hunched over their different monitors, is what I call the "swiveling chair syndrome." If there is a network outage, you have these eight or 10 different operators looking at different aspects of the network. They are just swiveling in their chairs, talking to each other and looking for data that should really be on one screen for them to manage. The lack of scalability of such tools just adds to the problem.
Gardner: How does an automated approach work better?
Kuthiala: To manage your network today, you really need to understand how your network is constructed from the bottom up, how it ties together, how it changes over time, and how it self-organizes. You need to build that kind of intelligence into your root-cause analysis.
The design of the tools has to be built ground up, based on these decisions. That's how you need to construct the tools. That's how they need to be integrated. For an operator, all these need to build upon each other.
It has to be in the right context. It cannot be siloed. It is a nightmare to manage. The desired nirvana for a network team is to reduce the numerous point tools to manage various aspects of network management. It has to be proactive, not reactive.
You have compliance management diagnostics and change issues that you need to take human error out of, and you need to automate that. You want to reduce the manual effort, the errors and increase control over your environment. You want to reduce the mean time to repair network outages, and maintain cost optimization as your network grows.
Today for customers, "performance is the new fault." So just because a network device is up and running, and you can ping it, doesn't mean it is providing the quality of service it should to the end user. It's really the performance that the network is being measured against. ...
Customers are looking for a solution that's efficient, automated, and secure for them. When they manage a network, they should be able to do things like fault, performance, change, configuration, compliance, trending and reporting, and this ties into their business services.
So, HP looked at this problem. As you know, we've had a long history of about 20 years with the HP OpenView product in network management. As we acquired other companies such as Opsware, they bought in additional tools with them. We looked at the tools and the evolving landscape of the network management domain and about five years ago, embarked on a re-architecture plan for these products from the ground up.
The approach wasn't to make these products just work together by putting in connectors, but we wanted them to be integrated from bottom up, from the data level itself, where the data would build upon each other.
Now, as we look at the Network Management Center (NMC), it is a complete portfolio of solutions and tools that lets you do network management in an integrated and automated way.
This really builds upon the HP Network Node Manager i (NNMi), the related special plug-ins that handle complex services such as multicast traffic, VoIP, etc., as well as the network automation piece of it, which really helps customers automate and manage their change, compliance and configuration of network devices that they need to do on an ongoing basis.
The five-year journey of re-architecting our NMC portfolio completes with the 9.1 release that we are talking about today.
So the earlier 9.0 release introduced a number of features including better user interfaces, the ability to scale to large environments, and tying our products together into better functioning solutions. With 9.1, we are building on that.
We've strengthened the ability of our customers to manage cloud services. The most critical capability that a customer must have is to manage the network the same way that they have managed traditional networks, and it doesn't matter if they have to go across the cloud or are looking at private or public clouds.
Gaining visibility into the network elements, whether they are local, off-premise or the health and quality of the cloud services that's being delivered, is the most important step. Can I reach my device? Is it healthy? Is it performing to the expected levels of business needs?
And of course, configuration compliance management of these devices across the cloud is very important, and corrective actions and rollbacks are very important. Our tools are able to do that across different environments.
The 9.1 release is also focused on the managed service provider's (MSP's) market needs. There is a big trend of IT outsourcing to MSPs, and one of the things that customers want to outsource is network management services. So this is a big, growing market, and our MSPs need platforms to manage their customers' network environments in a way that that maximizes their profit.
They need to scale and grow with their customer in expanding network environments, reduce their hardware spend and their training costs, as well as grow their revenues and create new lines of business, as their own customers move to new and complex services.
For example, a customer might go from traditional phones to IP telephones, and at that point, the MSP has to manage that aspect of their customer's environment as well, and they don't want at this point to buy a new tool.
The size of the customer's network might increase, and you don't want to buy another server, another set of tools and deploy another set of operators to manage that.
We have introduced multi-tenancy capability and security groups that allow our customers to separate their data and views into secure partitions. This helps them manage multiple customers, departments or sites per single software instance, driving down their cost and giving them a flexible architecture.
We've also done a lot of work on the performance-based, time-based thresholds for better alerting. What this means is that the performance data is in the context of the network topology providing a unique point of your fault monitoring. It helps them with proactive notification of performance degradation, fix it proactively and guarantee service delivery levels.
We've also increased the number of months that the data is retained. It's up to 13 months now which allows you to do forecasting and trending capabilities. This is a sufficient data retention period for compliance requirements for real-time and historical data, and allows a very efficient analysis.
Our user interface (UI) has been enhanced based on the feedback we've gotten from customers. The common look and feel UI across all the products and our solution set ensures lower training cost -- train once, leverage across all these tools.
The UIs show relevant contextual information on the nodes and incidents they're managing, giving them a lot of operational efficiency. The breadcrumb history and the easy navigation with right-click menus also allows the operators to get to the root cause more quickly, making them much more efficient and improving the time to resolution.
The analysis pane shows you a number of system component help enables you to get key information including availability and performance graph really quickly.
Gardner: In some of these high-profile outages that we've had recently, it seems that they were doing updates and that caused the cascading or spiraling effect and ultimately brought the network down. What is it about your suite and your comprehensive approach that could help ameliorate something like that?
Kuthiala: A network constantly needs updates, whether its configuration updates or being in compliance with a number of different policies -- Sarbanes-Oxley (SOX) or the Health Insurance Portability and Accountability Act (HIPAA), and government regulations.
Typically, customers have a set of people who use multiple tools or manually log into a number of these devices and do these configuration changes manually. This is very dangerous. One, there is human error involved. Second, when something goes wrong, you don't know what has gone wrong, and you are scrambling to fix it. Think about doing this across 50,000, 60,000, 70,000 devices in your network.
Our network automation capabilities allow customers to automatically make these changes through our tools. As they implement these changes, it's takes minutes and hours, versus days, to keep these devices configured to the latest and greatest configurations and in compliance.
Think about when you are on the 59,000th device that you are updating and you realize there is an error. This was not the right thing to do, and you need to roll back. If you're doing this manually, you're spending many hours fixing the error while your business is suffering during that time. Our automation capabilities help customers; with a few clicks of buttons they are able to automate all of this.
Today, customers might be looking at a number of incidents -- 10,000, to 15,000 incidents. For example, if somebody yanks a LAN cord out and puts it back in, what really has happened is the interface has gone down and come back up. And now that is flagged as an incident or an event that the operator has to pay attention to.
With our root cause analysis engine, and the ability to map the topology dynamically in a spiral discovery fashion, the network topology is always up-to-date. The root cause analysis engine helps figure out whether this is an incident that needs to be paid attention to or not, auto-resolving some of that.
The incidents that boil up to the operators are meaningful, and therefore are reduced in number to those that are actionable. We have had customers whose incidents have been reduced from 10,000 to 12,000 down to 400, and only about 100 of those have to be acted upon and escalated to the next level of management.
Automation really takes a lot of the work out of your hands and enables you to fix errors very proactively, and if there is a mistake, fix it right away with a few clicks. ...
I'm talking very specifically about the configuration of network devices. The software that your network device comes with is the key differentiator in how they act, and the intelligence that they provide. So this has to be not only managed really well, but there are patches and upgrades, just as you have software patches and upgrades on your servers. These have to be managed. Sometimes, there are government regulations or company regulations that you want to propagate across these devices.
But tying to the business service management set of tools or the suite stems from the fact that, when you look at it from a business service availability aspect, it's not just about the network. There are servers, there are applications, and they are all tied together. For example, if application business service is not working, do you know if it's the server? Do you know if it's the application? Do you know if it is the network?
Our Business Service Management offering ties in these aspects through our runtime service model. This ties your network, to your application, to your server and is able to give your business a look into how your business service is going to be affected by the failure of any one of these infrastructure elements.
Gardner: Now Network Management Center is a fairly significant set of different products, but most people already have something in place. So this is not a matter of starting greenfield. This is a matter of coexistence, migration and transformation. How do you get started?
Kuthiala: Most customers today have in place something to monitor their networks, but a lot of customers have not automated their configuration, compliance and diagnostic capabilities that we talked about.
We've seen a trend in our customer base where they buy smaller node packs to manage a small number of devices with our automation capabilities. Once they have put that in place, they start to see other efficiency use cases that they can achieve using our network automation capabilities.
We observe that these customers come back and buy more licenses for managing a greater number of network devices. So, that's almost like a greenfield opportunity here.
But, when we look at the most customers looking at managing their networks and doing performance and monitoring, for example, if they have an instance of our software, it's an in-place upgrade. We offer a dual entitlement and run a parallel program that allows customers is to seamlessly set up another parallel environment and bring the network up there, start to manage it, and seamlessly shift.
We've had an instance of a customer in the EMEA region, where they were testing our latest software and running it in parallel to see how it was functionally different and what effect of productivity it would have on their operators. A couple of weeks went by and their senior management started getting escalations for network problems.
Now, when senior management turned to the network operations team and asked, "We have all these incidents showing up. What is going on? Is something wrong?"
Almost sheepishly, the network operator team had to acknowledge that they were testing the new platform and had completely forgotten about the old tool which they needed to shut down because the new platform ignored the incidents that were not meaningful. They had "accidentally" migrated to the new platform to managing the network much more efficiently.
A lot of our customers use this approach to migrate to the new platform, and of course, our approach is modular. Start with the core product and add the special plug-ins to manage your IP telephony MPLS or multicast capabilities.