This enterprise case study discussion focuses on Blue Cross and Blue Shield of Florida and how they’ve been able to improve their applications’ performance — and even change the culture of how they test, provide and operate their applications.
Join Victor Miller, senior manager of systems management at Blue Cross and Blue Shield of Florida in Jacksonville, for a discussion moderated by Dana Gardner, principal analyst at Interarbor Solutions.
Listen to the podcast (15:56 minutes).
Here are some excerpts:
Victor Miller: The way we looked at applications was by their silos. It was a bunch of technology silos monitoring and managing their individual ecosystems. There was no real way of pulling information together. We didn’t represent what the customer is actually feeling inside the applications.
One of the things we started looking at was that we have to focus on the customers, seeing exactly what they were doing in the application to bring the information back. We were looking at the performance of the end-user transactions or what the end-users were doing inside the app, versus what Oracle database is doing, for example.
When you start pulling that information together, it allows you to get full traceability of the performance of the entire application from a development, test, staging, performance testing, and then also production side. You can actually compare that information to understand exactly where you’re at. Also, you’re breaking down those technology silos, when you’re doing that. You move more toward a proactive transactional monitoring perspective.
We’re looking at how the users are using it and what they’re doing inside the applications, like you said, instead of the technology around it. The technology can change. You can add more resources or remove resources, but really it’s all up to the end-user, what they are doing in their performance of the apps.
Blue Cross and Blue Shield is one of the 39 independent Blue Crosses throughout the United States. We’re based out of Florida. We’ve been around since about 1944. We’re independent licensee of the Blue Cross Blue Shield Association. One of our main focuses is healthcare.
We do sell insurance, but we also have our retail environment, where we’re bringing in more healthcare services. It’s really about the well-being of our Florida population. We do things to help Florida as a whole, to make everyone more healthy where possible.
When we started looking at things we thought we were doing fine until we actually started bringing the data together to understand exactly what was really going on, and our customers weren’t happy with IT performance of their application, the availability of their applications.
We started looking at the technology silos and bringing them together in one holistic perspective. We started seeing that, from an availability perspective, we weren’t looking very good. So, we had to figure out what we could do to resolve that. In doing that, we had to break down the technology silos, and really focus on the whole picture of the application, and not just the individual components of the applications.
Our previous directors reordered our environment and brought in a systems management team. It’s responsibility is to monitor and help manage the infrastructure from that perspective, centralize the tool suites, and understand exactly what we’re going to use for the capabilities. We created a vision of what we wanted to do and we’ve been driving that vision for several years to try to make sure that it stays on target and focused to solve this problem.
We were such early adopters that we actually chose best-in-breed. We were agent-based monitoring environment, and we moved to agent-less. At the time, we adopted Mercury SiteScope. Then, we also brought in Mercury’s BAC and a lot of Topaz technologies with diagnostics and things like that. We had other capabilities like Bristol Technology’s TransactionVision.
HP purchased all the companies and brought them into one umbrella of product suites. It allowed us to bind the best-of-breed. We bought technologies that didn’t overlap, could solve a problem, and integrated well with each other. It allowed us to be able to get more traceability inside of these spaces, so we can get really good information about what the performance availability is of those applications that we’re focusing on.
One of the major things was that it was people, process, and technology that we were focused on in making this happen. On the people side, we moved our command center from our downtown office to our corporate headquarters where all the admins are, so they can be closer to the command center. If there were a problem that command center can directly contact them and they go down in there.
We instituted what I guess I’d like to refer to as “butts in the seat.” I can’t come with a better name for it, but it’s when the person is on call, they were in the command center working down there. They were doing the regular operational work, but they were in the command center. So if there was an incident they would be there to resolve it.
In the agent-based technologies we were monitoring thousands of measurement points. But, you have to be very reactive, because you have to come after the fact trying to figure out which one triggered. Moving to the agent-less technology is a different perspective on getting the data, but you’re focusing on the key areas inside those systems that you want to pay attention to versus the everything model.
In doing that, our admins were challenged to be a little bit more specific as to what they wanted us to pay attention to from a monitoring perspective to give them visibility into the health of their systems and applications.
[Now] there is a feedback loop and the big thing around that is actually moving monitoring further back into the process.
We’ve found out is if we fix something in development, it may cost a dollar. If we fix it in testing, it might cost (US)$10. In production staging it may cost $1,000 It could be $10,000 or $100,000, when it’s in production, because that goes back to the entire lifecycle again, and more people are involved. So the idea is moving things further back in the lifecycle has been a very big benefit.
Also, it involved working with the development and testing staffs to understand that you can’t throw application over the wall and say, “Monitor my app, because it’s production.” We have no idea which is your application, or we might say that it’s monitored, because we’re monitoring infrastructure around your application, but we may not be monitoring a specific component of the application.
The challenge there is reeducating people and making sure that they understand that they have to develop their app with monitoring in mind. Then, we can make sure that we can actually give them visibility back into the application if there is a problem, so they can get to the root cause faster, if there’s an incident.
We’ve created several different processes around this and we focused on monitoring every single technology. We still monitor those from a siloed perspective, but then we also added a few transactional monitors on top of that inside those silos, for example, transaction scripts that run at the same database query over-and-over again to get information out of there.
At the same time, we had to make some changes, where we started leveraging the Universal Configuration Management Database (UCMDB) or Run-time Service Model to bring it up and build business services out of this data to show how all these things relate to each other. The UCMDB behind the scenes is one of the cornerstones of the technology. It brings all that silo-based information together to create a much better picture of the apps.
We don’t necessarily call it the system of record. We have multiple systems of record. It’s more like the federation adapter for all these records to pull the information together. It guides us into those systems of record to pull that information out.
About eight years ago when we first started this, we had incident meetings where we had between 15 and 20 people going over 20 or 30 incidents per week. We had those every day of the week. On Friday, we would review all the ones for the first four days of the week. So, we were spending a lot of time doing that.
Out of those meetings, we came up with what I call “the monitor of the day.” If we found something that was an incident that occurred in the infrastructure that was not caught by some type of monitoring technology, we would then have it monitored. We’d bring that back, and close that loop to make sure that it would never happen again.
Another thing we did was improve our availability. We were taking something like five and six hours to resolve some of these major incidents. We looked at the 80:20 rule. We solved 80 percent of the problems in a very short amount of time. Now, we have six or seven people resolving incidents. Our command center staff is in the command center 24 hours a day to do this type of work.
When they needed additional resources, they just pick up the phone and call the resources down. So, it’s a level 1 or level 2 type person working with one admin to solve a problem, versus having all hands on deck, where you have 50 admins in a room resolving incidents.
I’m not saying that we don’t have those now. We do, but when we do, it’s a major problem. It’s not something very small. It could be a firmware on a blade enclosure going down, which takes an entire group of applications down. It’s not something you can plan for, because you’re not making changes to your systems. It’s just old hardware or stuff like that that can cause an outage.
Another thing that is done for us is those 20 or 30 incidents we had per week are down to one or two. Knock on wood on that one, but it is really a testament to a lot of the things that our IT department has done as a whole. They’re putting a lot of effort into into reducing the number of incidents that are occurring in the infrastructure. And, we’re partnering with them to get the monitoring in place to allow for them to get the visibility in the applications to actually throw alerts on trends or symptoms, versus throwing the alert on the actual error that occurs in the infrastructure.
[Since the changes] customer satisfaction for IT is a lot higher than it used to be. IT is being called in to support and partner with the business, versus business saying, “I want this,” and then IT does it in a vacuum. It’s more of a partnership between the two entities to be able to bring stuff together. Operations is creating dashboards and visibility into business applications for the business, so they can see exactly what they’re doing in the performance of their one department, versus just from an IT perspective. We can get the data down to specific people now.
Some of the big things I am looking at next are closed-loop processes, where I have actually started to work with making some changes, working with our change management team to make changes to the way that we do changes in our environment where everything is configuration item (CI) based, and doing that allows for that complete traceability of an asset or a CI through its entire lifecycle.
You understand every incident, request, problem request that ever occurred on that asset, but also you can actually see financial information. You can also see inventory information and location information and start bringing the information together to make smart decisions based on the data that you have in your environment.
The really big thing is really to help reduce the cost of IT in our business and be able to do whatever we can to help cut our cost and keep a lean ship going.