Police departments release annual figures detailing the number of crimes committed in a given year. While these statistics may be useful for high-level planning, they don’t put anyone in jail. What is needed is convincing evidence of exactly who did what, where and when.
The same can be said for network management. SNMP (simple network management protocol) traffic figures can guide long-range capacity planning and identify when a particular circuit is overloaded, but don’t give the exact user and application. Network technicians need the ability to drill down into the underlying data. Just as crime scene investigators need to examine DNA and clothing fibers, circuit saturation investigators need to look at the packets to finger the culprits.
Two common ways of analyzing the network traffic are RMON (remote network monitoring) and NetFlow.
Deploying Remote Investigators
Remote Network Monitoring or RMON is an extension of SNMP that allows network managers to monitor and troubleshoot distributed LANs (local access networks) from a central console. This can be done by taking a hardware or software approach. The hardware method is to install probes in each of the LANs or at network consolidation points. An alternative is to include software in network devices such as switches and routers that gather the required information. The original RMON specification is covered in RFC (Request for Comments) 1757. RMON Version 2 (RMON2) is detailed in RFC 2021 and extends RMON’s analysis to the application layer.
As described in RFC 2021: “Remote network monitoring devices, often called monitors or probes, are instruments that exist for the purpose of managing a network. Often these remote probes are standalone devices and devote significant internal resources for the sole purpose of managing a network. An organization may employ many of these devices, one per network segment, to manage its Internet. In addition, these devices may be used for a network management service provider to access a client network, often geographically remote.”
RMON gathers and monitors nine groups of information:
- Statistics: Bytes sent, packets sent, packets dropped, CRC (cyclic redundancy check) errors, runts, giants, fragments, etc.
- History: Samples and records network statistics including sample period, number of samples and items sampled
- Alarm: Creates and logs an event when any of the statistics cross established thresholds
- Host: Statistics for individual hosts including address, packets, bytes received and transmitted and error packets
- HostTopN: Lists the top hosts for a given time period ranked according to a particular monitored statistic — for example, top ten hosts in terms of bytes received
- Matrix: Statistics on conversations between two addresses — packets, bytes and errors for each pair of addresses
- Filters: Packets can be matched by a filter equation
- Packet Capture: Allows packets to be captured after they flow through the channel
- Events: Event type, description and last time event sent
A vendor does not need to enable all nine of these groups. However, some of the groups do require others to be activated.
Making the Usual Suspects Talk
RMON and RMON2 make it much easier to use SNMP variables and provide some visibility into the packet-level data needed to answer the questions of who was doing what when. The problem is that RMON has not been broadly adopted by networking vendors due to the heavy hardware and system load it entails.
One can also purchase standalone packet analyzers. The disadvantage of this approach, in addition to the costs involved, is that they must be deployed at the right location to analyze the data. They can be moved to a chronically overloaded circuit to locate the source of the load, but are not good at detecting intermittent problems.
Even if there are numerous packet analyzers, they must be configured properly to operate on a switched network. In order for the packet analyzer to see the traffic on a port, one of two things must happen:
- The port of the switch must be connected to a hub first, from which the connection continues on. The packet analyzer also plugs into the hub to promiscuously listen to all traffic to and from on the link.
- Next-generation switches support something called “spanning” or “port mirroring,” where a port on the switch can be set up to receive all traffic to and from another port. This is the preferred configuration.
The above must be configured first before you can catch and analyze the data. Obviously, this situation presents problems, most of which are addressed by NetFlow. Rather than deploying packet analyzers, it performs the packet analysis right in the switch.
NetFlow was originally developed by Cisco Systems as part of its Internetwork Operating System (IOS). It is, however, an open protocol with several versions, including Juniper Networks’ cflowd and Huawei Technologies’ NetStream.
NetFlow involves two elements — a Flow generator and a Flow collector. The switch or router acts as the Flow generator — sending continuous stream of what it is seeing on one or more interfaces — to a NetFlow collector — a server or appliance that collects the data from one or more Flow generators.
The NetFlow generator examines the packets based on seven key fields: source and destination IP address, source and destination port, Layer 3 protocol type, type-of-service bit and input logical interface. If those seven criteria are identical for two or more packets, the generator assigns those packets to the same flow or conversation. When that conversation is complete, it sends the data to the collector.
Since NetFlow uses the switch or router’s CPU, two approaches can be used to reduce the overhead. One is to sample the packets; for example, only analyzing every tenth packet rather than all of them. The other is to only activate it on certain key interfaces.
A single NetFlow collector can receive data from hundreds of network interfaces. The collector, in addition to storing the NetFlow statistics, generally also includes analysis software that can determine:
- The applications seen on the interface
- The hosts communicating on the interface
- Who the hosts are conversing with and with what protocol (and much more)
The above information alone will help answer over 90 percent of questions pertaining to:
- Who: The host causing the problem
- What: The application the host causing the enigma was using
- When: The time stamps related to when the issue surfaced
- Where: The router/switch and the interface the traffic was seen on
NetFlow analysis results in one computer collecting flows from dozens or even hundreds of interfaces. The geographical limitations are dramatically reduced and the amount of resources necessary is a fraction of the money spent on deploying multiple packet analyzers.
Making the Arrest
NetFlow is included with higher-end hardware from Cisco and other networking vendors, but it must be activated. As mentioned, due to the overhead, it won’t always be activated. The state of Maine, for example, decided to use NetFlow on parts of its network which connect about 750 entities around the state.
“NetFlow is integral to Cisco’s IOS, so there was no extra cost in turning it on in the routers,” said Duncan Bond, the State of Maine’s data network supervisor. “Because the reporting capability is there already, it is simple enough to turn on, and we didn’t have to install any extra equipment at the edge. It was a no brainer. It is free information.”
The state has a backbone using Nortel ATM (asynchronous transfer mode) switches. At the ATM locations, Cisco routers direct traffic to the edge sites, typically over T1 connections. The state also has a SONET (synchronous optical network) ring in the capital area.
“On the entities that are connected right at the core, we don’t have a need for flow-related information because we have plenty of bandwidth,” said Bond. “But for all of our WAN (wide area network)-based edge locations, we have turned on NetFlow reporting.”
Although it was free to set up the generators, he did have to purchase a server to act as the collector. He uses a dual-CPU (3.8 Ghz) Windows server with 4 GB of RAM. The data is stored on an internal 400 GB RAID (redundant array of independent drives) array. The software he uses for analyzing the NetFlow data is Scrutinizer from Plixer International of Sanford, Maine. Scrutinizer comes in a free version that monitors an unlimited number of interfaces and stores the data for 24 hours. Commercial versions run from US$1,995 to $8,995.
The NetFlow data is initially available in one-minute intervals, and then rolls up to five-minute and half-hour intervals. Bond has the system set to retain the one-minute data for seven days. Beyond the first few days, that level of granularity is no longer useful. Network administrators can drill down and view the data in real time, but most often it is used after the fact to locate what was causing an earlier slowdown.
“Typically users don’t call right when they have a problem, but will wait till later in the day to phone in a trouble ticket,” Bond says. “At that time the problem has passed so we can’t see it in real time any more; that is where this tool really shines.”
He gives the example of an office on a relatively low-speed circuit that called in at 9 a.m. to report that the circuit had been slow since staff started arriving around 7 a.m. Looking at the bandwidth graphs showed that the bandwidth was saturated. Clicking into Scrutinizer then showed that it was Windows updates that were scheduled to occur at that time.
“Windows updates and virus updates are particularly expensive in terms of bandwidth,” says Bond. “Depending on how their business works, either we can reschedule the updates or they have to put up with it.”
In other cases, he has found problems connected with server consolidation. Consolidating the servers to central locations cuts down on maintenance costs, but shifts more traffic to the WAN. NetFlow shows the increase in bandwidth utilization by specific users and applications as those changes occur, so capacity can be added as needed. However, Scrutinizer is primarily used by Bond and his crew of four to spot and address immediate blocks.
“It is in relatively constant use,” he said, “and the payback is enormous.”
The lesson to be learned from this tale is that packet and flow monitoring make it much easier to find and remove the sources of network problems. RMON, however, is not as widely supported currently as Netflow.
In addition, Netflow just requires activating it on the network equipment, rather than installing software probes or packet capture appliances. Probably the wisest choice, therefore, is to activate Netflow, and set up a server to capture and analyze the data.
Drew Robb is a Los Angeles-based freelancer who specializes in technology subjects.