This week we're going to consider an often-overlooked function of your IT operations management system: data collection and correlation.
It's easy to focus on the reports and the dashboard views. That's what we all look at, day in and day out. So, it’s also easy to forget about the basic data collection and subsequent data correlation that goes on behind the scenes automatically. The last time you thought about that was probably during the platform sales pitch.
But collection and correlation represent a crucial step. If relevant data doesn't get into the system, then there can be no downstream magic.
The data collection subsystem must be capable of ingesting a wide variety of data delivered by multiple protocols. It will poll systems, applications and services, either directly or via downstream distributed clients. It will accept SNMP traps and event and alert notifications. It will absorb system logs, event logs, application and service logs.
And it must do so without creating a bottleneck or dropping potentially important packets. Downstream filtering can help here, but adequate processing power and network bandwidth are prerequisites.
Auto-discovery of new devices, applications and services is a valuable tool, but you shouldn’t depend on them without periodic verification. Managed devices often have performance plug-ins such as performance poller modules. These should be verified to ensure that they are configured properly, lest you auto-discover misleading information. As you know, you should always ensure that you are collecting the proper data from any new service.
A simple ICMP poller is great for up-down ability testing, while a fully instrumented SNMP-based application can provide vast quantities of information.
Now you have all the data you need for your infrastructure, representing everything from the hardware level to application services. You are confident, at least, that there is nothing that could impact essential business services that you are not keeping tabs on.
Dealing with the Tidal Wave of Data
Now that you have access to all the data, you have a new problem: managing the infamous tidal wave of information. You need an event correlation system that processes your data step by step.
Here, the first step is event filtering. A vast amount of the data that you collect is actually of no interest all. Services are reporting that they are available, cable modems are delivering well-within-spec telemetry, and so on. You often have a tidal wave of data indicating that everything is fine. Your event correlator will deem irrelevant data that are within threshold levels, indicate general debugging, or almost anything not related to network, application or services availability or a fault condition.
The next step in the process is event aggregation. Here, multiple events that are very similar (but not necessarily identical) are combined into an aggregate that represents the underlying event data. The objective is to summarize a collection of input events into a smaller collection that can be processed using various analytics methods.
For example, the aggregate may provide statistical summaries of the underlying events and the resources that are affected by those events. Another example is temporal aggregation, when the same problem is reported over and over again by the event source, until it’s removed by solving the problem.
Event de-duplication is a special type of event aggregation that consists of merging exact duplicates of the same event. Such duplicates may be caused by network instability – for example, the same event is sent twice by the event source because the first instance was not acknowledged quickly enough, but both instances eventually reach the event destination.
While the data are being processed, an additional process of suppression may be applied. Suppression associates a priority with events, and may choose to hide a lower priority, even if a higher priority event exists.
In a sophisticated management system, the topology of your network will be known to the system. In this case, an additional technique is brought into play. Event masking (also known as topological masking in network management) involves ignoring events pertaining to systems that are downstream of a failed system.
For example, servers that are downstream of a crashed router will fail availability polling, but that information can be safely ignored during the timeframe that the router is out of service.
Because all of this is happening at the beginning of the evaluation stream, it's easy to overlook the importance of these primary stages of the management chain. However, they are fundamental to the smooth running of your network, applications and services.