What do we mean when we talk about a ‘Clean Signal’?
Today’s IT organizations face a huge – and growing – problem. The infrastructure undergirding today's business services generates tidal waves of events. In turn, monitoring and management systems create alerts that provide a window into all of the problems, roadblocks, challenges, and of course, reboots, that are impacting the delivery of your services.
These millions of raw events and associated alerts are inundating even the most well-equipped and staffed IT teams.
As the deluge continues, it often becomes impossible to get a handle on the most important questions: Which business services and users are impacted – and to what degree? What is really causing this event? How quickly can we identify the source of the problem and fix it?
The necessity of filtering out irrelevant events to reveal true root cause – and its true business impact – is the heart of the concept of the Clean Signal.
The Layered Approach
As you know if you are a regular consumer of IT industry blogs, a wide variety of IT impact technologies exist, and they tackle various layers of the problem:
- You’ll find that the challenge raised by the flood of log messages, telemetry data, and system alerts can be approached initially by paying attention only to those data and messages that indicate a change of state – to ensure that the transition is not some kind of glitch.
- It is a critical requirement that your monitoring system has knowledge of the network configuration. For instance, the failure of an upstream device carries with it the high likelihood that a downstream device will eventually be adversely affected.
- Your management/monitoring technology must understand the applications being run, as well as the application-level dependencies that exist to provide true root cause analysis. For a simple example, email cannot be delivered without functioning DNS availability. But since DNS is a behind-the-scenes service, it may not be readily apparent that the problem doesn’t reside with the mail server.
The knowledge base that allows for these degrees of correlation and integration should also provide established threshold levels and other pertinent data in a rules-based event processing environment.
The holy grail, of course, is to determine the true root cause and obtain a clear knowledge of the affected business service. That’s the Clean Signal. But once you find the Clean Signal, what happens next is how you derive value.
You still need to take the steps to correct the issue while making sure the remediation of the entire process that keeps the impacted business service running is minimal and as unobtrusive as possible.
Having the sophistication described above – with, for example, a monitoring and management system that comprises a comprehensive service topology description with which to correlate incidents – leads naturally to some level of automated remediation.
This is achieved by referencing a library of workflows, both on a technical and a business level. This is the crucial step in reducing operator workload. Many, many issues that operators deal with are resolved with a simple reset of the culprit service or a reboot of a given server. Automations can take care of this, saving you time and money.
Minimal Operator Intervention
This process also includes the verification of the error condition. You don't want to initiate a remediation process against an event that may have cleared. This also ensures that operator intervention will also be as minimal as possible, allowing your valuable and scarce engineers and support personnel to concentrate on more pressing issues – like planning ahead.
In parallel, your Clean Signal should be delivered via a robust visualization layer. Today’s management and monitoring technologies allow extensive and meaningful dashboards for operators that provide easy-to-glean status of issue recognition, isolation, root cause and clean signal, event verification and remediation – available at all times.
While your IT operations platform should provide all of the functionality described here, you should also insist on a deployment model that ensures you will get the best fit for your organization. The platform's ability to pinpoint, verify and validate the cause of the event and drive the remediation will allow your IT team to deliver a predictable infrastructure – a state that you can depend on to run your revenue-generating services.
The Clean Signal gets the right information to the right person and results in measurable gains in your service uptime. Best of all, it can replace the heterogeneous gaggle of monitoring tools that you assembled in an ad-hoc manner as your business grew.