Events, Incidents and Problems – How Are They Related?
If your business is like most modern enterprises, you depend on information technology to drive your business forward. Whether it’s your e-commerce portal, your supply chain system or your call center platform, your critical business services always have to be online and operating at peak performance. Otherwise, your customers and employees are directly affected – as is your bottom line.
That’s why managing service availability is so important – and why so many forward-looking companies are investing in ITIL systems and processes to keep their IT environment up and running. However, if you’re new to ITIL, it can be really confusing. You’ve probably heard about events, incidents and problems – but what are these, and how they related? And, why do you need three different management processes to do exactly the same thing?
Actually, event management, incident management and problem management aren’t the same. All three are related, but each process has a unique purpose – and all three work together to dramatically improve service availability.
Let’s start with event management. ITIL defines an event as “A change of state which has significance for the management of a Configuration Item or IT Service.”
Still feeling less than enlightened? Let’s look at what this actually means.
To start with, a Configuration Item (CI) is simply a component that you manage in your IT environment – such as a router, server, database, or application. An event is simply something that happens to a CI – that’s what ITIL means by “a change of state”.
Of course, lots of things can happen to a CI, so there are many different types of events. For instance, if a router port fails, that’s an event. If an application is taking too long to respond, that’s an event as well. And, when a disk hits 95% capacity, that can also cause an event. The list is almost endless, resulting in thousands – or even tens of thousands – of events every day. And, these events can come from many different places, including monitoring systems, log analysis systems, SNMP traps and other sources.
Why Event Management?
Obviously, you can’t just ignore these events. On the other hand, it’s not practical to look at each event manually – there are just too many. And, because there are so many different event sources to look at, this makes the task even more challenging. Even worse, most of these events are just noise. They don’t actually affect your business services – making it incredibly difficult to see real issues.
That’s why you need an automated event management system. The first goal of this system is to filter out the noise, so that you only get events that actually have a service impact. The system then analyzes these remaining events to help you pinpoint the actual issue. For example, a single network failure can cause thousands of symptomatic events – everything from failed credit card transactions through to unresponsive webpages and application errors.
Unfortunately, most event management systems expect you to configure these correlation rules. This can take months or even years and still only cover a small number of failure scenarios. As a result, you’re still left with a huge number of events, without any clear understanding of what’s really going on. On the other hand, there are also intelligent event management platforms that come with extensive out-of-the-box analysis capabilities – reducing event volumes by up to 100,000 times, and automatically pinpointing the root cause of service outages more than 90% of the time.
How Does Incident Management Fit In?
ITIL defines an incident as “an unplanned interruption to an IT Service or a reduction in the Quality of an IT Service.”
Hang on a minute. Wasn’t that what event management was all about – analyzing event data to pinpoint service issues? Exactly! Ultimately, event management systems turn huge amounts of raw event data into a few meaningful, actionable incidents. In some cases, IT staff raise these incidents manually based on what they are seeing in the event management system, but some event management systems can also report incidents automatically.
Once an incident is raised, that’s where incident management takes over. The goal of incident management is to restore service as quickly as possible – even if the fix is only temporary. As a simple example, this might involve resetting a server, even if the long-term solution is to apply a software patch.
However, things aren’t usually that straightforward. Incidents are usually triaged by first level support, which categorizes and prioritizes the incident, and then tries to restore service by seeing if the incident is related to a known error and has a corresponding workaround. If first-level support can’t find a solution, they then pass the incident over to second level support.
And, that’s where the fun starts. Because business services are so complex, this process typically involves multiple domain experts who spend hours or even days trying to restore service. Often, these delays are caused by poor event management practices – support is overwhelmed with a flood of uncorrelated event data, making it incredibly difficult to identify the issue. On the other hand, intelligent event management platforms can often pinpoint the issue before it ever hits support, dramatically lowering service restoration times.
Why You Need Problem Management
With incident management, you’re reactive – the goal is just to restore service as quickly as possible. However, wouldn’t it be better if you could avoid service outages in the first place?
Let’s go back to that example of resetting a server to restore service. That may resolve the incident, but why did the service go down in the first place? Was it a software bug, incorrect configuration, or even a hardware manufacturing issue? If the service outage created widespread disruption, you’ll want to know – and take steps to fix the problem. The same applies if the same type of incident keeps on happening – even if the impact of each individual incident is relatively low.
That’s what problem management is all about – proactively preventing incidents wherever possible, and reducing their impact when they happen. At its most basic, a problem is something that causes one or more incidents – fix the problem, and you’ll eliminate similar incidents in future. Problem management is about managing this problem lifecycle – from initial problem recording and prioritization, through to investigation and resolution.