Excessive Alert Noise: Cause, effect, and solution

With an exponential growth in the IT sector over the last few years, traditional operational tools and process isn’t enough to stay ahead of the market. Problems/anomalies are treated as ‘events’. Each of these events triggers an alert in the system leading to separate incidents that require individual resolution. With an increase in data, hybridization, operational tools, countless metrics, there has been a corresponding increase in alert volume. This causes inundation of high volume and variety of log data, usually with multiple false and redundant alerts.

About 40% of IT organizations see over a million event alerts a day, with 11% receiving over 10 million alerts a day.

Most IT teams today operate in disparate silos, often unaware of the assets they have, their utilization or inter-dependence thereby compounding the problem.

Why is there an excess of alert noise?

Some of the common reasons for an increasing volume of alert noise are:

  1. Lack of stack awareness
  2. Static thresholds
  3. Alert Storms

Lack of stack awareness

Static thresholds

Alert Storms

Alert Noise is estimated to cost an average of $1.27Million per year to companies.

How does CloudFabrix help with alert reduction?

The CloudFabrix AIOps platform uses combination of user configurations and advanced AI/ML algorithms such as correlation , anomaly, forecasting etc to reduce alert volume through grouping, suppression and prevention.

Rule Based → AI/ML and Analytics Based Approach

Instead of relying on manual tagging and rule based grouping, CloudFabrix uses time based and asset dependency based automated grouping of multiple alerts into actionable problems. It further uses predictive analytics thereby reducing alert noise by a significant number.

Static Thresholds → Dynamic Thresholds

To address this problem

Alert Storms → Actionable Incidents

  1. Planned Outages: With our platform, alerts can be configured to be ignored during planned outages like patching, backup or maintenance. In addition to this, we are able to automatically exclude network device access ports from monitoring, as this can cause an excessive number of unwanted alerts, whenever employees, remote users, phones etc. connect/disconnect from the network.
  2. Unplanned Outages: and device fluctuations or flapping situations cause alert storms, which we detect automatically and suppress the alerts during unplanned outages like network disruption or device unavailable events.
  3. Cascading Alerts: this happens when a device/component fails resulting in alerts from other parts due to interdependence or lost connectivity between the monitoring system and the dependent devices. These deluge of alerts are often pointing to the same underlying issue. These sort of alerts can be grouped together if the system has knowledge of the interdependencies and can identify the underlying root cause issue.

Please feel free to ask anything about aiops and we will be happy to answer any queries you have.

14