“The warnings in cockpits now are prioritized so you don’t get alarm fatigue…We work very hard to avoid false positives because false positives are one of the worst things you could do to any warning system. It just makes people tune them out.” – Captain Chesley “Sully” Sullenberge.
The human brain is designed to be aware of potential dangers in our environment, a result of necessity when humans were living among their predators. Over time, as civilizations emerged, the awareness techniques evolved into a collaborative effort. For example, building a castle with a moat with lookout towers to warn of an enemy approaching.
In other words, humans are hardwired to know as much as possible about our surroundings, as if our lives depended on it because it often did. Fast forward to today, and every industry has examples of “alert fatigue,” a problem when people are desensitized to alerts, notifications, and alarms and fail to respond appropriately.
Our innate desire to be notified about every possible detail has resulted in alert fatigue not just inside of IT but in many facets of our everyday lives. And it is a real problem with significant consequences.
Causes of Alert Fatigue
Because we desire to be notified about our surroundings and environment, we tend to add additional alerts into our systems, like hoarders store newspapers in their living rooms. We know we don’t need it now, but it may prove useful later. But later, it never comes, and eventually, you move away from your house or job, leaving the mess of alert notifications behind for the next person to clean up.
The need for additional alerts is often due to the simple fact that your enterprise is growing in complexity. One day, you manage 1,000 nodes or entities, each with defined alerts. In another year, you have added a few hundred (or more) nodes and more alert notifications.
With the growth in complexity comes the reliance on default alert settings. Metrics like resource health, bandwidth utilization, or throughput are all standard alerts offered out-of-the-box by many vendors. Getting alerts for a few devices is one thing, but getting thousands during an event or outage is another.
The growth in your enterprise and managed nodes is rarely matched by comparable growth in human resources. Low or inadequate staffing is another cause of alert fatigue. When there are not enough operators to handle the workload, alert notifications run wild, often unchecked or acknowledged and closed quickly without remediation.
Another cause of alert fatigue is a large volume of false positives. False positives are a natural result of off-the-shelf monitoring solutions, as operators need time to tune the system for the needs of their specific environment. Dealing with a large volume of false positives and low staffing is the primary way alert fatigue enters your enterprise.
Risks of Alert Fatigue
Left unchecked, alert fatigue has serious consequences. I’m not just talking about missing an alert on a router, and suddenly, Adam in Accounting can’t look at cat photos after lunch. Alert fatigue can lead to someone’s death.
An obvious example was given initially, with the need to reduce false positives inside an airplane cockpit. Nobody wants their pilot ignoring alarms while flying the plane. A not-so-obvious example involves a hospital administering a lethal dose of medicine more than 38 times the prescribed amount. A famous example from 2022 was when Rogers Communications suffered an outage lasting for days for some customers, including emergency services.
Alert fatigue causes higher workloads for operators, leading to burnout if staffing needs are unmet. Burnout leads to increased turnover as employees look to find less stressful work. There is also the time spent, and therefore lost, due to the constant need to respond and remediate alerts. The extra time spent on alert remediation leads to missed deadlines for other projects, decreasing productivity and slower response times.
Alert fatigue is why companies complain they don’t have time to bring things into compliance because they are too busy “putting out fires,” except they can’t understand they are not just the firefighter but also the arsonists.
How To Reduce Alert Fatigue
Reducing alert fatigue requires understanding what an alert is and what it is not. For this, I offer a simple sentence:
“Alerts require action; everything else is informational and can be reviewed later.”
Alerts should be concise, provide context, and be actionable. If you send an alert or notification, you intend to grab and hold the user’s attention while the user decides how to take action.
Once alert fatigue hits, the first line of defense is a series of email rules to forward alerts to a different folder. Once email rules are deployed, your alerting system has lost all purpose. You are doomed to miss critical notifications as they are redirected to a nested folder, never to be read.
Here are some ways to avoid missing critical information and reduce alert fatigue.
Automation
Consider automating as many first-response actions as possible. Using alert triggers is one way to do this, as they perform actions automatically when an alert is raised. Continue to automate as many tasks in the chain as possible so that the user is alerted when action on their part is essential.
A key aspect of automation is flexibility. Platforms such as Selector allow for automation directly or through partner automation platforms, allowing for greater flexibility for customers looking to automate remediation of alerts.
Intelligent thresholds
Legacy monitoring and alerting tools would fire when a static threshold was crossed. For example, if the CPU utilization on a server is greater than 80%, send an alert to a person or team. Over time, these tools tried to improve the user experience by gathering usage data over time and calculating a baseline. This way, if 80% CPU utilization were “normal” for the server, an alert would not be sent until the CPU utilization exceeded 80%.
This rudimentary approach to baselining was never adequate for anyone’s needs, but it was better than nothing. A more modern approach to baselining is through time-series analysis and forecasting models. These models account for trends and seasonalities hidden in the data, allowing for a better forecast prediction of the values expected next and configuring alerts around those thresholds instead of static thresholds.
This modern approach to baselining is not only for metrics but also for logs. Selector creates baselines for both metrics and logs and provides the ability to create multivariate alert thresholds, tying together correlated events so the user better understands the required actions in response to an alert notification.
Increase headcount
Every company wants to “do more with less,” but the reality is that it is no longer possible for small teams of operators to manage the sheer scale of telemetry available today. And that’s where a company like Selector helps by leveraging AIOps to increase productivity. But for those companies lagging, hiring more staff to help with the workload will be necessary even after automation and intelligent thresholds are applied.
However, even large teams of operators struggle to keep up with the complexity and volume of telemetry seen in modern networking environments. Companies need a paradigm shift to enable these finite teams to upscale their impact — often bringing them back to technical solutions like AIOps.
Continuous Improvement
Another reason for hiring a dedicated administrator for your alerting and monitoring system is for continuous improvement. The administrator would be responsible for reviewing the alerting data and adjusting accordingly. For example, your logs may indicate a device was rebooted, while metrics may show the device was down for 60 seconds. Alerting on those activities separately means two alerts, which is double the work. This is an area where Selector shines through alert consolidation/deduplication, with some customers seeing a reduction ratio of 75:1 in alert volume.
The administrator could also determine which alerts are urgent and important versus which are not. Going further, they could set priority levels for alerts and ensure teams only receive a certain number of alerts marked “high” per day. Of course, these administrators are premium talent. Much like the previous section on increased headcount, hiring premium talent to solve the problem is a stopgap at best. When the tipping point is crossed, companies will look to a platform like Selector to identify improvement opportunities as part of a daily operation.
Summary
Alert fatigue is a pervasive problem that affects many industries. Alerts become meaningless when you get so many that you can’t read them. Identifying and reducing alert fatigue requires a concerted effort across teams, and most companies don’t have the time, headcount, or expertise to dedicate to such efforts.
Reducing alert fatigue is a core function of the Selector platform. Our proprietary machine learning algorithms de-duplicate alerting data, ensuring only actionable, relevant alerts are sent your way. Fewer alerts mean less noise, which enables your team to take action faster rather than surf through dozens or hundreds of alerts.
Selector will reduce multiple alerts for the same event and non-actionable events. By grouping clusters of related alerts, Selector reduces the overall number of notifications your users will receive, allowing quicker response time and your operations team to focus on solving priority issues immediately.
Nobody wants their batteries drained by constant alert notifications, but it happens over time. Still, it is possible to reduce alert fatigue through your efforts or by leveraging a platform designed for that purpose. Remove alert fatigue and help your team stay focused, productive, and motivated to ensure your business runs smoothly.