Network operations management is defined as the activities performed by networking staff to monitor, manage, and respond to alerts on a networks’ availability and performance. These activities are essential to ensure that the network infrastructure is running smoothly and within its optimal operating conditions. However, defining what these optimal conditions should be isn’t so clear-cut. Third-party vendors may guide in certain areas, but in most cases, the responsibilities fall on the network operations engineers to define what these thresholds are. The challenge is twofold: one needs to define what these thresholds are and keep them continuously updated as network conditions change over time.
There are usually two types of outcomes that can occur for defining what these thresholds are. First, the thresholds are set too high which may result in a false representation of the normal operating range. The second outcome could be that the thresholds are set too low resulting in a very noisy network operation. So, the challenge is to set and maintain a good balance that results in meaningful alerts when these thresholds are crossed.
Oftentimes, these thresholds are set based on what humans have learned over time and what we perceive to be normal or not for these metrics. In today’s cloud era, network and IT infrastructures are more complex, and there are thousands of different metrics (with millions of time series associated) that are required to fully understand the operating state. It is beyond our scope of knowledge to decide what is or normal or not at this level of complexity and scale.
How can algorithms help?
One advantage of using algorithms is their ability to enhance our capabilities of performing job-related tasks. In this situation, algorithms can be used to learn what is normal or not for each of the thousands of metrics or millions of time series separately. Essentially, these algorithms are based on the same learning mechanisms we use by referencing historical data.
Selector’s platform integrates algorithms to learn from the past and to help identify what the normal operating conditions are. In doing so, the algorithms dynamically set these thresholds to values that are representative of the behavior of these metrics. When thresholds are crossed, this suggests that an anomaly has occurred and requires further investigation.
Selector’s auto-baselining algorithm will compute in real time the dynamic threshold for each time series. It accomplishes this by looking at the short-term past and long-term past. The short-term past (usually a few hours) is representative of the current dynamic behavior. The long-term past (usually a few weeks) enables the identification of seasonal patterns that determine the expected operating range for such time series based on the hour of the day, the day of the week, etc.
By combining the short-term past and long-term past, Selector’s algorithm (which is continuously updated) will determine the normal operating range to reflect evolving conditions.
Once anomalies have been detected, they can be correlated with any other anomaly identified by the system or other ongoing events received via different mechanisms such as SNMP traps, syslog messages, structured events from third-party applications, configuration changes, etc. Alerts can be configured and associated with the anomalies detected or with the aggregated event clusters and turned into incidents that may trigger ticket creation in the end user’s ticketing systems.
Setting static thresholds should be an activity limited to certain key metrics for which there is a well-known operating boundary. For everything else, given the scale and complexity at which infrastructures operate, Machine Learning (ML) algorithms such as Selector’s auto-baselining can provide the accuracy and real-time behavior needed to provide valuable information.
Let’s explore some of the benefits of the Selector Analytics platform:
First, operations teams are no longer required to create and maintain static thresholds.
Second, by automatically and dynamically computing and updating the normal operating ranges, the Selector Analytics platform can help reduce the ‘alert fatigue’ by minimizing the false positives associated with manual static thresholds.
Below are several screenshots of this unique feature:
Interested in learning more about this feature? Contact us today for a free demo!