Introduction
A new generation of IT operations tools are emerging that are collectively referred to as “AIOps”. As the name implies, they are IT operations tools that leverage artificial intelligence / machine learning (AI/ML). As there are often questions about how much AI/ML these products use, it is worth exploring an example, the Selector platform.
There are three big buckets of AI. Narrow AI is the class of AI that most AIOps tools are in, capable of performing specific tasks, but not human like. General AI is more like human intelligence and Super AI is intelligence that exceeds humans.
While the future of General AI and Super AI is unknown, narrow AI is already making significant contributions, dramatically increasing the efficiency and effectiveness of operations teams.
Supervised Learning
No discussion of AI/ML can exclude a discussion of Supervised Learning, because it is perhaps the most widely discussed and known machine learning approach. Supervised learning is common in classification and regression processing where there is a desire to train a model to learn the relationship between a label (this is a cat, this is a dog,..) and structure in data.
There are aspects of AIOps for which supervised learning may be well suited, for example root cause analysis on previously captured anomalies. This blog focuses on rapid anomaly detection, ranking, and contextualization from unlabeled data streaming at a rate of millions of messages per second.
Rapid anomaly detection approaches that are already delivering significant value include unsupervised learning and self-supervised learning. Neither require long training times or rely on large datasets of previously labelled data.
Unsupervised Learning
Unsupervised learning is a way of learning structure without any hints about what that structure is, i.e., labelled data. An example is clustering like structures. This approach is extremely helpful in network health applications where the goal is to contextualize and correlate many different types and sources of data.
Unlike static rules-based systems, with unsupervised learning, a dynamically assembled and filtered view of network & applications is created. Analysis adjusts to the available data, rather than using rules-based logic that works best in the presence of specific data.
Anomaly sources and connections to other resources are identified. Correlation also occurs across time. A configuration change that occurred before the emergence of the anomaly can be correlated with the anomaly, and operations teams can quickly drill down on what the configuration change was. Other context, such as inventory data can be used to provide a richer view of anomaly sources and connections.
A clear and noise-reduced story of health is created, focusing the energies, experience and skills of operations teams on next steps: collaboration and action.
Most importantly, this approach discovers anomalies that have never been seen or occurred before. This is critical in increasingly dynamic and complex environments.
Self-Supervised Learning
Self-supervised learning is an approach that is often asserted as being close to how people learn. Learning by observation.
In network health applications, it is common to monitor thousands of specific measurements, and then alert based on threshold violations. Thresholds can be set manually by operations teams or default heuristics can be applied. Both approaches have challenges ranging from the time to maintain, to a flood of false alarms.
A better approach is to use self-supervised learning. Observe what is “normal”, and then alert based on algorithmic deviations from a normal-based prediction. In this way, thresholds are not only dynamic, they are automated. Less noise, and less manual threshold setting by operations teams.
Conclusion
The human imagination, aided by science fiction fantasies created for entertainment purposes, often creates expectations that are ahead of where technology currently is. In the case of AI, that expectation is general AI and super AI. However, narrow AI is real, and applied intelligently, is delivering significant value for operations teams.
The use of terms AI/ML and AIOps for anomaly detection are derived from unsupervised learning and self-supervised learning. In addition to approaches that are recognized as “AI/ML”, other algorithms are used, for example recommender systems for ranking.