Networking Field Day 35: Selector AI Alerting Discussion with Nitin Kumar

Description

Selector delivers consolidated, actionable alerts through your preferred collaboration platform, such as Slack or Teams. Alerts depend on Selector’s powerful event correlation fueled by advanced AI/ML techniques. Automations can be leveraged to generate service tickets that include detailed summaries, root cause analysis, and even suggested remediations.

Video length: 5:58
Speaker: Nitin Kumar, Co-founder and CTO

Transcript

Nitin: The alerting layer is a system figuring out issues on its own in the background and then publishing this information. This refers to an alert we saw earlier, indicating that a particular user made some configuration changes at a specific route, which caused damage. Most likely, the Cloud RAM services product was affected, and many devices were impacted. The system figures out this information from the data it collects.

This is the pipeline that runs in the background. Data comes into the collection layer on the left, and is stored in our data store. Machine learning and statistical algorithms are used to determine what is good and bad. We focus on the ‘red’ data points, which indicate issues. These points might not be related to each other, so clusters of related events are identified. For example, if a person made a change, a set of connected events is identified, forming what we call correlation trees. These trees can be visualized as graphs, where connected edges indicate related events. The graph is then published into the alerting layer, where summarization occurs, creating a more digestible alert.

The graph that is generated is typically very dense, and it contains detailed information, such as the configuration change that triggered the events. Summarization helps reduce the complexity; without it, each node in the graph would be a separate alert, leading to an overwhelming number of alerts. Instead, the system generates a single alert with comprehensive information, allowing users to drill down into the details as needed.

This process of data democratization enables even those who are not experts in infrastructure to understand what’s happening. In summary, instead of sifting through hundreds of alerts, the system quickly identifies that a specific user made a change that caused the network issue.

Once an alert is generated, it can be consumed as is, but typically, it is mapped into a ServiceNow incident ticket. The system creates this ticket automatically, including all relevant information, which can then be used by downstream teams to address the issue.

When a ticket is created, there is customization available. The system has a default view of what information is included in the ServiceNow ticket, but during the managed service engagement, customers can request additional data to be included.

There are other capabilities as well. While we’ve focused on alerting and browsing today, the system also serves as a single source of truth for many users. For example, we have an excellent SNMP collection mechanism, and many use us as a frontend to ServiceNow. Instead of creating thousands of tickets, our system creates a single, comprehensive incident in ServiceNow. Additionally, having a Kubernetes-based environment allows us to deploy on-premises, off-premises, or in the cloud, scaling as needed based on data growth.

Explore the Selector platform