Tokyo Interop Presentation

Description

This video is a recording of Kevin Kamel and John Heintz’ presentation at Tokyo Interop 2024. They introduced Selector as a platform to transform operational data into actionable insights, discussed three critical operations challenges Selector solves with AI, and gave a demonstration.

Video length: 15:32
Speakers
Kevin Kamel, VP of Product Management, Selector
John Heintz, Principal Solution Architect, Selector

Transcript

Kevin: Thanks for joining us today. I’m Kevin Kamel, and I head up product here at Selector AI. With me is John Heintz, a principal solution architect here with the company. Today, we’ll be presenting about Selector. I’ll walk you through a high-level introduction, and John will provide a demo of the platform and its capabilities.

So, what is Selector AI? Selector is a platform to transform your operational data into actionable insights. We collect telemetry from your network and environment directly from your devices or from your existing monitoring systems. We then analyze that data using AI and ML to establish a sense of what’s normal in your environment, helping to detect when things go wrong and pointing you towards the source of the problems so that you can quickly address them.

Diving in, we see three critical challenges that operation staff deal with today. First, the collection and management of heterogeneous data. By heterogeneous data, I mean the telemetry collected across the environment—metric logs and events from the network up to your applications and everything in between. As network operators, it’s not uncommon for us to rely on a large number of tools—10, 20, or more—to help capture all the different types of telemetry emitted by the environment. There’s a lot of friction here. Implicit siloing happens during the deployment of those tools. Not everyone can access the different tools, and not everyone even understands how to use each of the different tools in the first place. This is a pretty big problem for teams and something we solve for with Selector by collecting and consolidating all of the relevant telemetry and offering our customers a single pane of glass—a single place for our customers to look at the data, helping them focus on the incident rather than needing to log into all of those different legacy tools.

Beyond that is a lack of automated insights delivered by existing tools. Most do little more than collect and warehouse the telemetry in the environment. An incident happens, and the team is left to sort out how and why that incident occurred manually. In our experience, a vast majority—something like 90% of the time spent resolving a network or IT issue—is actually spent investigating the issue. Operators respond to some sort of alert or notification and then begin to manually dig through the different platforms, attempting to connect the dots as to what happened. This is a labor-intensive process and typically requires the involvement of senior staff.

With Selector, your operational data is continuously analyzed. This enables us to notice in real-time when things are starting to go wrong. We then send what we call a smart alert. This is an alert notifying you of an emerging problem. Smart alerts leverage our ML-driven correlation and identify the root cause of the problem. These smart alerts notify the team of an issue and point you towards the cause of the problem, enabling the team to skip over lengthy investigation and focus on quickly remediating the issue.

Which brings us to siloed collaboration. Today, incident response remains a fragmented process. Siloed tools often mean that remediation requires reaching out to different teams and asking for their help. These activities occur across the phone, Zoom, chat, and sometimes even email. This process is slow and inefficient. The same information is repeated over and over with every new conversation. With Selector, we offer an alternative model.

Involving chat Ops Slack, Teams, or whatever your favorite collaboration platform is, along with Selector Co-Pilot or Conversational AI Co-Pilot, which is powered by an integrated generative LLM, enables a chat-based triage model where you can use natural language to dig into the state of your system, learn more about an incident, and otherwise share information with your team in real time. No need to learn arcane query languages to leverage the platform, no special training needed either. Everyone can leverage Co-Pilot to interrogate their telemetry and dig into what may be wrong and why. 

In terms of data sources, Selector can work with pretty much any type of data from any source. Integrations at Selector are built with a declarative ETL, a low-code/no-code approach that allows us to very quickly build out different integrations to get data into the platform. This allows us to work with community and commercial software, or even proprietary software that may be unique to your organization.

Let’s walk through how the platform works end to end. As our first step, we’re going to collect the telemetry—metrics, logs, and events—and pull that data into our data lake. Along the way, we’ll pull in any CMDB data that may be available, along with any sources of metadata that may be relevant, including topological data, if present.

Now, once we have the data flowing into the system, we’re going to analyze it. For time-series data, we’re going to process the metrics. We’re going to apply proprietary machine learning to baseline each of the time series submitted to the system, accounting for cyclicity, time of day, and seasonality (time of year). We’ll establish a sense of what’s normal for each time series, and when we see an anomaly—a deviation from normal—we’ll record that along with the metadata for later use, storing correlations.

For logs, we’ll similarly process them. We’re going to apply proprietary ML to normalize logs entering the system and leverage models to perform clustering and named entity recognition. The resulting pipeline allows large-scale log processing, converting relevant logs into events while capturing the related metadata. These events are then stored for later use, storing correlations.

There are some other interesting outcomes worth mentioning here as well. The applied clustering, which groups together similar logs, negates the need to use regex to search through logs—this work has already been done. Further, the process serves as a form of log compression, enabling customers to shift away from expensive log storage platforms like Splunk and Elastic for their operational log data.

To recap, we’ve processed the metrics and logs and identified anomalies. We’ve recorded those as enriched events by leveraging metadata from the environment. From here, we apply recommender models to correlate across those anomalies. We leverage temporal correlation to identify all the anomalies that occur in a relevant period of time. We also leverage contextual correlation to look at all of the metadata and further group together the events. For example, events related to a given router or interface may be grouped together. We also leverage topology, if available. At this stage, we have now grouped together the related events and produced a graph illustrating or representing all of the related events. Finally, association models are applied, which form a directed graph.

Leading to the identification of the root cause of the incident, customers can consume the resulting correlations in a variety of ways. We can use them to drive the creation of tickets in ServiceNow, Jira, or other ticketing systems. Some of our customers go a step further and drive closed-loop automation, but more often than not, customers are using correlations to drive the creation of smart alerts, which are shown through Slack, Teams, and other collaboration platforms.

Here, you see an example of a smart alert. At the top, we have a clearly identified root cause. This is where the team needs to focus their efforts to remediate the issue. Below that are all the related events that resulted from the root cause. Alert deduplication or consolidation is a primary outcome of Selector correlation processes. Smart alerts frequently consolidate large numbers of alerts together within an enterprise environment. This can easily be 75 or 100 to 1.

To facilitate curation, Selector offers Selector Co-Pilot or Conversational AI Co-Pilot, a game-changing feature that immediately democratizes the insights raised by the platform. With Co-Pilot, there are no fancy queries that need to be written, no obtuse domain-specific languages to be learned. You simply come in and ask questions of the system in any language and get understandable summaries in response — summaries that anyone on your team can understand. 

To do this, we leverage an integrated LLM for alert summarization and retrieval-augmented generation, along with the original correlations and root cause we described earlier. With Co-Pilot, you can ask things like, “Why is my service not working?” “What is the severity of the current issue?” “Which customers are affected?” “Are there any known issues with this specific model and firmware of the device?” and much, much more. Everyone on the team can run these queries, and the responses are easy for everyone to understand — no more learning arcane query languages or different schemas to figure out which queries to write in order to answer your questions.

I think that gives a pretty good overview of what we do here at Selector. Let’s take a look at the platform in action with a demo from John.

John: Thanks, Kevin, and hello everyone! I’m John Heintz. In this demo, we’re going to show a real example of how an organization typically engages with the Selector platform. It highlights the capabilities of the platform, including natural language interactions, automatic correlation leveraging metadata about the network, and remediation using collaboration tools.

The demo begins with a user receiving a smart alert in Slack, asking some clarifying questions about the event, and then pivoting to the portal to do more triage and eventually take remediation action via Slack. A popular endpoint for the Selector smart alert is Slack. As you can see in the Slack alert, we have good context about the root cause of the event already — the device affected and the downstream effects.

At first glance, it appears that an interface flap is causing downstream issues with BGP. After getting the alert, the operator wants to gain more context about the current status. The first question would be to understand the device state, so the operator asks if the device is still reachable. As you can see from the response, the device is still reachable, but that doesn’t mean there might not be a service degradation. So next, the operator wants to understand the potential impact. Router 1 and Router 2 appear to be in a degraded state with two KPIs in red for BGP flaps and interface operational status. With this information, the operator wants to do more research using the Selector portal. They click on the link in the alert notification to open a dashboard that’s automatically tailored to the devices and events in this notification. 

The first widget we’re going to look at is the one for BGP sessions. This specific KPI identifies if a BGP session is flapping. A green status indicates a stable session, while a red status indicates it is flapping. This widget is automatically filtered for the devices in the smart alert. As you can see in the Sunburst Widget, the peer interface name and peer host tags indicate the failing session is related to a common interface between a pair of routers. By clicking on one of the interfaces on the outer ring, we will see more detail about the BGP flapping.

The timeline heat map gives a visual indication of the health, and the flap rate tells us the volume of the flaps for the time period. Scrolling down, we can see a line plot of all the eBGP peers for that device and additional relevant information such as the ABIN status, the peer interface configuration events, and interface uptime. At this point, we have a clear sense that we have a flapping interface that is in turn causing BGP to flap, and it appears to be isolated to just this pair of interfaces. 

So now the operator goes back to the main alert page to learn more. We can see on another widget that the interfaces are related to a common circuit and carrier. This will be useful later when we open a ticket with them. Looking further down the page, we see various raw logs and log events. As you can see from the log events, there is a repeated interface down event on both routers along with BFD, BGP, and LLDP events. Because we have an integration with a AAA server, we also know there are no recent configuration changes that could be causing a flapping. This was not caused by human error.

We also have additional KPIs for the device that Selector has built. These are helpful to understand the full context of the health of the device. We see red for BGP flapping and interface operational status with the other being green. Note the latency line plots during the flapping events. The devices have always been reachable, but the latency for Router 1 did go past the auto baseline threshold. For every device in the network, Selector will automatically create a dynamic threshold. This can be seen by looking at Router 2, which has a much lower dynamic threshold due to its closer proximity to the source of the pings. In this case, since the latency deviation on Router 1 is not consistent, it has not yet flagged the KPI to red yet, but it is something that should be considered when making a remediation decision.

The operator now has enough understanding of the network event to make a decision on what to do next. In this case, they are going to disable the interface via Slack since there are redundant interfaces and then open a ticket with the carrier. But before we show that, I will explain a bit more about the correlation. On the top left is a widget called Correlation Graph. It helps us understand how the various events are connected and is used in the smart alert.

First, I will expand the widget. On the top, we have a summary of the events and the labels. Scrolling down on the left side, we see more detail on the various labels or tags that are part of the correlation set. By increasing the correlation level, we can see all the events that share common tags along with the details of the tags selected. Machine learning continually examines log and metric events and their association levels using temporal and contextual correlation to automatically create this graph, which produces the smart alert that you saw on Slack. For example, when I hover over the BGP event, I can see the event name and all the associated labels such as local and remote.

This additional context is key for understanding how events are related in the network. In this case, the interface and BGP events directly relate to the circuit down event, which raises the criticality of this event for the operator. 

The final step is to go back to Slack and ask it to disable the interface. This same workflow can also be done directly from the portal if desired. So the operator goes back to Slack and instructs it to disable the interface on Router 1, Site 1, so that the flapping doesn’t continue and potentially degrade the site. As you can see, Selector responds with a visual indication of the interface being down.

I hope you enjoyed watching this video. If you want to learn more about Selector AI and how it can enhance your network operations, please visit us at www.selector.ai

Explore the Selector platform