Networking Field Day 35: Democratization of Data Access Using Network LLMs with Selector AI

Description

In this brief demo of the Selector platform, a user interacts with Selector Copilot to explore behavior within their network infrastructure. They first look into the latency of their transit routers, revealing a regional issue. The user drills down into network topology information to further investigate the latency, where they access details about devices, interfaces, sites, and circuits. Selector Copilot is then leveraged to surface circuit errors. Notably, each visualization provided by Selector Copilot can be copied and pasted onto a dedicated dashboard.

Video length: 21:10
Speaker: Nitin Kumar, Co-founder and CTO

Transcript

Nitin: So I’m going to spend the rest of the time talking about Selector’s ability to democratize access to data using our Network LLM and what I mean by that. What does data democratization mean, and why is it important for networks? Let’s take a retail customer. If you think of a retail customer, they have a very complex network. You have retail stores across the globe, and those stores are connected and there are payment applications probably running at the cash register. 

One of the big challenges is when the registers are not able to connect to the payment application sitting in the cloud, and the network is always to blame. The networking infrastructure team comes in and they have to figure out what’s going wrong. Is it the Wi-Fi inside the store, or is it the LAN connection? There’s probably an SD-WAN connection going out to the internet. And then of course, you have the cloud provider, AWS, or Azure. 

When things go wrong, people who are experts in each of these domains have access to the data and understand what’s going on, but anyone else is not able to figure out where the problem is. This ability to be able to access any kind of data and understand it across all these different domains is data democratization. You are no longer hostage to the experts or vendors of that particular domain. Anybody can do that, and Selector provides a conversational interface so that this is possible.

What is that conversational interface? I want to first show you the interface. We’ll do a brief preview of the demo, and John is going to walk you through some scenarios. Then we’ll go into the technology. There are fundamentally two ways in which users see insights: there is an alerting insight, like alerts generated by selector, and then there is a browse interface where you log into the system and just interact with it, asking questions. That’s the conversational interface. 

Let’s take the conversational interface. If you’re having issues at your retail sites, you just go and ask a simple question: “Are we having any uplink issues in the US?” Once you do that, the entire topology pops in front of you and you can continue your investigation from there. You can keep asking questions, and we’re going to show you how that works.

Then, of course, the alerting layer—when alerts are generated, it’s not just one event and one alert. A consolidated alert shows up, as seen here, where someone named Joe made a configuration change at a particular device, causing a widespread outage. The different markets affected are listed in the alert, as well as the different protocols and events. There are close to 50 events that this single alert encapsulates, so instead of seeing 50 alerts from this particular incident, you see only one alert.

What possible products got affected? For instance, if this is a colocation provider, one of their products is cloud on-ramp. Due to this change, their cloud on-ramp offering might be at risk or something might be going on with it. All of that information is brought to you in the alert right away. Once this alert shows up, it has different workflows. You can create an incident around it, such as a PagerDuty incident or a ServiceNow incident, to report what happened. Teams downstream can then take care of the situation. All of that can happen from here.

I’ll focus initially on the browse workflow, which shows how to converse with the platform. Then we’ll talk about the alerting later on. Let’s do a very simple demo. We have three sets of demos. The first set is where you have a backbone network that’s providing connectivity to different parts of the country. There are data centers in the country, and you are providing connectivity through your backbone network. There is an alert or an ongoing issue where some sites are not able to connect to other sites.

When you come in, you enter the portal, which is all blank at first, and you start a conversation with the system. John’s going to enter the first conversation: “Show me, are there any latency issues going on anywhere?” As soon as you type in, in plain English, the system processes the request.

You would think like, are there any latency issues going on? The system does the query and shows you a general latency map. The details are not important, but you see a bunch of red, which means there are certain areas affected. 

You then ask, “Let me see what my network topology is like.” Once you enter that question, you get to see the entire topology right there. As John is doing these queries, he is also saving the outputs for a troubleshooting session later on. The outputs that the query is producing are showing up as widgets, and John is saving them on the left-hand side, building a canvas from scratch. You don’t have to worry about the insights or dashboards that are already present; you’re building the dashboard as you go along.

Now, because there is an issue in a certain area in Denver, you can say, “Show me the topology around Denver.” You’re now trying to get to the root cause and analyze what’s going on. As soon as you do that, you get a simpler view of the topology, which just shows the connections around Denver. Now, you know Denver is connected to the rest of the country along certain lines. Then you might say, “I know there are certain latency issues, I need to see what the errors are.” You ask, “Show me the circuit errors around Denver.” 

That immediately shows you that within this whole system, there is one circuit at risk, showing a lot of errors. You get to the root cause right away. Of course, if you had an alert, all this information would have already been present. This use case involves drilling down manually and communicating with the machine in English to determine that there were circuit errors inside Denver. 

We will go into more details about this workflow later on, but this is just a preview. 

In the second example, we have a customer with a multi-cloud environment. They have built a backbone across public clouds and provide connectivity to tenants running applications in the cloud. If there was an outage in the US East by AWS, you might want to check which tenants were affected. You issue a query to see which tenants experienced the outage and identify several affected tenants. Then, you can further query to find out how many of those customers are exceeding their allocated quota. The system presents this information in a human-readable format, allowing you to analyze historical data and predict future behavior. For instance, you can check projected bandwidth needs to ensure tenants have a good user experience and potentially provision more bandwidth.

In the final example, a retail customer has stores connecting to the cloud. If you query whether stores in Atlanta are experiencing connectivity issues, the system will show that some stores in Atlanta are indeed having issues.

Atlanta, they’re having connectivity issues. You might think there are certain internet providers in that area that could be causing issues. So you want to know what your providers in Atlanta are. You ask, “Show me my providers in Atlanta,” and you get a list of those providers. For example, in that particular area, your providers might be AT&T, Verizon, or Comcast.

Next, you might want to know if these providers are having issues in that area. You start checking, and you see all the issues. For example, you might find that there are several providers with issues, like Comcast and CenturyLink. You eventually determine that Verizon is likely the one causing the issue. You then focus on tracking Verizon to diagnose the problem.

This process demonstrates how English queries get translated into underlying system queries. The user experience involves using conversational interfaces, which we all desire from chatbots. However, handling this in the context of networking data and infrastructure is complex. 

Audience: This is Chris Grundman. You might be covering this next, but I’m curious about the model behind this. Did you build and train your own LLM, or is it based on something else with some customizations? 

Audience: Thank you, Chris. If I could jump in, this is Remington. I’m really interested because I see a lot happening in terms of CMDB and service mapping. I’m putting these pieces together in my mind. For a customer that doesn’t have a well-done or well-updated system, is that part of what your managed service is building?

Is that part of what your managed service is building essentially for the customer or can you maybe clarify how that part works or maybe I don’t even have a CMDB right? 

Nitin: Even if it might not be part of a formal CMDB, that information is always available. They might have spreadsheets, or a Google Drive sitting somewhere where the metadata is available. Our system, as part of the managed services, takes those connections and polls into those endpoints to build a table or a CMDB in our system. So if the customer doesn’t have a formal CMDB, we do build it for our operations. CMDB is the single most important thing that gives you business context to these queries because raw data most of the time does not tell you whether it’s Verizon or AT&T. It’ll just have some cryptic IP or some ID. That ID-to-Verizon mapping is sitting somewhere else, and we build that to provide the kind of user experience needed.

Audience: And is there a hook if I do have a very well-curated CMDB that allows me to keep it in sync with your efforts? So as I’m onboarding hardware and adding inventory, I put it in my system, and it’s going to then push up?

Nitin: We have sophisticated customer integrations. A lot of customers who use NetBox as a CMDB on their side, as they onboard devices or circuits or service providers, we have a first-class integration with NetBox. They give us credentials to access that NetBox, and we periodically scrape that and transfer that information to our side. For customers who want NetBox as a CMDB, we also offer it because maybe for some reason you have not operationalized NetBox or haven’t done certain things. We do provide NetBox as a service as well.

Audience: Thank you. 

Audience: Could you just go into, not too deeply, but I’m really curious for customers whose CMDB is in sad shape or they don’t have one or, for whatever reason, could you talk to us a little bit about the discovery process? Do you have something that gets into their network and crawls it and builds it that way?

Nitin:  Let me hold on to the question; I’m probably going to get to that as I describe the stack and I will make sure I answer that in that context. I’ll hold then, thank you. 

So, the stack:

In recent years, there has been a revolutionary change in conversational LLMs. There is a natural tendency to assume that with tools like ChatGPT or Google, you can simply connect your infrastructure data to these services and get a conversational interface. While this is a fair question, the issue lies in the significant gap between where infrastructure data is and where these services operate.

Firstly, these services are hosted on public cloud platforms, while your data may reside on-premises or in specific infrastructure. Transferring data from your local environment to these cloud services is a major challenge. Even if you manage to transfer the data, the models these services are trained on primarily understand general English language. They can answer questions about the weather or geographical distances, but they don’t comprehend technical details like optics levels on a Cisco or Arista line card, or SNMP data and syslogs from vendors.

The models are not trained to understand these specific technical contexts, which creates a gap. To bridge this gap, you need to build a specialized layer of software. Instead of using public cloud-based LLMs, you need to develop a custom LLM that operates closer to your infrastructure. Once you have this LLM, you need to connect your infrastructure to it.

This connection involves a complex technical stack, which is where our company comes in. We develop the software needed to collect and process data from various devices and systems. Data collection itself is a significant challenge; it involves identifying which devices to query and implementing mechanisms for crawling and gathering the relevant data.

There are mechanisms where you can say, “Discover all my devices in this subnet or this particular region,” and the system will figure out what those devices are, do the initial connection, and start SNMP polling or GNMI polling them. This collection has to be diverse and span across the globe. You need to deploy a fleet of collectors in different regions, like the US or Asia, because these collectors need to be close to the infrastructure. Once they collect the data, it has to be presented in a uniform layout.

There is no universal data format; an SNMP format looks very different from something else. So, you need to build a layer that can understand any schema and convert it to a normalized schema. That’s the technology we build. We have a data compiler, which we call a data hypervisor. Just like compute hypervisors abstract the details of compute and storage from applications, our data hypervisor abstracts the variety of formats of the underlying data and presents it in a uniform way to the storage layers.

Then, you need to store this data. The storage for logs is very different from the storage for metrics and events. You need to roll out a storage layer, and once you have storage, you have to be able to query it in a uniform way. The query interface then needs to connect to the LLM so that it can interpret the data effectively.

Audience: Michael Winston asked if the collection means getting data from existing systems. You mentioned NetBox, which implies support for other systems like Notot, ServiceNow, or Forward Networks. 

Nitin: The data hypervisor performs ETL (Extract, Transform, Load) before feeding it into the data store. The data hypervisor handles normalization. Instead of programming the logic, you configure it. While programming involves writing code, configuring allows the schema to be discovered and mapped accordingly.

Sometimes the schema can be discovered by the system, which identifies what the schema should be. However, users can override these discoveries and define their own schema mappings. This configuration in the data hypervisor allows for the necessary transformations during the ETL process.

The collection and data hypervisor components are crucial for feeding any system, as they handle data ingestion and ETL without requiring additional platforms. This capability is key to successful deployments. Without a data compiler, integrating diverse data formats and ensuring smooth data ingestion can become a significant challenge, limiting the effectiveness of sophisticated queries and models. Embracing data diversity and not avoiding it is essential for managing complex infrastructure environments.

Explore the Selector platform