Networking Field Day 35: Selector AI Demo Part 2

Description

In this demo, a user leverages Selector’s Conversational AI, Selector Copilot, to investigate performance within their network infrastructure. The user first probes into the health of tenants located in a specific geographic region. Selector Copilot provides a visualization of the current state and summarization of the overall condition and afflicted tenants, along with probable root cause. The user then interacts with Selector Copilot to explore resource allocation, historical usage, and projected bandwidth. Each visualization provided by Selector Copilot can be copied and pasted onto a dedicated dashboard.

Video length: 28:57
Speakers
Nitin Kumar, Co-founder and CTO
Debashis Mohanty, VP Solutions Engineering and Customer Delivery John Heintz, Principal Solution Architect

Nitin: So now let’s go back to understanding the LLM infrastructure. With the LLM deployed in the selector, John will show you the queries we discussed earlier, along with more details on the SQL that gets executed and a more interactive experience.

The first customer is a transit provider with a backbone network connecting data centers. There’s a complaint about issues from Data Center A to Data Center B, likely around Denver. How does one interactively find the root cause of this issue? I want to emphasize that while there are automated root cause analyses available, this is how you would interactively go about finding the answer.

John: Please feel free to chime in as I go through this. I’ll explain a little bit about the platform and what you’re seeing to bring it all together. We’ll go through the three demos shown in the video live and slowly so you can see what’s happening behind the scenes.

In this case, we’re using the co-pilot feature built into the stack. This could be via Slack, WebEx, Teams, or any collaboration client, and you’ll get the same results. The first thing I did was look at latency seen by all transit routers. You can see a visual representation as well as a textual one.

Now, I’m going to build an ad hoc dashboard. One thing we’ve heard from customers is that they have too many dashboards already. I like this approach because it allows you to build dashboards on demand, as needed, rather than having a thousand dashboards for every scenario, which doesn’t scale very well. 

Here, I’ve asked a question and gotten my backbone devices along with a couple of core sites. I see a few in red, so I’ll look into more details about those specific sites. Selector has synthetics, which we call Ping Mesh. These are deployed across the backbone, so they’re constantly pinging each other, doing race routes, understanding paths, and finding anomalies using auto-baselining. To show a bit more about the query: this is what I typed, but this is what the LLM inferred. Latency is the metric we’re querying, and “threshold violation” is the matrix displayed here. We specified that the host equals “Transit routers,” which is essentially what happened behind the scenes as I typed this.

Now, let’s say the user doesn’t know the network topology or wants to understand it better. They can ask about it, and I’ll bring this over to the main dashboard to make it easier to see, so everyone doesn’t have to strain their eyes on a small laptop screen.

As I zoom in here, I get a detailed view of the topology, including every circuit connecting these elements. When I hit this button, I typed “Network topology,” and it now knows to display the Layer 2 topology as a topo map. Layer 2 topology is the metric, “topo map” is the visualization, and this filter can be set to show by device, device interface, etc.

It’s handy because if you like this view, you can share it to another collaboration client or hit the API button to get the API query instantly. 

Nitin: This feature allows you to generate and integrate this data into other applications as needed. This democratizes access to complex SQL queries. While these queries can be complex and typically require expert knowledge, someone less familiar can simply request “Network topology” and obtain the information.

John: Next, since I saw these elements were related to Denver, I’ll ask for the topology around Denver. I’ll bring this over to the larger dashboard again. You’ll see that now it specifies “site equals Denver” or “Z site equals Denver,” and the rest remains the same. This map has been updated accordingly. If I zoom in, I can see different elements and click on them to get more information, such as nodes. I have the nodes, which are the devices in Denver, I have the edges, which are the interfaces, and I also have the topology of Denver. We refer to this as a query chain. It allows you to take a high-level piece of data, click on it, and pull in additional context. These can be customized based on what a user wants to pull in, or we offer generic ones applied across multiple customers. The goal is to provide quick context to what’s going on.

Audience: Is this only data from the network itself, or is there a way to get real-time data from third-party sources? For example, if your network vendor just issued a bug report or there’s a security vulnerability report, there may be something outside your network causing the problem. Is it possible to have that data available?

Nitin: Yes, because the ingestion layer can connect to infrastructure devices and other sources. If a security vulnerability is available on a website or in a URL, you can bring that in. The correlation aspect, which I did not cover today, would correlate that information with other networking data. This means you can see it in a hover like this: “Oh, this device just had a vulnerability.” All of that information would be available in the storage layer. The challenge then becomes querying and displaying it effectively.

John: This has come up quite a bit from customers—hooking up knowledge bases for certain vendors. If you know your device, model, and OS, the system can tell you about vulnerabilities, especially if they match an event occurring on the system. This saves time for the user. 

Now, I’ve asked for circuit errors. It will pull errors and display them with colors, indicating whether the link is good or bad. I can highlight just the bad links to identify the actual problems in my network. If I click into this, it will bring up the circuit, showing it is between two devices—Cincinnati and Denver.

The KPIs for that circuit are pretty easy to see. I have an operational status down on both of them, a status change alarm, and errors on the Denver side. Below, I’ll have more context about the logs for these devices, including device events, circuit names, descriptions, and more. These scenes can generate tickets straight into a ticketing system. We’re showing you here how you could use the co-pilot feature to do this somewhat manually.

Next, I’ll set up the next demo. If there are any questions, please ask.

Nitin: The next demo is about a multicloud operator who is managing a backbone across different clouds using transit gateways and AWS constructs. They have built a network and are providing connectivity to tenants. In such an environment, they depend heavily on the underlying Cloud infrastructure. They need to analyze how the cloud infrastructure is behaving and whether it has been properly provisioned. For instance, they recently heard about an outage in US East and want to see who was affected. 

John: When I go to the ‘I’ button, we now show the tenant status as the metric. In this case, tenant status is defined in our ETL to represent a customer’s status in their environment. Honeycomb is one of the representations used here, showing multiple types of data with color coding to indicate severity: red for critical, shades of orange and yellow for less severe, and green for good.

We have a few tenants experiencing problems. Next, I’ll show some other examples. In this scenario, we see tenants bursting beyond their provisioned capacity. I’ll request data for tenants that are exceeding their allocation, and what I’ll get is a line plot illustrating these bursts.

We have different types of presentations, we still have tenants in this query, but I’m discarding the percentage line plot. It’s actually looking for traffic or the percentage of traffic being discarded. Now I’m focusing on the region US East because that’s the one we were looking at in our previous example. Understanding context, what I also like is that you can look at historic data. For example, if I’m an operator trying to do some analysis, I could review historic usage in the East.

I’ll bring this over to the main dashboard to make it easier to see. This is a similar query, but if I click on the “I” button here, it will go back the last 30 days. The point is, it’s going back in time, looking at this data over a historical period. At the same time, I can look into the future.

So, we’ll finish with showing projected usage. We’ll look at the future bandwidth for tenants in US East and drag this over as well. It’s great doing this live as well as seeing the actual text happening.

Audience: If I have a bunch of optics, can you project when my optics are going to fail? 

Nitin: Yes, if the failure reason is slow temperature creep or voltage fluctuations, we can actually alert that as well. You would see the temperature slowly grow, and we can compute the failure prediction based on that.

Because you can see the temperature slowly grow and compute the slope of that if it is the reason for failure. If the reason for failure is dirt and all that, probably not, but usually, if it gets really hot, it will fail. So theoretically, you could project that. We haven’t productized that feature just yet, but we would tell you that you’re going to hit this particular temperature in the next 30 days. If that projection ever changes, alert me right away because you might be okay for memory to grow up to 30 days. As soon as that projection changes, let me know. That’s how these future time plots are useful. 

Yes, there’s one reason why you can look at them and see how things are going, but you cannot be staring at dashboards all the time. So you can convert that into an alertable event that you can react to. 

Audience: That’s an interesting piece. We’ve talked a lot about the language model piece of being able to make these human language queries into the data. As Jeremy said earlier, it’s the microscope versus the doctor. Does that use AI also, or is that more just straight limits and trends? 

Nitin: No, I’m going to talk about that after this. This is just the browse interface. I’ll talk about the alerting interface, and all of that will be covered as part of it.

Audience: Excellent. 

Nitin: And this is the other one. Last, this is the retail customer. They have a lot of stores across the country, and these stores are talking to applications running in public cloud. How do they go about managing this? If they’re seeing some issues in Atlanta, how do they find the right circuit provider or internet provider to file an incident?

John: The first thing they can do is ask for the current packet loss. This is a real-time view of it. In this case, I’m going to look at latency to AWS. I’m focusing on a certain cloud provider from Atlanta because they’ve been complaining about issues with their connectivity. I’m asking for the last two days.

Sometimes you have to get the structure of the request right to get the actual data back that you want. In this case, I’ve asked for data from the last two days, and you’ll see that it’s incorporated into the query. This gives us the latency event we’re seeing.

If there’s an issue, you might need to check the providers. For instance, if there are problems in Atlanta, you can ask to show the providers there. This will give you the three providers and a text summary.

If you want to see the errors from those providers, the system will infer the type of errors you want to see. Once the data is loaded, you can drag it over, scroll down, and filter to focus on specific providers. For example, if Verizon is known to be problematic, you can focus on Verizon’s errors.

To do this, you ask to see errors for Verizon for the last two days. After bringing this data back to the dashboard, you can expand it. If you want to look specifically at Atlanta, you can type in “ATL” to narrow it down to just the events from Atlanta. You can then analyze the frequency and details of these events, often grouping and attaching them for further investigation.

This approach is common for our customers, allowing them to collect and analyze events organically. They can then attach to an incident. In this case, the incident now has all the tags, metadata, and events related to the device and interfaces. When I hit “Create New Incident,” it generates this into whatever ITM integration I have provisioned, such as ServiceNow or another system. I can set the priority and define certain characteristics as well.

We also have customers who auto-create incidents with all this context. There are workflows that create incidents manually based on their work, research in the ticket, and the event data.

Audience: How are you getting latency data? Are there agents deployed in the network or how is that? 

Nitin: Regarding latency data, there are a couple of ways we obtain it. We build synthetic agents that can be deployed across the infrastructure. These agents use UDP packets to measure packet loss, latency, jitter, and trace routes. For larger service providers who already have their own agents, we connect to their databases to retrieve the data. This method is quicker and more efficient than deploying new agents.

John: We prefer using existing data sources whenever possible. It just happens way quicker. We always prefer data that’s already onboarded if possible. 

Audience: Most of your customers already have a data lake of some type for data aggregation. 

John: Any enterprise customer typically has that, but it’s in their own format. So, it’s a matter of building a hypervisor to extract that data, normalize it, and then enhance it with metadata, which is the big key.

Audience: Can you guys talk a little bit more about that statement? What does that mean? How does that work? How does somebody get you the metadata? Who handles that?

Nitin: Okay, now is a good time to talk about metadata. As you can imagine, you have metrics and events coming into our system, and you need to annotate that with other metadata like customer IDs, uplinks, sites, and a lot of other information. There are a few ways in which that data from a customer environment can show up in our stack.

First, there are metadata stores like NetBox, Notabot, etc. Customers have already deployed those, and they are feeding all their data into them. We provide a NetBox integration where you configure it, and we understand the NetBox schema, pulling everything in. Then, you can curate it. The NetBox connection is just another data source for us. Just like we connect into devices and use SNMP to pull data, here we connect into NetBox, understand the schema, and bring that data in. That’s one way.

The second way is customers sometimes have spreadsheets or just documents lying around

on an HTTP server and provide us access to that. We connect to these sources and bring the data in. Additionally, some customers use other CMDBs like NetMRI, and we connect to those as well. CloudGenix has its own repositories, which we integrate with too. 

Once the data is in, there is one more source: sophisticated customers might have their own software development teams that want to publish metadata into our system. We offer an API interface for this purpose. Customers can bring in any kind of metadata they have manufactured on their side, create a meta store on our platform, and push metadata using APIs. For example, they might want to set expectations for the number of prefixes from a tenant or the number of errors. This metadata is then used to enhance and augment the incoming metrics, making that context available in the system.

Is there anything else you wanted to see? 

Audience: I think that makes sense. 

Audience: Yeah, and can we also send it—Michael Winston—can we also send streaming data via gNMI or some other streaming protocol? 

Nitin: More real-time information can get in there. Yes, so that is a property of the ingest layer. It could be streaming telemetry like gNMI, or even a Kafka bus. Sometimes a lot of our customers have streaming data available in Kafka, so we just connect up to the Kafka bus and then streaming data starts showing up. It could be just records that keep coming into us, and we then ingest them into stores.

Audience: In that case, you would have to, in your data hypervisor, know the structure of those message packets, map them into fields, and then any kind of extra context or metadata that’s in your meta store gets automatically attached. 

Nitin: Yes, and just now that we’re talking about meta stores, another use case which I did not anticipate when we built the system: once the data is inside our store and you have a SQL interface to that, a lot of data rationalization can be done. Because when you are working with NetBox or have your own spreadsheets, you don’t necessarily see if there’s anything wrong in your data—like are my interface descriptions all uniform? Are my interface descriptions missing? I think I told my teams to do this, but how much have they actually done? So, the teams are actually fixing interface descriptions; are they 80% done? Are they 90% done? What am I looking at? 

A lot of our customers are able to look at the meta stores they’ve built in our platform and just use SQL queries on an ongoing basis to make sure things are valid. For example, they’ve built dashboards called inconsistency dashboards. That dashboard should always be empty. And you can set an alert if a row shows up in that table, indicating some inconsistency somewhere that needs to be addressed. 

Once you have a SQL interface to a table, a lot of things can be done that traditionally were not possible because SQL gives you a great facility to slice and dice and look at different things. Yes, you can use spreadsheets, but how does your data get into the spreadsheet to start with? A lot of that capability, which I did not anticipate early on, is actively used. The interface description one is really one that caught my eye. They know that the interface description should look like this but it doesn’t for whatever reason, and they look at the inconsistent dashboards to see which needs to be fixed. 

Audience: So then the metadata store is kind of acting—I hate to use the word intent—but it’s kind of acting like an intent store. It’s like this is how I want things to be or how I expect things to be, then your system can use that to correlate or use it to even set thresholds for metrics or however you’re going to process that data. 

Nitin: Yes.

Audience: One last thing before we move on: I’m just thinking through, I guess, maybe how many boxes I need to deploy. What I mean is, is the data aggregated locally and then sent over like TLS tunnels over the Internet, or how are you getting all this data? I’m assuming this is kind of back in your cloud is where most of this is happening. How do you get the data from the network to Selector?

Nitin: Most of the time, the collection layer is separate from our cloud because the collectors have to be deployed on-prem. The collectors are very efficiently written. We’ve seen that a reasonably sized VM can poll up to 5,000 to 7,000 devices. So if you have a fleet of—like one of our larger customers has a fleet of about 30,000 devices—we have about 8 or 9 pollers installed globally. These pollers are polling locally, and there’s of course high availability and redundancy built in. You most likely want to have two pollers, and you can load share so that if one poller goes down, the rest of the infrastructure pollers will pick up all of that. All of those facilities exist.

Once these pollers collect the data using HTTP TLS tunnels, they publish up to our cloud. Our cloud—I’m just using metaphorically—it would be the customer’s cloud; it would be deployed in their public infrastructure. A few of our customers actually let us deploy in our cloud, so it’s a mix of customers’ security policies. 

In our cloud or our customers’ cloud, the system is very elastic. Since it’s all built on Kubernetes, the infrastructure can scale according to the load. If query demand is high, you can allocate two nodes to the query infrastructure. If you need longer data retention, you can scale the storage layer accordingly. It all depends on customer policy—whether you want to retain the data for nine months or 30 days. You need to allocate the necessary storage for that. If you anticipate users querying the system frequently, you would dedicate more nodes to the query layer. 

Since everything is powered by Kubernetes, each layer can be scaled independently. This flexibility has contributed to our success because we don’t require customers to start with a thousand servers. Most likely, they won’t need that. We can start small and grow organically as more data is ingested into the system.

Audience: Since you brought up the topic of managed service, I’d like to ask a question. Do you support customer-managed keys for encryption if we were to use your cloud?

Nitin: Could you clarify the question?

Audience: Many financial services and regulated companies require control over the encryption of their data so that the provider can’t reuse or access the data outside of where it’s loaded. We use customer-managed keys (CMKs) to encrypt the data. It’s an endpoint you can subscribe to, and we provide the encryption key. This might be getting too technical, and we can take it offline, but this control is important for some of our customers.

Nitin: Right. For customers who have encryption requirements and require external encryption, they most likely have their own cloud footprint where all these policies are in place. They handle it locally on their side, and we would not do it in our cloud. That would be a use case where they host the infrastructure on their side.

Audience: I have more questions, but we can discuss those offline.

Debashis: Sure, no problem. Just to clarify, the deployment can be in our cloud, on the customer’s premises, or even in their own AWS or GCP instance. They can provide access there. Regardless of where it’s deployed—whether in our cloud, on-premise, or in their own cloud instance—it’s the exact same stack deployed in all three environments.

Explore the Selector platform