Networking Field Day 30: Deep Dive into Selector AI

Description

Take a deep dive into Selector in this Networking Field Day 30 presentation from Nitin Kumar, Selector Co-founder and CTO. He explains the challenges that network operators face and how Selector has solved them. He also goes into the user experience, from problem to insights.

Video length: 01:21:34
Speaker: Nitin Kumar, Co-founder and CTO, Selector AI

Transcript

[Video text]
Networking Field Day 30 Selector AI
Introduction to Selector AI
Deba Mohanty
January 20, 2023

Thank you. My name is Nitin. I am the CTO and co-founder of Selector. I’m going to go into a deep dive into the platform. This is the agenda for my session.

As we said, a big problem for network operators is there’s a lot of data. What you would want to do is correlate all of that data and tell me what to do. It’s a simple enough problem, and everyone understands it. But this problem hasn’t been solved for years. That’s why we exist and we’ve solved this problem. 

My goals for the session are three things. First, I want to tell you why this problem is hard. There are certain aspects of a simple problem; what are the hard parts about it? If this problem hadn’t been hard, it would have been solved years ago. It’s still a work in progress. The problem is hard. What has Selector AI done to solve that problem? We have made certain design decisions … Kubernetes adoption was one of them. What have we done to solve those problems? 

And third, what is the user experience like? You want to correlate and stuff. What is the end user experience? How does a user consume a product? If you were using Uber, the user experience is you open the Uber app, and you’re up and running. That’s the taxi-cab calling experience Uber provided. What is the user experience that Selector does? Of course, stop me anytime for questions.

The user experience to start with, everything is Slack native. Before … let me give some background on this. This is an example of a data center operator who operates a spine-leaf fabric. You have a lot of spines; you have a bunch of leaves; and then you have servers that are running applications. This is a very simplistic scenario that the demo is based upon. When this infrastructure is running, applications are probably having issues. Something happens. What is the day in the life of such an operator? This is not the onboarding part that you would ask. This is things have been onboarded, things are working, Selector is deployed. What happens after that?

Slack native

As Deba mentioned, we are a Slack native company. That means you can consume a lot of the most important parts of our product via Slack. You don’t have to go to a dashboard. We do have dashboards, but you can do most of your stuff on Slack. Tier one teams can do things on Slack. When I say Slack, I mean Microsoft Teams also … everything, because every organization has an official policy of using Microsoft Teams. But there’s shadow IT; they always use Slack. [00:03:00 laughter]

Audience member: It’s been spoken.

Nitin Kumar: We support both even for our customers who do Microsoft Teams. We have a Slack integration. The Selector AI app is available in the Slack Marketplace. If you search for us, it will be there. We are an app in the Azure Marketplace as well. 

This is a bet. Again, we placed a bet on this four years ago when we started the company. We didn’t know COVID was going to come. We felt that something like an enterprise communication tool has got to be the way these things have to be consumed. It’s hard to do dashboards. Plus, as a product, it gets hard. Right now, if I say my product is available on Slack, everybody kind of gets it. That’s the important thing. So, now we’re on Slack.

Really it’s all alert-driven. This is an alert that a Slack operator gets. First, it’s an alert on Slack. This alert is not a dump of information. It has something that is actionable, and it tells you, by just looking at it, really what happened. 

You don’t know the details of it, but at a 10,000-foot view, you know that this person, called Joe, has made a configuration change that has caused certain protocols to go down. There’s a summary over there—a simple English looking sentence. But there’s a lot of technology that has gone into building that sentence. The devices that are affected, the two devices—a spine leaf four and leaf switch 150—those devices have been affected. The BFD [bidirectional forwarding detection] sessions have gone down. The two applications that are most likely to be affected are Office 365 and mail … and video. [It’s a] made-up example. In real life, this looks a little more detailed and a little more complicated, but the concepts are all explained over here.

Jordan Martin: Just coming right out, finger-pointing right away. It’s definitely Joe. It’s important information. I don’t want to be Joe if this message comes across Slack.Nitin Kumar: Then, you are probably running thousands of applications. Those applications haven’t been affected, only your video application and 365. Your search space has been pruned a lot. You don’t have to look at all the leaf devices; you only have to look at those two. You have to start there. You have a starting point; that’s what Deba said. In any outage, it is the triage that takes 90% of the time. If you know what happened, where it happened, fixing it is not really that hard. That’s the easy part. I’ve written code a lot, debugging the code using a debugger that’s where I spend most of the time. The fix is probably a line, and I spend a whole week fixing this, plus five to plus six or something like that.

Jordan Martin: … character. 

Nitin Kumar: This piece of information, however trivial it looks, this is how the product works. I’ll do the demo as well. This is the thing and stuff is going to happen on Slack. Then, there is the portal link as well. If somebody would want to get more information out of it, you click on that portal link. Then, you can do more debugging.

Tim Bertino: Is there some sort of alert suppression that’s happening when you’re doing the correlation so that you’re not sending a whole bunch of different Slack messages? You’re sending one that’s got all the correlated information.

User experience

Nitin Kumar: That’s what I want to show. So, I’m going to try something which is I’m going to do part of the demo now just to show this aspect. This is that Slack channel where stuff is happening. I took a screenshot from here. It’s assuming an alert has fired. In this case, that alert happened earlier. And there’s another instance of this. I see this alert. I want to quickly figure out more about what’s going on. I could be on my mobile. 

First of all, since I’m on Slack, I already have access to this application on my mobile phone. I’m going to assume I saw this alert; I’m going to ask for more information so /select (slash select) is our keyword on Slack. I’m going to say, “status of devices.” I just type that command on my Slack client. I should see the result of that over here. I typed it on my phone. I said, “status of my devices.” I don’t have to remember the exact command or how to pull out all the device status. 

As soon as I type it on my phone, I see this. You can see these are my devices. If you’re managing infrastructure, you know certain devices are red. Red means bad; green means good. Other people who are on your team right away see all of this application. Whoever those nine people are on the channel over there, they suddenly have accesses to this information. 

I can keep typing over here. I’m going to say, “Select.” Now, I know a certain device is bad. I want to get more detail around it. I know the device is bad, but what about the device is bad? So, I select “status.” I know the device called L150 is bad. I want to see what’s going on over here, so I typed it again on my mobile phone. The status of that device is going to show up over here. This is primarily the user experience. 

Here, the chart is not very clear, but different things about the devices show up over here—the logs that happened on the device, the status of all the ports on the device. Everything shows up over here. I can keep going there. Maybe I’m back on my computer or somebody else … one of my team members is already on this computer. They’re already seeing all of this. They can go back. Let me go back to this link. I saw this alert, and I want to go back and see the information at that point in time.

Inside Selector’s technology

Now, let’s see the technology that we built to get to this view—how this was built and what happened. This is the pipeline; I’m going to walk through this pipeline. There’s a bunch of data processing things that happen. What we just saw on Slack is on the far right. You saw that the alerts were formed. The alerts were described in a certain way, and it gives the impression that it’s just a Slack gimmick. It’s not. The information is hard to curate. How do you get there? That’s where the technology lies. That’s where the eventual goal is; I’m going to walk you over there. 

Information curation

These are the raw metrics. Starting from the left, the metrics are coming in. First, the metrics come in. As I said earlier, these metrics are different kinds of metrics. They could be ping mesh metrics; they could be SNMP metrics. You’ve configured them. These metrics come in; the logs are coming in; or events are coming in. Different kinds of streams are coming into your system. We accept all of those connectors.

Once those connectors come into it, we put them into a common data lake that we host. These are still raw metrics. This is what we cannot afford to store for months. They are just the raw metrics, which is why they’re shown in white. The key part is they are sitting in a common data store; they are not siloed. The initial metric sources were siloed, but they’re brought into one data store. That’s the first technological challenge that we’ve solved—getting these different forms of data into one storage. It’s harder than you would think. That’s why silos exist today. People will have a metric store or people will have a log store; they just live side by side—one dashboard for this, one dashboard for this. You already brought them together. That’s the first challenge that’s been solved.

Now, if you want to do anything with this data, it is computationally infeasible because there are millions and millions of data points. If you started to analyze each one of them, there’s no amount of compute that will solve for that. You just cannot observe. You just cannot write good enough software that is going to do all of that.

Machine learning

What we then do is use ML to convert these white numbers and sentences into reds and greens. Green means things are good; red means things are bad. Our assumption is, which I think is a fairly good assumption, we don’t have to really correlate between the greens. We can throw away the greens from a correlation point of view momentarily. We don’t have to worry about the greens. 

Once we’ve thrown away the greens, now we’re left with only the reds. The volume of data that our software has to look at is significantly reduced. Sometimes it gets reduced by 99%. In a given period of time, only so many bad things can happen—even a thousand red events. But you had a billion good events. We’ve converted them into, or we’ve sifted out, the reds. 

Once you have the reds, you need to start clustering them together. Some events could be just unrelated. The bad event Joe configured over here, but something that happened over here is not related over there. That’s where part of the correlation algorithm comes into play. We call it collab operative filtering. I’ll describe more about that, how that happens. 

You start forming these clusters. To answer your question, if there were alerts on the reds, you will see so many alerts. If you had a thousand things that went bad, you would have seen a thousand alerts. Because of the collaborative filtering clustering that we do, bunches are formed. Like the thing I showed you on Slack, there was a cluster that was from then. Now there are only two classes that got formed and two alerts got generated. We call those. First of all, these two alerts that get generated, the user doesn’t have to configure them. The system is already looking for bad behavior. It clusters the bad behavior together and then sends it over as problems to alerts. That’s the pipeline. I’m going to go deep into each part of the pipeline to explain how that is.

Tim Bertino: How are thresholds for what’s good and bad determined? And do customers have an ability to customize that, or would you even want customers to have the ability to come? 

Nitin Kumar: I’ll describe that. Customers don’t want to configure that. Our system uses unsupervised machine learning to get to what is good and bad. I’ll describe all of that. 

Tim Bertino: So, you’re taking baselines.

Nitin Kumar: Yes.

Rita Younger: But you could have management software, for example, that appears to be an anomaly because of its behavior with contacting multiple devices. So, it seems like it’d be a good idea to allow them to tune.

Nitin Kumar: Yeah. After the fact, you can say: “yeah, you’ve discovered this cluster.” I don’t want you to take this into account in the future. There is some amount of human learning. When I said we don’t want our customers to do that, 99% of the time, we don’t want them to be involved because it puts additional burden on them. And sometimes they don’t know. But for certain aspects when they do know, which is when we call rules to prune your correlations, that ability is there.

Jordan Martin: So, maybe the elephant in the room in talking about a solution like this is that correlation is an incredibly challenging process. You know, that’s the tool. My question to you: are you doing analysis on the accuracy, statistical analysis on the accuracy, of what that ML is coming up with and constantly revising that? What does that look like? How are you ensuring that you’re getting the best possible data out of there? You’re not going to tell us how you do it, and I’m not asking you. How are you looking at that? Because I imagine there have to be false positives … false negatives. What does that look like?

Nitin Kumar: We’ve built a feedback loop. If I’m discovering this correlation for you, there is a browser interface to that. Then you can say: bad alert … good alert … something like that. That feedback the human does. That feedback is then looped into the correlations phase and then it knows I don’t need to be correlating that. That is one aspect of the feedback. The same thing applies for the baselining aspect. Yes, you said this is a threshold violation, but I know this really is not. It’s expected. There are things around seasonality that we’ve built that, yes, I expect things to go up at these times of day. We do all of that as well. 

Jordan Martin: Okay. Does that feedback filter out to other customers’ products or does each entity have its own bit of training that happens around what is and what isn’t? 

Nitin Kumar: It is a mixture of both. It is a mixture of both. Certain things we do for one customer does benefit other customers. For example, this log line is bad; this log line is really harmless, although it looks …Jordan Martin: They’re being verbose.

Nitin Kumar: Right, yeah. Stuff like that does transfer from one deployment to the other, and sometimes other things don’t. Seasonality in our video provider has one kind of seasonality. The seasonality that we see in a different deployment is very different, so it is tuned over there.

Pete Robertson: Okay. To follow up on that then, if, in each individual’s customer deployment, there is obviously a period of time of data ingestion and then training the AI model specific to that customer with limited preloaded intelligence from your broader customer base, how long in a typical customer are you seeing to get to time to value?

Nitin Kumar: We’ve been very cognizant of that observation. As a startup, we cannot walk into a customer and say: “I’m going to train for a year before I show a value.” We know that if you look at traditional ML models, that’s what we’re going to go after … non-starters. We’ve used … we said: “Okay. We have to compromise a little bit. Yes, we will train the models more aggressively that look at lesser amounts of data. Then, maybe a little bit more at false positives, which we’ll prune those away.” It’s an engineering compromise. 

For us, we take a day to show them value, like: “Look. You had these latency values. I’m able to tell you what is good and what is bad. You gave me all of these logs that came in. I was able to cluster them based on what I’ve learned from other deployments or from your deployment.” Within a day, we were able to show value … with respect to ML. Generally speaking, a product has a pilot phase before it goes to deploy. Our parks run for 30 days. Generally, 30 days is enough value that they see. I’ll show you examples of where that helps.

Data ingestion

So now let’s go through different parts of this pipeline and show what goes on. The first part is data ingestion. If you see in the top right corner, I’m walking through the pipeline and showing different aspects of the pipeline. Data ingestion has two important properties that you have to solve for. First, you have to build enough connectors for the different kinds of data sources. There are a few categories of data sources that you have to build for customers to let you connect to direct devices. Or there are message buses in the middle that you connect to. Or you connect to a data store. 

Somebody has already stood up a Splunk. Somebody’s already stood up a Prometheus. You connect to that. You need to be able to connect to these three different kinds. Each of them requires a different kind of technology that you have to build. There is no one-size-fits-all. You have to do all of this. Then, of course, there are cloud providers. Cloud providers have APIs that you can connect to as long as you have the credentials and all that. They give you API access.

Now we need to be able to ingest a given new piece of data in minimal amount of time. What we call is … there has to be minimal time to integrate new connectors. We’ve realized there’s no universal data format. Every data format looks different. There is no universal. And there is no way customers or people will arrive at something that is uniform. So, we embrace that fact. Instead of having to write custom Python parsing code to take this data, we’ve invested in a compiler. 

We’ve built … this is one of our patterns that we file … we built this compiler that  we’ve written once. As and when a new data source comes in, you just feed that schema to the compiler. This compiler generates the code, and that code executes. Our time to integrate and parse a new data format that we haven’t ever seen is as fast as this compiler can run, which is seconds. We don’t have to write new code. You don’t have to deploy. You don’t have to do anything. This compiler we’ve built we internally call this compiler the data hypervisor. The name hypervisor came from the compute world. Just like hypervisors abstracted disk drive network storage from the application, this compiler abstracts out the data variety that our applications have to see. So that’s one aspect of the data ingestion.

The second aspect are these three things. This is how you do high volume data ingestion. The difference between ingesting operational metrics and business intelligence data is around scale. There’s so much scale involved in ingesting that data that just doesn’t exist for other kinds of BI (business intelligence) tools. We do a lot of analytics. Yes, the slicing and dicing of dashboards, the Slack, that is common to the BI world. But in the operational world where we all live, how do you deal with five giant hoses of data. 

There are three important problems to be solved in this space. First, you have to have a scalar architecture because data volume generally starts small, and it grows gradually over a period of time. Your architecture needs to elastically be able to use more compute resources, more storage, more whatever. As I said earlier, we’ve adopted Kubernetes. All of this comes naturally. We start small at a deployment and suddenly we grow. It’s just a matter of tuning the Kubernetes environment to a large cluster. That’s the one part.

Distribution architecture

The second part is so much data comes in, we need an internal data distribution architecture. We’ve ingested all of this data, but this data has to be distributed inside our system as well. If you were to take a router or a chassis, there is a fabric inside that takes in traffic from different ports and distributes it to different ports. We also have a fabric inside. We have applications, our ML applications, our baselining applications, our query applications. They need to have uniform access to that data.

We ourselves deploy Kafka inside our cluster. We have a Kafka bus running where all the data comes in. We dump it into Kafka, and then all our application is sucked out of Kafka. This is again we put that on Kafka. The second is a disaggregated collection of data. You had asked that earlier that do we do that? Yes, we have to do that. If we are in the cloud, we need to have a collector on premises that is locally polling because the administrative policies are set like that. This is then streaming up to the other instance. 

There are two aspects of the disaggregated collections that sometimes get missed. Building the collection is hard, but yes, you have to do that. Then, you also have to manage them. You need to have a central place where all these collectors are present. If you just deploy it somewhere else and it stops working, whose responsibility is to figure out whether that remote thing is working or not? 

We’ve paid special attention to it. Yes, it’s deployed distributed but still managed centrally. These collectors, when they are installed on premises, they dial up to the master instance. They establish a control connection on which heartbeats are exchanged so, if they die locally, at least somebody can be notified, or it can be restarted. All of that needs to be taken care of. If you do not do these things, the solution is just not going to work. This is on the data ingestion part.

Metrics and events

Now let’s get to the more interesting part—converting metrics and how to correlate them. The first basic problem is metrics are fundamentally numbers, just numbers—billions and billions of numbers. Logs are English sentences. 

If you want to correlate them, together, they’re apples and oranges. You cannot correlate them. Metric correlation itself is a hard problem, granted, but now you’re looking at metrics and logs that are fundamentally different classes of information. Some technology has to be built to unify them. Some common currency has to be built so that you can say: “Ah, now I can compare these things.” I’m going to talk about the technology for that. 

Jordan Martin: One of the things that I think also needs to be considered is there are a lot of static elements that contribute to the analysis of our network—things that don’t change, things that don’t show up in logs, things that don’t show up as metrics—that are kind of steady state but have an impact on the analysis of what’s going on.

How do you correlate not only disparate sets of data, because that’s an incredibly challenging problem like, to your point, numbers … sentences. I’m sure there are probably 15 other ingestion formats that could exist beyond those two. But then how do you take into consideration the things that are static that are not generating those things?

Nitin Kumar: Yeah.

Pete Robertson: There’s a context to what is taking place right, and the context is not necessarily ingested through telemetry or SNMP. To Jordan’s point, how are you making the correlation in light of that context to drive the actual wins?

Nitin Kumar: Yes. Javier is sitting at the back, and he pointed this out just before the presentation yesterday. You need to include metadata as well in this light. The metadata is a very important piece of information that generally gets missed in our system. Since I don’t have a slide description for that, we have a first-class integration to metadata stores. NetBox is the most commonly used store these days, and we connect into NetBox. We use the NetBox APs, and suck in that information. Those tables exist in the system as a data element. And as those streams are coming in, they start getting joined and used as well. 

Jordan Martin: How do you reconcile that with your strategy of ingesting all the data only as long as you need it to build insights? Are you just redoing that process on some interval? How does that work?

Nitin Kumar: Two parts to that. The meta stores are not voluminous.

Jordan Martin: Okay. You treat that data a little bit differently.

Nitin Kumar: Yeah. Because even though we don’t expose it to the users—these retention policies and how long something is stored—our admins do control that. We have some amount of control where these meta stores can live forever. That’s number one because they’re a drop in the ocean when it comes to comparing the other stuff.

The other part is, even these meta stores, we have a period set to a day sometimes … very infrequent. And the third option is where some of us have a customer who says, “Hey, I know when I’ve updated my meta store. Just give me a button that I know I’ve updated it here, and I’m going to make an API call that will call your ingest on demand, or I’m going to log into the portal and do things. A combination of these three things allows us to get metadata into the platform.

Let’s see how we do the metrics and log stuff. The metrics say … so this is the metrics pipeline. How do we go from numbers into a color? On the left, you have these numbers. This chart I’m showing is a latency chart as a data set. Latency is from point X to point Y. These latencies have different numbers. Then, we have to convert them into good numbers or bad numbers. 

For example, here, a number like 174 is still considered green, good. A number like 17, which is smaller, is still red. This is where the ML comes into play. It says that 174 is normal for this connection. However, 17 is not normal for this connection. Fifty-six is getting closer to being abnormal. We generally have three colors: red, yellow, and orange … although orange and red are considered the same category. 

How does this baselining thing work? There are two aspects of it as and when the data comes in. You’ll always see in the machine learning space, there is the training pipeline, which is shown at the top, and then there is the inference pipeline shown at the bottom. The data gets fed into the training pipeline as well as the inference path. The inference path is kind of lagging behind the training part. The training part does its training; it builds its model and makes it available to the inference pipeline. You can think of the model as a giant lookup store. You look up into it and you get a value back, although it’s not as simplistic as that.

This model in the middle is continuously being built by the inference store. As data is coming in, this training store is looking at data samples. It is going back in time. It is looking at whether the current data sample and what is the threshold value. It does all of that, and it starts writing into the model part. Then the real-time inference part is just comparing the computed threshold with the actual value and coming up with an answer. If it doesn’t show up on this screen … if you see the thing on the right, there is a faded line that is sort of the baseline of the chart. Then there are these spikes in the middle. Those spikes in the middle get flagged as red lines in the thing. 

If you look at the behavior of the underlying metric is just a sea of greens, then a red sea of greens, and a red. That’s the picture that I showed initially that you had white things, which are just the numbers. Then, because of this technology that we’ve built, those numbers get converted into red elements and green elements.

Another interesting part is the seasonality implementation over here. You know that, yes, it’s going to spike up at a point in time, but that spike is normal. There is a seasonality process running that is looking at the numbers over a larger period of time. It recognizes that in the morning at 8 AM, things are going to spike up, so it then artificially pulls up the threshold. Don’t flag this as a red because I’ve seen this happen. 

We kind of look at seasonality … if you look at the human brain, there is a system one behavior, where you look at something and you instantly react. That’s what real-time baselining is. Seasonality is system two behavior that you go back, you think about it, and that influences your behavior. That’s a system two behavior. There are two systems that are running different swaths of data. One looks at real-time data; the other looks at more historical data. 

Now we have numbers that we converted into events. Now we have a red event and a green event. That’s how we converted numbers into this common currency of events. We have to do the same process for logs.

Logs

Logs are more complicated. Logs traditionally are written by developers for themselves. They think they are going to be debugging in the field, and they just write this message for themselves. There is no rhyme or reason to how the log looks. I’m guilty of writing logs that only I can understand; nobody else can understand. 

What is a log? It’s really a signature of a state that the original program has detected. Maybe the original program … some piece of code … some piece of configuration … got into a bad state. If this state happens, print this log. That’s what that piece of code is. Our goal as an analytics tool is to reverse engineer what that state was. 

There are two parts of the log. You have a log line. You need to classify this log line into what it represents. Does this log line represent a BGP down event? Does this log line represent a BGP up event? What does this log line represent? 

You might have millions and millions of logs, but you probably have 100 classes of events because there are 100 conditions that have happened. BGP went down, BGP went up, stuff like that. That’s the first part of the technology that has to be built. Inside a log line, there are important entities. This log line refers to a BFD session. This log line refers to an interface. This log line refers to an ASIN number. And in this log line, we’ve highlighted that there is a tag called 10.1.9.12 or there is ASIN number 65101. You need to be able to infer that.

Traditionally folks have written regular expression parsers to get that information back. Regular expression parsers are impossible to maintain. The moment you write them right then, the vendor is going to change the description in the next release, and you have to go back and change the expression. You need to reverse engineer from the log. 

This is where we use two very important pieces of machine learning technology. The first is called clustering, so you cluster all of these logs into different clusters. That’s how you determine … detect the pattern. This cluster is one cluster of things. This cluster is one cluster of things. That’s number one. The number two is called named entity recognition [NER]. You know that these are my entities in my system. By looking at this log line, you would be able to extract that this is an IP address, and this is an ASIN number. NER is a well-established discipline of machine learning; we’ve applied that to logs.

Once these two …

Audience member: Can you say that acronym one more time? You said NER?

Nitin Kumar: NER and clustering.

Audience member: And NER stands for what again?

Nitin Kumar: Named entity recognition.

Audience member: Thank you.

Nitin Kumar: But I didn’t give the explanation. Even the log processing … we call this process inside our system log mining. We are mining the logs and converting them into signals. Raw logs come in. As with the metrics, there’s a training phase and then there is an inference phase. 

In the training phase, these two things happen: clustering happens, and NER happens. It puts that into building what’s called a large model. When the raw logs are coming, the raw logs are evaluated against that model. Then suddenly the mine logs look like every log line has an event name, and every log line has a certain set of tags. Now because of this process, we have metrics converted into events, and we have logs converted into an index. 

Correlator

At some point in the stack, this piece of code that I’m going to describe next—the correlator—the correlator doesn’t know that this event was sourced because of a metric stream, or this thing was sourced because of a log stream. The common currency has been found, determined, and we’re going to now work with that now.

The event data model … to understand how correlation works, think of this Mickey Mouse like thing, event thing [pointing to image on screen]. There is an event name; every event has a name, and it has a set of tags. 

In this case, for example, if you had an event called BGP down, the tags that exist for that event are what is the IP address, what is the hostname, what is the ASIN, what is this,… You could have any number of tags associated with an event called BGP down. There could be another event called if down, and it has a different set of times. When I have … down refers to an interface … what the remote interface is. These are these tags. This is where the metadata augmentation comes in very handy. 

When I describe those pipelines from the raw metrics and the logs, and you generate this, it goes through yet another phase where this thing is enriched with more metadata. Even though your raw metric doesn’t contain a site name or location or customer ID, all of that information would be enriched on the event. 

It’s very important for the correlation algorithm to work, but it’s equally … the metastore augmentation is one of those key pieces of technology that we discovered that, from a technology point of view, we just think of it as another data source. But the effect that it has on our results is just tremendous.

Correlation and recommendation

Once we know what an event model looks like, how does the correlation work? Think of a correlation as, if I have an event with a certain number of tags, more numbers of tag values that match between two events, they are correlated. In this example, if I look at event one and event three, there are two matching tags—the blue tag and the black tag. They’re strongly correlated. Event one and event two are weakly correlated because only one tag is matching green. Event three and event one are not correlated at all because no tags match. This is the core algorithm. 

Since we have time I’ll tell you how we arrived at this. Me and my chief data scientist, we were just discussing Netflix recommendation systems. This is before we had done this. … We shared a lot of common taste in movies, and he would say, “Hey watch this movie or this one.” Also on Spotify, sometimes I’d be listening to music and suddenly [it’d] recommend: “you should listen to this.” I’m like: how did this software know that I like this?

Then we did some research on it. And we found that every movie, every song, can be decomposed into call features.

So, think of the movie Pulp Fiction. If the movie is Pulp Fiction, you and I understand Pulp Fiction as just a name, maybe [that it’s also] directed by Quentin Tarantino. That’s about it. But internally, the Netflix system breaks down Pulp Fiction into thousands of attributes … thousands of attributes … like this movie has a screenplay. This movie has this and that. It breaks down a movie into a thousand attributes. 

Now you take another movie that will have another thousand attributes. Then, the system does an intersection of those attributes. So, if I’ve liked Pulp Fiction, most likely I’m going to like the other movie because, out of those thousand attributes—60 attributes in that movie, max 60 attributes in this movie—most likely Nitin is going to like [this other] movie. 

That’s how recommendation systems work. Then we realized that means recommendation systems are correlating these two movies. Pulp Fiction and Jackie Brown were correlated because of certain common attributes. 

But wait. We can do this with network events as well. If two events are heavily correlated, there must be a root cause between them. In the case of the movie system, the commonality was my choice in movies or my choice in music. [crosstalk] Something.

We felt that the same applies to networking events or any kind of events. If they have similar attributes, there must be a root cause between them: Joe must have issued that configuration … [laughter] because things went down. [interjection] Of course, when you start the implementation, you don’t believe it’s going to work for you; you want it to work; and then we’ve deployed this again and again and again, and it just works. 

The magic of this algorithm is [that] it’s domain independent. I do not have to model the underlying domain. Some folks might argue that you’re losing some fidelity. That’s okay, because as soon as you rely on understanding the domain, you become embedded in it, and every network looks very different. You then have to start taking care of every nuance in the rule modeling. 

This thing … just works. Of course! The premise here is tags have to be well documented … you have to have good tagging. Let’s solve that problem. It’s a data problem. The rest of the world has solved that data problem. Even if the original tags are not clean, let’s put in some work to solve for that … because you can see the end result there. If I solve the problem of cleaning up tags, creating more tags, it’s a bounded problem. Maybe it’s hard. It’s a bounded problem. 

That’s been one of our key differentiators as we go into a system in a deployment and we’re able to show correlations. We sometimes don’t understand their exact network topology. That’s not important for this part of the problem. And it was inspired by Netflix and Spotify.

Audience member 1: I have to tell you, that was the best explanation I’ve ever heard.

Audience member 2: Seriously.

Nitin Kumar: Thank you.

Audience member 3: Yeah. It was great.

Nitin Kumar: And it’s true. It’s true. I’m not kidding you.

I’m a runner. You don’t look at me as a runner, but I run, and Spotify plays. Sometimes it’s playing music I don’t want. I don’t. Skip. And suddenly they replace it with something. I’m like, yeah, I want to listen to that. It really is the same music as the other music. If your ear is trained to that kind of music … and my daughter said the same thing, “Daddy, you listen to the same kind of music all the time.” For her, all my music sounds the same, and Spotify is just fooling me into that.

Thank you. It’s a true story. 

This algorithm, we want it patented in the context of networking, but this is recommendation systems really. From a different point of view, now we’ve correlated. We’ve gone into 90% of the hard problem. We’ve gotten there, but this is not it. 

If you look at the machine inside it has produced this massive JSON. It’s produced this because, if you look at the correlation, it’s produced a lot of tags. It’s produced a lot of information. All of that is, again, overwhelming to the user. This is the reverse part of the query. Like the first part I said, when I was doing my stuff on Slack, I wasn’t typing in complex things. I was just [typing] “status of application,” “status of ….” 

That part of the problem was me as a user of the system, and that’s called natural language processing [NLP]. I’m going to type in English. That converts that into something the machine understands and goes from it. This is the reverse part of the problem. The machine has produced this correlation. For this machine to give it back to the human, you need a separate piece of technology. It’s called natural language generation.

The last part of the stack takes this, and it converts it into an English sentence. Over here configuration by Joe … if you look at that piece of message, that information is embedded in this JSON object. But we’ve not presented along those lines. This is where a system also learns. Sometimes the users give us feedback about no this is not very clear, so we need to adjust for that. That’s where a lot of the back and forth of learning happens.

Now the final piece of a stack, it takes in this JSON object it and produces it into this nicely formatted thing. Then, the last part is just call API to Slack and post it to Slack. Again, it looks very simple, like this is what the system produced. But to get to this, there were seven or eight steps that had to be done and that’s what Selector AI is about. 

Having come from the networking space, I knew it was not easy, but somebody has to build it, and that’s what we’ve built.

Jordan Martin: This is the thing. This light bulb for me … this whole table is filled with people who do this natively through experience and intuition. What we’re trying to do is to translate that into something that could be done programmatically. That’s been the challenge historically. I don’t even think about what we do in the context of doing this, but as you explain it’s like, “Holy crap, that is something?” It really is something of value to be able to distill down to the important part [crosstalk] … and really make that meaningful so …

David Penaloza Seijas: You know there’s a parallel to when you’re troubleshooting and trying to find a cause, then you do more or less the same. You look at IP addresses and then you keep hopping from one device to another, from one process to another, until you get there, not with the same speed of course. I don’t see green and red, no offense. [laughter] But it’s a good way of seeing it because essentially I don’t think it’s an easy thing to write programmatically or repeat or emulate programmatically what you think … the way you think … the train of thought … the way you correlate things. This is something you do while you’re trying to figure it out; it’s breathing down your neck. It’s not that it just happens there.

Nitin Kumar: Yeah. I’ll go to the demo. I think it’ll work. That’s what they tell me, but I’ll check it out but.

What our customers love best

Correlation is one key aspect where folks say, “whether you do it,” but there are other capabilities of our system as well, which to my surprise [our] customers love more. I used to think these problems are solved.

Single source of truth

The number one thing is Selector becomes a single source of truth. One of our customers uses the term “no more swivel chair.” [They say,] “I don’t have to go look from one dashboard to the other dashboard. I just look at one thing and you guys tell me everything. I don’t have to go to Splunk, I don’t have to go to metrics I don’t have to do any of that.” That’s the single source of truth.

Democratization of data access

Anybody can look at that data and do anything. They are able to make queries through Slack into their own data, which before Selector, they weren’t able to get access into their own data. Some data was trapped in the firewall, like the load balancer and stacks. All of that we made available through our system. 

SNMP and gNMI cloud native collectors

I do believe we are one of the first cloud-native SNMP collectors or gNMI collectors. I don’t know of any products out there … a lot of open source … any product out there that gives you a Kubernetes-based SNMP or a gNMI collector. You have these products that are hardware-based, and you have a lot of open source. Open source is hard to operationalize. It’s good, but you cannot really deploy it at scale. We are one of the first SNMP and gNMI cloud native collectors. What I mean by cloud-native is Kubernetes-based.

Front-end to ServiceNow 

This is really interesting. If you can imagine, we have a customer who said that when an outage happens, “I’m more afraid of the ServiceNow bill than the actual damage caused by the outage.”

What happens during an outage? Thousands of alerts fire … thousands and thousands of events fire. Each of those get into ServiceNow and guess what? ServiceNow charges them on a per-event basis. [The customer says,] “My ServiceNow bill is more than the bill of the outage.”

We are used as a front-end to ServiceNow. All of those events we correlate, and we produce two or three ServiceNow incidents. We have an incident ITSM methodology. Those incidents get into ServiceNow. They’re like, “Okay, yeah. We’ll use Selector as a ServiceNow front-end.” They use it for more, but this is the reason why they selected us. 

David Penaloza Seijas: Well, ServiceNow isn’t going to do correlation. It’s just going to fire up, and anyway, it’s not convenient for them from a cost point of view to do correlation. There [inaudible] more money. 

Nitin Kumar: Yeah. Why would … what’s that article—the innovators dilemma? A company that is successful at certain things, they will not do anything that will kill their business model. Startups like us, that’s why we are here.

Customizable

Then we are customizable. What I mean by that is, even though it’s a networking use case, the core platform … if you look at … like I’ve spoken a lot today … I’ve never used the words like BGP and all that. I never used core networking principles because the underlying platform is just producing data and consuming data. It’s working in terms of math. It’s just independent of what the actual semantics of the data are. 

We are able to customize, using a solution engineering team, take the platform, and solutionize it in the context of a customer. By the way, this is not a professional services engagement; it’s part of our product offering. We say this is a data scientist as a service or a solution engineer as a service. You get all of that as part of the Selector service. You don’t have to pay anything separate. We do tailor everything in the context of a customer.

Flexible deployment

Then deployment flexibility we talked about earlier—how we can be deployed on-premises. We can be deployed on the cloud premise, on our on our cloud, or completely on a VM or a set of VMs in the data center. These are other capabilities that we’ve built and really have been very successful with folks.

Summary

I want to summarize here. We are a Kubernetes native architecture. We adopted Kubernetes. It has helped us a lot … elastic scale out … don’t have to reinvent the wheel. Kubernetes has solved a lot of the important problems for us. We are built on top of that. 

We do allow for a lot of network-centric observability use cases where we do routing analytics, SD-WAN analytics, application analytics. A lot of it is use cases that are network-centric.

Finally, as I said, we are tailored for customers. We say solution engineering and data scientist as a service.

Jordan Martin: A question about data sources still. I keep coming back to this at least in and for me. We talked about logs versus metrics versus metadata. What about parts and things of the network that can impact you that you don’t control? Talking about internet performance or those types of things … is there any consideration or any thought to how those things get integrated into the analysis process? 

Nitin Kumar: Yeah. Believe it or not, there are experts in that space. That’s like a thousand eyes. A thousand eyes will give you a very accurate view of what your WAN connectivity looks like or what your internet connectivity looks like. Kentech is another example. We integrate with their insights. 

All of these products have an API access or even a streaming access into the insights that they manufacture so that we don’t have to reinvent the wheel. Customers don’t want us to reinvent the wheel because they’ve already deployed these established products in this space. 

If Selector comes in and says, “Hey, we do all of these things.” They’re like, “Why would I pay you to do this? I’m paying a lot of money to this company. I’m very happy with them. You need to fill in the gap that these guys are not able to do.”

For stuff like that, which is outside the network, we integrate with existing solutions. The good news is these solutions are very modern, and they provide great API access into the insights they produce. They have a lot of software integrations that they built in because they were born in a day and age where integration with other software systems is very important. For them, for a company like Kentech to integrate is a good thing for us.

Tim Bertino: As you said at the beginning of this, for the time being, you’re focused on network devices … network events. With what you’re doing with tagging metrics and logs, and turning those into correlated events, is there potentially a security play here in the future? Pulling in that information from security infrastructure and then correlating and saying, “Hey, you may have an issue here, here, and here.”

Nitin Kumar: Yes and no. There’s definitely a play with security. The requirements for a security play are a lot of audit requirements, long-term storage, and stuff like that. Our plan at least in the next two to three years is to integrate with SEM and Source Systems and make use of their analytics. 

Tim Bertino: Okay. That makes sense. 

David Varnum: You talk a little bit about the chat and communicating with Selector. We saw that Joe made a configuration change that potentially caused an issue. Can you interact with the chat and say, “What is the change that Joe made? Was this change approved through a change request?” Things like that.

Nitin Kumar: Yes. I didn’t have … so you can do two things. The configuration. Now that you know Joe made the change, Joe’s changes are also sitting in a data store inside the platform, so you can query. You can say, “Show me all the changes made by Joe in the last 10 hours.” All of those will show up on your platform. That’s number one, in which you can still stay on Slack, and you can get all that information. 

This is an example of the correlation graph. Of course, the system, we’re not going to expose this to the user. This is the correlation graph the system built from all those events that happened. It’s all rooted at this information and this itself is the real picture of the PowerPoint that I showed. Then, if I were to click on this … now you get a little more of a deep dive into at that point in time 9:42 to 10:42. This is the state of the art in your system. Then, this is where I go. I’ll expand this. You can see this is what Joe did. Joe did this on L-150: set interface down e114, and the username is Joe.

This is the query that’s going on. Now you are in developer mode. At this point in time, you know exactly what’s going on. Somebody is having a chat with Joe, but you want to find out more. As an investigator, you can now play with this query over here and do different things. You can browse around.

Again, our experience is inspired by me as a developer. When I see software crash, I have a core dump. Then, I take the core dump feed it into a debugger. Then, you don’t know what you’re looking for. A debugger gives you all the tools. I can highlight on a variable. It tells me this is what happened over here. I’m looking, but the debugger tool has made it easy for me to browse for information. 

Once the alert has fired our system, you can now browse for this information. This is one way. Now I know L 150 is … so here this is all that happened. These are all the logs that got parsed and got correlated at this point in time. I can look at this log, and I can see what this message looks like. Again, maybe there are millions of logs that are happening over there. I’m not inundated by any of that, and I’m able to see my logs right away. I’m able to see the application metric loss like this information. As you can see at this point in time, the configuration happened. And at this point in time, at the bottom window, you can see packet loss between these two applications. 

These are the underlying metrics. These are the metrics that the correlator had access to, and it built that graph over there.

Jordan Martin: When you said that the correlation graph wasn’t exposed to the end user, are you just saying that it wasn’t exposed in Slack when the message was given? Like the consumer of your product can go to that page [crosstalk] because I think there are a ton of things that can be learned.

You’re looking at that winning event. [crosstalk] Maybe I don’t fully understand. [crosstalk] We got to this conclusion. Let me look at all of them, all of the events, or all of the related metadata would help me as a human, to kind of, maybe train it a little bit better. 

Nitin Kumar: Yes 

Jordan Martin: Okay. So, they do have access to it. It’s just not the first thing they see. They’re going to see something more natural. 

Nitin Kumar: I’m just used to saying that because, as a startup, we build in phases. There was a point in our life we would throw that at the user, and they were like, “What is this?” We’ve worked on that. We worked on putting layers of software on top of that graph just so you’re not exposed to that; you’re exposed to that chat message we just saw, but that information is there. As I said, we went to the portal from the Slack message. You go to this portal. Depending upon what your privileges are as a user, maybe read-only users might not have access to all the charts. But as an admin, your access is to all this.

Jordan Martin: When you said that I just didn’t know if you’re looking like at a Dev version of this or something where we were … [crosstalk] so that’s what a user will see if they click on it if they have the rights to do so. 

Nitin Kumar: Yeah.

Pete Robertson: And it’s really important because, I mean, especially for all of us sitting in the room here, we’re engineers. We don’t just have blind trust. We always want to understand how something was determined. We want to know what’s going on kind of behind the scenes. And that’s been generally a challenge in industry of … getting people to adopt automation or AI or ML is that feeling of giving up control. But the more you can reassure that, as I look at that, like this makes a lot of sense to me. This is putting together and connecting the dots that I would have done manually. It’s going to instill confidence in the outcome.

Nitin Kumar: Yeah.

David Penaloza Seijas: It’s still going to help though because maybe you want to see all of it. You want to you how it was correlated. But some other users just get an alert and want to know the quick details, say we’ll just look at Slack. I have a ton of [inaudible] errors, and people evolved that they simply come to me. I don’t care about it. I only want the sausage, and this is what you’re going to give me. Don’t give me the details. Just simplify it. It is addressing more than one, well, audience. And the intent is clear. You want to know more, get it here. But this is what you need at least if you want to keep yourself informed on what’s happening. I think it’s a valuable thing.

Nitin Kumar: Those 247 people I showed on chat in the earlier other channel, all of those 247 people are never going to go to the portal but maybe five of them are. 

Audience member: Joe. [laughter] 

Nitin Kumar: Including Joe. 

Audience member: Joe’s too busy at HR right now. [laughter]

Nitin Kumar: Yes. Out of those 247 people five people want to go to the Slack channel. You should see other Slack channels where we have engineers. They don’t issue natural language queries. They know exactly what they want. They issue these long queries, and they get their data because they’ve learned and learned the platform. They understand what the query language looks like. 

People ask me why did you call the company Selector? The reason for that is our data is exposed using a SQL query. We felt that SQL is a good query language, but it lets you query any kind of information. There’s no need to invent a new query language. SQL is there. Let’s just use SQL as our  … if you look at our queries, they look very similar to SQL. The most common statement is a select state. If the first statement you said, “What is SQL?” Knowledge is select. So that’s how we call ourselves Selector. That’s just trivia. 

Jordan Martin: We’ve seen what you do. Super interesting. But I want to know what you think is coming. What’s next?

Nitin Kumar: Yeah. Three things because, being the CTO, I have a chartered roadmap for our company for the next two years. The first thing I still feel we want to get to … I want to say a self-service model, but not self-service in the typical login to a portal, create an account, not in that kind of fashion. Today our solution engineers have to do a lot of background YAML coding to set this up. We have an internal language that we call Selector meta language, S2ML. We use GitOps to deploy all of that. It’s a lot of CI/CD that happens, so it’s still code-oriented. It’s not code as in C code or Python code. It’s still YAML code. But code is still code. 

The first thing I want to get to is the platform should be stored up completely via the portal. Completely. I can point and click things, and the stuff will happen. You don’t have to log into a machine and do anything. Simplicity. Everything has to be a point and click. That’s number one.

Number two: we want to address … now that folks are using more predictive use cases, forecasting is a different area of machine learning and AI that we want to get into. We didn’t start with forecasting right away and start using predictive technologies because you need to understand more about your customers. What is it they’re trying to predict? 

This year we are going to be focusing on just forecasting. There are different kinds of forecasting that are interesting to customers. [For example,] on this one link, it’s going to get saturated in the next 40 days. So how do you get to that right away? Or if somebody is provisioning VLANs on an interface, today you have 10 VLANs on this interface, likely in the next quarter, you’re going to get to 2000 VLANs or some number of VLANs on this interface. At that point, you need to be moved to a different port.

Forecasting along those lines, the biggest forecasting thing is people are doing multi-cloud networking. When you do multi-cloud networking, you have to deploy a lot of routing constructs in cloud providers. You’ll deploy a transit gateway. You’ll deploy load balancer service. You always start cheap because you don’t know how whatever you’re doing is it going to stick or not. And you’re also trying it out, so you don’t want to pay for a transit gateway that can hold so many routes. You’ll start with these kinds of routes. Of course, when you’re deployed, you forget about it. Your service sticks and suddenly your transit gateway just steals over and dies. It’s not AWS’s fault because that’s what you paid for, and you just forgot about it. You need to deploy a bigger transit gateway. You need to pay AWS a little bit more. 

Selector is going to predict that [your] transit gateway capacity is going to run out in the next whatever, 50 days. You better start getting that budget approved so that you can deploy the next level of transit gateway.

David Penaloza Seijas: That would be wonderful. Everybody tries to get the budget approved in the last minute rather than … [laughter]

Nitin Kumar: Right. We’ve had a few outages where the reason, not we, [rather] our customers have an outage. They just forgot that they had deployed this transit gateway when they’re just trying out this thing. How does this cloud on ramping work? I’ll run some applications over there, and of course, it runs great. Then everyone’s happy, and they keep getting users into it. Then they have an outage. Prediction is one of the things that we want to do this year.

David Penaloza Seijas: I got self-service model, more predictive use cases, and … 

Nitin Kumar: As a startup, I only have an intuition as to this is what we want to do. But we always build with customers. If you build in a vacuum, you will build something that nobody is going to use or you’re not able to tailor. We want to get into, this year, adjusting networking use cases like Kubernetes networking. As I said earlier, Kubernetes is a complex network running under the covers. If he can do this kind of correlation over here, we can do it over there as well. But we want to build this with a customer partner who actually runs a very large size Kubernetes network so that we can learn from them. That’s one thing that we want to do. 

We want to get into multi-cloud networking. My belief is a vendor cannot build an analytic solution. If I am a manufacturer of certain things—whatever … pick a laptop, pick a router, or a switch—the manufacturer of that piece of thing cannot build the analytic solution. They just cannot. They think they can, but they will never have the resources to build the analytics layer on top. Building analytics and observability requires resources. It requires effort. They will never have the budgeting dollars to shift away budgeting dollars from what they are building to the analytics layer. It is always a second thought. Number one.

Anywhere that piece of equipment is deployed, there will be a bunch of other manufacturers. Whoever is the customer, they will never trust an analytic solution from a vendor. This is the reason I started Selector. At Juniper I could not … I could … but it would not be used. They say, “Oh, you’ll always be partial to Juniper.” Whoever the other customers are, they’ll never trust me as that. 

Now multi-cloud networking is going along those lines. You have multi-cloud networks from company A, company B … we know that’s the thing these days. My belief is these multi-cloud networking companies will never have a robust analytic solution. A company like Selector will be that analytics observability layer on top of multi-cloud networks. That’s another use case that we want to go after maybe this year. Maybe multi-cloud networking this year is still nascent. It’ll take a few more years to stick. That’s the next use case we’re going to go after. 

Rita Younger: It seems to me with all the information you’re gathering, digital twins might be something kind of on the right path.

Nitin Kumar: Yes. We explored digital twins initially. The reason I want to say no, we’re not going after digital twins in a traditional sense [is] it involves a lot of deep understanding of what you are the twin of. If I decided I wanted to be a digital twin of a data center fabric, that’s good business. It’s a very hard problem to solve, but that means I have to invest completely in being a digital twin of that thing. You have to become an expert there, so I’m not sure as a company [that] I want to tie myself down to a given flavor. That’s my honest opinion, but it might change. That’s the reason why we initially thought of doing digital twins.

I’m like, no, it requires a lot of expertise. Then you’ll be at that company forever. I’ve seen for certain companies that’ve been on that path, they have never been able to transition to a different vertical. There’s probably a lot of money there, and they can make a lot of money.

Rita Younger: Thank you. 

Tim Bertino: Have you seen any of your customers take data or alerts that they’re getting out of this platform and use that as information to take to vendor support if they’re having issues … to prove: hey we are having a problem here and here’s the data?

Nitin Kumar: Yeah.

Tim Bertino: Is there a good way to export that to take to vendor support cases or anything? How would they do that?

Nitin Kumar: That requires a certain amount of automation on the vendors to expose the APIs into their support system. 

Tim Bertino: Even not an automation thing, that you have a case open, you’re actively working with a support individual.

Jordan Martin: Do I write a PDF or something?

Nitin Kumar: Yeah. If your service providers have outages, a couple of them that we worked with, and they are the usual vendors you know … CAF [Laughter] the usual vendors and it’s an outage call. Every vendor says, “it’s not my fault, not my fault. It’s the one link. It’s a load balancer. It’s that. 

Then our customer says, “Selector is saying the load balancer went down.” So, Mr. F vendor, you are responsible. You have to prove to me that this thing is red. Why is it red? Show me the logs of your device and show me what Selector is saying that this went bad. You need to prove to me. 

As I said earlier, in the blameless postmortem—it’s never blameless though—blameless postmortem, we get used a lot so that the vendors can be … and vendors also sometimes appreciate this help. I’ve been on the vendor side at Juniper. When I get hauled up and say, “Hey the router crashed; this line card crashed,” the first thing I want is: “Okay. Can you tell me what happened around that time? What piece of code got triggered? If you could just give me all of that information, I’ll get to the root cause faster.” 

Vendors appreciate the information that we’re able to provide to them. We give them the logs that were seen by these particular routers. Around that time, these were your metrics; this is what they looked like. We’ve had a lot of success with working with vendors as well.

Pete Robertson: Earlier on you mentioned, and we had a little bit of discussion around APM and starting to understand obviously the business cares about the application, pulling application insights into the model. Obviously there are other solutions out there that are trying to do correlation of broader infrastructure, say in the data center. How do you maintain your focus or your strategy, understanding that boiling the ocean, trying to pull too much in, and make sense of it usually doesn’t result in the outcome we’re looking for? Yet, obviously, the more context, the more data, means the more meaningful outcome.

Nitin Kumar: As I said earlier, APM today is focused on how the application is behaving internally, not really around how the application is interacting with other pieces of things that the application needs to interact with. 

Our strategy and focus are going to be how that application works with other applications. I don’t have any intentions of building a Dynatrace killer or … what is Jyoti’s other company? I forget the team now. The first APM thing that was built, I don’t have any intentions to build that. However, today, APM tools—how they interact with the databases, how they interact with the Postgres cluster—all of that visibility is completely absent.

Applications bring in a different flavor of that connectivity. There’ll be a lot of HTTP requests, a lot of TCPE transmits. We want to get into the space that is around connectivity. We don’t want to get into what an application does. Does this piece of code have to be more efficient? … APM tools in that space, they know more than us, and they will figure out a way. Just like we don’t compete with the solution that Arista builds to debug an Arista switch, we don’t compete with the solution that Cisco builds to debug a Cisco switch. They have thousands of developers who know more about that thing that we can potentially possibly figure out.

We don’t want to go deep into application monitoring as the term means, but how an application works with its ecosystem. We will go after that space.

David Penaloza Seijas: I think that’s a smart move though or lack of movement because it also happens that somebody comes up with a product that initially works well. Then they simply try to cover all possible use cases. Then, well, you end up not doing any of them involved in the same depth that you will be doing the first one. Knowing exactly what you would be good at and sticking with it, then you’re going to be known for something. [It] is different from, oh they tried everything, and they did one well, but another 49 that sucked totally. I think it’s excellent as well.

Nitin Kumar: I think you are defined as a person, as a company on what you do not do, rather than what you do. That is more telling of you as a company than, “I choose not to do this; I will not do this.” Not because of laziness, not because you don’t have the skills, because I want to focus on this particular thing. This is what I’m going to do. I’m a firm believer of identifying things that we will not do.

David Penaloza Seijas: That’s right. You simply need to know which battles to fight and what you do best.

Ed Harmoush: If you were, say if you were God for a day, and you could force all the other network vendors to implement some sort of telemetry system by your definition to best help Selector AI, what would you implement? What would you tell them to do or force them to do in this case?

Nitin Kumar: Let me think. [laughter]

David Penaloza Seijas: It was like an interview question. [laughter]

Nitin Kumar: That’s fine. Let’s take a vendor of a networking flavor. My ask of the vendors would be [to] make it easy for me to pick and choose. There needs to be a protocol. Maybe gNMI was that protocol. Define a protocol by which an analytics system can interact with other devices. We’ve created BGP; we’ve created OSPF over the years … how routers and switches talk to each other. Today, there is no protocol of how an analytic system, like us, talks with other things. We need to first agree that a protocol needs to be built. Maybe it’s gNMI; maybe it’s not. 

Once that protocol is built, every part of that protocol is specified, so I know what the control workflow is and what the data workflow is. There is no RFC when it comes to this. My ask of the vendors would be [to] collaborate. Let’s build an RFC; let’s build a protocol in which this information exchange is more prescriptive—not open to interpretation—so that everyone will benefit from this. That would be my take if I were God.

David Penaloza Seijas: Would you hire him, Ed? [Laughter]

Tim Bertino: I think you could run for office on that platform.

Nitin Kumar: Talking about running for office, an interesting anecdote on that. People ask me how [the process for starting the company was]. My analogy to that was it’s almost like running for office. Because when you pitch to investors, and you pitch to customers. They’re very different pitches. For investors, you have to pitch a certain way because you have to pitch the long-term vision. You cannot pitch tactical stuff. If you pitch the long-term vision to a customer, they’ll [tell you to] get out of here. It’s like almost like running for office. You need to run. You need to win the primaries, so you gather one part of your pace. 

Once you win the primary and win the nomination, you have to run as far away from that position to a different position. [laughter] so you can become president. The same as when you need the VC funding. You pitch a certain way. Once you get that … it’s not lying. I’m not trying to say that this is lying; it’s just a different perspective. 

You pitch to the VC in a certain way. Once you start, you get the money. You start the company. You start pitching to customers. It takes a while to get to a different pitch mode. You start with: I am going to do this … I’m going to do this. The customers are like, yeah. Whatever. You need to be more grounded when you pitch to customers, so it’s almost like running for office.

[Video text]

GestaltIT.com

TechFieldDay.com

Explore the Selector platform