Networking Field Day 35: Selector AI and the Workings of an LLM

Description

An LLM differs from a function in that it takes output and imputes, or infers, a function and its arguments. We first consider how this process works within Selector for an English phrase converted to a query. We then step through the design of Selector’s LLM, which relies on a base LLM trained with English phrases and SQL translation, then fine-tuned, on-premises, with customer-specific entities. In this way, each of Selector’s deployments relies on an LLM tailored to the customer at hand.

Video length: 13:56
Speaker: Nitin Kumar, Co-founder and CTO

Transcript

Nitin: The LLM does what is known as imputing. What does “impute” mean? In traditional functions that we’ve written in our past lives, a function takes a set of arguments and produces an output. For example, a function could take in three arguments—A, B, and C—and produce an output. Another function could take in a different set of arguments and produce a different output. Functions are written by humans to perform specific tasks. An LLM, on the other hand, does the reverse of that—it infers. The capability we’re trying to leverage is that given an output, the LLM infers the function that maps to that output. Essentially, it does a reverse mapping. If you feed it an output and use this key technology called a model, it can then say, “If you use this function with these arguments, this output would be produced.”

In a degenerate case, which is computationally inefficient, this model could be nothing but a series of all possible outputs and all possible functions that would be the answer. Clearly, that’s not feasible—there’s not enough computational power to create such a large lookup table. This is where training the model and building it comes into play. How do we build the model? That’s the next layer we need to peel back.

The LLM will take in an English phrase, and if it has the right model alongside it, it will output the SQL query that corresponds to that statement. Building that model is what we call “training.” How does Selector build a network LLM model? We start with a base model. This base model understands English, as we use a public domain model that has been trained on English nuances. It also handles tasks like translation between languages.

Selector’s IP isn’t involved in this part of the model. After we have this base model, we proceed with the first fine-tuning step. Fine-tuning is the process of making a model more intelligent by giving it more examples. In our case, we teach the model how to understand SQL and the nuances of SQL implementation within Selector. We give it enough examples, and it learns from them. Note that this fine-tuning is independent of any customer data. This part of the fine-tuning happens in our cloud because it doesn’t depend on customer data. We train the model and periodically publish it to our deployments.

The third aspect involves further fine-tuning the model in a customer’s environment using their specific data. For instance, in a multi-cloud networking environment, customers might have tenants, such as a company called Splunk. We need to teach the model to recognize keywords that are specific to that customer’s environment. These customer-specific entities are taught to the model on-premises, as the data cannot leave the customer’s cloud or infrastructure. This completes the whole process.

To summarize, the core model is common across different customers, but the final fine-tuning is done on customer premises using their data. This fine-tuning is essential to ensure the model understands customer-specific entities and environments. Once the model is fine-tuned, it runs in an execution environment called LLOS, which is the runtime environment for the model. The translator layer interacts with this environment, sending English sentences and receiving SQL queries in return. This process allows us to perform tasks like querying errors in specific locations, with the LLM imputing the correct SQL query and returning the desired results.

There’s another aspect of the LLM that I won’t delve into today, which involves summarizing data that comes back. This is more of an English-to-English translation layer.

Audience Member: Can I jump in for a second? This is Remmy. Can you clarify the timeline for training the model on customer data? I assume you first gather the data, then perform model training. How long does this process take from data collection to model usability?

Nitin: This slide should help explain that. These are different deployments: Customer A runs in their cloud, Customer B runs in their cloud, and Selector Cloud runs in our cloud. The bulk of the model training and fine-tuning happens in Selector Cloud. This training happens periodically, and we publish the model to our customers, maybe every week or every other week. Once the model is trained, it doesn’t need to be retrained on customer premises, except for a thin layer of fine-tuning, which takes just a couple of hours.

Audience Member: So, during the solution-building process, your team creates tables based on customer requirements. The customer doesn’t have to worry about this part—your team handles it.

Nitin: Yes, that’s correct. The tables form the source for the fine-tuning outer layer. The local trainer reads these tables and hydrates the model with customer-specific data. The first step in the process is knowing which tables to query based on a customer’s query phrase. We already know the finite set of tables that exist, and the LLM handles the mapping.

Audience Member: So, if we didn’t do any training on Customer A, for example, the base model would still know how to handle general queries like “show me port errors.” However, customer-specific entities like “Ashburn” need to be taught to the model locally.

Nitin: Exactly. If a customer has a CMDB with all their devices and sites, the model needs to be trained to recognize those entities. The core model might not know specific customer entities, but that’s what the local training is for.

Audience Member: One other point I’d like to clarify: Selector AI, the platform, doesn’t need direct access to the infrastructure itself, right? It uses the data we feed it to train the LLM and reduce MTTR.

Nitin: The platform does need access to the data, but not necessarily to the devices themselves. If you don’t have a discovery engine, you can still feed the platform data from sources like NetBox, CMDB, or Splunk, and it will do its job. But if you need device discovery, there is a module within the ingest layer that can perform that task.

Audience Member: So, you don’t need direct access to every device in the network. You can just feed the platform data from things that already have established access, significantly speeding up deployment time.

Nitin: That’s right. If you don’t want to give access to your end devices, you can provide the data, and the platform will still function effectively.

Audience Member: That’s really interesting. Thank you.

Nitin: To minimize drift and hallucinations, we use a public domain software called Chrome TB to store the models.

Audience Member: I was asking about the total timeframe—how long it takes to train once you have the customer data, and how much data is needed for effective training.

Nitin: The data you need would be necessary for annotating your queries to get business context, whether or not you’re using LLMs. So, the long pole is generally in getting access to that data, which is driven by the deployment environment and security policies. But once the data is connected and available, local training takes about half an hour to an hour.

Explore the Selector platform