Growing up in America has allowed me to be bombarded with marketing messages that border on the ridiculous. You can’t go one full day without seeing an ad for some type of anti-aging cream, or yogurt containing particular bacteria to boost your immune system, or an energy drink promising you will grow wings.
The world of tech is no different. Everyone seeks an edge and will stretch the truth with marketing messages. Look no further than the phrase “full-stack observability” used by companies whose products don’t offer network monitoring. Leaving the most critical piece of infrastructure out of your platform and still calling it “full-stack” is a bold move.
In simple terms, observability tools are an evolution from traditional legacy monitoring platforms and are a must-have for anyone responsible for maintaining and monitoring a modern globally distributed architecture. There are a plethora of observability tools on the market today, each with its own set of features and capabilities and each with a definition of “observability.” And these definitions all include messaging, which stretches the truth around usability, affordability, and scalability.
While a decent observability platform will help monitor and troubleshoot complex systems, it is crucial to understand every observability platform is flawed. We must start at the beginning to better understand where and why observability platforms fail.
A Brief History of Network Monitoring
At some point during the previous century, network monitoring was born. It’s hard to say when, but ARPANET evolved in the same timeframe. Engineers collected metrics and logs to help respond to events and debug and troubleshoot network performance.
Fast forward a few decades to when internet usage increased exponentially monthly, thanks to everyone getting hundreds of hours for free. In time, networked systems grew in complexity, shifting from servers self-hosted in closets to globally distributed systems hosted by cloud providers. The result was a shift in monitoring methods, going from a metric collection to something marketed as “visibility.” This technique groups similar metrics to find correlations between events and systems.
In time, we shifted marketing messages again to “observability.” The word itself is borrowed from control theory, which offers a classical definition for observability: “…the ability to infer the internal state of a system based upon the external outputs.”
Suppose you search for a definition for observability today. In that case, you will uncover dozens of candidates, as each software vendor has sought to leverage the latest buzzword in their marketing materials. Since each company gets to define what observability means for their customers, this leads to market confusion. If we apply observability to these definitions, we will discover the internal state of observability is “many marketing messages.”
But those messages often fail to help their target audience — system administrators, who keep corporate IT running daily. Sysadmins want to demonstrate their value, be productive, and not feel like a failure every time Adam in Accounting opens a ticket because something is “running slow.”
Observability Demystified
Let’s think briefly about what observability is and what it is not. Observability is a superset of monitoring, as you cannot infer a system’s health without measuring the outputs somehow. Let’s take my favorite example — a trash bin.
A trash bin, by default, lacks many external outputs. The most critical metric a trash bin should communicate externally is “I’m full.” But you often only know if a trash bin is full by opening the lid or door (a physical constraint). So, we need a way to monitor for fullness. A sensor to measure weight would work, except not all trash weighs the same or has the same shape. So, we need additional metrics for fullness. We can add these metrics and claim the trash bin is observable, but what we have accomplished is we have made the trash bin observable for our specific requirements and not necessarily the requirements of someone else.
Simply stated, observability is a way to infer a system’s health through the outputs and metrics provided. But you also need the ability to adapt the metrics collected. In modern observability platforms, this is often done by applying tags to your systems, allowing you to slice the output data and return a filtered view.
However, observability by itself is not the solution to all your problems. Anyone telling you otherwise is trying to sell you something.
The Fundamental Problem with Observability
Data without context is useless. That’s the problem.
It’s not a new problem, either. The original network monitoring tools also collected metrics directly from devices. It was easy to know if a router or switch was having a spike in CPU usage. What was more complex to understand was why the spike was happening in the first place. That’s the context missing from simply observing an output from a system.
Shifting to a monitoring platform offering “visibility” into your systems allowed you to have context, but you needed to add the metadata yourself. Adding metadata was done by manually correlating groups of metrics together. Today, we use tagging to provide context for an observability platform. Lack of context has always been an issue and, thus, is the fundamental problem with any monitoring platform.
Think of all the ways your observability tools have failed you:
- Blind spots: Observability tools may not capture data and events occurring inside modern networks, especially those involving connections to the cloud, through VPN tunnels, or database bottlenecks.
- Alerting: Some observability tools can’t alert what they can’t see, and even when they have visibility, they may not have sufficient capabilities to notify users of issues, forcing customers to rely on third-party integrations. Many such tools are reactive, requiring an event to occur before analysis is performed, leading to alert fatigue.
- Root cause analysis: Other observability tools require 27 different dashboards to help users perform root cause analysis. Often, the tools are not collecting a specific metric by default, resulting in users deciding to collect as much data as possible, slowing down analysis even more.
- Complexity: Some observability tools reach the upper bounds of their scalability sooner than anticipated. Modern globally distributed networks are complex and growing, and your tools need the ability to scale to match. Many observability tools fail to consider that data collected from different sources have different scales (aggregated versus raw), leading to poor analysis.
- Cost: Observability tools promise a low price to get started, but costs quickly add up as additional packages become necessary and more extensions are needed, not to mention the extra storage costs due to the need to collect as much data as possible because you don’t know what you are looking for.
In short, the “unknown unknowns” are the Achilles heel for these systems. They require a lot of manual configuration and updates to observe the correct things, which change with time. While these systems promise the ability to be proactive in your monitoring, you spend too much time in reactive mode first.
Therefore, the next generation of network monitoring platforms must include contextual observability as a core offering. Providing an emphasis on context is the most significant shift in modern network observability. Collecting data is no longer enough; understanding the context in which data is collected is crucial.
NextGen Network Observability Platforms
The next generation of network observability platforms must (1) be genuinely data-centric and (2) provide the context necessary for actionable insights.
Data-centric is crucial, and your observability platform must be able to collect any data from any system, anywhere. The collected data is a permanent asset as applications and systems come and go over time. The data is transformed through advanced analytical techniques such as machine learning classification models. The model allows the platform to find correlations otherwise undiscovered by traditional monitoring. In other words, with a bit of math, we can uncover some unknown unknowns. No advanced knowledge is necessary.
The context is provided in the same way through analytical techniques, which allow the infusion of metadata to help build correlations of different data points to find a true root cause. For example, a traditional monitoring tool might report a network switch is at 80% CPU utilization. The next generation of network observability platforms will not only report the 80% CPU utilization but will correlate with a spike in application request rates and provide historical context for anomaly detection.
But these NextGen tools will use analytics to go even further. They can build forecast models, helping to predict network bottlenecks before they happen. They will automatically create dependency mappings and show you the root cause and all the nodes affected. They will also provide insight into the business impact of an incident, helping to understand the revenue impact, customer satisfaction, and effect on overall operations for an outage.
And the result will be breaking down those corporate data silos between different monitoring and observability platforms. You won’t need twenty-seven unique dashboards to discover your root cause.
Summary
Anyone who has built their monitoring system will tell you it is not fun to be in a meeting where you are asked, “Why didn’t your system catch the issue?” The reality is that no one, not you or these legacy monitoring vendors, will capture everything you need at the precise times you need it, no matter how many tags or customizations you add.
Even if your legacy tools can capture every piece of data possible, correlation and root cause remain exercises for the user to complete — an activity which involves jumping from one dashboard to the next, hoping each time the next dashboard will be the dashboard showing the root cause.
The next generation of monitoring and observability tools will fundamentally differ from the legacy tools and their years of false promises. These new platforms will federate data from all available sources — formatting, normalizing, and automatically labeling incoming information. They will allow for real-time analytics where anomalies are detected and correlations are discovered, and provide inputs to automation tools for alerts and service management.
Modern observability platforms will finally provide vertically integrated platforms to deliver the single pane of glass we’ve all been waiting for and pave the way for revolutionizing observability.