|

Observability isn’t about Metrics. It’s about Understanding

If you’ve worked in monitoring or observability for any length of time, you’ve probably heard some variation of this:

“We need more visibility.”

And the usual response is pretty predictable:

  • More metrics
  • More logs
  • More traces

Basically, more data. Which is great, because collecting data is something we’ve gotten very, very good at. Collecting the data isn’t the hard part: understanding it is.

Which leads to the uncomfortable truth universal among all observability solutions…

Step One is the Easy Part

Modern observability tools do a fantastic job of collecting data. You can set up pipelines for:

  • Metrics
  • Logs
  • Traces

and before long you have dashboards full of charts doing their best impression of something useful. You look at it, glean some information, make an assertion, confirm your findings with the team that owns the system, and BINGO! – your solution works.

Step one, complete! Uncork the champaign, pour a round for everyone, order the ubiquitous pizzas, whatever floats your boat.

Unfortunately, that is also where a lot of teams stop. This is only the very first step in monitoring a complex infrastructure. Generally speaking, this would not be a problem if step one actually solved anything.

The reality is that it doesn’t.

Step Two is Where Things Get Interesting

The real challenge is not getting the data: it’s making that data useful for the people who are supposed to use it.

And that is where things start to fall apart, because “the people” is not just one group.

You are dealing with:

  • Someone at the help desk trying to fix something from five minutes ago
  • An engineer who wants to dig into the root cause of an incident
  • A manager who wants to know if this is affecting more than the source of the incident
  • And at least one person who opened the dashboard, stared at it, and quietly closed the tab

I have been that last person more than once. And it’s usually on something I built. That’s not a great feeling, but it is an honest one.

And it points to a bigger issue.

The Bake-Off Reality

Over the years I have sat through more “bake-offs” than I can count. Some as the engineer evaluating tools, some as the one representing the vendor being evaluated. And you start to notice a pattern pretty quickly.

Every platform looks good in a demo. Many look great when the telemetry starts streaming in.

Most of them can:

  • Ingest data
  • Store data
  • Visualize data

That part is expected now. If a solution cannot do that, it’s not a solution, it’s just software.

The part that is harder to evaluate, and easier to miss, is this:

Can someone who did not build this actually understand what they are looking at?

That is where the real differences start to show up. While you have the pre-sales engineers on the line, you are seeing answers to your questions, but not necessarily the answers the what the consumers of this tool are looking for at your organization. That’s a huge (and under-discussed) delta.

And it is also where a lot of tools quietly fall apart once they leave the demo/staging environment.

Metrics Don’t Help if They Don’t Make Sense

I’ve been in the business of monitoring and observability for a long, long time and there are facts that ring true across all vendors, solutions, and topics.

MetricsA CPU graph by itself does not solve anything. Since the first days of Virtualization, CPU consumption has been at a premium. Multiple guests running on hosts all sharing the same physical processors. Containerization has further muddied the waters. Docker host frequently report CPU numbers as the sum of percentages for all cores. That means you could have a 6-core system with 547% CPU utilization. That data point is worthless.
LogsA stream of logs does not explain what broke. Sometimes, even the first log in a stream isn’t indicative of what’s happening. This is further compounded with distributed logging of multi-tiered systems.
TracesA trace can be incredibly powerful, but only if you already know what you are looking for. You know who typically know what they are looking for? The developers and only them. (Unless they are especially nice when they instrument their code)

Otherwise, it is just more noise, and this is the part we do not talk about enough.

Data without context does not create clarity: it creates work.

Which is usually handed to the same people who were already trying to solve the problem. This is the disconnect between the monitoring engineer (typically you) and the people at your organization for whom you installed this solution (everyone else).

Observability as Translation

At some point I stopped thinking about observability as a purely technical discipline. Yes, of course, there’s the technical aspect of getting the telemetry into the solution, but there’s the next phase, and it feels a lot more like a translation problem.

You are taking:

  • System behavior
  • Raw signals
  • A whole lot of moving parts

and turning that into something a human can understand and act on… quickly. And like any translation, the audience matters.

  • What works for an engineer does not always work for a help desk analyst.
  • What works for a help desk analyst probably does not land with leadership.

If you hand everyone the same dashboard and call it done, you have not solved the problem. You just moved it somewhere harder to see. Worse, you may have introduced a number of red herrings to the mix because someone is misinterpreting that dropped packets on a router interface with QoS configured is intentional and TCP was designed to retry. (Sorry, I’ll step off my network engineer soapbox now.)

What Good Actually Looks Like

Good observability starts with a different question. It’s never “What can we collect?”, but “Who needs to understand this, and what do they need to do with it?”

If you pivot your thinking, things change.

  • Dashboards get simpler and more targeted to needs/roles
  • Signals become clearer because you’re taking the time to sift the wheat from the chaff
  • Noise starts to fade into the background and (hopefully) you’ve tuned your thresholds to be part of your alerting scheme

If you take the time to do this, people spend less time guessing and more time actually solving problems.

Which, for most organizations, would be a nice upgrade.

Where this is Going

This is something I am going to come back to throughout this series, because it is not just about observability.

Whether you are:

  • Building dashboards
  • Writing documentation
  • Designing products
  • Supporting a community

The same rule applies: If people cannot understand it, they cannot use it.

And if they cannot use it, it does not matter how good it is.

Although the very best kind of correct is to be “technically correct” (thank you Central Bureaucracy from New New York circa 3009), that may not translate to your audience. That’s what you and your solution are: translators.

Closing Musing

I have spent a lot of time around systems that promise better visibility. The ones that stand out are not the ones with the most data. They are the ones that make that data make sense.

Corollary: The absolute best solutions are the ones that are flexible enough to allow you (the owner) to make visual changes and empower your other teams to design their own visualizations.

Corollary to the Corollary: And the absolute best solutions support #DARKTHEME.natively.


This is the kind of problem I enjoy solving, making complex systems easier to understand and more useful for the people who rely on them.

If you are working on something similar, I would love to connect.

Similar Posts

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.