So I'm walking home, irritated and weary, trying to summon the energy to compose another thread explaining "Why Prometheus Is Not An Observability Tool",

or maybe "How To Tell If Your 'Observability' Team Is Actually A _Monitoring_ Team" (hint: you're using metrics/tsdbs),
but then I remembered my vow to stop talking about low level storage formats and start talking about it in terms of end user functionality.

So let's try something new here. ☺️📈🐝
To recap, the technical definition of observability is the ability to understand what's happening inside the system, just by looking at it from the outside.

You need to be able to understand any system state, even ones you've never seen and couldn't have predicted in advance.
Which means you need to be able to ask any question, in any combination of ways, to tease out what's going on.

And you need to do this *without* shipping new custom code to handle it, because that's the whole point of unknown-unknowns: you don't know. ☺️

With me so far?

OK!
Here is where I normally drop into a bunch of crap about high cardinality, high dimensionality, events and indexes and schemas.

But I think we can make this much simpler. 🤔
Your tool provides observability if you can answer these questions:

🐝 is something wrong?
🐝 what is wrong?
🐝 how is it wrong?
🐝 why is it wrong? what happened to cause the error?
🐝 who all is impacted by the error?
🐝 what {1..n} things do those affected have in common?
🐝 where in the system is the error or latency coming from?
🐝 is this a transient error or linked to specific characteristics, ie size of payload, source up
🐝 And so many more.

I don't mean to imply that o11y is all about errors; it manifestly is not.
Observability is about following the trail of bread crumbs, from high level trends to nitty gritty raw rows and back again, to deeply understand your system, your code, and your users.
Look closely at those questions again. How would you answer them today?

Most people will say they have metrics tool like datadog or prometheus for the high level trends (do I have a problem?), and a logging/tracing tool for the low level (what exactly happened?)
So doesn't that make them observability tools??

Absolutely not.

Every time you leap from tool to tool in the middle of a single question, you *break* the contract of observability. You stop doing science, and start outright guessing, just eyeballing based on timestamps.
Most people aren't gathering canonical logs with any discipline, they're just spewing out random snippets at execution time, which means they're only ever going to be able to find the stuff they thought to log+know to look for in the first place.

Which also breaks observability.
It's observability if you can start with high level trends & aggregates, just like your metrics tool...

then slice and dice and drill down to specific events, like your logging tool...

and also trace them, or zoom back out, or correlate outliers in the rich surrounding context
Does this help? Is it more meaningful or less than the other explanations? Have I made a compelling case for why it matters, or no?

I assume this is what every vendor is busy building on the backend -- merging together all their tools into one. Believe me, they know it matters.
It's not only that you shouldn't have to pay to store the same data 3+ times (tho you shouldn't).

It's that it becomes exponentially more powerful when you can move beyond dumb regexps and searches for known unknowns, and start exploring your systems in novel ways.
There are some truly elegant and powerful monitoring and logging tools out there. Kudos to their makers. I mean it.

But they are still building for known unknowns. Canned dashboards. Schemas. Indexes. Text search. "Custom metrics".

This is not observability.
You can follow @mipsytipsy.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: