0/ This is a thread about why tracing will gradually replace most logging, at least where distributed or cloud-native architectures are concerned. And we’re going to explore this through the lens of a relational data model.
It’s going to be fun!
Thread:
https://abs.twimg.com/emoji/v2/... draggable="false" alt="👇" title="Rückhand Zeigefinger nach unten" aria-label="Emoji: Rückhand Zeigefinger nach unten">
It’s going to be fun!
Thread:
1/ The best logging is always *structured* logging. That is, logging statements are most useful if they encode key:value pairs which can then be queried and *analyzed* in the aggregate.
(Even for plain, textual logs, NLP and stats can extract basic structure.)
(Even for plain, textual logs, NLP and stats can extract basic structure.)
2/ A structured log implicitly defines a *relational table*, with the keys for each attribute defining the columns, and the values for each log line defining rows in this (theoretical) table.
Like this:
Like this:
3/ And, naturally, there are a number of implicit columns in our table as well. Things like host, timestamp, etc:
4/ Now, to be clear, we’re talking about the “abstract idea” of relational tables here, and not actually inserting every log line into mysql or similar – that would be a disaster at scale. :)
Just think of each line of logging instrumentation as a “table schema.”
Just think of each line of logging instrumentation as a “table schema.”
5/ Once we realize this, we can write queries with *most* SQL niceties (WHERE filters, GROUP BY aggregations, etc).
But what about “JOIN”? How does *that* work in logging systems? The long answer won’t fit here.
The short answer? “Poorly.” Bummer. :-/
But what about “JOIN”? How does *that* work in logging systems? The long answer won’t fit here.
The short answer? “Poorly.” Bummer. :-/
6/ Why is it a bummer? Well, because when we’re instrumenting a microservice, by definition *we only have access to data from that microservice!*
What about version numbers of peer services? Or request customer_ids? Or downstream feature flags? Surely those could be relevant…
What about version numbers of peer services? Or request customer_ids? Or downstream feature flags? Surely those could be relevant…
7/ But relevant or not, that data lives *in other services.* Which means it’s not there to log. What’s an eng to do??
Faced with this conundrum, engineers stuck with logs will inevitably/sadly hack something together rather than address the underlying structural issue. (
https://abs.twimg.com/emoji/v2/... draggable="false" alt="😭" title="Laut schreiendes Gesicht" aria-label="Emoji: Laut schreiendes Gesicht">)
Faced with this conundrum, engineers stuck with logs will inevitably/sadly hack something together rather than address the underlying structural issue. (
8/ E.g., have you ever seen a customer_id painstakingly propagated across function and *process* boundaries just so someone can add it to instrumentation?
That’s an error-prone *and* expensive way of implementing log JOINs via app code (rather than automatically via tracing).
That’s an error-prone *and* expensive way of implementing log JOINs via app code (rather than automatically via tracing).
9/ When we implement JOIN manually in this way, we are taking on *literally the hardest part of distributed tracing instrumentation* (namely, “context propagation”) and trying to manage it via one-off hacks. It doesn’t end well. (TL;DR “use @opentelemetry instead”)
10/ So again, “that’s wasteful.” And ineffective.
The right way to solve this problem is to leverage distributed tracing to perform a much (much) more powerful JOIN.
Let’s imagine that your system looks like this:
The right way to solve this problem is to leverage distributed tracing to perform a much (much) more powerful JOIN.
Let’s imagine that your system looks like this:
11/ Now, when a truly modern observability solution “assembles a trace,” it’s *really* executing a JOIN across the entire *distributed* transaction, and thus populating a wider and more powerful table: one with columns from every Span that participates in the trace.
Like this:
Like this:
12/ Now, when people think about tracing, they tend to think about this giant table “one trace (or row) at a time.”
Imagine restricting a logging system to display only one log-line at a time. This is just as bad… perhaps worse. And yet it passes for “tracing.” :-/
Imagine restricting a logging system to display only one log-line at a time. This is just as bad… perhaps worse. And yet it passes for “tracing.” :-/
13/ It’s really only in the past few years that observability technology has developed to the point that these massive, *distributed*, tables can be hydrated both dynamically and in real-time.
14/ And all of that data engineering is worth it! Because when the relational tables are as wide as your distributed system is deep, amazing things are possible – and I don’t see how logging will ever be able to catch up.
PS/ For example applications of these sorts of dynamic, relational tables, see any of the following (or play with http://lightstep.com/sandbox )
https://lightstep.com/sandbox&q... href=" https://twitter.com/el_bhs/status/1364282343196827650
https://twitter.com/el_bhs/st... href=" https://twitter.com/el_bhs/status/1227358990968877056
https://twitter.com/el_bhs/st... href=" https://lightstep.com/blog/announcing-lightsteps-change-intelligence/">https://lightstep.com/blog/anno...
https://lightstep.com/sandbox&q... href=" https://twitter.com/el_bhs/status/1364282343196827650
https://twitter.com/el_bhs/st... href=" https://twitter.com/el_bhs/status/1227358990968877056
https://twitter.com/el_bhs/st... href=" https://lightstep.com/blog/announcing-lightsteps-change-intelligence/">https://lightstep.com/blog/anno...