published new releases of the `tracing`, `tracing-core`, and `tracing-subscriber` #rustlang crates today.

of course, there are a few new features and bug fixes. more interesting, though, is that *some* users will see a much better performance...and the story behind it (1/n)
i was surprised, because i've spent a lot of time trying to optimize `tracing` performance. i think performance matters a lot for observability tools: you're not going to record as much data if you know everything you record is a big perf hit. (3/n)
...and i was a bit embarrassed, because i've often told people that `tracing` shouldn't have a significant performance difference between the `log` crate. i've done some benchmarks to back this up! i was starting to worry that maybe these claims had been wrong all along (4/n)
so, what happened? with a lot of help from @oli_obk, i was able to do some digging, and i realized something i think is really interesting. `tracing` *is* well-tuned for performance, but a lot of the optimizations we've done in the past depend on certain assumptions. (5/n)
`tracing`'s original intended use case was for diagnostics in the @Linkerd proxy, and other applications built using @tokio_rs.

tokio is a platform for asynchronous networking applications. it tends to be used in high performance servers and infrastructure like Linkerd. (6/n)
because of this original use case, i've made some assumptions in the performance tuning i've done. in particular, i've made the assumption that *amortized* performance is what we should optimize for. (7/n)
a lot of `tracing`'s optimizations rely on paying some up front cost the *first* time something happens, in return for better performance every other time.

in a server, daemon, or other long-running piece of software, this makes perfect sense! (8/n)
suppose i choose to make an optimization where the first time i hit some event, it takes 10x as long, in return for making every *subsequent* hit take half as long.

if i hit that event 100,000 times over the two weeks my server runs for, this makes perfect sense! (9/n)
but what if i'm writing a batch-oriented program, like a CLI tool? something that runs to completion in a few milliseconds?

maybe i hit that event *two* times. suddenly, the same optimization has made my program much slower! (10/n)
we had done something like this in the way tracing does filtering.

filtering matters to me a lot. a good observability tool should ensure the overhead of data that 's not collected is as low as possible. that way, users can instrument liberally, and turn it on as needed. (11/n)
both `tracing`'s `EnvFilter` and `log`'s `env_logger` implement a kind of filtering where a max verbosity level can be set for multiple targets (typically Rust modules) in the program. this is fairly common in the ecosystem, so we wanted to optimize for it. (12/n)
`log` has a nice fast path where the *most* verbose level that any of those filters would enable is stored globally. that way, we can just check that first, to see if a given `log` macro should be skipped, without having to ask the logger to evaluate the whole filter (13/n)
this approach works nicely with a filter like
"foo=info,bar::baz=warn". neither of these filtering directives enable the "trace" or "debug" levels; the max global level is "info". skipping `trace!` and `debug!` logs is really fast! (14/n)
it falls down a bit if you *do* enable the more verbose levels. something like "foo::bar::baz=trace,quux=warn,zotz=error".

because the "trace" level is set for one tiny module, we are *never* in the fast path. even in `quux` and `zotz`, we have to ask the logger (15/n)
when i was designing `tracing`, i thought we could do better than this. and (IMHO), we did!

we have a filter-caching system where each individual macro stores a filter result in a static. skipping it is a single load & branch, even if `trace` is enabled someplace else. (16/n)
this system works great.

the thing is, the *first* time you hit that callsite, you have to evaluate the filter to determine the value in the cache. and you need to store the cached value's address in case you need to invalidate it later.

this takes time. (17/n)
astute readers have probably already figured out where i'm going with this. the overhead of seeding the cache is amortized over the rest of the program's lifetime. in long running programs, this is great. in @Linkerd, we saw *better* performance versus the `log` crate (18/n)
but these were over pretty long-running load tests that measured the latency of each request. if a request caused a new callsite to get registered in the cache...that was an *outlier*, and we ignored it. overall, the performance was good. (19/n)
`rustc` isn't shaped like that! this overhead was much more meaningful in the compiler, and for other software that runs to completion. (20/n)
anyway, we ended up adding a global maximum level as a faster fast path on top of the callsite cache. the implementation was quite interesting ( https://github.com/tokio-rs/tracing/blob/c9f8b6dbe9cf05c029cf88d532ade9e7d0118ef2/tracing-core/src/metadata.rs#L583-L652) but that's a story for another time. (21/n)
what's the lesson from all this? performance, especially in libraries, depends on *assumptions*. sometimes, it's as simple as saying one implementation is "fast" and another is "slow", but this is much rarer than we are willing to admit. (22/n)
we make *guesses* about how the code is actually used. and even if our guesses are *usually* right, there might be a case where those assumptions fall down. this time, for me, that case happened to be...the Rust compiler, and i felt pretty bad about myself for a bit. (23/n)
so, that's my shaggy performance regression story. go download the latest `tracing` ( https://crates.io/crates/tracing/0.1.18) and `tracing-subscriber` ( https://crates.io/crates/tracing-subscriber/0.2.10)...especially if you're writing a short-lived CLI app! (fin)
one last update on this: no matter how hard i try, i simply *cannot* make this change have any impact on Linkerd request latency benchmarks, at all. not even in the highest latency buckets (above the 99.9th percentile). i guess that really drives home the message here.
(which is a bummer! i was hoping it would make the proxy faster!)
also! thanks to the aforementioned @oli_obk, @kneecaw, @eddyb_r, and @mattgodbolt's Compiler Explorer, for copious help optimizing the new fast path
You can follow @mycoliza.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: