published new releases of the `tracing`, `tracing-core`, and `tracing-subscriber` #rustlang crates today.

of course, there are a few new features and bug fixes. more interesting, though, is that *some* users will see a much better performance...and the story behind it (1/n)
the story starts with me checking my email on Monday of this week.

i was pretty surprised, and a bit embarrassed, to see that a pull request to switch the Rust compiler from the `log` crate to `tracing` (!!!) resulted in a fairly big perf hit: #issuecomment-663915420">https://github.com/rust-lang/rust/pull/74726 #issuecomment-663915420">https://github.com/rust-lang... (2/n)
i was surprised, because i& #39;ve spent a lot of time trying to optimize `tracing` performance. i think performance matters a lot for observability tools: you& #39;re not going to record as much data if you know everything you record is a big perf hit. (3/n)
...and i was a bit embarrassed, because i& #39;ve often told people that `tracing` shouldn& #39;t have a significant performance difference between the `log` crate. i& #39;ve done some benchmarks to back this up! i was starting to worry that maybe these claims had been wrong all along (4/n)
so, what happened? with a lot of help from @oli_obk, i was able to do some digging, and i realized something i think is really interesting. `tracing` *is* well-tuned for performance, but a lot of the optimizations we& #39;ve done in the past depend on certain assumptions. (5/n)
`tracing`& #39;s original intended use case was for diagnostics in the @Linkerd proxy, and other applications built using @tokio_rs.

tokio is a platform for asynchronous networking applications. it tends to be used in high performance servers and infrastructure like Linkerd. (6/n)
because of this original use case, i& #39;ve made some assumptions in the performance tuning i& #39;ve done. in particular, i& #39;ve made the assumption that *amortized* performance is what we should optimize for. (7/n)
a lot of `tracing`& #39;s optimizations rely on paying some up front cost the *first* time something happens, in return for better performance every other time.

in a server, daemon, or other long-running piece of software, this makes perfect sense! (8/n)
suppose i choose to make an optimization where the first time i hit some event, it takes 10x as long, in return for making every *subsequent* hit take half as long.

if i hit that event 100,000 times over the two weeks my server runs for, this makes perfect sense! (9/n)
but what if i& #39;m writing a batch-oriented program, like a CLI tool? something that runs to completion in a few milliseconds?

maybe i hit that event *two* times. suddenly, the same optimization has made my program much slower! (10/n)
we had done something like this in the way tracing does filtering.

filtering matters to me a lot. a good observability tool should ensure the overhead of data that & #39;s not collected is as low as possible. that way, users can instrument liberally, and turn it on as needed. (11/n)
both `tracing`& #39;s `EnvFilter` and `log`& #39;s `env_logger` implement a kind of filtering where a max verbosity level can be set for multiple targets (typically Rust modules) in the program. this is fairly common in the ecosystem, so we wanted to optimize for it. (12/n)
`log` has a nice fast path where the *most* verbose level that any of those filters would enable is stored globally. that way, we can just check that first, to see if a given `log` macro should be skipped, without having to ask the logger to evaluate the whole filter (13/n)
this approach works nicely with a filter like
"foo=info,bar::baz=warn". neither of these filtering directives enable the "trace" or "debug" levels; the max global level is "info". skipping `trace!` and `debug!` logs is really fast! (14/n)
it falls down a bit if you *do* enable the more verbose levels. something like "foo::bar::baz=trace,quux=warn,zotz=error".

because the "trace" level is set for one tiny module, we are *never* in the fast path. even in `quux` and `zotz`, we have to ask the logger (15/n)
when i was designing `tracing`, i thought we could do better than this. and (IMHO), we did!

we have a filter-caching system where each individual macro stores a filter result in a static. skipping it is a single load & branch, even if `trace` is enabled someplace else. (16/n)
this system works great.

the thing is, the *first* time you hit that callsite, you have to evaluate the filter to determine the value in the cache. and you need to store the cached value& #39;s address in case you need to invalidate it later.

this takes time. (17/n)
astute readers have probably already figured out where i& #39;m going with this. the overhead of seeding the cache is amortized over the rest of the program& #39;s lifetime. in long running programs, this is great. in @Linkerd, we saw *better* performance versus the `log` crate (18/n)
but these were over pretty long-running load tests that measured the latency of each request. if a request caused a new callsite to get registered in the cache...that was an *outlier*, and we ignored it. overall, the performance was good. (19/n)
`rustc` isn& #39;t shaped like that! this overhead was much more meaningful in the compiler, and for other software that runs to completion. (20/n)
anyway, we ended up adding a global maximum level as a faster fast path on top of the callsite cache. the implementation was quite interesting ( #L583-L652">https://github.com/tokio-rs/tracing/blob/c9f8b6dbe9cf05c029cf88d532ade9e7d0118ef2/tracing-core/src/metadata.rs #L583-L652)">https://github.com/tokio-rs/... but that& #39;s a story for another time. (21/n)
what& #39;s the lesson from all this? performance, especially in libraries, depends on *assumptions*. sometimes, it& #39;s as simple as saying one implementation is "fast" and another is "slow", but this is much rarer than we are willing to admit. (22/n)
we make *guesses* about how the code is actually used. and even if our guesses are *usually* right, there might be a case where those assumptions fall down. this time, for me, that case happened to be...the Rust compiler, and i felt pretty bad about myself for a bit. (23/n)
so, that& #39;s my shaggy performance regression story. go download the latest `tracing` ( https://crates.io/crates/tracing/0.1.18)">https://crates.io/crates/tr... and `tracing-subscriber` ( https://crates.io/crates/tracing-subscriber/0.2.10)...especially">https://crates.io/crates/tr... if you& #39;re writing a short-lived CLI app! (fin)
one last update on this: no matter how hard i try, i simply *cannot* make this change have any impact on Linkerd request latency benchmarks, at all. not even in the highest latency buckets (above the 99.9th percentile). i guess that really drives home the message here.
(which is a bummer! i was hoping it would make the proxy faster!)
also! thanks to the aforementioned @oli_obk, @kneecaw, @eddyb_r, and @mattgodbolt& #39;s Compiler Explorer, for copious help optimizing the new fast path
You can follow @mycoliza.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: