0/ Now that organizations are building or buying observability, they are realizing that it can get really damned expensive. And not just “expensive,” but “expensive and out of control.”

This is a thread about *observability value:* both the benefits and the costs.
1/ You hear so much about observability because it *can* be awesome. :) Benefits roll up into at least one of the following:

- Reducing latencies or error rates (foreach service)
- Reducing MTTR (also foreach service)
- Improving velocity or communication (foreach team)
2/ But most observability vendors charge based on something that has literally no value on its own: *the telemetry.*

This is rough for customers, especially since these vendors provide no mechanism to scale or *control* the telemetry volume (why would they? it’s $$$!).
3/ Let’s dig into this a bit more. Really, there are two flavors of telemetry – statistics (i.e., “metrics”) and events (i.e., “traces and logs”), and they should be considered separately.
4/ For metrics telemetry, the cost driver is cardinality, especially around “custom metrics.” Per-metric cardinality is combinatorial and grows well into the millions, and customers pay accordingly.

It really doesn’t need to be that way!
5/ Most of that “cardinality budget” is spent on long-tail metrics that *literally never appear in query results*. Customers should have a simple slider to trade off cardinality, spend, and query result quality – schematically, like this:
6/ Want perfect fidelity for a business-critical metric? Drag the slider all the way to the right. Want to degrade gracefully for a customer_id tag with millions of values? Drag the slider to the left, pay 99% less, and still have metric data for your largest customers.
7/ Event data is interesting, too.

(An aside: 99% of “logging telemetry” is really just “tracing telemetry.” Any log *about a transaction* (i.e., “almost all of them”) should be attached to the trace context so it can benefit from trace analysis.)
8/ In any case, there is a fundamental tradeoff between the *number of transactions recorded* (txns/sec – i.e., those that survive sampling at a given analytical stage) and the *level of detail* (bytes/txn) in those transactions.
9/ When we multiply “txns/sec” by “bytes/txn,” we end up with a “bytes/sec.” In order to visualize the trade-offs here, we chart various telemetry throughput targets as (hyperbolic) lines against the following axes:
10/ For tracing (or logging) data, we must trade off between *sampling* (either in the clients, in collection infra, or before hitting long-term durable storage) and *detail*. There is no “right answer” here, and it’s just something that should be considered carefully.
11/ But no matter where we draw that line, the reality is that there is *A LOT* of tracing/logging/event data; especially when microservices are involved, as the data volume is a function of transaction rate *multiplied by* microservice count!
12/ The elephant in the room around tracing data is where we *collect, store, and analyze* the data. The moment you send things over the WAN, you are paying a 100x cost penalty; but if you just keep things in a “dumb” collector, you can’t query over it dynamically.
13/ Architecturally, this means that computation *must* be distributed – close to the data – and that data *must* stay close to the services themselves; or the network cost is simply too disruptive to overall observability ROI.
14/ So, in closing…

Observability can be _incredibly_ valuable!

*Do* build an ROI case around its primary benefactors: your services, your teams, and your brand.

But `Telemetry != Observability`: *don’t* pay vendors extra for a telemetry firehose you can't even control.
You can follow @el_bhs.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: