Thread by @el_bhs, 0/ Now that organizations are building or buying observability, they are realizing [...]

0/ Now that organizations are building or buying observability, they are realizing that it can get really damned expensive. And not just “expensive,” but “expensive and out of control.”

This is a thread about *observability value:* both the benefits and the costs.

1/ You hear so much about observability because it *can* be awesome. :) Benefits roll up into at least one of the following:

- Reducing latencies or error rates (foreach service)
- Reducing MTTR (also foreach service)
- Improving velocity or communication (foreach team)

2/ But most observability vendors charge based on something that has literally no value on its own: *the telemetry.*

This is rough for customers, especially since these vendors provide no mechanism to scale or *control* the telemetry volume (why would they? it’s $$$!).

3/ Let’s dig into this a bit more. Really, there are two flavors of telemetry – statistics (i.e., “metrics”) and events (i.e., “traces and logs”), and they should be considered separately.

4/ For metrics telemetry, the cost driver is cardinality, especially around “custom metrics.” Per-metric cardinality is combinatorial and grows well into the millions, and customers pay accordingly.

It really doesn’t need to be that way!

5/ Most of that “cardinality budget” is spent on long-tail metrics that *literally never appear in query results*. Customers should have a simple slider to trade off cardinality, spend, and query result quality – schematically, like this:

6/ Want perfect fidelity for a business-critical metric? Drag the slider all the way to the right. Want to degrade gracefully for a customer_id tag with millions of values? Drag the slider to the left, pay 99% less, and still have metric data for your largest customers.

7/ Event data is interesting, too.

(An aside: 99% of “logging telemetry” is really just “tracing telemetry.” Any log *about a transaction* (i.e., “almost all of them”) should be attached to the trace context so it can benefit from trace analysis.)

8/ In any case, there is a fundamental tradeoff between the *number of transactions recorded* (txns/sec – i.e., those that survive sampling at a given analytical stage) and the *level of detail* (bytes/txn) in those transactions.

9/ When we multiply “txns/sec” by “bytes/txn,” we end up with a “bytes/sec.” In order to visualize the trade-offs here, we chart various telemetry throughput targets as (hyperbolic) lines against the following axes:

10/ For tracing (or logging) data, we must trade off between *sampling* (either in the clients, in collection infra, or before hitting long-term durable storage) and *detail*. There is no “right answer” here, and it’s just something that should be considered carefully.

11/ But no matter where we draw that line, the reality is that there is *A LOT* of tracing/logging/event data; especially when microservices are involved, as the data volume is a function of transaction rate *multiplied by* microservice count!

12/ The elephant in the room around tracing data is where we *collect, store, and analyze* the data. The moment you send things over the WAN, you are paying a 100x cost penalty; but if you just keep things in a “dumb” collector, you can’t query over it dynamically.

13/ Architecturally, this means that computation *must* be distributed – close to the data – and that data *must* stay close to the services themselves; or the network cost is simply too disruptive to overall observability ROI.

14/ So, in closing…

Observability can be _incredibly_ valuable!

*Do* build an ROI case around its primary benefactors: your services, your teams, and your brand.

But `Telemetry != Observability`: *don’t* pay vendors extra for a telemetry firehose you can& #39;t even control.

PS: Given all of the above, @lightstephq& #39;s approach to metrics cleanly separates telemetry from observability, and will provide explicit control over cardinality costs.

You can sign up for early access here: https://lightstep.com/metrics/ ">https://lightstep.com/metrics/&...

Full-context metrics | Stop staring at tiny charts | Lightstep

Tired of flipping through long dashboards and charts? Lightstep automatically finds which metrics in your call-stack are correlated to the regression you’re investigating.

https://lightstep.com/metrics/

Latest Threads Unrolled: