Measuring Mean Time To Recovery https://twitter.com/QuinnyPig/status/1265036735307808772
Let's measure how long it takes 1000 different animals to run 100 meters and then punish the slow ones
Yo human, cheetah's MTTR is consistently four times as fast as yours, we're going to need you to work on that in Q3
also btw some of you are running 1000 meters but we won't tell you which, also btw there is some confusion over where the finish line is
it is difficult to overstate how nonsensical MTTR and MTBF are as failure metrics in complex systems where the very definitions of "failure" and "recovery" are different every time and socially constructed and subject to motivated reasoning and negotiation.
Scientists used to believe in "types": short, portly men were more likely to be jovial, etc. (also tons of racism to unpack here). This is approximately how most orgs analyze incidents today.
This is what you look like when you measure mean time to recovery. ( https://www.autodeskresearch.com/publications/samestats)
As Gould points out in The Full House and many others have before and since, (trends in) measures of central tendency don't always tell you much about the *distribution* of data (the proverbial Full House). I think this is true of incident data.
If anyone has a bunch of incident data lying around, it would be cool to see a histogram of their incident TTRs, and then perhaps one for each SEV level. You might be surprised by how little it looks like a bell curve -- or I by how much!
Generally speaking, when a measure has some left wall (no incident takes negative time), high variance, and no right wall, the central tendencies are especially likely to distort. Add to this the lack of a clear unimodal distribution (tendency to clump towards the mode) and...
There seems to me to be no particular duration that incidents _ought_ to have. They take as long as they take. All we can say is that we'd like each one to take longer, but what do we actually learn about future incidents by studying the _duration_ of past incidents?
(*be shorter, lol) And I think we learn especially little by studying the mean.
I also think we are embedding some assumptions here, for example

- duration correlates with customer impact
- duration correlates with stress on engineers
- duration correlates with some objective level of difficulty in resolving the incident

and so on. Do they?
RE that last one: can you even measure this?
Here's a fun thought experiment: Your month-over-month MTTR is up by 10%. How many distinct explanations can you come up with for this?
*stares directly into the camera* https://twitter.com/ReinH/status/1265798723855986689?s=20
(cite https://twitter.com/ReinH/status/1265798432205094912?s=20)
You can follow @ReinH.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: