Updated definition:
Monitoring is for running and understanding other people's code (aka "your infrastructure")
Observability is for running and understanding *your* code -- the code you write, change and ship every day; the code that solves your core business problems. https://twitter.com/mipsytipsy/status/911711540008628224


Questions monitoring tools (like datadog, signalfx) can answer:
* When will my disk fill up?
* Am I running out of capacity in $(cluster)?
* Did the % free memory drop after my last deploy?
* What is the avg, 90th, 99th percentile latency per service?
* When will my disk fill up?
* Am I running out of capacity in $(cluster)?
* Did the % free memory drop after my last deploy?
* What is the avg, 90th, 99th percentile latency per service?
Questions observability tools (like honeycomb, lightstep) can answer:
* What (1..many) things do all the errors in that spike have in common?
* How many exports per second is $app doing, and how large are they, and how does this compare to the average export size in kB?
* What (1..many) things do all the errors in that spike have in common?
* How many exports per second is $app doing, and how large are they, and how does this compare to the average export size in kB?
* Break down by app and sort by export size: what are your top 3 export users, and what is the sum of their total throughput compared to the overall throughput?
* Are the errors evenly distributed across workers, AZs, instance types, software versions, build_id versions, shards?
* Are the errors evenly distributed across workers, AZs, instance types, software versions, build_id versions, shards?
* Are the timeouts happening for all our users, or only the test users, or only our top users by write volume, etc?
* For all of the deliveries that failed over a specified time period, what are the top three reasons they failed, and what % of failures were from a single app?
* For all of the deliveries that failed over a specified time period, what are the top three reasons they failed, and what % of failures were from a single app?
Running infrastructure means running black boxes. You may have some insight into them (god i hope so) but you don't have the ability to tweak their instrumentation, and you certainly aren't shipping code changes every day.
Infrastructure code gets package-upgraded, rarely.
Infrastructure code gets package-upgraded, rarely.
And when it comes to monitoring and understanding your infrastructure, metrics-based monitoring tools that let you understand performance in aggregate are the tool for the job.
Esp when workloads are high throughput with little differentiation (routers, etc) metrics are king.
Esp when workloads are high throughput with little differentiation (routers, etc) metrics are king.
When it comes to aligning developer perspective with user experience to provide core business value, though: event-based observability tools are the only way to get at the information you need.
You need the flexibility and precision of a scalpel, not an axe.
You need the flexibility and precision of a scalpel, not an axe.
To see an expert yet beginner-friendly (and entertaining!) intro to observability for business problems, check out this talk from @seebails -- https://observe2020.io/2020/03/chris-bailey/
And to continue my killjoy track record of stiffly caring about technical definitions for technical terms, if you'd like to read more, please read my three year history of observability in the software domain. aka how we got here and where we're going: https://thenewstack.io/observability-a-3-year-retrospective/
annnd -- should you be in the mood to build an o11y tool in-house, or want to argue with me about why datadog and signalfx and their ilk are definitely not observability tools (or about what constitutes an o11y tool), do read this: https://www.honeycomb.io/blog/so-you-want-to-build-an-observability-tool/