The topic of “observability” has been getting a lot more mainstream discussion lately. If you’re interested in learning more about what it is, what it isn’t, and what it means for your own monitoring efforts, look no further: this list of articles is your one-stop-shop for learning all about it.
What the hell is observability? How is it any different than monitoring? Is it just the “devops” vs “sysadmin” debate all over again? This article answers these and so much more.
This quote from the article sums it up quite well: Monitoring tells you whether a system is working, observability lets you ask why it isn’t working.
Now that we’ve introduced some nuance into our world, our terminology is getting overloaded. This post sets out some definitions of what everything means and how they differ.
How do I make my applications more observable?
USE and RED are two methods for deciding what to instrument why. This article walks you through their meaning and usage.
One of the best parts about the RED Method comes when you instrument all of your services to emit the same data: it becomes soooo much easier to spot the troublesome service in a microservice/distributed architecture.
The list of best practices at the end is worthwhile reading.
Another method for deciding what to monitor is the Four Golden Signals, popularized by the Site Reliability Engineering book. This article series by Steve Mushero walks you through what the signals mean and how to gather them.
Who is actually doing observability?
There are a number of teams out there doing observability-like things, and some of the larger, more engineering-focused companies have more mature Observability teams that are focused on providing expertise and a platform to other teams.
Dating from 2013, Twitter was one of the first companies to work toward solving the problem of monitoring high-scale, distributed monitoring. For more details on their architecture (from 2016), see these posts: Observability at Twitter: technical overview, part I, Observability at Twitter: technical overview, part II
Did I miss anything?
Is there an article or video you think I’ve missed and should be on this page? Send it on over!