Hey folks, welcome to the 2017 Q4 Best Of Special Edition issue of Monitoring Weekly! This issue contains all the most-liked articles from Q4 2017. In case you missed something or just want another good look at the best stuff from the past quarter, here it is. Go grab a coffee and settle in, cause there’s a lot of great stuff to read.
This issue is sponsored by:
Free Guide: Low-Risk Continuous Delivery
Adopting Continuous Delivery can bring a lot of benefits, but deploying to production can be filled with uncertainty. Learn how to reduce the risks with the right culture, architecture, and tooling to deploy early and often. Check out this free guide as we explore solutions.
Monitoring News, Articles, and Blog posts
Monitoring traceroute through Prometheus and Grafana
Speaking as a former network engineer, there is one easy way to make your network engineers love you: give them a proper traceroute when you tell them about suspected problems. (bonus points for packet captures!) One thing I didn’t even think about, and yet seems so obvious now, is running all that traceroute data through Grafana. The example given in this article is straightforward but it could easily be expanded to be even more useful.
I love it when James Turnbull writes about monitoring (you have read The Art of Monitoring, right?), and this post is no exception. James walks us through the foundational concepts behind Prometheus and a simple prototype setup. Love it.
We’ve talked a lot here at Monitoring Weekly about the use of Google’s “Golden Signals” (errors, latency, saturation, traffic). This post is a multi-part series that dives deep on both understanding what the signals mean in different scenarios as well as how to collect them for different parts of your infrastructure. Check out the rest of the series too:
Getting a look at what the big companies are doing internally is always interesting (though not often directly useful for most of us, unfortunately). This time, we get a peek at Lyft’s dashboarding: Grafana, statsd, statsrelay, SaltStack, Wavefront, and a whole lot of custom code.
I don’t know why, but I’ve not seen an article that walks through the setup and use of the full TICK stack before this one. Most only talk about using pieces of it (usually Telegraf and/or InfluxDB), so this is especially interesting to see how they all fit together.
James Turnbull is writing a new book and this time on Prometheus. If you’ve ever had the pleasure of reading one of his many books, you know how great it’s gonna be. Pre-orders are open now–just bought mine.
Not sure where to start on observability? Charity Majors has got you covered with some really helpful tips and best practices.
I love reading about other people’s monitoring stacks and this writeup from the folks at OLX Group is certainly a good read. It’s a pretty complex setup with a bunch of tools strung together, including some I haven’t seen in a while (well hello there Brubeck!) and some new ones (moira).
Alright, here’s the deal: I’m not even going to attempt to summarize this beast of a post. It’s incredible and if you only read one article this issue, it should be this one. Also: better make a pot of coffee/tea, because it’s a bit of a read (and worth it).
Looking for the right solution for monitoring your Docker-based infrastructure? The folks at Rancher have updated their assessment, looking at ten different solutions and the pros/cons of each.
I love spotting *NIX stuff in mainstream film and this one is super cool: Kibana on screen in Mr. Robot. I went looking for other examples of this sort of thing in film and turns out it’s surprisingly uncommon for movies/TV to obviously feature *NIX on screen. Major props to the tech consultants with Mr. Robot for the dedication to realism.
I really love mental models. The USE Method and the Four Golden Signals from the SRE book are two of my favorites for performance analysis and monitoring. I hadn’t seen RED before, but it looks like a really useful mental model. Regardless, the author makes the observation that the two models are very much complementary, as they talk about related-but-not-the-same metrics.
The folks at The Economist set out to improve monitoring by leaps-and-bounds and made great initial headway with a good strategy: an internal two-day hackathon dedicated to improving monitoring. I really like the approach because it gets people out of the mindset of the usual day-to-day tasks.
I’m not sure how I missed this, but here’s a great talk from the recent LISA17 conference about mitigating false alarms and the true costs of them to your staff and company.
While the RED method works pretty well, it has a major shortcoming: diagnosing whether the issue is with one service or a dependency. This article proposes a solution to that while also making the case for standardizing on RED metrics for every service your organization runs.
My book, Practical Monitoring, is now available from Amazon and Safari Books Online! Many months in the making, I’m proud to see it on shelves now. If you’ve already got your copy, please let me know what you think and leave a review on Amazon! Want to try it before you buy? There’s a free chapter available at the link.
A much-needed effort at defining terminology and breaking down the overloaded “monitoring” term. The author goes into more detail, but I’ll include the TLDR here because it’s so good:
“Monitoring is the process of observing systems and testing whether they function correctly. Analytics is the process of turning data (usually behavioral data) into insights. Observability is the property of a system that supports analytics. Diagnostics is the process of determining what’s wrong with a system, and also relies on observability. Root cause analysis is corporate mumbo jumbo.”” - Baron Schwartz
High-cardinality of metric data has always been sort of a pain point. If you’ve ever been on the unfortunate end of having someone encode a UID into a metric path and watching it bring a metrics server to its knees, you know how tough it is. This article makes the case that high-cardinality isn’t just desirable–it’s required. And while we’re at it, we should all stop assuming it’s impossible to achieve.
It always annoys me when companies carve out “planned downtime” from their availability reporting and goals. This article says what we’re all thinking: planned downtime punishes the customer for our failure to build resilient and maintainable systems.
Want more about browser performance metrics? Here you go. This article goes into more depth on the current generation of the metrics available to us, what they mean, and how they’re best used.
This is actually the first time I’ve ever seen Kafka and Graphite in the same stack, but it really does have a great use case. The author pulls data from over a hundred weather endpoints and stores them in Kafka+Graphite for later trend analysis and historical reporting. Pretty neat use of the tools.
The folks at InfoQ who interviewed me about my book, Practical Monitoring, a while back, interviewed Charity Majors about observability, how applications have changed, and what we should do about it in order to keep pace with understanding the health of the apps.
Enjoy your weekend, folks!
– Mike (@mike_julian) Monitoring Weekly Editor