I hope you don’t mind another “gift”, because it’s time for our “Best of Q4” issue! I’ve gone back over the past few months and pulled out the most popular articles as chosen by you… Enjoy! 😍
This issue is sponsored by:
Join the Elastic Community Conference. Save the date and submit.
ElasticCC is a free technical conference for the community, happening February 11–12. Submit your stories and learnings from ELK to Elastic observability and security until January 4; introduction, deep dive, legacy, or cutting edge are all welcome. And don't forget to join us in February!
Articles & News on monitoring.love
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
We hear about the sidecar pattern all the time, but rarely with an explanation of what it is or why we should care. Here you go.
If you’ve been asleep for the past five years, this is a great way to get caught up on the past and present of Observability as an industry practice.
Ever wonder why we don’t hear more about open source APM tooling (i.e. is it even a thing?). Wonder no more, this article has you covered.
If you can’t (or don’t want to) use Promtail, it’s now possible to push your logs directly to Loki. Lots of useful bits about Python and Loki logging internals. A great read.
This might be the best article I’ve read on incident response and postmortems in a long while. Read this. Share it.
An excellent summary of Google’s “Four Golden Signals” for SRE, including some examples that feel appropriate for this audience.
Running Kubernetes and thinking about monitoring it with Prometheus, but not sure how to get started? This is the definitive guide you’ve ben looking for.
We’ve covered the Monika project a couple times this year. Looks like they’ve added some new alerting features and more flexible capabilities added to the project.
Whether or not you believe the “single root cause” exists, eBay’s Groot event-graph-based approach to RCA demonstrates some extremely impressive numbers for their causality graphs. The whitepaper on Groot’s design (in partnership with University of Illinois Urbana-Champaign and Peking University) can be found here.
A comparison of three popular log aggregators. I’d like to see more dimensions covered (e.g. transforms, metrics exporting, etc) but it’s still a useful look at how each performs at basic log collection duties.
On the other hand, if you’re feeling adventurous and thinking about writing your own website-monitoring Monika knock-off, this article has you covered.
Honestly, this would have saved my bacon at a previous gig where we used TLS certificates for everything.
Incident responses can be a chaotic experience for everyone. This post from Datadog highlights some best practices for collecting your data in preparation for writing the postmortem.
How to autoscale your Kubernetes systems using any custom metric. Yes please and thank you.
I’m a big believer that for any new technology (e.g. events) to become ubiquitous, there needs to be an open source alternative to provide competition and training opportunities. This article does a good job summarizing the most popular open source tools representing the pillars of observability.
Most of us have a passing understanding of SLIs, SLOs, and how they feed into SLAs. Unfortunately, many of us still struggle with the question of how to leverage them for availability numbers and error budgets. This post aims to answer these for us.
Congratulations to the OpenTelemetry project on reaching their GA milestone for Tracing components! 🎉
Good monitoring and testing goes hand in hand. Here’s an interesting tool I first learned about this week for testing Kubernetes operators.
I have a lot of conflicting feels on this one. Yes, I agree with the author’s take, but I also recognize that not everyone has the resources or freedom to proritize High Availability for their entire architecture. Here’s a terrible thought… are you better off taking a service down for maintenance without notifying your customer?
We’ve seen countless articles explaining what OpenTelemetry is, where and how it can help us, etc. This is one of the few articles I’ve read that actually walks us through the considerations leading up to adoption, which questions to ask yourselves, and how to plan the rollout.
A thorough look at the logging patterns for Kubernetes clusters with a comprehensive look at the pros and cons for each approach.
In a cloud-native world, it shouldn’t surprise me that we don’t see more network monitoring articles. I’m probably one of the few who gets excited to see articles about SNMP, but this one’s a doozy.
I don’t think any single article can address the variety of cultural and systemic organizational issues that can lead to Really Bad Alerting Practices™. Nevertheless, this one does a solid job covering many of the aspects within our direct control and influence.
Some excellent tips on how to leverage Prometheus labels more effectively in Grafana. Bonus points to the author for demonstrating the Prometheus labels API.
Negotiating your AWS contract? Let us help. At The Duckbill Group, we’re on your side and we see dozens of these a year–more than most AWS account managers! We’ve helped negotiate everything from $3mm contracts to $650mm contracts and a whole slew in between. Check out our AWS contract negotiation services. (SPONSORED)
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor