Announcing my new video course: Monitor Anything
How do you improve monitoring, specifically? Where do you even start? Worse: how do you know you’re done? If this resonates, I’ve got something in the works you’re going to love: a foolproof framework for how to monitor any app, service, or infrastructure. Read more about it and pre-order the course here.
This issue is sponsored by:
Adopting Continuous Delivery can bring a lot of benefits, but deploying to production can be filled with uncertainty. Learn how to reduce the risks with the right culture, architecture, and tooling to deploy early and often.
Check out this free guide as we explore solutions.
Articles & News
This pair of articles does a great job of introducing some foundational security incident response stuff–topics I personally think deserve significantly more discussion within the realm of monitoring (it’s a damn shame that the security monitoring field is populated with FUD and big-E enterprise vendors with something they desperately want to sell you).
No doubt if you run an ELK cluster, you’ve likely run into Elasticsearch performance challenges. This article makes a few suggestions for basic Elasticsearch performance improvements, such as the optimal memory size in order to avoid any issues with HEAP.
Ever been confused by InfluxDB’s internals around retention? I know I have. This article does a wonderful job of explaining how it all works.
It’s hard enough detecting failures in web systems normally, but the folks at Walmart Labs have gone a few steps further: detecting failures within web experiments, on live traffic, in real-time. Definitely a non-trivial problem and a fascinating topic.
The post-Monitorama writeups are starting to roll in, after a good weekend’s worth of rest. I totally agree with this writeup, too: Logan McDonald’s talk was one of my favorite’s of the conference. While we’re waiting on the 2018 videos to be posted, you can watch the lightning talk version she did last year.
Snyk is a really neat product, but there’s one flaw: its alerts aren’t tied into your production alerting mechanisms. This article resolves that for Icinga by wrapping a Snyk API call with a custom Icinga plugin (the code is in the article). It should be easily adaptable to other monitoring systems as well.
Integrating a test suite with statsd, Grafana, OpsGenie, Slack, and some custom HTML reports. This is actually really neat and gives me a lot of ideas.
Only a slidedeck instead of a video, but it’s written in a way that’s still really useful. I’ve been itching for a reason to use osquery for a while and this article just reminds me of that.
Hat tip to SRE Weekly for tuning me in to this one: a very detailed, and thought-out postmortem template. The only thing I’d personally change is the “root cause” bit to “contributing factors”. After all, there’s no such thing as a root cause.
See you next week!
– Mike (@mike_julian) Monitoring Weekly Editor