This issue is sponsored by:
Why Hosted Metric Analytics to Monitor Modern Cloud Applications & Infrastructure at Scale
Modern cloud application architectures require a modern monitoring and analytics approach. Find out why SaaS leaders like Workday, Intuit, Box, and Reddit chose hosted metric analytics for real-time insights across all their engineering teams.
From The Community
This a list of really awesome tips about improving how you use Grafana. Seriously, it’s a great list.
Dan Barker, one of the writers for OpenSource.com and organizer of DevOpsDays Kansas City and the DevOps Kansas City meetup, just released a fantastic guide to open source monitoring and observability tools. Highly recommend downloading the full guide.
I’ve bene teaching the USE and RED methods to clients lately and it’s been fascinating to see the impact it has on their monitoring. In my experience, the biggest challenge in monitoring isn’t alert fatigue, but knowing what needs instrumented and monitored to begin with. USE and RED make for a fantastic starting point.
The (in)famous Corey Quinn of Last Week in AWS recently graced the website of Monitoring Weekly and wrote this gem on a new/old concept: Observerless. Now with video of this talk from ServerlessConf, available at A Cloud Guru.
Most of the time, these Top N lists are garbage, but this one… Well, this one is gold. It starts off with Coda Hale’s seminal Metrics, Metrics Everywhere talk from 2011 and only gets better from there.
Basically exactly what the title says: metrics you should absolutely be tracking if you’re running in AWS.
I like this straightforward explanation of log levels. For those who don’t know, these are a subset of the well-defined syslog severity levels, which are awesome and more people should use them.
Everyone’s favorite book, the Google Site Reliability Engineering book, now has a companion book: The Site Reliability Workbook. This new book aims to be the practical application of the original book, which was a whole lot of theory. Looking at the table of contents, there’s a lot of great stuff about monitoring, incident management, and more.
Need a bit more k8s and Prometheus in your life? Here’s a new well-written walkthrough.
With all this talk about “high cardinality” around monitoring lately, this article finally explains what it really means in concrete examples.
Calculating a latency SLO is harder than it first seems, and you’re probably doing it wrong (I know I am).
Cindy Sridharan/@copyconstruct is back with a new monster post on health checks and it’s great. Go read it.
When I’m not writing this newsletter, I help companies overhaul their approach to monitoring. After working with and talking with tnos of large companies, you know what they all have in common? Getting buy-in from the teams they’re trying to help. Yep, you’re not alone. Want some expert help with the problem? Let’s chat.
In other words, they traded an invoice large enough to give most CFOs a heart attack for a complexity level high enough to give most SREs a heart attack. Then again, the team managing their ES stack is probably larger than most of our entire SRE teams.
This upcoming feature in Grafana looks amazing and I can’t wait to see it out of beta. In essence, this allows you to take a query that’s defined in a dashboard and explore deeper in an ad hoc way by changing the query–without losing the dashboard config. Only Prometheus is supported in the beta and I’m looking forward to seeing more datasources supported on this.
Breaking down large problems into smaller problems is a tried-and-true method of solving problems and finding insight, and this article on logging does just that. The article makes the observation that logging is really five separate problems.
This isn’t strictly monitoring-related, but given everyone’s job on this newsletter, I know you’ll be interested in it. For those not familiar with it, the annual State of DevOps Report is a tremendous work headed up by Dr. Nicole Forsgren every year using legit, rigorous research and statistical analysis methods. The results of this one are pretty neat to read.
This interview between A Cloud Guru and Charity Majors has some great stuff in it that will get you thinking about what your future in monitoring and observability could be.
It’s always interesting to get peek into how other companies use tools. The folks at Logicify have gone into detail on how they use Grafana and the different use cases they have for it. I especially like the business intelligence use case.
Not sure what to write for your status page updates? Follow these instructions–they’re great. You may also be interested in the followup article, Status page updates: It’s all about timing.
Want more about monitoring RDS than you could possibly ask for? Here you go: a monster post about RDS covering metrics, logging, and even Cloudtrail.
Mark Carter, Product Manager for Stackdriver, talks to Software Engineering Daily about monitoring, tracing, observability and everything in between.
My thoughts on this are kinda tangential to the article, but everyone loves a good rant, right? To quote Jeff Hodges, “A systems engineer without a good startup idea inevitably winds up doing monitoring.” Building time series databases is a hard problem so maybe this post will head some startups off at the pass should they decide, “I know! We’ll build yet another damn monitoring service!”
Monitoring the awful horribleness that is the banking industry has always fascinated me (I’m a masochist, clearly), so this post from the folks at Plaid got my attention. They take us through how they chose the components for the next iteration of their monitoring platform and how it all fits together to monitor 9600+ banks.
One of the annoying things about monitoring is that it’s actually kinda hard to do at small scale. When you’ve got 100+ nodes, it makes sense to deploy a robust monitoring infrastructure, but that’d be dumb when you have one or two servers (such as a personal VPS). And yet, there’s not a lot of good tools for monitoring that few systems well. This seems to be a solution to that problem area.
This issue is sponsored by:
Move Faster, See Everything, and Deploy Confidently
Get real-time analytics and massive scale so your Dev and Ops teams can move faster on a stable cloud application estate. Use full stack monitoring to slash MTTR. Start your free 30-day trial today with Wavefront by VMware.
I had the pleasure of speaking with the hiring manager recently and it sounds like a really awesome gig. If you’re into Ops/SRE/DevOps and love monitoring, click through to check it out.
Want your job listed here? Why not submit a post to the job board? It’s only $199/ad for 30 days.