Hey folks, welcome to the 2018 Q1 Best Of Special Edition issue of Monitoring Weekly! This issue contains all the most-liked articles from Q1 2018. In case you missed something or just want another good look at the best stuff from the past quarter, here it is. Go grab a coffee and settle in, cause there’s a lot of great stuff to read.
This issue is sponsored by: Free Guide: Low-Risk Continuous Delivery
Adopting Continuous Delivery can bring a lot of benefits, but deploying to production can be filled with uncertainty. Learn how to reduce the risks with the right culture, architecture, and tooling to deploy early and often.
Check out this free guide as we explore solutions.
Monitoring News, Articles, and Blog posts
On-call doesn’t have to suck
Another monster knowledge-bomb from Cindy/@copyconstruct. I won’t try to summarize it all here, but it’s definitely worth a read.
Structure your logs, people. (well, sometimes don’t bother, but if you’re reading this newsletter: structure your logs!)
Continuing in their series on structured logging, this post from Snyk (guest-posting for Honeycomb) goes through how they structure their logs. One of the more interesting bits in this is how they effectively record each request end-to-end by logging both the start and end of it at a minimum, allowing them to follow a request and the actions it encounters along the way of being serviced.
On-call handoff is an under-appreciated and oft-overlooked aspect of on-call. Having the opportunity to discuss the previous on-call period is incredibly worthwhile for teams. One of the tricky parts, though, is remembering everything about it. The folks at Gitlab have created a tool to make this much easier to do.
Logging using purely AWS services has always felt kinda wonky to me, in a “wow, that’s…complicated” sort of way. It hasn’t actually changed, but this article from AWS goes over the architecture for a multi-account logging infrastructure using a dedicated AWS account for log processing, aggregation, and storage. I can’t imagine ever using this architecture, but it does give some interesting ideas.
We all understand that infrastructure has changed dramatically in recent years, as new methodologies and tools have started spreading. But has your approach to monitoring and observability changed with it? Are you sure?
Do you use HAProxy? Then you’re probably familiar with the treasure trove of metrics it provides. The author of this article walks us through getting HAProxy metrics into Prometheus, and specifically, getting the elusive Duration metric (from the RED–Rate, Errors, Duration–model) using fluentd.
More open-sourced monitoring tools from the folks at LinkedIn’s SRE team. This time they’ve released a tool to aid in investigation of an alert and another tool to visualize the results…in ASCII.
A home-grown self-hosted status page app from the folks at Crisp, complete with some basic alerting functionality. Seems to be primarily built for Node-based apps, though it has some support for HTTP and TCP checks.
This is a much better explanation of why SQL databases are a poor choice for time series data than I’ve been able to give. It’s not like I’m some monitoring expert or anything, right? Very nice.
Why be on-call if you don’t have to be? The author makes a few compelling reasons for why they’re still on-call as a manager.
Everyone loves a good statistical index. I had almost forgotten about Apdex until I saw this article, which as it turns out, is also the most interesting and complex thing I’ve seen done with Prometheus. I’d really love to see more of this beyond this proof-of-concept. Anyone doing anything similar? I’d love to hear from you!
This is a really interesting stack of technology: Bucky, celery, tornado, and statsd. Really interesting (and neat) setup for solving the problem of async task monitoring.
Prometheus Blog Series
One of the best things about long breaks is that people seem to catch up on all the writing they’ve been meaning to do. Our first stop this issue is a five-parter on Prometheus and it’s pretty awesome.
- Part 1: Metrics and Labels
- Part 2: Metric types
- Part 3: Exposing and collecting metrics
- Part 4: Instrumenting code in Go and Java
- Part 5: Alerting rules
Much has been said about the differences between monitoring monoliths and microservices and the new challenges that come with a microservice architecture. But you know what? This article is a pretty clear and concise explanation of those challenges and I really like it.
The author takes us through how to use Grafana 5.0’s new programatic dashboard and data source definitions.
The folks at Sitewards reflect on 12 months of Prometheus in their organization, lessons learned, and where they hope to go from here.
Straightforward enough for a checklist, which is pretty neat. I’d caution you to be wary about relying on load average, though.
This is quite a nice overview of monitoring at the conceptual/component level.
Curious about Prometheus? Here’s another well-written start-up guide.
The author does a really great job at writing an explainer article covering monitoring, observability, and a whole bunch of related things. Also, I and Monitoring Weekly are mentioned, which was a really nice surprise for my ego. (but the article is totally worth reading)
There has been a trend lately of people cough monitoring companies cough jumping on this rising tide of “observability,” and suddenly changing their marketing from “We do monitoring!” to “We do observability!” All that would be great, if they actually did observability all along. Pretty much none of them did. This post lays out a vision for what is actually required to say you have an observability platform versus a monitoring platform.
Enjoy your weekend, folks!
– Mike (@mike_julian) Monitoring Weekly Editor