Hey folks, welcome to another installment of Monitoring Weekly! Did you write something about monitoring recently? Maybe got an idea rolling around in your head? Send it on over and let the community learn from you. 😀
Monitoring News, Articles, and Blog posts
Prometheus Blog Series
One of the best things about long breaks is that people seem to catch up on all the writing they’ve been meaning to do. Our first stop this issue is a five-parter on Prometheus and it’s pretty awesome.
Ever wondered about the underlying mechanics behind distributed logging? This two parter goes into the topic, working up from first principles.
I love seeing writeups about real-world monitoring architectures, mainly because they’re never as simple as the various vendors would have you believe. This article does a great job explaining MakeMyTrip’s monitoring architecture and how each component is used.
I love Monitorama. In my not-so-humble opinion, it’s the best tech conference out there right now. 😉 Jason Dixon, the founder, just made it even better: speakers are now paid $1,000 for full-length talks and $200 for lightning talks.
Speaking of which, the CFP closes on February 1st. Have you submitted?
Building a monitoring platform in your typical startup is pretty straightforward. Building something in an older enterprise is…less so. This talk from the folks at Northern Trust about their experiences in rolling out Prometheus and Grafana is a great watch.
Ever wondered what MySQL metrics you should be gathering and why? Here you go–complete with some awesome owl art.
More open-sourced monitoring tools from the folks at LinkedIn’s SRE team. This time they’ve released a tool to aid in investigation of an alert and another tool to visualize the results…in ASCII.
I love silly, misleading metrics.
Did I say love? I meant loathe.
Yeah, this isn’t about typical monitoring, but I figure you lot are interested in anything to do with metrics, regardless of where in the stack it falls. NPS is one of those metrics you’ll run into a lot when working with your marketing team and customer support team, so it helps to understand the background and the pitfalls of the measurement (according to this article, it’s mostly pitfalls).
The folks at Booking.com, whom some of you might know as the creators and managers of one of the largest known Graphite setups, have done a nice writeup on their incident management process. One thing that really stuck out to me was this bit:
However, too many escalations create an overload in work. Communication and “freedom to leave” are the ways to go. Even in an all-hands-on-deck situation, you can decide whether you can help or not. You are free to join, assess the situation. You are free to leave when you can’t help or because the situation is under control. We trust our people to make the right call and to manage their time in the best possible way.
Datadog does a fantastic job of putting out amazing articles from time to time. After a bit of a break in publishing, they’ve put out a two-parter on PostgreSQL. Part 1 covers the key metrics of Postgres and what they mean.
This is Part 2 of the above link, covering how to actually get the metrics you learned about in Part 1.
Want more Postgres? Here, have some more Postgres! I love the SysAdvent articles and this one about monitoring Postgres replication lag is a a great one.
If you’re a Sensu user, you’ll love this fantastically-detailed post on what Sensu aggregates are and how to use them. If you’re not a Sensu user, most modern tools have a similar concept. Either way, most of you should totally be using aggregates (or similar).
Full disclosure: My company, Aster Labs, is a Sensu Partner. I received no consideration, financial or otherwise, for including this post.
See you next week!
— Mike (@mike_julian) Monitoring Weekly Editor