Hey folks, welcome to the Q3 Best Of Special Edition issue of Monitoring Weekly! This issue contains all the most-liked articles from Q3 2017. In case you missed something or just want another good look at the best stuff from the past quarter, here it is. Go grab a coffee and settle in, cause there’s a lot of great stuff to read.
Monitoring News, Articles, and Blog posts
Monitoring and Observability
Lots of hub-bub about “monitoring vs observability” lately! I really like this post for its in-depth treatment of what it means to build “observable” systems and how it’s different from plain-old monitoring.
For those not steeped in the world of the ELK stack, Elasticsearch is the
E in the ELK acronym. Elasticsearch serves as the data storage and search engine components of the logging/analysis stack. To say Elasticsearch can get confusing for newcomers is, well, an understatement. To rectify this, the fine folks at Elastic have written a wonderful guide on it. Even if you’ve been around the Elasticsearch ecosystem for a while, it might even be worth reading for the refresher.
More slick stuff from the team at LinkedIn. This time, they’ve open sourced what amounts to an internal PagerDuty. This could be rather useful for those of you with internal compliance requirements that prohibit the use of SaaS.
Metrics are great and all…assuming you know what they actually tell you (or don’t tell you). The author takes a hard look at disk utilization and saturation metrics. Skip to the end for the immediate takeaway (spoiler: don’t rely on util%)
Being in the midst of a stats bender myself, this article on the math behind percentiles is timely and relevant. If you’re interested in understanding percentiles and using them effectively, this is a really helpful article.
Everything you wanted to know but didn’t think to ask about load averages. I’ve always been taught that load average is CPU load average, but apparently, that’s not quite true for Linux (but is for other *NIX operating systems). There’s some really great stuff in this article.
The author makes a compelling point for championing a single metric as The Metric to Use for improving website performance and maintaining that performance over time. I especially like the author’s disdain for the document.onLoad method of measuring page load time.
I like statsd, I really do–it revolutionized the world of monitoring when it was introduced back in 2011 and continues to change how teams interact with their apps even today. But, as Brian points out, it does have some warts and growing pains. Maybe it’s time for the next iteration/spiritual successor?
It seems like half the people who do monitoring in the web-scale world forget about synthetic monitoring entirely, and the other half just pay for Pingdom. For some reason, I always forget about worldPing, which is built by the lovely folks at Grafana Labs. This article is one team’s experience in moving from one unnamed-and-inadequate synthetic tool to worldPing.
I ran across this earlier in the week, which is way better than constantly googling for “high availability chart. Of course, when it comes to availability numbers, it’s as they say: there be dragons. I’m convinced most SLAs are just lies and wishful thinking, but using these as targets is helpful.
ELK too complex for your needs? Grepping syslog flat files not flexible enough? oklog aims to solve that middle ground in tooling. It’s a pretty new project but looks like it could be neat for a lot of use cases.
Transition and growth stories are always fun. It’s like getting a free peek into the future and learning from other people’s stories of troubles and woe and how they overcame them. This article details Matomy’s growth from humble Graphite beginnings to a robust metrics platform consisting of Brubeck, InfluxDBm, and Grafana, including their brief stint with a scaled-out Carbon+Whisper cluster.
If you were fortunate enough to be at Monitorama 2016, you might have seen Heinrich Hartmann’s great talk on Statistics For Engineers. He’s just turned his attention to understanding the math behind the well-known metric from ‘iostat’, average queue length. It’s worth a read, even if it’s just to remind you that you probably should have paid more attention in math class (yeah, I’m guilty too…).
With all this talk of structured logs lately, it felt like people were glossing over the constraints and limitations in the network gear world–then I ran across this new project. This tool listens on the syslog port, taking unstructured logs from network devices and turning them into structured (JSON) entries. Super awesome. Maybe my favorite new tool this year.
There’s something beautiful and compelling about a good graph, I must say, and it seems I’m not the only one who thinks so. Someone at LinkedIn has been saving interesting graphs internally for a while and decided to write about it. Bonus points for the graphs-as-art at the end, which reminds me of the fantastic lightning talk by Dave Josephsen at Monitorama 2016.
This came out a few weeks ago, but I somehow missed it and it’s definitely worth mentioning: send your NetFlow data to ELK. Much better than writing six figure checks to some NetFlow vendor that doesn’t play nicely with your other tools.
I’ve never found a good way to monitor serverless/FaaS effectively. Even on AWS, Lambda’s Cloudwatch/Cloudwatch Logs story is kinda…crap. I don’t normally link you folks to commercial solutions, but Dashbird is unique in that it’s solving a problem no one else is working on, and it looks really promising too.
Not a lot to add here: Prometheus-as-a-Service. Well, mostly–it’s a Prometheus-compatible API, at least.
In a similar vein as above, the author goes through Linux delay accounting and the various uses for it. Sadly, not a lot of monitoring tools support this data yet–anyone want to take a crack at it?
There’s not much here…except for some neat and interesting graphs. With Cloudflare handling so much of the global internet traffic, they’ve got a unique perspective on what’s happening with internet traffic. What does traffic look like when a country decides to cut themselves off the internet? Or when a hurricane hits a region? Click through and find out.
Carnegie Mellon has started a lecture series on time series databases, which is both apropos and super neat. First up is Paul Dix, the founder behind InfluxDB. Click through for an hour of Paul talking about the unique challenges of TSDBs, TSDB design, and InfluxDB internals.
This is a really surprisingly in-depth article on the author’s experience with implementing monitoring in their new Kubernetes-based infrastructure. They cover everything from metrics to tracing, and even how to write the readiness/liveness checks.
Continuing the discussion of “what the hell is observability”, the author of this article makes the case that monitoring is an action–a thing you do–while observability is a property or attribute, much like “testability” or “usability.”
Ignoring the SignalFx-specific examples, this article is one of the simplest explanations I’ve seen yet for predictive statistical functions. The author goes over linear projection and double exponential smoothing, which are two of the most effective predictive functions for our sort of date. Chances are high that whatever you’re using for metrics also support these functions.
Way back in the wild days of the 90s, monitoring was simple: you built a thing and slapped a cronjob on it to tell you when it went wrong. Nagios allowed us to do essentially that same at-scale. Fast-forward to today and the world of infrastructure and applications and how we monitor it all is vastly different. Has terminology caught up with the modern times yet? People much smarter than I have been asking this question for a while (@lusis’s post from 2011, @grepory’s talk from 2016) and the trend continues with this article. I think it will be very interesting to see where this leads us.
Enjoy your weekend, folks!
– Mike (@mike_julian) Monitoring Weekly editor