Thanks for joining us for another issue of Monitoring Weekly! This week we have a grab-bag of monitoring stories with an emphasis on logging and incident response. We hope you enjoy them.
Monitoring News, Articles, and Blog posts
Prometheus is a cloud-native monitoring system seeing impressive growth and adoption. This hands-on guide demonstrates how to instrument an existing app and expose your metrics using a Prometheus-compatible HTTP endpoint and then ask questions of your data with PromQL.
What happens when you need some old log entries but your logging service has already purged them? This author discovered a handy pipeline for reconstructing archived Papertrail logs using S3 and Amazon Athena.
If you’re monitoring at “Twitter Scale”, there’s a good chance you’ve looked at stream processing engines like Apache Storm or Twitter’s in-house successor, Heron. This article gives an overview of Heron’s design and how their engineers worked to identify wasted cycles, refactor the stream manager, and measure their optimizations.
Alert fatigue is a common problem among operations and engineering teams. Auto-remediation is one possible strategy for fixing routine problems before they result in paging a human operator, but it’s still mostly an academic discussion with few companies having made the engineering investment in fully automated self-healing systems. Facebook is one of those companies. This article from 2016 looks at Facebook’s auto-remediation design, and where they draw the line between system automation and manual intervention.
The new Grafana release includes a number of improvements and bugfixes, with a particular emphasis on alerts and notifications.
Datadog has a history of really helpful blog posts covering how to monitor popular open source services. This week they’ve added a three-part series on the Apache HTTP server. Even if you’re not a Datadog customer, these posts are a great resource for understanding how specific services can be instrumented and monitored effectively.
Ask any frugal Heroku customer, and they’ll tell you how to keep Heroku runtimes from sleeping by “pinging” them with automated HTTP checks. The author of this article uses a similar approach with AWS CloudWatch to keep seldom-used Lambda functions “warm” in memory. A crafty workaround for a practical service concern in the land of serverless apps.
An example of how to instrument your application with endpoints for polling health status or performance data. The py-healthcheck library makes it easy to add custom Python functions exposing telemetry via Flask routes or Tornado handlers.
This story recalls an interesting bug in Elasticsearch where its own internal health statistics endpoint was publishing bad data, causing nodes to get removed from the cluster, added back, and then removed again. Wash, rinse, repeat. Though the article only mentions the bug affecting 5.1.1, this issue affects versions 5.1.1, 5.1.2, and 5.2.0, so make sure to check your production nodes and upgrade as needed.
Root cause analysis can be done very poorly, especially when it amounts to finger pointing without consideration of the underlying human factors. However, RCA is a valuable process for tracing the origin of a service outage. TJ Gibson gave a related talk at the recent SREcon 2017, offering sage advice for identifying biases, challenging your own assumptions, and recruiting outside perspectives when performing your own RCA investigations. Audio, video, and slides from the presentation are included.
An excellent primer on statistics modeling tools and techniques. Many of these questions and explanations apply to the monitoring data you’re probably already collecting. Even if you’re not a data analyst (or especially if you strive to be), it’s worth your time to check this one out.
As someone who’s spent a large portion of my career developing integrations and custom pipelines between third-party services, this story hits close to home. After moving from AWS Cloudfront to Fastly, this team needed a way to forward their CDN logs to their in-house fluentd log aggregator. This walkthrough covers the steps they performed on either end to get these otherwise incompatible services talking the same language.
SF Prometheus Meetup Group
The San Francisco Prometheus Meetup Group is meeting on April 11th. Matthias Radestock, CTO at Weaveworks, will go into how they monitor their Kubernetes-backed Weave Cloud with Prometheus.
Fresh from London, the inaugural DevOps Exchange San Francisco meetup is on March 30th. The first one’s theme is, as you might have guessed, monitoring! You’ll hear from Avi Freedman on network traffic telemetry, Alex Solomon on incident response at PagerDuty, and Roy Rapoport on insight engineering at Netflix.
Thanks for joining us, folks! If you like what you’ve seen, invite your friends and colleagues! As always, if you have interesting articles, news, events, or tools to share, send them our way by emailing us (just reply to this email).
See you next week!