Hey folks, welcome to another installment of Monitoring Weekly! Did you write something about monitoring recently? Maybe got an idea rolling around in your head? Send it on over and let the community learn from you. :D
Monitoring News, Articles, and Blog posts
How To Establish a High Severity Incident Management Program
Monitoring and incident management go hand-in-hand, and this article from the great folks at Gremlin is pretty awesome. It is incredibly thorough in its approach, covering example severity levels and their meanings, incident lifecycles, the creation of severity levels for different kinds of products, and much more. Seriously great article.
This is quite a nice overview of monitoring at the conceptual/component level.
This article from the folks at GCP walks us through what an typical escalation policy looks like for this. What’s interesting about their escalation policies is how detailed they are.
Most Sensu deployments I’ve seen rely on Graphite as the TSDB so I’m pretty happy about this article that takes you through Sensu+InfluxDB.
I’ve always thought it was dumb that I had to send my MySQL RDS logs to S3 before being able to move them elsewhere. Now you can send them to Cloudwatch Logs. Still not great, but at least there’s better integration with Cloudwatch Logs and third-party logging systems than there is is S3, so all-said this is a pretty good improvement.
|*[Project STAR: Streamlining Our On-Call Process||LinkedIn Engineering](https://engineering.linkedin.com/blog/2018/01/project-star-streamlining-our-on-call-process)**|
At first, I expected this article to basically be about LinkedIn’s efforts at reworking an internal version of PagerDuty, but the more I read, the more interesting it actually is: in setting out to solve a standard scheduling problem, they found large organizational challenges such as engineers misunderstanding the impact and importance of being on-call to LinkedIn’s mission. I love this particular bit: In reality, however, Voyager On-Call is far more important than even the most important project. If the site goes down, even the greatest revenue-doubling project is dead in the water.
Continuing the series, Part 4 talks about the actual product that spawned the series to begin with and the lessons learned so far. In essence, CAP theorem is the bane of existence for building a high-quality, distributed data store.
Datadog recently published a deep-dive treatment of AWS EC2 metrics, both from the perspective what matters and what the metrics mean, as well as how to actually collect the metrics.
See you next week!
– Mike (@mike_julian) Monitoring Weekly Editor