Thanks for joining us for another issue of Monitoring Weekly!
Monitoring News, Articles, and Blog posts
These notes from KubeCon Europe 2017 detail many aspects of running your own High-Availability Prometheus installation. While HA Prometheus is still in its infancy (you’re running multiple independent instances), this article contains a number of useful takeaways, regardless of your Prometheus deployment type.
Paul Dix, cofounder and CTO at Influx, covers some of the evolution of InfluxDB’s time series storage engine (TSM) and the work going into their next-generation time series index (TSI). This index redesign is intended to alleviate current memory issues with high cardinality metrics at scale.
Jamie Wilkinson, SRE at Google, gave a presentation at SREcon recently on how to design effective and useful alerts. Though Google uses Borgmon internally, Jamie relates all of his recommendations to Prometheus and how to implement them using it.
Rushin Barot, SRE at LinkedIn, spoke at SREcon 2017 about how LinkedIn monitors the news feed using some pretty neat in-house tools. Their idea of using a “dark canary” to monitor the impact of a deploy using production traffic without actually impacting production is pretty neat and reminds me of the Facebook “dark launch” of username support (2009).
Charity Majors, cofounder at Honeycomb, spoke at SREcon 2017 about the explosion of infrastructure tools lately and what the future of monitoring should really look like: less monitoring, more observability and debugging. Probably my (that is, Mike) favorite talk from SREcon on monitoring (sorry Rushin and Jamie!)
A recently-revived meetup in San Francisco, DOXSFO, hosted a monitoring-themed meetup last week with Avi Freedman (CEO, Kentik), Alex Solomon (CTO, PagerDuty), and Roy Rapoport (Insight Engineering at Netflix -- aka, Netflix’s monitoring team) all giving really interesting talks on their respective domains.
A high-level overview, including some lessons learned and caveats, about how Kik integrated a number of open source monitoring components to measure performance across an auto-scaling fleet of cloud instances providing video features to their chat service.
If you love monitoring, then it stands to reason that you probably also care about good engineering and scaling practices. This post from the Stripe blog talks about how they build rate-limiting strategies into their API services. While this isn't your typical monitoring story, it's still a thoughtful and insightful read about how to design proactively when building new systems.
One of InfluxData’s platform engineers explains why they moved away from Go’s native expvar package to their own forked version. These libraries provide a standard HTTP endpoint for exposing application-level statistics to polling services and collectors.
For anyone running on Microsoft’s Azure cloud service or considering it for future use, this final post of a four-part series looks at how to monitor your “serverless” Azure Functions with elmah.io.
Cerebro is an “open alerting system” designed to integrate with Graphite’s time-series API and Seyren’s alerting and scheduling features. It offers a native REST API and dashboard to allow users to interactively or programmatically construct alerting rules with custom notification recipients.
Thanks for joining us, folks! If you like what you’ve seen, invite your friends and colleagues! As always, if you have interesting articles, news, events, or tools to share, send them our way by emailing us (just reply to this email).
See you next week!
Monitoring Weekly curators