Last week I was at Monitorama and it was an amazing time! Not only where the talks fantastic, but I also had the opportunity to join Liz Fong-Jones from Honeycomb.io for a podcast with The New Stack to dig a little deeper into some of the topics in monitoring and site reliability engineering.
We kicked off with discussing how well designed monitors can help us all avoid fatigue and burnout, which is a topic that is near and dear to my heart (see: Best Practices for Proactive Monitoring and Building Monitors You Can Trust). The focus here is on something that is simple to say but hard to implement: knowing too much means that you’ll ultimately know too little when you cannot follow the flood of information, so knowing “what you need to know” is key. And difficult. I discussed these concepts in my talk, where the key takeaways were: plan how specific your knowledge needs to be (e.g. do you need to know if a VM / container died, or if the resurrection died?), creating a workflow around your alerts so they are going to the correct people, and how to routinely “clean” your monitors / metrics so that they keep pace with changes to your infrastructure. For the full spiel, I encourage you to listen to my talk (linked below).
As we moved through our conversation, we shifted scope: chaos engineering, incident management, and tracing. These are all critical important parts of the engineering process. Chaos engineering helps us figure out “what we need to know that we don’t already know” by introducing intentional stress and “misbehavior” into an application environment or system. Regardless of the source, reviewing the resolved incident is critical. (Do you have an incident management platform? Pro tip: you definitely should.) In order to recover from the incident, you need data on what is going / went wrong. I could continue to summarize, but I’d recommend you just jump in at this point. Perhaps on your lunch break!
Where to listen to us:
Our podcast at The New Stack
The New Stack’s blog post about the podcast
Talks referenced in the podcast are:
John Allspaw, “Taking Human Performance Seriously In Software“
Nora Jones, “Chaos Engineering Traps“
Nida Farrukh, “The Power and Creativity of Postmortem Repair Items”