A Monitoring Reality Check: More of the Same Won’t Work

By: Jonah Kowall

On December 7, 2021, Amazon’s cloud services recently suffered a major outage that not only affected Amazon services, but also many third-party services we use day-to-day, including Netflix, Disney+, Amazon Alexa, Amazon deliveries and Amazon Ring.

Causes for the outage, which began at 7:30 am PST and lasted nearly seven hours, were detailed in a Root Cause Analysis report published by AWS that shed light on factors that may have contributed to the extended length of the disruption. According to the issue summary, “congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.”

Even after AWS moved some traffic off the affected network, they realized they had not fully resolved the issue; however, without sufficient visibility, they could not further isolate the involved problems. The RCA states, “monitoring data was still not visible to our operations team, so they had to continue resolving the issue with reduced system visibility.”

These gaps in monitoring and the general lack of visibility into the affected infrastructure were the main reasons the resolution took so long. This is likely because the team was focused on identifying the issue in the wrong area of the infrastructure and the application services, wasting precious time. Making matters worse, customers who were using AWS services, such as AWS Cloudwatch which provides monitoring to AWS customers, were also impacted. That meant AWS customers dependent on these cloud provider services could not diagnose issues in their infrastructure, nor could the cloud provider itself.

Anyone running applications or infrastructure knows how important and difficult monitoring is. It is no wonder that over $20b per year is spent attempting to address monitoring challenges (based on Gartner estimates). The most basic function of monitoring is to notify IT and DevOps teams when an issue occurs. When there are major outages, it’s critical to know why something is happening and where the failure is. Monitoring will provide notification when an issue occurs and helps isolate where the problem is occurring.

Few people think about what to do when monitoring doesn’t work, or who is monitoring the monitoring stack. Another term we often use in the industry is who is watching the watcher. Using another service just to keep your primary monitoring service honest is considered a best practice for this explicit purpose.

Often when monitoring tools have gaps that prevent IT and DevOps teams from detecting or determining an outage, we add even more monitoring as part of the postmortem to avoid any recurrence in the future. While necessary, adding more monitoring subsequently creates more work for teams attempting to shore up gaps in visibility, alerting, or resilience within the monitoring system! Additional monitoring only helps if the monitoring system works, which is why resilience is critical to having functional monitoring, which wasn’t the case with the recent AWS incident – and likely many other outages.

How do you build resilience to ensure observability to avoid extended downtime? A key is to separate your monitoring from your infrastructure, meaning use of external services which have fewer dependencies on the same infrastructure. An ideal approach is to use multiple monitoring systems or multiple regions, data centers, or cloud providers where your monitoring services are delivered from. This will ensure you have the proper visibility to determine the problems you might be having, and how you might fix it.

At Logz.io, we provide cloud-native monitoring solutions. We run across 8 regions within AWS and as a result, this most recent AWS outage only affected one of our regions. Customers running on the other non-affected regions were fully up and running. This is a good example of how building independent cloud regions in your architecture can create resilience and provide business advantages. It makes good sense, as nearly all companies today rely on cloud providers to run their businesses. Granted, AWS has had an impressive track record, but it was a challenging end to the year with two outages, including the one referenced here on the 22nd of December.

As someone who was on-call for 17 years, I know the pressure that DevOps teams (and others who are responsible for observability) feel daily. The job is very difficult and nothing is more difficult than when the pressure is on to deliver and save the day during a problem or outage.|

Unlike many other jobs, this is a challenge that never ends. It’s a 24-7 job, 365 days a year. Thanks to all of you who make the technology possible for all of us users. The reality is, when it comes to cloud monitoring, doing more of the same will just give you the same results. To improve results, and mitigate gaps in monitoring, we must take a more distributed approach to offset the risks of extended downtime.