If you could know information about your systems in advance, what would you choose to know? If there was a set of repeating behaviors that happened consistently before an outage, would you want to know what they were?
This is the idea behind proactive monitoring – the switching of context from “reactive” monitoring to something that allows you to act before the problem arises. Here are some guidelines to help you get started with your customized solution.
1. Find your healthy baseline
Establishing what “health” means for your system(s) is the first step. Once you know what “healthy” is, you can grow from that. So what’s healthy?
Health will vary a lot from use case to use case. For example, if you’re responsible for supporting an application that sells toys, you might expect extremely high seasonal traffic around certain holidays. If you’re running an application that handles food or food delivery (non-commercial), you might expect to see traffic spikes around certain meal times. Whatever you specific use case, in order to create and maintain a healthy baseline it is important to understand not only your technology but also your business and what can be reasonably expected to impact resource usage. Some ideas to get started:
- Meet with the appropriate person or persons who would be able to give you insight into some of the expected business cycles that you can translate into infrastructure and application code (expanding on this in the next section)
- If you’re working in an environment that’s already established, take advantage of the information already being collected (monitoring, metrics) so that you can establish what, if any, patterns of usage have started to appear
- Make sure to make note which, if any, areas of your environment fall on either extreme end of the usage spectrum. By this I mean any that are relatively flat in usage, in which case smaller deviations from the norm would still be worth a glance, and on the opposite side which areas are prone to extreme fluctuation.
- In the case of fluctuation, see if you can establish any correlation to business patterns that might explain the fluctuations. If so, you can build in the relevant cycles to your healthy baseline.
- Relatedly, see if the fluctuations are related to frequent outages. If you are using a service that is experiencing frequent difficulties, then depending on your circumstances you might want to consider outsourcing, insourcing, or perhaps designing around switching to a different service entirely.
2. Understanding the business
I touched on this briefly in the previous section, but this concept deserves its own separate section. Why? Whatever application(s), system(s), etc. you are building and maintaining, you are doing to meet a set of business needs. The step here, then, is to ensure that you are on the same page as the “state of the business”. This can involve tracking things like:
- New / unique users and/or visitors
- Duration spent on site
- Page views
- Navigation patterns
These can help reveal if a product or feature is giving the experience intended. As a more direct example, you might introduce a new feature to your product. While monitoring customer experience and business logic, you may find that customers are logging in less frequently or fewer customers are signing up than expected. This allows you to catch releases that, while they passed testing, may actually be interfering with your users’ experiences with your product.
When speaking on security, a lot of discussion centers on reactivity. How quickly can you respond once an intrusion or other malicious behavior is found? How quickly can you detect the aforementioned behavior?
While there is a lot to be said for accurately assessing a situation and responding appropriately, it might go without saying that the best security issue is one that never arises. I get it, that reads like some sort of IT utopia, but hear me out. Once you establish your baseline, you can start drawing inferences from what deviations from the norm might mean, which also allows you to take preventative action. Beyond the baseline, there are habits you can start and maintain that will help you keep ahead of common security issues. Here are some ideas to get you started:
- If you are in an organization that is large enough to have a separate security team, make sure there is communication between IT and security teams.
- Monitor identity management and take action if there is a policy violation. A Cloud Guru did an awesome workshop on small variation of this at ServerlessConf 2017 that they have on their public Github. I recommend taking a look (and noting the caveats in the README).
- Monitor to verify that keys and tokens are not pushed into a code repository – public or private.
- Regularly engage in security testing, including PEN tests and security audits.
- Make sure that you are familiar with any external dependencies, e.g. libraries, that you are using in your code base.
For additional research into proactive security practices, I definitely recommend taking a look at the Rugged Software Manifesto (2012), our Cloud Operations Blueprint, and other articles that have been published in the space since.
I touched on this a bit in the previous section, but now I’m going to branch out to the broader infrastructure. A lot of the verbiage in this section will be AWS specific, but the examples laid out should help you think of ideas that apply to your own infrastructure – whether you are using a different cloud provider or maintaining on-prem equipment.
Common practice in AWS is to use services in multiple regions so that in the event of an outage your own services aren’t down for the count. Monitoring your PaaS can give you insight to let you know if, for example, services in us-east are starting to experience latency issues, that may indicate an upcoming outage. Knowing when the potential issue is small allows you to course correct without customer impact (or minimal impact). Returning to the knowledge of your healthy baseline, knowing how your infrastructure should behave under the level of stress it is currently under allows you to determine if behavior is within normal limits. Key areas to focus on when getting started:
- Monitoring (and understanding) your network traffic flow. This is especially relevant if you’re in a regulated industry and need to ensure that traffic is or is not permitted between different services / deployments / etc. in use.
- Monitoring and building failover in the event that there is a partial or full outage in one or more of your microservices.
- Monitoring impact of infrastructure changes. Part of this is making sure you know which application(s) are using which resources, so when it comes time to make changes to those resources (updates, scaling, etc.) you can anticipate where potential problems may arise based on historical data.
- Monitoring services that are in charge of you maintaining your infrastructure. What this means can vary significantly depending on how you deploy and version control your infrastructure, but can include configuration management tools such as Puppet or Chef or container orchestration like Kubernetes.
5. “Watch the watchmen”
One last item before you go: a reminder to “watch the watchmen”. Whether proactively or reactively, if your primary means of communication goes down and you don’t know it, then you won’t be able to make full use of all the visibility that you have carefully configured. Make sure not to neglect any third party tools you may be using in your monitoring and/or incident management system, including any relevant endpoints if you are using an API or public status pages.
Proactive monitoring practices allow you to apply what you know of your environment to include watching for triggers of potential problems so they can be addressed before there is a full-on outage or data loss. Please remember that, just like “regular” or “reactive” monitoring, your monitoring configurations are in a state of flux and can be improved upon iteratively as needs change and environments grow.