10 Ways to Simplify Cloud Monitoring

By: Daniel Berman

August 9, 2019

Is monitoring in the cloud special enough to warrant a list of tips and best practices? We think so. On the one hand, monitoring in the cloud might seem easy since there is a large number of solutions to choose from. On the other hand, though, the dynamic and distributed nature of the cloud can make the process much more challenging. In this article, we’ll cover ten tips and best practices that will help you ace your cloud monitoring game.

1. Keep It Super Simple (KISS)

Every second spent on monitoring is a second not spent on your app. You should write as little code as possible, since you’ll have to test and maintain it. The time spent doing this adds up in the long run.

When evaluating a tool, the best question to ask yourself is, “How hard is it to monitor another service?” You will perform this operation very frequently, and the total effort to do so may skyrocket if you multiply it by the number of services under your command. Sure, there is some intrinsic complexity involved in setting up the tool, but if you have a choice between a tool that gets the job done and one that has more features and is harder to use, apply the You Ain’t Gonna Need It (YAGNI) rule.

After the initial setup, there is a maintenance phase. It’s a well-known fact that the only thing that stops planned work is unplanned work. You can minimize outages and failures by simplifying monitoring operations. For example, Prometheus does dependency inversion. It makes monitoring dependent upon your app, not the other way around (pull vs. push model). It also reduces operational complexity by making the collectors totally independent in a high availability (HA) setup—that’s one fewer distributed system for you to manage!

2. Instrumentation Is the Way

Once you choose a simple monitoring tool and set it up, the question of what to monitor arises. The answer? “The Four Golden Signals”, obviously! These are: latency, traffic, errors, and saturation.

But what does “latency” mean for your app, and what values are acceptable? There’s only a handful of people who know that: you, your fellow operators, the business, and the application developers.

To embed this expertise into a monitoring system application requires instrumentation. This means that the services should expose relevant metrics. An additional value that comes with instrumentation is that every additional metric can be validated by business needs.

3. Automated Infrastructure Monitoring? Leave It to Your Provider

Some tools may tempt you with the promise of zero-configuration monitoring while lacking other features. These may include AI-based anomaly detection and automated altering. Have you ever wondered how can they provide the value if the quirks and the desired behaviors of your system are unknown to the tool?

You might say to yourself that these tools are great for monitoring infrastructure. Indeed, there are common tasks like load balancing or storing relational data that shouldn’t require manual instrumentation. But, if spinning up custom monitoring for your infrastructure is a problem, maybe you should be considering using a hosted solution from your provider instead.

The price tag on a cloud load balancer includes monitoring (as well as upgrades, failovers and fault remediation), so why not consider outsourcing standard utilities and focus on value-adding services instead? When thinking of running infrastructure on your own, make sure that you consider the full cost of maintaining it.

4. Make Sure Monitoring Can Keep Up With You

Everything changes in the cloud. The implications of these constant changes are not always straightforward, though. Another machine or service instance might appear without human interaction. Since changes to the state of your cloud environment are automated (by autoscaling rules, for instance), monitoring has to adjust accordingly. In the ideal world, we’d like to achieve something called location transparency at the monitoring level and refer to services by name, rather than by IP and port. The number of service instances (machines, containers, or pods) isn’t fixed.

The ideal monitoring tool should integrate seamlessly with the currently operating service discovery mechanism (like Consul or Zookeeper), with the clustering software (like Kubernetes), or with the cloud provider directly. According to the KISS principle discussed in the first paragraph, you shouldn’t need to write any adapters for infrastructure purposes.

Integration ubiquity isn’t a must, although it may reduce the amount of moving pieces. There should be no need to change a monitoring tool when switching cloud providers. Prometheus is an example of a product that balances integration and configuration requirements without vendor lock-in. It not only integrates out-of-the-box features with major cloud providers and service discovery tools, it also integrates with niche alternatives via either DNS or a file (via an adapter). Of course, the ELK Stack is also open source and therefore vendor independent. It is also is well-integrated.

5. One Dimension Is Not Enough

Some monitoring systems have a hierarchy of metrics: node.1.cpu.seconds. Others provide labels with dimensions: node_cpu_seconds{node_id=1}. The hierarchy forces an operator to choose the structure. You should consider expressing this measurement in a hierarchical system, such as in the following: node_cpu_seconds{node_id=1, env=”staging”}.

More dimensions allow more advanced queries to be made with ease. The answer to the question, “What is the latency of services in staging with the latest version of the app?” boils down to selecting appropriate label values in each dimension. As a side effect, brittleness is reduced with aggregates. A sum over http_request_count{env=”production”} will always yield correct values, regardless of the actual node IDs.

6. Does It Scale?

It’s great if your tool works in a PoC environment without any problems. However, will that tool scale when the demand for your product skyrockets? The system throughput should increase proportionally with the number of resources added. Consider vertical scaling before horizontal. Machines are cheap (compared to person-hours) and available at a Terraform rerun (if you practice infrastructure as code).

Also, don’t think of scale in a Google sense. We love to think big, but it’s more practical to keep things realistic. Complicating the monitoring infrastructure is rarely worth it. You can counter many scaling issues by taking a closer look at the collected metrics. Do you actually need all the unique metrics? Extensive metric cardinality is a simple recipe for spamming even the most performant systems.

7. Recycle and Reuse

There may be valid reasons for running the infrastructure yourself. Maybe none of the databases offered by your provider have the desired business-critical features, for example. However, there should be very few such cases in your system. If you are running applications on-premises, just grab ready-made monitoring plugins and tune them to your needs.

Doing so reduces the need for instrumentation. You will still have to manually fine-tune the visualization and alerting. Adding custom monitoring on top of custom infrastructure is rarely justified by business needs.

8. Knock, Knock

Monitoring without alerts is like a car without gasoline—you’re not going anywhere with it. Indeed, there is some value in the on-the-spot root cause analysis, but you can crunch the same data from the logs. The true value of monitoring is letting the human operator know when their attention is required.

What should alerting look like, then? Ideally, human operators should only be alerted synchronously on actionable, end-user, system-wide symptoms. Being awakened at 3am without a good reason isn’t the most pleasant experience. Beware of the signal-to-noise ratio; the only thing worse than not having monitoring is having monitoring with alerts that people ignore because of a high false-positive rate.

9. Beware of Vendor Lock-in

Although the application monitoring solutions readily available from your cloud provider may look dazzling and effortless to set up, they don’t necessarily allow instrumentation (principle #2). Even if they do, they will be tied to a particular cloud provider.

Beyond crippling your ability to migrate or go multi-cloud should the need arise, vendor lock-in will keep you from being able to assemble your system locally. This can raise your costs (since every little experiment has to be run in the cloud), operational complexity (the need to manage few accounts for development, staging, and production), and iteration cycle time (provisioning cloud resources is usually an order of magnitude slower than provisioning local resources, even accounting for automation).

10. Dig a Well Before You Get Thirsty

You may be tempted to put off creating a proper monitoring system, especially if you’re running a startup. After all, it’s a non-functional requirement and the customers won’t be paying extra for it. However, you want to have that monitoring in place so that you are aware when an outage happens before an enraged customer lets you know. The best time to set up a monitoring system is right now.

You can start off with a simple non-HA setup without any databases and then talk to the business about what to monitor first. As you likely know by now, monitoring is driven by business requirements, even if the business does not always recognize that. Starting early will let you amortize the cost of implementation and gradually build up your monitoring capabilities while you learn from every outage (not if, but when they happen). In the process, you will gain agility and confidence in the knowledge that you’re monitoring the right things.

Summing it up

By trying to apply these ten principles to your own projects, we believe you’ll be able to make the most out of your monitoring and logging. These are not the only ideas out there, of course, and you may find that not all of them apply to your specific workflows or the organization as a whole. There is no true one-size-fits-all solution, and nobody but you knows your business.

Remember, you can start this process gradually! After all, imperfect monitoring is better than no monitoring at all.