Best Practices for Automating Monitoring

By: Evan Klein

Developer teams and even operational teams often ignore monitoring applications. Deadlines, inexperience, company culture, and management can lead to poor or neglected monitoring inside developing platforms.

Best Practices for Automating Monitoring

The five monitoring best practices listed below illustrate how infrastructure can be configured to be ready for automation. The techniques cover not only application instance monitoring, but also machine and cluster monitoring.

Design Applications with Monitoring in Mind From the Beginning

Projects should use monitoring frameworks inside their codebases to provide better quality metrics. For example, many monitoring frameworks offer ways to capture metrics that can be exposed in a dashboard. Micrometer, for instance, is a Java framework designed for microservices that is capable of capturing parameters using many different structures.

@RequestMapping(“/”)
public class StudentService {
	@Autowired
	StudentRepository studentReposity;
 
	@GetMapping
	@Timed
	public List<Student> listStudents() {
    	    return studentRepository.list();
	}
}

In the example above, the @Timed annotation tells Micrometer to evaluate method execution time and provide execution data.

Those metrics are only available when the application is developed using a framework that provides this kind of information. For the same reason, applying some monitoring features to legacy systems may be hard; the code may not be easily structured to capture useful information about the software. Additionally, new monitoring frameworks may demand updated libraries while legacy systems may need older versions of the same library to work. In that case, only a refactor can open legacy systems to use advanced monitoring metrics in this way.

Classify Resources into Multiple Categories

Test, production, databases, CPU-bound, backend, Java, Python, payment, CRM—there are many ways to classify a product. Each resource on your infrastructure should be classified using the many tags you think are useful to evaluate what’s working correctly and what’s not.

For example, if all CPU-Bound applications increased their response times, they might result in infrastructure failures since those systems generally don’t use many IO operations. For managers or tech leaders, knowing the failure incidence based on which platform the application is using (Java, Python, etc.) may provide useful insight into what platform to use in the future.

Another useful tip is to classify temporal activities. Besides organizing resources, you’ll want to classify events that happened to your infrastructure. Events can be new versions, changes in the infrastructure, or new advertisements that could result in increased infrastructure usage.

Beyond helping you understand why your metrics changed over time, those event markers can be useful for projecting how new events can affect current infrastructure. It becomes possible to anticipate outages and improve service using the historical data to discover new similar events that happened in the past and create processes that reduce systems outages.

Add Monitoring Tools With Your Applications

A monitoring log or metric can be classified into one of three categories: application, machine, or cluster monitoring. Application monitoring provides the highest number of metrics related to the product. These metrics can measure sales or database is performance, for example.

Outside of creating your own metrics (as the first practice in this article dictates), there are many platform-related parameters that can be automatically extracted. These can be shipped directly to a monitoring tool or collected using an agent.

Some monitoring providers supply agents and integrate them with the application. They can monitor resource usage (memory, CPU, network, disk, etc.) without any code on the app, and they also can provide insights into method execution and whether or not there is a need to scale the application up or down.

Agents are highly attached to the platform in which they are made. Java agents, for example, can efficiently extract useful information about the operation as the platform runs in a virtual machine. Other platforms, such as exceptionally low-level platforms like C and C++, may be able to expose only a fraction of what a JVM provides.

Agents, however, are platform and provider dependent and must be integrated inside the application. Another approach to plug in monitoring tools is to connect them with an external source for logging or monitoring. The system will output all logging stream to an external service that will process, interpret, and index the data. These logging streams can be associated with monitoring metrics and alerts.

While it’s not always ideal to follow the first recommended practice of this article (design applications with monitoring in mind from the beginning), adding monitoring tools is easy, even on legacy systems.

Use Containers to Uniformize the Platform

Containers such as Docker are great for simplifying infrastructure and deployment. They can also help make your project uniform and easily inject application monitoring into it.

For each technology platform (such as Java and Python) used in a company, the IT or DevOps team can provide a base Docker image that has a uniform structure among all projects. That image can have the many monitoring components the application needs—such as auto-registration (with a logging or monitoring backend)—embedded in it.

Many Docker monitoring features are available without changing a platform’s innards. However, this approach is particularly useful for guaranteeing that even legacy systems are running the same startup process, which makes it easier to send the same environment variables to all container instances and direct monitoring data to the same address.

For Clusters, Use an Orchestrator That Enables Automating a Process on Each Node

Automating monitoring means every resource must be automatically registered and classified on a monitoring platform. We’ve reviewed how to do this for applications and containers themselves, but clusters and nodes need to be monitored as well.

Dynamic clusters can scale the number of nodes up or down based on demand. More importantly, a single group can have different sets of machines designed for various purposes. Database machines will have more memory, for example, while backend applications could have more CPU cores. All of those machines must be monitored the second they become available, even if there is no application running on them.

A cluster orchestrator such as Kubernetes (K8s) simplifies the process and already has some monitoring features built into the installation. K8s also provides a different type of application called DaemonSets. It ensures that all cluster nodes run a copy of the registered application. In other words, it’s possible to register a monitoring application as a DaemonSet. It will become available, capturing hardware metrics and logs for every node available, even if the node is created and removed automatically by the cluster.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd-elasticsearch
        image: gcr.io/fluentd-elasticsearch/fluentd:v2.5.1
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

The example above (provided by K8s docs) shows a DaemonSet using Fluentd to send logs to an Elasticsearch stack.

Many monitoring providers, including Prometheus, Dynatrace, NewRelic, and Logz.io have official DaemonSets.

Conclusion

Monitoring is essential in any highly distributed and heterogeneous environment. With monitoring, an operator can identify and appropriately apply measures to mitigate an issue that might propagate to other nodes of a cluster.

As time passes, more and more services can be created inside an application. This movement forces operational teams to be more creative about, and competent in delivering the expected monitoring quality. Automating monitoring is essential to quickly delivering high-level monitoring capabilities even with a large number of services running in a cluster.

The best practices described in this article will make operational teams’ lives easier by automating and anticipating outages.