Best Practices for Monitoring Kubernetes using Grafana
Microservices and containers have taken the tech industry by storm. Kubernetes is one of the tools that has evolved to manage these new aspects of software development. It is an open-source system for automating deployment, scaling, and management of containerized applications. One of the biggest challenges that organizations face when adopting Kubernetes is performing monitoring tasks in this dynamic environment.
Traditional monitoring strategies don’t work for containerized applications, since containers are ephemeral and are difficult to troubleshoot. When you add container orchestration to this mix, managing your application’s underlying infrastructure and taking care of its operational aspects at scale become very challenging. Having an efficient strategy for centralized metrics and monitoring dashboards is key to successfully running your applications in Kubernetes.
Grafana is an open-source data visualization and analytics tool that can monitor time-series data and can be used to monitor your Kubernetes cluster. It can query a large number of datastores and help users visualize, alert on, and understand the metrics. Grafana can be installed on any operating system, and developers can access the tool via a browser.
This article looks at some best practices for monitoring your Kubernetes cluster with Grafana. We’ll examine this tool’s ability to leverage metrics that give you in-depth insights into the health and performance of your Kubernetes cluster, node, pod, and containers through sophisticated dashboards.
Grafana in the Kubernetes Monitoring Architecture
Grafana dashboards are centralized places to check real-time metrics. They are critical to monitoring both your applications and your infrastructure. You can leverage Kubernetes metrics in Grafana to get complete visibility into the state of your Kubernetes cluster and ensure that your services are running as expected.
Metrics that you can use Grafana dashboards to monitor include:
- Kubernetes cluster resource utilization (CPU/memory on cluster, node, pod and container level)
- Actual CPU/memory usage of Kubernetes cluster nodes
- Health status of individual Kubernetes nodes
- Available resources on individual Kubernetes nodes
- Requested usage vs. actual usage of resources
- Pod health and availability
Which Kubernetes Metrics to Monitor
From a Kubernetes monitoring standpoint, there are two types of metrics available: system-level metrics and application-level metrics. System metrics can be fetched from various out of the box core Kubernetes sources, like cAdvisor, Metrics Server, and Kubernetes API server. The application-level metrics can be fetched from several third party monitoring solutions/integrations like Prometheus Node Exporter and kube-state-metrics. You can read more about the Kubernetes Monitoring Architecture here.
The following three lists contain important Kubernetes metrics that you should monitor:
Cluster Metrics
- Cluster level overview of workloads deployed
- Cluster CPU usage: used vs. total
- Cluster memory usage: used vs. total (you can configure this in the memory-defaults.yaml file under the default-mem-example namespace)
- Cluster file system usage: used vs. total
- Cluster network I/O pressure
- Cluster health (pod status, pod restarts, pod throttling)
- Overview of nodes, pods, and containers
Node Metrics
- Health check for master nodes—API server, scheduler, controller, etc.
- Degradation of master nodes
- Number of nodes available for serving pods
- Node CPU utilization
- Node memory usage
- Node disk space available for placing pods
- Node disk I/O usage
- Node network traffic (in and out)—receive and transmit
- Node network traffic errors
- Node network traffic drop
Pod/Container Metrics
- Resource allocation for pods
- Pods which are either underprovisioned or overprovisioned
- Number of running pods in the cluster
- Healthy vs. unhealthy pods in the cluster
- Percentage of throttled containers
- Number of container restarts that have occurred
- Number of persistent volumes in a failed or pending state
- Container CPU and memory utilization (you can configure this in the memory-defaults-pod.yaml file for each pod or container)
Troubleshooting with Kubernetes and Grafana
Grafana dashboards are excellent resources for data visualization, and they provide meaningful insights into the metrics collected from various data sources. These dashboards can be beneficial in a number of troubleshooting scenarios, such as the following:
- Correlating cluster instability and performance degradation issues with resource planning—requests vs. limits.
- Visualizing container restarts that might indicate a problem with your application.
- Correlating throttled pods or unhealthy pod states with I/O wait times and memory spikes on nodes.
- Correlating issues with unhealthy pod states or throttled pods using CPU utilization.
- Determining the source of I/O waits by correlating I/O wait spikes with disk or network spikes using the disk I/O and network stats.
- Monitoring Kubernetes nodes and identifying workload bottlenecks.
From an application perspective, using the RED metrics (Request rate, Error rate, and Duration) for instrumenting the services running in Kubernetes is critical for investigating any performance issue. Leveraging Grafana’s built-in alerting capabilities can make it easy to notify teams when business thresholds are breached.
How to Add Data Sources in Grafana
Grafana fetches information from data sources and displays it in graphs. These data sources are the storage backends for your time series data. Grafana supports several data sources out of the box, including Prometheus, InfluxDB, MySQL, Elasticsearch, AWS CloudWatch, and Azure Monitor.
When building your dashboard, you can combine data from multiple data sources into a single dashboard. However, each panel is tied to a specific data source. There is a query editor which allows you to write queries against your data stores in order to provide visualizations of the metrics. You can choose from a number of visualization options and apply them to your panels.
Below is a screenshot showing the data sources that are currently officially supported by Grafana:
How to Build a Grafana Dashboard
Setting up a dashboard in Grafana is very straightforward. Dashboards consist of panels that can fetch information from various underlying data sources. By default, Grafana comes with a variety of panels such as Graph, Singlestat, Heatmap, Table. You can add panel plugins that allow new data visualizations to be added to Grafana for both time series and non-time series data.
You can organize your panels into rows, and you can drag and drop them across the dashboard. In addition, you can customize your panel’s look and feel from a wide range of available visualizations, and you can display data in a format that works best for your use case.
Tips and Tricks for Building a Grafana Kubernetes Dashboard
- Keep your dashboards simple. Do not add too much information to a single dashboard, and try to limit the number of panels on a dashboard. Ideally, each panel should display a single metric, such as CPU, memory, or disk space. More graphs do not indicate better dashboards. At the end of the day, the key metrics on your dashboard should be easy to understand and actionable.
- Ensure that dashboards are consistent by design. A simple trick to make consistent dashboards is to use the same layout and visualizations for all of them. If your dashboards are built differently for your various services, it can be confusing and difficult to make correct decisions during troubleshooting.
- Dashboards should be developed by keeping the audience and their requirements in mind. The development team will need a detailed dashboard with less aggregation and increased diagnostics for troubleshooting purposes. Management might be interested in an aggregated dashboard that shows a high-level picture of all the services and their SLA/SLI/SLO. Make sure your dashboards are configured to help your staff with their decision making processes.
- Tag your dashboards. Once your teams start creating dashboards for their services, you are likely to end up with a lot of dashboards. Tagging your dashboards is critical for organizing and grouping them.
- Leverage open-source dashboards from Grafana. There is no need to reinvent the wheel. A large and active community works on this technology stack, and it is likely that someone else has already created a solution for your problem. Take a look at the official and community-built dashboards here.
- Leverage Grafana plugins as extension points. There are a large number of third party plugins available (apart from the core plugins) that can be integrated with your Grafana dashboards to enhance the visualization of data. Currently, plugins are available as Panel, Data source, and App.
- Source control your Grafana dashboards. One of the biggest challenges with dashboard maintenance is version drift. You can save Grafana dashboards as JSON files in source control and deploy them to the Kubernetes cluster via an automated build pipeline. This will ensure that the Grafana dashboards are consistent across all Kubernetes clusters in all environments.
- Switch into the query-focused Explore mode for troubleshooting issues. You can easily focus on the query in Explore workflow without worrying about modifying queries in the existing dashboards. Once you have the final query ready, you can start working on the new dashboard or modify an existing one.
- Take advantage of template variables to create dynamic and reusable dashboards. Doing so gives you the ability to monitor a large number of components (set as template variables) with one centralized dashboard. Since you don’t need to hardcode the application name, server name, etc. in your metrics query, maintenance of the dashboards is much easier.
- Boost dashboard performance by lazy loading of panels. There’s no need to load all the panels in a dashboard when you first open it. As you scroll down on your dashboard, the queries are executed against the backend data store, and metrics are displayed in the panels. This feature is enabled by default starting with v6.2.
Conclusion
Having a consolidated observability tool for cloud native applications is incredibly helpful. Grafana is the most popular visualization tool for one of “the three pillars of observability” (metrics, logging and tracing). This will be a game-changer in the monitoring landscape, since users can bring in their favorite observability tooling by adding different data sources to Grafana dashboards.