Prometheus and Grafana are two monitoring tools that, when combined, provide all of the information DevOps and Dev teams need to build and maintain applications. Prometheus collects many types of metrics from almost every variety of service written in any development language, and Grafana effectively queries, visualizes, and processes these metrics.
Together, these two tools serve the needs of most R&D groups supporting on-premises or cloud applications—from organizations that do not have high service-level objectives (SLOs) to businesses with mission-critical production environments and high-frequency traffic. Time series databases, document databases, SQL databases, cloud providers’ monitoring services, and other applications can all be used as data sources for Grafana’s visualizations. Grafana has become the industry standard because it gathers data from many different data sources, collects the data in one place, and displays it in a unified way. For these reasons, Grafana’s usage has increased, and DevOps teams prefer it over the available individual monitoring and logging user interfaces.
Grafana’s ability to measure mean time to repair (MTTR) is another reason for its popularity. MTTR is a key metric used by production teams to measure DevOps efficiency and productivity. In order to maintain high SLA and meet high service standards, a team needs to be able to continuously monitor, identify, and instantly act upon incidents. This is critical to ensuring that the recovery process can start right away. Using the right graphs in Grafana gives teams the ability to keep track of each dedicated service and the system as a whole. That said, the team still needs to know what to track, and this is where Prometheus comes in.
What Is Prometheus?
Prometheus is a tool that every DevOps professional should be familiar with. It’s an open-source system for monitoring services and alerts based on a time series data model. Prometheus collects data and metrics from different services and stores them according to a unique identifier—the metric name—and a time stamp. This storage system allows Prometheus to quickly query metrics and provide data sets that can be easily manipulated for visualization. Labels are another aspect of Prometheus that enables its dimensional data model. Labels and metrics can be combined to identify a certain dimension of a specific metric and extract it. This makes querying more precise and efficient.
Unlike other monitoring tools which communicate with an agent deployed on the monitored and measured service’s host, Prometheus uses exporters. In order to employ Prometheus, users must either instrument their code to implement Prometheus’ metric types or have the monitored service push their metrics to the relevant exporter if the code cannot be changed. The exporter compiles the log entries to a Prometheus metric and sends this compilation to the Prometheus server. Prometheus comes with its own query language, PromQL, which facilitates the acquisition of metrics and allows other tools—like Grafana—to get the data they require.
Prometheus (together with Kubernetes) has been adopted by the CNCF, its new official owner. It has a long list of exporters, making it possible to collect metrics for almost every available software. Databases, http servers, other monitoring systems, and even issue trackers or continuous integration tools can all be monitored using Prometheus. Kubernetes itself can be a data source for Prometheus when the Prometheus Operator is used. As stated in the CoreOS.com documentation, the mission of the Prometheus Operator is “to make running Prometheus on top of Kubernetes as easy as possible, while preserving Kubernetes-native configuration options.” Every day, new exporters are created. It seems that the development community is uniting around Prometheus and will likely continue to invest in making it the best metric monitoring tool available.
A Winning Integration
Every DevOps professional wants the following features from a monitoring system:
- Deployment simplicity,
- Minimal code intrusion,
- High-value ROI, and
- Low maintenance effort.
When put together, Prometheus and Grafana (as a monitoring backend and a user interface system) provide all of these capabilities. Deployment of either tool is simple. They both have Docker images, helm charts, and other easy ways of being deployed. The configuration steps are relatively quick to execute, and both tools work together out of the box. Prometheus queries are easily defined, and you can use template variables to dynamically change values in your dashboards.
There are also some hosted Grafana options available on the market, including the recently released Infrastructure Monitoring by Logz.io, which offers an easy correlation between Grafana and Kibana within the same user interface.
Grafana was built to support the time-series data model that Prometheus is based on. Therefore, it is the ideal tool to visualize the metrics Prometheus provides. Prometheus is designed for working with modern technologies like Kubernetes, serverless architecture, and microservices. As a result, it can provide the kind of data DevOps staff need to maintain a high-availability production environment.
Grafana comes with the ability to upload ready-made dashboards for use with each of these modern technologies. Additionally, the user community at large has developed dashboards with many visualizations for a variety of use cases related to these technologies. The dashboards are preconfigured to work with Prometheus servers and provide valuable information for DevOps teams from the moment they are implemented.
Monitor The Monitoring Systems
Another way to employ the Prometheus and Grafana combination is by monitoring the monitoring tools themselves. Prometheus is an excellent metrics collector, but when it is used to monitor an application, you must also keep track of the other systems used to monitor the application, track their users’ experiences, and constantly ensure that all the monitoring services are alive and healthy.
For log management, most DevOps teams use the ELK Stack, Splunk, or one of the many other logging systems that can help with root cause analysis. For continual validation of the application’s availability, tools like Pingdom and Uptime Robot are commonly employed.
The availability and performance of these tools or services has to be validated as well, as they are part of the production environment’s mission-critical stack. With that in mind, both Prometheus and Grafana come with built-in solutions for these tools. Prometheus has exporters that collect relevant metrics, and Grafana visualizes those metrics in a variety of dashboards.
Grafana, through its template variables feature, allows you to work with different Prometheus servers by simply switching between them in the dashboard view. In this situation, you have one centralized Grafana that knows to pull data from a set of Prometheus servers and display data from one Prometheus server at a time.
Both tools have an alerting module. However, the UI of Prometheus’ alert manager doesn’t meet the needs of most DevOps and production teams. Grafana provides a more straightforward solution to this problem. Whenever you create a table and query the label “alerts,” all of the alerts that Prometheus offers can be displayed in a single pane. Grafana alerts can also be used alongside Prometheus alerts, since they close some of the gaps in the Prometheus alerting module.
Grafana has many integrations with collaboration tools. In one of its latest versions, a feature was added that allows a user to send an image of a graph representing the specific alert that was generated. Grafana simply renders the panel associated with the alert rule as a PNG image and includes it in the notification. This way, all of the collaboration systems can easily display the image.
Monitoring the Monitoring Tools
Just like other applications and systems, Prometheus and Grafana are not fail-safe. The monitoring state of mind requires DevOps to keep all production services available. Since both Prometheus and Grafana are production-supporting systems, they, too, must be monitored to constantly validate the accuracy of the data that is collected and displayed.
In the case of large-scale applications, DevOps teams have already invested significant effort in production maintenance. Requiring them to also maintain a large-scale monitoring stack to accommodate the system load is rarely worth the financial and energetic investment, since Grafana monitoring can easily be outsourced to a managed solution. Doing so creates confidence and peace of mind for DevOps teams while providing the same visibility and functionality as a self-managed system.
Grafana and Prometheus work so well together that other tools and systems are starting to have built-in Prometheus and Grafana support as part of their releases. Both continue to develop new features and capabilities, and both continue to be extended by the community, making the future of this partnership very bright. Every DevOps and SRE team that chooses to implement Prometheus and Grafana together as part of their monitoring stack will gain from this integration—both as it exists now and as it evolves in the future. Anticipated metrics for newly supported tools, new measurements, new visualizations, and new dashboards promise to make this pairing even more effective.