Guide to AWS Monitoring with Prometheus and Logz.io

Prometheus is a widely utilized time-series database for monitoring the health and performance of AWS infrastructure. With its ecosystem of data collection, storage, alerting, and analysis capabilities, among others, the open source tool set offers a complete package of monitoring solutions. Prometheus is ideal for scraping metrics from cloud-native services, storing the data for analysis, and monitoring the data with alerts.

In this article, we’ll take a look at the Prometheus ecosystem and offer some key considerations for setting up Prometheus to monitor AWS, highlight some of its shortcomings, and take a look at how to go about solving them with Logz.io.

Prometheus Ecosystem

Prometheus has three core components – scraping which is done from the endpoints that exporters expose, a time series database, and an alerting system called Alert Manager.

Using this system, an exporter reads metrics from AWS infrastructure and exposes the data for Prometheus to scrape. For example, you can run a node exporter on EC2 and then configure Prometheus to pull metrics from your machines. A node exporter will collect all ofl your system information and then open a small server to expose these metrics.

While Prometheus scraping can be used to collect metrics from all kinds of infrastructure, it’s hugely popular based on its comparative ease-of-use for Kubernetes-based environments. Its auto discovery for new Kubernetes services has dramatically simplified Kubernetes monitoring. And we all know how popular Kubernetes is among today’s cloud developers.

Once data is scraped using Prometheus, its time-series database stores these metrics, while AlertManager monitors them, and then pushes notifications to your desired endpoint.

Other tools in this ecosystem of course include Grafana, Trickster, Thanos, M3DB, Cortex, Pushgateway, and a number of other Prometheus exporters.

Trickster is a caching layer on top of Prometheus that can cache queries that are very frequent and /or large in scale; this can prove extremely useful in lowering the pressure on Prometheus itself.

The Thanos, Cortex, and M3DB databases can be used to extend the functionality of Prometheus features including high availability, horizontal scaling, and historical back up. While Prometheus is a single-node solution, you can write the data to these time series databases to consolidate data from multiple servers for analysis.

Pushgateway enables push-based metrics in your Prometheus setup. By default, Prometheus can only read metrics from defined sources. You can simply push the metrics to Pushgateway, and Prometheus will then pull the metrics from there.

And while Prometheus is a powerful solution for collecting and storing metrics from cloud-native environments, its visualization capabilities are lacking.

As a result most Prometheus users visualize their data with Grafana – an open source data visualization tool that easily connects to Prometheus. It has great support for Prometheus’ query language and is a highly capable and flexible metric visualization solution.

Prometheus Challenges

As mentioned, Prometheus runs on a single node so it is inherently not designed for high availability. Since Prometheus stores metrics on a disk in a single machine, as the data grows, many users end up decreasing their related range of fine metrics to accommodate growing scale. In some cases, this comes at the expense of monitoring critical information.

To scale your system without reducing the cardinality of your metrics, you can however implement tools like Thanos and Trickster to centralize your Prometheus metrics for storage analysis.

But of course, adding additional components means invoking additional installations, adding infrastructure, creating more configurations, undertaking more upgrades, and increasing other maintenance tasks – all of which requires time. As a result, high availability Prometheus deployments can become increasingly difficult to manage as data volumes grow.

Finally, metrics is only one piece of the observability puzzle, and Prometheus isn’t purpose built to collect and store logs or traces. For this reason, Prometheus users will inevitably end up isolating their metrics from their log and trace data – which can prove a recipe for observability tool sprawl. Those who want to unify their logs, metrics, and traces in one solution will need a different approach.

Key AWS Metrics to Monitor

Usage

Usage defines the percentage of consumption of any resource. For example, if you’re saving 10 GB of data on a 100 GB disk, the usage percentage is 10%. There are different ways to monitor usage.

CPU

CPU usage is important to monitor because it helps you discover any issue with or high consumption of CPU. This metric is available for AWS services like EC2 machines, load balancers, RDS, etc. The threshold for this, for example, can be when all your CPU cores hit 100% utilization.

Disk

Disk is the permanent storage (secondary storage) available to be consumed. This can be a critical metric to keep an eye on since if there is no disk left, all your software could stop working. Generally, the threshold for this is 90%. If you see 90% consumption, you should quickly extend the disk size. Services like RDS and EC2 have these metrics available.

Memory

Memory is the RAM used during any processing, with 100% memory utilization possibly triggering the OOM killer, terminating your process. The threshold here can be 80% utilization. Services like RDS, Elasticache, EC2, and ECS have these metrics.

Bandwidth

Bandwidth is the network I/O being consumed by your services. You have to make sure that your network I/O doesn’t reach the limit of networking defined by AWS, which is 10 Gbps in most cases. You can monitor this in services like Managed NAT, EC2, Elasticache, and RDS.

Request Count

Request count helps you identify the usage of a given resource. This number tells you the number of times someone requests this resource. You have to watch for any anomaly here. Most AWS services have this metric, with the most important ones being load balancers, Elasticache, RDS, and EC2.

AWS Errors

An error number shows if there is an increase or decrease in errors. Below are a few important error metrics that you should watch.

ELB Status Code

You should keep an eye on Elastic Load Balancer Status codes as well. An increase in error status codes means that your application may not be performing well.

S3 Access Errors

This metric gives the number of requests that resulted in failed states either due to a permission error or “not found” error.

Unhealthy Hosts

ELB and ALB generally have this metric. It is one of the most important metrics to monitor since it tells you how many healthy backends there are to serve requests. Any decline in this number can be a problem, so make sure to configure an alert for it.

AWS Performance Metrics

In the modern era of cloud computing, where latency can also be treated as an error, it is important to keep a watch on performance metrics. These will help let you know if any scaling is required to run your application properly. Below are a few metrics that you should monitor in this space.

Latency Increase

Latency numbers are very important. These can tell you a lot about your application saturation and how it can scale for further requests. If you see latency increase, there may be some problem with your application or you may need to increase the number of instances of your application.

Surge Queue Length

Surge queue length is the number of requests waiting to be served. This metric comes with ELB and ALB. You don’t want your requests to be in a queue, as this can dramatically increase response time.

Integrating Prometheus with your AWS services

Using the CloudWatch Exporter to expose AWS metrics for Prometheus scraping is a popular way to monitor AWS. Let’s go through an example of implementing this exporter to collect EC2 metric data.

Integration of EC2 with Prometheus with the CloudWatch Exporter

To integrate your EC2 machines with Prometheus, first install the CloudWatch agent on them using the following command:

java -jar target/cloudwatch_exporter-*-SNAPSHOT-jar-with-dependencies.jar 9106 example.yml

Next, configure your Prometheus server to start scraping metrics from these machines:

job_name: cloudwatch
		metrics_path: ip_of_ec2_machine:port/metrics

Now, configure the CloudWatch agent to instruct what metrics to scrape from the machines.

Install the cloud watch agent. You can follow this link to install it or use below command

sudo yum install amazon-cloudwatch-agent

Update the Prometheus scrape config to identify the new metrics sources.

global:
  scrape_interval: 1m
  scrape_timeout: 10s
scrape_configs:
  - job_name: MY_JOB
    sample_limit: 10000
    ec2_sd_configs:
      - region: us-east-1
        port: 9404
        filters:
          - name: instance-id
            values:
              - i-98765432109876543
              - i-12345678901234567

You can get the detailed instructions for the above steps in the AWS documentation.

Integration of CloudWatch Metrics with Prometheus

The easiest way to gather all of your metrics is taking them directly from CloudWatch, as most events are logged there. Simply install a CloudWatch exporter in one of your machines and run it:

java -jar target/cloudwatch_exporter-*-SNAPSHOT-jar-with-dependencies.jar 9106 example.yml

Input the proper configuration along with AWS credentials; these values can go in the environment variable:

	export AWS_ACCESS_KEY_ID = “aws_key”
	export AWS_SECRET_ACCESS_KEY  = “aws_secret”

Now, configure your Prometheus server to start scraping metrics from the CloudWatch exporter metric endpoints:

	job_name: cloudwatch
		metrics_path: ip_of_cloud_watch_exporter_vm:port/metrics

Further documentation on this from Logz.io is available, plus, you can read about AWS Lambda integration with Prometheus.

Solving Prometheus Issues with Logz.io

As we’ve seen in the above discussion, scaling Prometheus can be a significant challenge and you may end up managing multiple components including Thanos, Trickster, Grafana, and underlying infrastructure. As an alternative, Logz.io can solve this problem for you, and very easily at that.

Using Logz.io, you can configure your existing Prometheus server to forward the metrics and thus offload the management complexity to the Logz.io Open 360™ observability platform.

To illustrate this process let’s quickly walk through how this is done.

Send Prometheus Metrics to Logz.io

To get started, you can easily configure Prometheus to perform a remote write to Logz.io servers. Using this approach, your Prometheus servers will act as a scraper and then write those metrics to Logz.io for storage and analysis. After taking this step, you can easily build dashboards on top of these metrics within Logz.io.

To start, simply create a Logz.io account, and select the correct region and listener configuration.Next, get your metrics account token from Settings > Manage tokens > Data shipping tokens > Metrics.

Then add the remote write URL in the Prometheus configuration:

                       global:
  external_labels:
    p8s_logzio_name: <labelvalue>
remote_write:
  - url: https://<<LISTENER-HOST>>:8053
    bearer_token: <<PROMETHEUS-METRICS-SHIPPING-TOKEN>> 
    remote_timeout: 30s
    queue_config:
      batch_send_deadline: 5s  #default = 5s
      max_shards: 10  #default = 1000
      min_shards: 1
      max_samples_per_send: 500 #default = 100
      capacity: 10000  #default = 500

Now, simply restart Prometheus and your metrics will begin streaming to Logz.io so you can begin building dashboards or explore metrics using the metrics explorer found here.

Logz.io unifies metrics, traces, and logs in a unified platform, so it’s easy to correlate across all your data – giving you the desired ability to detect and solve issues quickly. When logs, distributed traces, and stack traces are presented with metrics, it becomes much easier to pinpoint the location and time of an issue, decreasing mean time to resolution and increasing your team’s overall efficiency.

Conclusion

Prometheus is a great tool to utilize as you begin your monitoring journey, but as your usage and scale inevitably grow, related complexity can become a significant hurdle.

For many teams, an easier alternative approach is to employ Prometheus but also ship the metrics to a managed SaaS platform such as Logz.io. This way, you can save engineering costs and spend more time building new features – all while retaining the powerful innovation of the open source community.

Logz.io is designed to be simple to integrate and use, and it also importantly provides PromQL support to build custom dashboards and alerting on top of any metrics that you ship. You can also use AWS Kinesis to send the metrics to Logz.io or use Logz.io’s Telemetry Collector without requiring intermediate Prometheus setup. Get started with a free 14-day trial of Logz.io, and monitor your AWS applications with a modern cloud-native solution based on Prometheus!

An Introduction to AWS Monitoring with Prometheus and Logz.io