Guide to Monitoring AWS Lambda Metrics with Prometheus & Logz.io

By: Charlie Klein

July 19, 2021

A Guide to Monitoring AWS Lambda Metrics with Prometheus & Logz.io

In this post we will discuss some key considerations and strategies to monitor your AWS Lambda functions. This will include: which Lambda metrics you’ll want to monitor, how to collect AWS Lambda metrics with Prometheus and Logz.io, how to create a monitoring dashboard with alerts, and how to search and visualize your metrics.

Key AWS Lambda Metrics to Monitor

The following section briefly highlights key Lambda metrics to monitor to effectively understand their behavior and identify problems. There are three key categories of AWS Lambda metrics:

1. Invocation metrics

Invocations – The number of times your function code is executed, including successful executions and executions that result in a function error.
Errors – The number of invocations that result in a function error.
DeadLetterErrors – For asynchronous invocation, the number of times Lambda attempts to send an event to a dead-letter queue but fails.
Throttles – The number of invocation requests that are throttled, which when a Lambda intentionally rejects a request and does not execute the function.
ProvisionedConcurrencyInvocations – The number of times your function code is executed on provisioned concurrency.
ProvisionedConcurrencySpilloverInvocations – The number of times your function code is executed on standard concurrency when all provisioned concurrency is in use.

2. Performance metrics

Duration – The amount of time that your function code spends processing an event.
IteratorAge – For event source mappings that read from streams, the age of the last record in the event.

3. Concurrency metrics

ConcurrentExecutions – The number of function instances that are processing events.
ProvisionedConcurrentExecutions – The number of function instances that are processing events on provisioned concurrency.
ProvisionedConcurrencyUtilization – For a version or alias, the value of ProvisionedConcurrentExecutions divided by the total amount of provisioned concurrency allocated.

For more information on these metrics, check out the detailed documentation in the AWS docs.

Now that we know which metrics to monitor, let’s set up some dashboards to monitor them. While Lambda metrics are automatically published to Cloudwatch, most engineering teams prefer other solutions to collect and analyze their metrics.

Let’s check out how we can use Prometheus and Logz.io to collect metrics at scale so they can be searched and analyzed.

Collecting and Monitoring Lambda Metrics with Prometheus and Logz.io

Now comes the fun part. To put the metrics described above into action, we’ll need to collect the metric data and build dashboards to monitor them. We’ll do these things with Prometheus and Logz.io.

But metrics don’t show the full picture of what’s happening with your lambdas. Once we identify problems in metrics monitoring dashboards, we’ll also need log data to troubleshoot the problems. We will also show how to correlate your metrics with the associated logs on Logz.io.

Let’s get started.

Collecting and Analyzing Metrics with Prometheus and Logz.io

Before we get to set up the monitoring dashboards, we’ll need to get the metric data into our monitoring solutions.

In this section, let’s pull Lambda metrics from CloudWatch with Prometheus. And then if you’d like to try it, you can follow along to forward those metrics to Logz.io’s Prometheus-as-a-service for scalable storage and analysis.

If you’re NOT interested in using Prometheus to collect your Lambda metrics, check out these directions for sending your metrics straight from CloudWatch to Logz.io with the otel/opentelemetry-collector.

How to Pull Lambda Metrics in Cloudwatch to Your Prometheus Server

Assuming you already have Prometheus installed, let’s start pulling Lambda metrics from Cloudwatch (where they are automatically published) to a Prometheus server so we can monitor them.

Cloudwatch Metrics to Prometheus via the metrics exporter

To scrape the metrics, we will use yet-another-cloudwatch-exporter. Capable of exporting metrics for a lot of AWS services, we’ll stick to exporting AWS Lambda metrics with it for now.

Here is an example of the yet-another-cloudwatch-exporter config file (in yaml):

discovery:
 jobs:
   - regions:
       - us-east-1
     type: lambda
     enableMetricData: true
     metrics:
       - name: Duration
         statistics: [ Sum, Maximum, Minimum, Average ]
         period: 300
         length: 3600
       - name: Invocations
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: Errors
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: Throttles
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: DeadLetterErrors
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: DestinationDeliveryFailures
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: ProvisionedConcurrencyInvocations
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: ProvisionedConcurrencySpilloverInvocations
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: IteratorAge
         statistics: [ Average, Maximum ]
         period: 300
         length: 3600
       - name: ConcurrentExecutions
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: ProvisionedConcurrentExecutions
         statistics: [ Sum ]
         period: 300
         length: 3600
       - name: ProvisionedConcurrencyUtilization
         statistics:
         - Maximum
         period: 300
         length: 3600
       - name: UnreservedConcurrentExecutions
         statistics: [ Sum ]
         period: 300
         length: 3600

Once the right credentials and configurations are all set you can run the cloud watch metrics exporter locally like below:

docker run --rm -v $(PWD)/credentials:/exporter/.aws/credentials -v $(PWD)/yace.yml:/tmp/config.yml -p 5000:5000 --name yace quay.io/invisionag/yet-another-cloudwatch-exporter

Then you will be able to see metrics in port 5000 as shown below. You can see that dev-generate-random-errors Lambda metrics are being reported. Please find the details on running and configuring yet-another-cloudwatch-exporter in their official github repo.

CloudWatch metrics exports seen from browser

Using PromQL Queries

Now that Prometheus is configured successfully, we can query AWS/Lambda our metrics using the PromQL query language.

The diagram below shows the query result of our aws_lambda_errors_sum of the dev-generate-random-errors Lambda metrics.

Prometheus is a widely popular way to collect metrics from Lambda functions. But oftentimes, Grafana will analyze these metrics due to Prometheus’s limited visualization capabilities. Learn more about how these tools work together here.

This example shows a small volume of metric data. But Prometheus servers and roll up servers can be a lot to manage, upgrade, and maintain at large scales. Plus, you’ll need a totally separate solution for logging, which can slow down your monitoring and troubleshooting workflows and can lead to tool sprawl.

These are the reasons Logz.io built Prometheus-as-a-service, which collects all your Prometheus metrics on a scalable SaaS platform for long-term storage. Plus, it unifies your metrics alongside your log analytics, which consolidates your and makes it faster to monitor and troubleshoot your lambdas.

Forwarding Prometheus Metrics to Logz.io for Scalable Storage

Logz.io offers a scalable Prometheus-as-a-service solution called Infrastructure Monitoring.

All we have to do to forward our Prometheus metrics to Logz.io Infrastructure Monitoring is add the remote_write url to our prometheus.yml config file. This will automatically forward all the metrics collected by Prometheus to a remote destination – in this case, Logz.io.

Just open a free Logz.io account, add the remote_write parameters to the config. It’s not complicated, but please find the detailed documentation here to set it by yourself.

Here is an example prometheus.yml file. Note that you’ll need to add the account token – which you can find in your Logz.io account page – to the file:

global:
 scrape_interval: 60s
 evaluation_interval: 60s
 external_labels:
   p8s_logzio_name: LambdaMetrics
scrape_configs:
 - job_name: 'cloudwatch_yet-another-exporter'
   scrape_interval: 30s
   static_configs:
     - targets: [<exporter-host-ip>:5000]
 - job_name: 'prometheus'
   scrape_interval: 5s
   static_configs:
     - targets: ['localhost:9090']
remote_write:
 - url: https://listener.logz.io:8053
   bearer_token: <Metrics Account Token>
   remote_timeout: 30s
   queue_config:
     batch_send_deadline: 5s
     max_shards: 10
     min_shards: 1
     max_samples_per_send: 500
     capacity: 10000

Now, all your metrics automatically will both scrape from CloudWatch and automatically forward to Logz.io for long term storage and analysis. In other words, you can keep your existing Prometheus servers in place, without needing to manage the data storage at scale. Plus, your metrics can be unified with log and trace analytics:

Creating AWS Lambda Metrics Monitoring Dashboards

Now we can set up monitoring dashboards in Logz.io that monitors our generate-random-errors Lambda. Creating new monitoring visualizations (called panels) is easy. Hit the ‘Add panel’ button, select the data source, and add the PromQL query to visualize the desired metrics:

The query above monitors all our errors from the function. After hitting save, you can see the new visual in the top right panel below.

We also added panels that monitor latency (which measures the average and max durations of all the invocations), throttles, invocations, and global concurrent executions – a good start on the list of metrics to monitor in the ‘Key AWS Lambda metrics to monitor’ section.

But nobody wants to stare at a dashboard all day to see if something goes wrong, so we can set alerts to do that for us.

Setting Up Logz.io Alerts

The alert below triggers automatically when the Lambda error count is greater than 17 (random count) on every minute evaluated for the last 2 minutes. You can set the threshold by dragging and dropping the highlighted heart up and down the y axis.

Editing dashboard panel and configuring alarm for AWS metrics dash

Then, we can configure the alert to notify us when it triggers with endpoints like Slack, PagerDuty, VictorOps, ServiceNow, OpsGenie, Gmail, and many other alerting endpoints:

Lambda function error metrics alerts in #lambda-alerts channel on Slack

In the real world, after being alerted of spiking Lambda metrics, we’d immediately want to investigate the related logs to debug the problem. This would normally require accessing a separate system to begin querying the logs.

But Logz.io also offers a Log Management solution, so we can correlate our error metrics with the associated logs to immediately bring up the logs that explain the spiking metrics by hitting this link we’ve added:

Hitting this link brings us straight to the relevant logs so we can debug any problems in our lambdas.

Correlated logs for AWS Lambda metrics in Logz.io

But how did we get our Lambda logs into Logz.io you may ask? Check out our Lambda logging blog to learn more.

Your Monitoring Game Plan

Go into your Lambda monitoring journey with a game plan. First, review the metrics we covered in the ‘Key AWS Lambda Metrics To Monitor’ section to decide exactly what you want to monitor.

Once you know which metrics you want to monitor, the next question becomes how you’re going to collect and analyze the data. Most of today’s ‘monitor-ers’ choose Prometheus. It’s easy to set up and integrate with modern stacks to collect your metrics. But there are a few considerations.

The first is scale and time. The data we used in this example was very small-scale. More realistic production environments require tools to effectively ingest and store huge quantities of data.

Using Prometheus for this is doable – some of the best engineering teams in the world use it. The question for you is whether or not your team has the time and resources to manage it.

The next consideration is whether you want your metrics and other monitoring data (logs and/or traces) all in one place, or if you’re okay with having them spread out.

Logz.io is a good alternative for teams that want somebody else to manage these open source tools and/or because they want all of their monitoring data together.

If that’s the case for you, try Logz.io for free here. Otherwise, Prometheus is your best bet.