openstack monitoring

OpenStack is an open source project that allows enterprises to implement private clouds. Well-known companies such as PayPal and eBay have been using OpenStack to run production environments and mission critical services for years.

However, establishing and running a private cloud it not an easy task — it involves being able to control a large and complex system assembled from multiple modules. Issues occur more frequently in such IT environments, so operations teams should log and monitor system activities at all times. This will help them solve performance issues before they even occur.

The Elasticsearch, Logstash and Kibana (ELK) open source stack is one of the leading logging platforms due to its scalability, performance, and ease of use. It’s well-suited for this purpose. Here, I will discuss OpenStack monitoring and provide a step-by-step guide that shows how to retrieve, ship, and analyze the data using the ELK Stack.

Retrieving the Logs

Most OpenStack services such as Nova, Cinder, and Swift write their logs to subdirectories of the /var/log directory. (For example, Nova’s raw log file is in /var/log/nova.) In addition, OpenStack allows you to retrieve the logs using its REST API and CLI. Here, we will use the API because it returns the data in a structured JSON format, making the logging and shipping process simpler due to its good compatibility with Logstash and Elasticsearch.

Authentication

To use the OpenStack APIs, you need an authentication token, which you can get from Keystone using the curl command:

Where the TENANT_NAME is replaced with name of the project for which we want to monitor, USERNAME and PASSWORD are your OpenStack environment admin credentials and OPENSTACK_IP is the IP address of your OpenStack.

The output:

The id under the access.token property is the token that we use to access the APIs. In addition, make sure to keep the tenant id because we will use it later to monitor the specific tenant resources.

Nova Metrics

Similar to every other cloud, the core of OpenStack cloud is in Nova — the compute module. Nova is responsible for provisioning and managing the virtual machines. Nova monitoring can be segmented into three different layers — from the underlying hypervisor, through the single server/VM, and finally per specific tenant.

Hypervisor metrics expose the underlying infrastructure performance. The server metrics provide information on the virtual machines’ performance. Tenant metrics provide detailed information about user usage.

Hypervisor Metrics

Monitoring the hypervisor is very important. Issues with this layer will lead to broad failure and issues with VM provisioning and performance. The hypervisor exposes a lot of metrics, but you will need to pick the ones that are most important to you. We picked the following ones that we believe provide the baseline transparency that is required to keep a healthy environment:

  • current_workload: number of tasks, for example build, snapshot and migrate
  • Running_vms: number of VMs.
  • vcpus: number of use/available CPUs
  • free_disk_gb: free hard drive capacity in GB
  • free_ram_mb: Amount of available memory

These include available capacities for both computation and storage so that you can understand the load and lack of resources that can eventually harm your OpenStack cloud performance.

To retrieve this information, use the following:

The output:

Server Metrics

Nova server metrics contain information about individual instances that operate on the computation nodes. Monitoring the instances helps to ensure that loads are being distributed evenly and that network activities and CPU times are being reported.

To retrieve the metrics information use the following:

The output:

The list of the servers per project:

The output is the list of servers in the project:

Tenant Metrics

The tenant (or project) is the group of users that has access to specific resources and where resources quotas are defined. Monitoring the quota with the instances inside of each project can be very useful in identifying the need for change with particular quotas in line with resource allocation trends.

Get the quota per tenant looks, for example:

The Fourth Component: RabbitMQ

In addition to these three groups, Nova components use RabbitMQ for both remote procedure calls (RPCs) and internal communication. It is crucial to log and monitor its performance because it is the default OpenStack messaging system. If this fails, it will disrupt your whole cloud deployment.

The following metrics will be collected using rabbitmqctl:

count: number of active queues
Command: rabbitmqctl list_queues name | wc -l

memory: size of queues in bytes
Command: rabbitmqctl list_queues name memory | grep compute

Output:

consumers: number of consumers by queue

Command: rabbitmqctl list_queues name consumers | grep compute

Log Shipping

The next step is to aggregate all of the logs and ship them to Elasticsearch. Here, we will present two methods: one using Logstash and the second using an Amazon S3 bucket.

Using Logstash

One of the most fundamental tools for moving logs is Logstash, which is one of the three components of the ELK Stack that I mentioned earlier. When using Logstash, the input, output, and filter should be specified. Together, they define the transportation and transformation of the logs.

(New to Logstash? Learn how to get started with Logstash!)

The Logstash configuration file can look like the following:

// The input block below defines the source of log data and shows how a particular log source will be processed. This includes parameters such as the frequency (interval) of new incoming data.

// The Filters below are used to process the logs in the Logstash pipeline and can drop, convert, or even replace part of a log.

// The output block is there to define where the data will be sent.

//// If you a Logz.io user you can use the following.

Note: This Logstash example is not ideal because we have to work with a fixed number of tenants and instances per project. However, if your environment is highly dynamic, you will need to develop a mechanism that auto-discovers and updates the tenant and instance information.

Using S3

There are several ways to store files in S3, but using Logz.io makes it easy to configure and seamlessly ship logs to the ELK Stack with S3. You will not have to automate the export and import of data.

In this example, we will reuse our previous Logstash configuration to store the logs in an S3 bucket that will be continuously tracked and used by the Logz.io ELK Stack. For that purpose, we will change the last part of the Logstash configuration (the output section) to point to the S3 bucket:

After executing ./bin/logstash -f <PATH_TO_THE_LOGSTASH_CONFIG> the files will be there. Now, they’re ready to be shipped to Elasticsearch.

(Don’t have a Logz.io account? Start your free trial here!)

Next, log into your Logz.io account and select Log Shipping. Then, look for the S3 Bucket option in the left-hand menu and fill out the input fields with the required information that is shown below:

s3 bucket configuration

Build The Dashboard

Now, we are ready to present the shipped metrics data. First, we will start with the hypervisor metrics:

  1. To create the chart, click on the Visualize item in the menu at the top of the page and select the type of chart that you want to use. In our case, we used a line chart.
  2. In the Metrics settings, select the type of aggregation. We used the sum over the hypervisor current workload field.
  3. In the Buckets settings, we selected Date Histogram for the X axis in the dropdown menu and left “automatic” for the interval.

The result chart is shown below:

hypervisor current workload over time

We followed almost the same steps to create the following charts for the hypervisor memory usage and the number of cores (hypervisor_statistics.vcpus_used) in use:

hypervisor memory used over time

hypervisor core amount in use

To create the above image, we defined the minimum and maximum values so that we can easily see if the tenants are about to hit their core limits.

Kibana’s flexibility on top of the OpenStack logs in Elasticsearch allow us to create a comprehensive and rich dashboard to help us to control and monitor our cloud.

openstack monitoring dashboard

The Conclusion

By its nature, the OpenStack cloud is a complex and evolving system that continuously generates vast amounts of log data. However, not all of this data can be accessed easily without a robust and structured monitoring system.

The ELK Stack not only provides the robustness and supports the real-time performance required, but when correctly deployed, it is also flexible enough to support the ongoing monitoring and control that cloud operations teams must have.

Logz.io is an AI-powered log analysis platform that offers the open source ELK Stack as a cloud service with machine learning technology and can be used for log analysis, IT infrastructure and application monitoring, business intelligence, and more. Start your free trial today!

Asaf Yigal is co-founder and VP Product at Logz.io. Prior to Logz.io, Asaf co-founded Currensee, a social-trading platform, which was later acquired by OANDA in 2013. Prior to Currensee, Asaf played executive roles at Akorri in developing an end-to-end performance monitoring platform and at Onaro in developing a storage resource management platform. Both Akorri and Onaro were acquired by NetApp. Prior to Onaro, Asaf headed a research team in the Israeli Navy, taking an artificial intelligence system to military deployment. Asaf holds a B.S. from the Technion and is an Instrument-rated private pilot.