How to Use ELK to Monitor Performance

elk monitor platform performance

Very often, when I was troubleshooting performance issues, I saw a service that is or a couple of machines that are slowed down and reaching high-CPU utilization. This might mean that it lacks resources because of high load, but very often it means that there is a bug in the code, an exception, or an error flow that over-utilizes resources. To find that out, I had to jump between New Relic/Nagios and the ELK Stack.

So, I decided that I wanted to have one pane-of-glass to view performance metrics combined with all the events generated by the apps, operating systems, and network devices.

In order to use ELK to monitor your platform’s performance, a couple of tools and integrations are needed. Probes are required to run on each host to collect various system performance metrics. Then, the data needs to be shipped to Logstash, stored and aggregated in Elasticsearch, and then turned into Kibana graphs. Ultimately, software service operations teams use these graphs to present their results. In this article, I will share how we built our ELK stack to monitor our own service performance.

1. Collecting and Shipping


In the first stage of collecting and shipping data to Logstash, we used a tool called Collectl. This cool open-source project comes with a great number of options that allow operations to measure various metrics from multiple different IT systems and save the data for later analysis. We used it to generate, track, and save metrics such as network throughput, CPU Disk IO Wait %, free memory, and idle CPU (indicating overuse/underuse of computing resources). It can also be used to monitor other system resources such as inode use and open sockets.

Collectl command example:

collectl command example

Finally, Collectl outputs metrics into a log file in plot format. This open-source, live project knows how to gather information but does not automatically ship it to the ELK stack.

Using a Docker Container

We encapsulated Collectl in a Docker container in order to have a Docker image that basically covered all of our data collecting and shipping needs. We used Collectl version 4.0.0 and made the following configurations to avoid a couple of issues:

— In order to avoid data overflow within the container, we only keep data collected from the current day. Longer data retention periods are maintained by virtue of the ELK stack itself so you don’t need to worry about keeping all of the data in the container’s log file.

— Collectl can collect various samples at a specified interval but will dump output to a disk on a different interval. This is called a flush interval. As this is the closest speed to real-time that you can achieve, data needs to be flushed every second. A 30-second collection interval, for example, is quite an aggressive sampling interval that may not be necessary for every use case. An output formatter is used to output a plot format, which has various values on the same line with a space delimiter.

The Collectl configuration file should look something like this:

DaemonCommands = -f /var/log/collectl/performance-tab -r00:00,0 -F1 -s+YZ -oz -P --interval 30
PQuery =   /usr/sbin/perfquery:/usr/bin/perfquery:/usr/local/ofed/bin/perfquery
PCounter = /usr/mellanox/bin/get_pcounter
VStat =    /usr/mellanox/bin/vstat:/usr/bin/vstat
OfedInfo = /usr/bin/ofed_info:/usr/local/ofed/bin/ofed_info
Ipmitool =  /usr/bin/ipmitool:/usr/local/bin/ipmitool:/opt/hptc/sbin/ipmitool
IpmiCache = /var/run/collectl-ipmicache
IpmiTypes = fan,temp,current



RSYSLOG is another component of the container. It comes into play in the form of picking up data from the log file and shipping it to the ELK stack. In order for Logstash to only perform grokking on the necessary fields instead of on everything, it is advisable to add a bit more metadata to the log selected using RSYSLOG. This can be done by taking the metrics just before shipping and adding more information such as the instance name and host IP. Along with the proper time stamp, this information can be sent to Logstash.

Small Gotchas

At this stage, there are two issues that require some attention:

1 – Timestamp: First of all, Collectl doesn’t output in its time stamp of the collected data. Therefore, if you are running hosts in various time zones, they won’t be aligned properly in your ELK. In order to hack this problem, we query the time zone the container is running in and set the timestamp accordingly.

2 – Follow the Collectl log filename: The other complication is that Collectl outputs the data into a file, but the name doesn’t remain constant. Only the filename prefix is customizable, and Collectl automatically appends the current date. The issue is that RSYSLOG is unable to follow this file if the name changes from day-to-day. A way around this is to use the latest version of RSYSLOG — version 8, which I assume most users are not yet utilizing. We created a short script with the older version that runs in cron inside the container and links a specific constant name to the ever-changing data collection name. RSYSLOG is then able to pick up that name despite the fact that it is a link to a target that changes daily. It’s like a pointer that moves to whatever the Collectl log filename is at any moment.

$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/spool/rsyslog

# Logengine access file:
$InputFileName /var/log/collectl/daily.log
$InputFileTag performance-tab:
$InputFileStateFile stat-performance
$InputFileSeverity info
$InputFilePersistStateInterval 20000

$template logzioPerformanceFormat,"[XXLOGZTOKENXX] <%pri%>%protocol-version% %timestamp:::date-rfc3339% XXHOSTNAMEXX %app-name% %procid% %msgid% [type=performance-tab instance=XXINSTANCTIDXX] XXOFSETXX %msg% XXUSERTAGXX\n"
if $programname == 'performance-tab' then @@XXLISTENERXX;logzioPerformanceFormat

Container Checklist

— Collectl
— Collectl output file link rotation script
— crontab config for rotation script

The docker image can be pulled from DockerHub at this link:

2. Parsing the Data

After the collection and shipping stages come data parsing. Collectl returns unstructured log data, which is basically a series of numbers that are fed into the Logstash Grok expression in order to get each field name and specific values.

Collectl configuration parameters explicitly set a specific output pattern. The RSYSLOG log configuration adds the time zone in a specific place in a shipped message. If you are using these two configurations together the Grok pattern that you need is:

%{GREEDYDATA:zone_time} %{NUMBER:cpu__user_percent:int} %{NUMBER:cpu__nice_percent:int}
%{NUMBER:cpu__sys_percent:int} %{NUMBER:cpu__wait_percent:int} %{NUMBER:cpu__irq_percent:int}
%{NUMBER:cpu__soft_percent:int} %{NUMBER:cpu__steal_percent:int}
%{NUMBER:cpu__idle_percent:int} %{NUMBER:cpu__totl_percent:int} 
%{NUMBER:cpu__guest_percent:int} %{NUMBER:cpu__guestN_percent:int} 
%{NUMBER:cpu__intrpt_sec:int} %{NUMBER:cpu__ctx_sec:int} %{NUMBER:cpu__proc_sec:int} 
%{NUMBER:cpu__proc__queue:int} %{NUMBER:cpu__proc__run:int} %{NUMBER:cpu__load__avg1:float} 
%{NUMBER:cpu__load__avg5:float} %{NUMBER:cpu__load__avg15:float} %{NUMBER:cpu__run_tot:int} 
%{NUMBER:cpu__blk_tot:int} %{NUMBER:mem__tot:int} %{NUMBER:mem__used:int} 
%{NUMBER:mem__free:int} %{NUMBER:mem__shared:int} %{NUMBER:mem__buf:int} 
%{NUMBER:mem__cached:int} %{NUMBER:mem__slab:int} %{NUMBER:mem__map:int} 
%{NUMBER:mem__anon:int} %{NUMBER:mem__commit:int} %{NUMBER:mem__locked:int} 
%{NUMBER:mem__swap__tot:int} %{NUMBER:mem__swap__used:int} %{NUMBER:mem__swap__free:int} 
%{NUMBER:mem__swap__in:int} %{NUMBER:mem__swap__out:int} %{NUMBER:mem__dirty:int} 
%{NUMBER:mem__clean:int} %{NUMBER:mem__laundry:int} %{NUMBER:mem__inactive:int} 
%{NUMBER:mem__page__in:int} %{NUMBER:mem__page__out:int} %{NUMBER:mem__page__faults:int} 
%{NUMBER:mem__page__maj_faults:int} %{NUMBER:mem__huge__total:int} 
%{NUMBER:mem__huge__free:int} %{NUMBER:mem__huge__reserved:int} 
%{NUMBER:mem__s_unreclaim:int} %{NUMBER:sock__used:int} %{NUMBER:sock__tcp:int} 
%{NUMBER:sock__orph:int} %{NUMBER:sock__tw:int} %{NUMBER:sock__alloc:int} 
%{NUMBER:sock__mem:int} %{NUMBER:sock__udp:int} %{NUMBER:sock__raw:int} 
%{NUMBER:sock__frag:int} %{NUMBER:sock__frag_mem:int} %{NUMBER:net__rx_pkt_tot:int} 
%{NUMBER:net__tx_pkt_tot:int} %{NUMBER:net__rx_kb_tot:int} %{NUMBER:net__tx_kb_tot:int} 
%{NUMBER:net__rx_cmp_tot:int} %{NUMBER:net__rx_mlt_tot:int} %{NUMBER:net__tx_cmp_tot:int} 
%{NUMBER:net__rx_errs_tot:int} %{NUMBER:net__tx_errs_tot:int} %{NUMBER:dsk__read__tot:int} 
%{NUMBER:dsk__write__tot:int} %{NUMBER:dsk__ops__tot:int} %{NUMBER:dsk__read__kb_tot:int} 
%{NUMBER:dsk__write__kb_tot:int} %{NUMBER:dsk__kb__tot:int} %{NUMBER:dsk__read__mrg_tot:int} %{NUMBER:dsk__write__mrg_tot:int} %{NUMBER:dsk__mrg__tot:int} %{NUMBER:inode__numDentry:int} %{NUMBER:inode__openfiles:int} %{NUMBER:inode__max_file_percent:int} 
%{NUMBER:inode__used:int} %{NUMBER:nfs__reads_s:int} %{NUMBER:nfs__writes_s:int} 
%{NUMBER:nfs__meta_s:int} %{NUMBER:nfs__commit_s:int} %{NUMBER:nfs__udp:int} 
%{NUMBER:nfs__tcp:int} %{NUMBER:nfs__tcp_conn:int} %{NUMBER:nfs__bad_auth:int} 
%{NUMBER:nfs__bad_client:int} %{NUMBER:nfs__reads_c:int} %{NUMBER:nfs__writes_c:int} 
%{NUMBER:nfs__meta_c:int} %{NUMBER:nfs__commit_c:int} %{NUMBER:nfs__retrans:int} 
%{NUMBER:nfs__authref:int} %{NUMBER:tcp__ip_err:int} %{NUMBER:tcp__tcp_err:int} 
%{NUMBER:tcp__udp_err:int} %{NUMBER:tcp__icmp_err:int} %{NUMBER:tcp__loss:int} 
%{NUMBER:tcp__f_trans:int} %{NUMBER:buddy__page_1:int} %{NUMBER:buddy__page_2:int} 
%{NUMBER:buddy__page_4:int} %{NUMBER:buddy__page_8:int} %{NUMBER:buddy__page_16:int} 
%{NUMBER:buddy__page_32:int} %{NUMBER:buddy__page_64:int} %{NUMBER:buddy__page_128:int} 
%{NUMBER:buddy__page_256:int} %{NUMBER:buddy__page_512:int} %{NUMBER:buddy__page_1024:int}?( 


3. Visualization

If you have a fast ELK stack, you will get the information almost immediately. Obviously this depends on the performance of your ELK, but you can expect results in half a minute or less, giving you a very up to date stream of information.

At, we have several pre-built alerts and dashboards for performance metrics using Collectl. If you’re using the service, feel free to drop us a line on the chat box and we’ll share them with you.

kibana example

If you’d like to learn more, feel free to leave a comment below!

Power your DevOps Initiatives with's Machine Learning Features!
Artboard Created with Sketch.

3 responses to “How to Use ELK to Monitor Performance”

  1. Mark Seger says:

    Always happy to see people using colletl and new and exciting ways. One comment though is that I’d claim the plot format is structured, in that the header describes the data contained below. Makes it very convenient to import into a spreadsheet and get the right column headers. There really is a method to my madness. 😉

    Sounds what you have is working for you, though I’ll at least mention the –export capability in which you can have collectl generate output in any format of your choosing. One such example is lexpr which reports each variable as a single line. In fact gexpr and graphite both use an almost identical form except they send data over a socket. And if you look at vmstat, you’ll see an example of how to generate a custom form, in this case the same format as vmstat but with all the benefits of collectl, such as including timestamps, playing back data, etc…

    I’d also be remiss if I didn’t plug colmux. I don’t know how anyone can mange a cluster without it. Think top-anything. That’s right, an integrated view of anything collectl can generate across a cluster of nodes and displayed in a top-like format sorted by the column of your choice. You can even change the sort columns in real-time with the arrow keys. I can’t live without it!

    Anyhow, thanks again for using collectl and if you ever have any questions feel free to send me an email, though if you post something on the collectl website or send it to the mailing list others will benefit as well.

  2. Alex Cz says:


    implemented part of this in my architecture, but a couple of limitations :

    what about having more than one disk / partition to monitor?
    same, what happens when you want to monitor several network cards?

    Would you be super kind and share your dashboards ?

    I guess they are cleaner than mine ( attached )!!


  3. Noni Peri says:

    Hi Alex,

    Regarding your questions about multiple disks and NICs:
    From The collectl utility provides aggregate information – but provides a way to filter the information. For disks this is with “–dskfilt” and for NICs this is with “–netfilt”. I haven’t had the need but I would modify the container to accept these filters form external variables and run multiple containers per required filter if I had the need.
    Regarding the dashboard, mine looks like the one below. I use the “user_tag” to label similar role machines so that I can quickly filter the dashboard. I also like including a table visualisation in my dashboards so that I can easily click it and have the rest of the visualisations filtered by that term. If you have an account with our service (Free trial accounts at you can add this Dashboard (and others) from our ELK Apps without the hassle of building it yourself 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *