“How do you monitor Elasticsearch at this scale?” is a question we are asked again and again by ELK Stack users and our customers. Recognizing the challenge, we wanted to share some of our monitoring engineerings with everyone!
A Short History Lesson
When Elastic released Elasticsearch 2.0, the company released Marvel 2.0 as well. This was a most-welcome move because the community finally got the option to monitor Elasticsearch clusters in production — an option which was previously unavailable by license.
The problem with Marvel 2.0 is that a large number of metrics were removed, with the most useful metrics available only in the Elasticsearch API. For us, this meant Marvel suddenly became obsolete and of little use as we required deeper visibility into our clusters.
Our Monitoring Solution
As an initial replacement, we first used our “es-health” Docker image to monitor our Elasticsearch clusters. This served us well to a certain degree, but we still felt like it was not enough for real-time, complex analysis.
We were already using Graphite to retrieve real-time metrics from all of our microservices (using this Docker image / Java Agent), so we decided to try to analyze our Elasticsearch clusters with Graphite as well.
The successful result of this engineering effort was a lightweight Docker image that serves the purpose of monitoring Elasticsearch perfectly.
How does it work? It’s pretty simple. Once at every set interval, a request is made to the
/_node/stats API on an Elasticsearch cluster of your choice. We recurse on those values and use the “pickle” protocol to send any numeric metric found to Graphite. Once the metrics are in Graphite, it’s only a matter of playing around Grafana to produce a rich monitoring dashboard to monitor the Elasticsearch cluster.
The Docker image is available here on the Docker Hub.
To complement this Docker image, we are also collecting metrics from each node with this Docker image. Combined, these two metric collectors provide all of the data we need (and more!) to understand everything that is going on behind the scenes and within our clusters.
Of course — and as is often the case with monitoring — the real problem with metrics is not how to retrieve them but what to do with them once everything is in place.
Monitoring Tips and Tricks
I could go through all the metrics our Docker image produces, but that would be a waste of time. Instead, I would like to give some recommendations based on our experience.
First and foremost: Most people open monitoring dashboards only when things go south, but you must be able to understand how your dashboard behaves when everything is working as expected. Make sure that you know the readings of your metrics when everything is going well so that you will have a comparison for whenever you will actually need it. Answering questions such as “Is this always like that?” forces you to waste time on false positives instead of actually finding the issue.
Another thing that is probably obvious, but I want to mention it anyway: Search for anomalies. There are a lot of metrics in Elasticsearch, and you are probably not familiar with all of them — and not even with most of them. But that doesn’t mean that they don’t count.
Visibility is the name of the game here. If you suddenly see some irregularity in your metrics, try to search for the ones that first started to act weird. Elasticsearch malfunctions will often be reflected in many metrics — and you need to find out where the issue started. When you have a general sense of the origin, search online to find information on what to do next.
Something that you absolutely need to keep in mind is that many different companies use Elasticsearch for many different things.
Each Elasticsearch use case has a set of different best practices and different monitoring caveats. That is the main reason that I don’t want to elaborate on the different metrics — what is right for us can be totally wrong for you. The bottom line: Don’t jump ahead and do everything that the Internet says about configuring Elasticsearch or what specific metrics to follow. Develop your own method.
Finally, no monitoring post can be complete without at least some DevOps porn.
Here are some screenshots of our Grafana dashboard that is based on our
es2graphite container and
collectd2graphite container mentioned above. Enjoy!
Why do you use different technology stack to monitor ES? I.e. why Graphite/Graphana instead of ES/Kibana?
Well, we do use ELK for monitoring Elasticsearch as well, as I mentioned in the blog post.
But each technology has its own benefits for different use cases.
Specifically we chose Graphite as one of the stacks we are using since it better handles metrics retention for longer periods, it’s real-time, and it provides us with mathematical functions that Kibana does not offer.
A related question…
There are N monitoring services out there, from New Relic (https://newrelic.com ) to Sematext SPM (https://sematext.com/spm ) with Elasticsearch and other integrations.
Considering you are a SaaS company, have you consider using a monitoring SaaS and if so, what sort of analysis led you to running monitoring in-house?
Well, that is a great question.
And I think it’s even qualifying for a dedicated blog post, but the quick answer:
Monitoring is a big and complexed world.
For most parts, we are heavy users of our own service and basing most of our day to day operations on ELK.
For alerting, we are using Nagios in house, since it’s the most reliable solution we could find and we can’t have any alert go to waste.
And about graphite – It’s pretty easy to set up (Future blog post coming!), and just work great out of the box.
So we didn’t see any reason why to use any other 3rd party there.
On ELK, my honest and unbiased opinion is to use one of the SaaS companies out there since it is really hard to maintain a scalable solution in house.
If you have any more questions on our monitoring decisions i’ll be happy to answer!
Thanks Roi. Can you please elaborate on what do you mean by “Specifically we chose Graphite as one of the stacks we are using since it better handles metrics retention for longer periods,”
As you can see here – https://graphite.readthedocs.io/en/latest/config-carbon.html#storage-schemas-conf
You can configure how graphite should handle metrics once a specific period has passed.
For example, you can save metrics for 10 seconds for a day, and then convert it to 1 per hour for a week, and then 1 per week for a year.
This give us the ability to save metrics for a really long period of time, and apply the same dashboard for analyzing differences over time in a snap.
Interesting. We’ve implemented something like this in ES by ourselves (for both metric and non-metric data). Good to know they have it out of the box. Thank you.
Is it possible to also share the Grafana dashboard that you use for “ES Metrics”?
That would be a bit difficult since we have some graphs and templates for our in-house metrics, but i’ll try to provide some version of this for you next week 🙂
I created a version of the dashboard that can be suited for generic use.
Under Github, you can find this file: https://github.com/logzio/logzio-es2graphite/blob/master/Grafana%20Dashboard%20Example
Open it in a text editor, and use find-replace to replace all occurences of GRAPHITE_PREFIX with the es2graphite prefix, and COLLECTD_GRAPHITE_PREFIX with collectd2graphite prefix.
And just load that to grafana
Let me know if this works for you.