The Top 5 Elasticsearch Mistakes & How to Avoid Them

By: Asaf Yigal

February 1, 2016

The Top 5 Elasticsearch Mistakes & How to Avoid Them

Elasticsearch is open-source software indexes and stores information in a NoSQL database that is based on the Lucene search engine — and it also happens to be one of the most popular indexing engines today. Elasticsearch is also part of the ELK Stack.

The software is used by growing startups such as DataDog as well as established enterprises such as The Guardian, StackOverflow, and GitHub, to make their infrastructures, products, and services more scalable.

Despite the increasing popularity of Elasticsearch, there are several common and critical mistakes that users tend to make while using the software. Let’s take a closer look at five of the mistakes and how you can avoid making them.

1. Not Defining Elasticsearch Mappings

Say that you start Elasticsearch, create an index, and feed it with JSON documents without incorporating schemas. Elasticsearch will then iterate over each indexed field of the JSON document, estimate its field, and create a respective mapping. While this may seem ideal, Elasticsearch mappings are not always accurate. If, for example, the wrong field type is chosen, then indexing errors will pop up.

To fix this issue, you should define mappings, especially in production-line environments. It’s a best practice to index a few documents, let Elasticsearch guess the field, and then grab the mapping it creates with GET /index_name/doc_type/_mapping. You can then take matters into your own hands and make any appropriate changes that you see fit without leaving anything up to chance.

For example, if you index your first document like this:

{
“action”: “Some action”,
“payload”: “2016-01-20”
}

Elasticsearch will mark the “payload” field as “date.”

Now, suppose that your next document looks like this:

{
“action”: “Some action 1”,
“payload”: “USER_LOCKED”
}

Here, “payload” isn’t actually a date, and an error message may pop up and the new index will not be saved because Elasticsearch has already marked it as “date.”

2. Combinatorial Explosions

Combinatorial explosions are computing problems that can cause an exponential growth in bucket generation for certain aggregations and can lead to uncontrolled memory usage. In some aggregations, there is not enough memory in the world to support their combinatorial explosions.

The Elasticsearch “terms” field builds buckets according to your data, but it cannot predict how many buckets will be created in advance. This can be problematic for parent aggregations that are made up of more than one child aggregation. Combining the unique values in each child aggregation may cause a vast increase in the number of buckets that are created.

Let’s look at an example.

Say that you have a data set that represents a sports team. If you want to look at specifically the top 10 players and supporting players on that team, the aggregation will look like this:

{
“aggs” : {
“players”: {
“terms”: {
“field”: “players”,
“size”: 10
}
}
},
“aggs”: {
“other”: {
“terms” : {
“field”: “players”,
“size”: 5
}
}
}
}

The aggregation will return a list of the top 10 players and a list of the top five supporting players for each top player — so that a total of 50 values will be returned. The created query will be able to consume a large amount of memory with minimal effort.

A terms aggregation can be visualized as a tree that uses buckets for every level. Therefore, a bucket for each top player in the player’s aggregation will make up the first level and a bucket for every supporting player in the other aggregation will make up the second level. Consequently, a single team will produce n² buckets. Imagine what would happen if you would have a dataset of 500 million documents.

Collection modes are used to help to control how child aggregations perform. The default collection mode of an aggregation is called depth-first and entails first the building of an entire tree and then trimming the edges. While depth-first is an appropriate collection mode for most aggregations, it would not work in the player’s aggregation example above. Therefore, Elasticsearch allows you to change collection modes in specific aggregations to something more appropriate.

Anomalies, such as the example above, should use the breadth-first collection mode, which builds and trims the tree one level at a time to control combinatorial explosions. This collection mode drastically helps to reduce the amount of memory that is consumed and keeps nodes stable:

{
“aggs” : {
“players”: {
“terms”: {
“field”: “players”,
“size”: 10,
“collect_mode”: “breadth_first”
}
}
},
“aggs”: {
“other”: {
“terms” : {
“field”: “players”,
“size”: 5
}
}
}
}

3. Production Flags

By default, the first cluster that Elasticsearch starts is called elasticsearch. If you are unsure about how to change a configuration, it’s best to stick to the default configuration. However, it is a good practice to rename your production cluster to prevent unwanted nodes from joining your cluster.

Below is an example of how you might want to rename your cluster and nodes:

cluster.name: elasticsearch_production

node.name: elasticsearch_node_001

Recovery settings affect how nodes recover when clusters restart. Elasticsearch allows nodes that belong to the same cluster to join that cluster automatically whenever a recovery occurs. While some nodes within a cluster boot up quickly after recovery, however, others may take a bit longer at times (due to nodes receiving a restart command at different times, for example).

This difference in startup times can cause inconsistencies within the data that is meant to be evenly distributed among the nodes in the cluster. In particular, when large amounts of data are involved, rebalancing nodes after a restart can take quite a while — from several hours to a few days — and a lot out of your budget:

gateway.recover_after_nodes: 10

Additionally, it is important to configure the number of nodes that will be in each cluster as well as with the amount of time that it will take for them to boot up in Elasticsearch:

gateway.expected_nodes: 10
gateway.recover_after_time: 5m

With the right configurations in place, a recovery that would have taken hours or days to complete can be finished in a matter of seconds. Additionally, minimum_master_nodes are very important for cluster stability. They help prevent split brains, which is the existence of two master nodes in a single cluster and can result in data loss.

The recommended value for this setting is (N/2) + 1 — where N is the number of master-eligible nodes. With that, if you have 10 regular nodes that can hold data and become masters, the value would be six. If you have three dedicated master nodes and 1,000 data nodes, the value would two (only counting the potential masters):

discovery.zen.minimum_master_nodes: 2

4. Capacity Provisioning

Provisioning can help to equip and optimize Elasticsearch for operational performance. It requires that Elasticsearch be designed in such a way that will keep nodes up, stop memory from growing out of control, and prevent unexpected actions from shutting down nodes.

“How much space do I need?” is a question that users often ask themselves. Unfortunately, there is no set formula, but certain steps can be taken to assist with the planning of resources.

First, simulate your actual use-case. Boot up your nodes, fill them with real documents, and push them until the shard breaks. Booting up and testing nodes can quite easy with Amazon Web Services’ Elasticsearch offering (but it needs additional features to become a fully-functioning ELK Stack).

Still, be sure to keep in mind that the concept of “start big and scale down” can save you time and money when compared to the alternative of adding and configuring new nodes when your current amount is no longer enough. Once you define a shard’s capacity, you can easily apply it throughout your entire index. It is very important to understand resource utilization during the testing process because it allows you to reserve the proper amount of RAM for nodes, configure your JVM heap space, and optimize your overall testing process.

5. Oversized Template

Large templates are directly related to large mappings. In other words, if you create a large mapping for Elasticsearch, you will have issues with syncing it across your nodes, even if you apply them as an index template. The issues with big index templates are mainly practical — you might need to do a lot of manual work with the developer as the single point of failure — but they can also relate to Elasticsearch itself. Remember: You will always need to update your template when you make changes to your data model.

Is there a better solution? Yes, dynamic templates.

Dynamic templates automatically add field mappings based on your predefined mappings for specific types and names. However, you should always try to keep your templates small in size.

In Conclusion

Elasticsearch is a distributed full-text search and analytics engine that enables multiple tenants to search through their entire data sets, regardless of size, at unprecedented speeds. In addition to its full-text search capabilities, Elasticsearch doubles as an analytics system and distributed database. While these three capabilities are impressive on their own, Elasticsearch combines all of them to form a real-time search and analytics application that can keep up with customer needs.