In Elasticsearch parlance, a document is serialized JSON data. In a typical ELK setup, when you ship a log or metric, it is typically sent along to Logstash which groks, mutates, and otherwise handles the data, as defined by the Logstash configuration. The resulting JSON is indexed in Elasticsearch.

Elasticsearch documents live in a segment of a shard, which is also a Lucene index. As additional documents are shipped, the segments grow. Whenever a search is executed, Elasticsearch checks each segment that is stored in a shard. This means that as the segments grow in quantity, searches becoming increasingly inefficient. To combat this, Elasticsearch will periodically merge similarly sized segments into a single, larger, segment and delete the original, smaller, segments. 

Segments are immutable, which has an important implication for documents. When a document is initially deleted, it is not actually immediately removed from Elasticsearch. Instead, it is flagged as deleted, making it inaccessible to users, but is still in the segment. During a segment merge, documents flagged as deleted are not written to the new segments, so segment merges is actually when deleted documents are dropped from Elasticsearch. Segment immutability also means that document updates function the same way: when a document is “updated” it is actually flagged as deleted and replaced with a new document that has the appropriate field change(s). Just like documents that are flagged to be deleted outright, these documents are removed only when Elasticsearch performs a segment merge.

Documents via API

Elasticsearch’s API allows you create, get, update, delete, and index documents both individually and in bulk (depending on the endpoint). Although interacting with individual documents has remained virtually unchanged since Elasticsearch 2.x, the release of Elasticsearch 6.x added features to delete and update by query as well as improving the formerly very manual reindexing process. A few general examples are provided below for each of the endpoints, but if you’d like to see more examples and the full list of endpoints please take a look at the Elasticsearch API documentation.

Individual Documents

Get

Delete

Index

Update

If you are updating multiple documents in sequence and some exist and some do not, you will want to use the _update endpoint with doc_as_upsert set to true. This will create the document if it does not exist and update if it does.

Multiple Documents

Multi-Get

_mget allows you to retrieve multiple documents based on index, type, or id. For example, to retrieve documents of a specific type:

Bulk

_bulk allows you to post several create, update, delete, etc. requests in a single call. To perform these operations you still need to include the complete JSON for each request, e.g.

Important: Note that you cannot use pretty print with _bulk as \n is the delimiter.

Update or Delete by Query

_update_by_query does exactly what you’d expect: allows you to change the data in documents that match a given query. Since you’ll be using one query at a time you can use pretty print. (Side note: you will be able to use pretty print with both _delete_by_query and _reindex as well). There are lots of options to use with this query, so as a more precise example let’s say that you’ve been tracking lunar eclipse data and now want to add solar. Since up until now you were only tracking one type of eclipse, perhaps you tagged your lunar eclipse data simply as “eclipse“, so now you’re going to update “eclipse” to “lunar_eclipse” (and incoming data will be tagged “solar_eclipse” as appropriate). What might this look like?

To break that down, the painless script is changing the value of an existing field that matches a certain value. Scripts can also be used to modify fields or to do more complex operations, for example if you want to add a field that doesn’t exist with a default value and then update existing values based on a series of criteria.

One final tidbit: when you update (or delete) by query, Elasticsearch takes and makes use of an initial snapshot of the state the index was in before any modifications. If the index changes after that snapshot, a common example being additional data is written to the index after the snapshot but before the operation concludes, then you’ll encounter a conflict. It’s important to be aware of what conflicts you’ll encounter when running an update (or delete) to know if these are conflicts you need to manually resolve or not. In the latter case, you can set ‘conflicts’ to ‘proceed’. This will count the conflicts, but will not update (or delete) the conflicting documents or stop the update (delete) process:

Continuing on, the syntax for delete by query is very similar to update by query. So to continue on with the above example, if you wanted to delete all your eclipse data (don’t do it!) you would do something like this:

Reindex

If you ever need to change mappings (discussed below), your shard count, shard size, etc. you will need to reindex your cluster. With the reindex API, this is actually rather straightforward:

The same code block will be included below when I talk about reindexing mappings. Why put it in twice? Because it’s that important. Grok it!

Creating Structure with Mappings

In order to structure documents for searches, Elasticsearch relies on mappings. Mappings can be user-defined and, depending on the use case, can vary from simple to extremely complex. For a dive into how to create mappings, please check out our earlier Elasticsearch Mapping blog post. Important caveat: in 2018 Elasticsearch started implementing changes with the goal to remove mapping types. For more information, please see our blog post about the Removal of Mapping Types from earlier this year.

🎱Reply Hazy: When Mappings Aren’t Clear

The most common issue that Elasticsearch users come across after mapping their documents are mapping conflicts. Mapping conflicts happen when a mapped value has different types within the same index. How does this happen? As it turns out, mapping conflicts usually arise for one of two reasons:

#1: Same name, different type

When defining a mapping it is important to grok that while you as a user may logically separate the fields A.response and B.response that Elasticsearch does not. So if A.response is defined as an integer, e.g. an HTTP response code, and B.response is defined as a string, e.g. response message text, then the response field with have a mapping conflict.

#2: Updated field definition, same index

One of the difficulties of mapping is that it requires you as an Elasticsearch admin / architect to be a little prescient and know what your field definitions are before you send over your data. So when you’re defining your mappings, you’ll need to already know all of your field definitions. This is a tall order, especially since changing requirements frequently result in changes to the data being shipped to Elasticsearch – and thus require you to update your mappings. So what happens if you need to update a field you previously defined as an integer to a string? You guessed it: mapping conflict.

So how do you resolve these mapping conflicts? Reindex. In the latter scenario you should expect to reindex your data whenever you need to update an existing field definition. Why? To quote Elasticsearch:

“In order to make your data searchable, your database needs to know what type of data each field contains and how it should be indexed. If you switch a field type from e.g. a string to a date, all of the data for that field that you already have indexed becomes useless. One way or another, you need to reindex that field.” (Source)

Although initially a pretty manual process, as outlined in the referenced Elastic blog post, with the release of version 2.3, Elastic added the _reindex API endpoint that drastically simplifies the process. If you’re running a version of Elasticsearch after 2.3, instead of the described manual process all you need to do is pass the original (source) and new (destination) indices to the _reindex endpoint. Note that to reindex you do need to create a new index with a new name – you cannot reindex your documents into a new index with the same name as the original.

Exceptions, exceptions, exceptions

There is one more common type of mapping error: mapping parser exceptions. Straightforwardly, this means that Elasticsearch cannot parse the JSON as it arrived against the mapping as you have defined it. Two common causes for this is that either an invalid JSON request is being sent or Logstash has been configured such that the resulting JSON does not match what the mapping definition expects. In either case, the exception text provides a guide for the cause of the error. For example, someone posted the following on Stack Overflow:

In this user’s case, as the responder pointed out, they were sending over invalid JSON.

A Quick Note on Document Security

Here at Logz.io we take security very seriously. We protect our user’s data by keeping up-to-date with the latest requirements for a variety of security standards. If you are hosting your own Elasticsearch cluster, you will need to ensure that your data is being kept secure in compliance with the standards put forth by the relevant regulatory bodies. To get started, you should use X-Pack to configure document and field level access rules as applies. In future blog posts, we will be taking a deeper look at a series of security topics to help you secure your Elasticsearch cluster.

Get more value from Elasticsearch with Logz.io's built-in features!