elasticsearch tutorial

Elasticsearch is often described as a search server. That might be confusing because we usually think of search as something that we do, not something that needs to be served. However, the reality is that search can be quite complex, and search servers have been developed in response to that fact.

Described in more familiar terms, Elasticsearch is a NoSQL database. That means it stores data in an unstructured way and that you cannot use SQL to query it. Unlike most NoSQL databases, though, Elasticsearch has a strong focus on search capabilities and features — so much so, in fact, that the easiest way to get data from Elasticsearch is to search for it using the REST API.

How to Install Elasticsearch

The requirements for Elasticsearch are simple: Java 7. Take a look at my Logstash tutorial to ensure that you are set. Also, make sure that your operating system is on the Elastic support matrix, otherwise you might run up against strange and unpredictable issues. Once that is done, you can start with installing Elasticsearch.

Elasticsearch can be downloaded as a standalone distribution or installed using the apt and yum repositories. To keep things simple, let’s just download the distribution because it works for all operating systems. Be sure, though, to rethink this before you go into production:

wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/zip/elasticsearch/2.1.1/elasticsearch-2.1.1.zip
unzip elasticsearch-2.1.1.zip
cd elasticsearch-2.1.1

On Linux and other Unix-based systems, you can now run bin/elasticsearch — or on Windows, bin/elasticsearch.bat — to get it up and running. And that’s it! To confirm that everything is working fine, point curl or your browser to http://127.0.0.1:9200, and you should see something like the following output:

{
"name" : "Bloodhawk",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "2.1.1",
"build_hash" : "40e2c53a6b6c2972b3d13846e450e66f4375bd71",
"build_timestamp" : "2015-12-15T13:05:55Z",
"build_snapshot" : false,
"lucene_version" : "5.3.1"
},
"tagline" : "You Know, for Search"
}

Creating an Index in Elasticsearch

Adding data to Elasticsearch is called “indexing.” This is because when you feed data into Elasticsearch, the data is placed into Apache Lucene indexes. This makes sense because Elasticsearch uses the Lucene indexes to store and retrieve its data. Although you do not need to know a lot about Lucene, it does help to know how it works when you start getting serious with Elasticsearch.

Elasticsearch behaves like a REST API, so you can use either the POST or the PUT method to add data to it. You use PUT when you know the or want to specify the ID of the data item, or POST if you want Elasticsearch to generate an ID for the data item:

> curl -X POST http://127.0.0.1:9200/logs/my_app -d '{"timestamp": "2015-01-18 12:34:56", "message": "User logged in", "user_id": 4, "admin": false}’
{
"_id": "AVJWJkaW0D5QbnIxzP5S",
"_index": "logs",
"_shards": {
"failed": 0,
"successful": 1,
"total": 2
},
"_type": "my_app",
"_version": 1,
"created": true
}

> curl -X PUT http://127.0.0.1:9200/app/users/4 -d '{"id": 4, "username": "john", "last_login": "2015-01-18 12:34:56"}'
{
"_id": "4",
"_index": "app",
"_shards": {
"failed": 0,
"successful": 1,
"total": 2
},
"_type": "users",
"_version": 1,
"created": true
}

The data for the document is sent as a JSON object. You might be wondering how we can index data without defining the structure of the data. Well, with Elasticsearch, like with most other NoSQL databases, there is no need to define the structure of the data beforehand. To ensure optimal performance, though, you can define mappings for data types. More on this later.

If you are not comfortable with curl, look into the unofficial Sense Chrome plugin or the Sense Kibana app. You can also use log shippers like Beats and data pipelines like Logstash to automate the data-ingress process.

Elasticsearch Query: Getting Information Out

Once you have your data indexed into Elasticsearch, you can start searching and analyzing it. The simplest query you can do is to fetch a single item. Once again, because Elasticsearch is a REST API, we use GET:

> curl -X GET http://127.0.0.1:9200/app/users/4
{
"_id": "4",
"_index": "app",
"_source": {
"id": 4,
"last_login": "2015-01-18 12:34:56",
"username": "john"
},
"_type": "users",
"_version": 1,
"found": true
}

The fields starting with an underscore are all meta fields of the result. The _source object is the original document that was indexed.

We also use GET to do searches by calling the _search endpoint:

> curl -X GET http://127.0.0.1:9200/_search?q=logged
{
"_shards": {
"failed": 0,
"successful": 10,
"total": 10
},
"hits": {
"hits": [
{
"_id": "AVJWJkaW0D5QbnIxzP5S",
"_index": "logs",
"_score": 0.095891505000000002,
"_source": {
"admin": false,
"message": "User logged in",
"timestamp": "2015-01-18 12:34:56",
"user_id": 4
},
"_type": "my_app"
}
],
"max_score": 0.095891505000000002,
"total": 1
},
"timed_out": false,
"took": 62
}

The result contains a number of extra fields that describe both the search and the result. Here’s a quick rundown:

  • took: The time in milliseconds the search took
  • timed_out: If the search timed out
  • shards: The number of Lucene shards searched, and their success and failure rates
  • hits: The actual results, along with meta information for the results

The search we did above is known as a URI Search, and is the simplest way to query Elasticsearch. By providing only a word, all of the fields of all the documents are searched for that word. You can build more specific searches by using Lucene queries:

  • username:johnb – Looks for documents where the username field is equal to “johnb”
  • john* – Looks for documents that contain terms that start with john and is followed by zero or more characters such as “john,” “johnb,” and “johnson”
  • john? – Looks for documents that contain terms that start with john followed by only one character. Matches “johnb” and “johns” but not “john.”

There are many other ways to search including the use of boolean logic, the boosting of terms, the use of fuzzy and proximity searches, and the use of regular expressions.

What is even more awesome is that URI searches are just the beginning. Elasticsearch also provides a request body search with a Query DSL for more advanced searches. There is a wide array of options available in these kinds of searches, and you can mix and match different options to get the results that you require. Some of the options include geo queries, “more like this” queries, and scripted queries.

The DSL also makes a distinction between a filtering and a query context for query clauses. Clauses used as filters test documents in a boolean fashion: Does the document match the filter, “yes” or “no”? Filters are also generally faster than queries, but queries, can also calculates a score based on how closely a document matches the query. This is used to determine the ordering and inclusion of documents:

> curl -X GET http://127.0.0.1:9200/logs/_search -d '{
> "query": {
> "match_phrase": {
> "message": "User logged in"
> }
> }
> }'
{
"_shards": {
"failed": 0,
"successful": 5,
"total": 5
},
"hits": {
"hits": [
{
"_id": "AVJWJkaW0D5QbnIxzP5S",
"_index": "logs",
"_score": 0.46027922999999998,
"_source": {
"admin": false,
"message": "User logged in",
"timestamp": "2015-01-18 12:34:56",
"user_id": 4
},
"_type": "my_app"
}
],
"max_score": 0.46027922999999998,
"total": 1
},
"timed_out": false,
"took": 29
}

Removing Elasticsearch Data

Deleting items from Elasticsearch is just as easy as entering data into Elasticsearch. The HTTP method to use this time is — surprise, surprise! — DELETE:

> curl -X DELETE http://127.0.0.1:9200/app/users/4
{
"_id": "4",
"_index": "app",
"_shards": {
"failed": 0,
"successful": 1,
"total": 2
},
"_type": "users",
"_version": 2,
"found": true
}

As with retrieving data, you don’t need to know the ID of the item you’re deleting. You can delete items by specifying a query:

> curl -X DELETE 'http://127.0.0.1:9200/app/users/_query?q=username:john'
{
"_id": "_query",
"_index": "app",
"_shards": {
"failed": 0,
"successful": 1,
"total": 2
},
"_type": "users",
"_version": 1,
"found": false
}

What’s Next?

We have touched just the basics of CRUD operations in Elasticsearch. Elasticsearch is a search server, so it is not surprising that there is an immense depth to its search features. Since the release of Elasticsearch 2.0, there is also a wealth of available analytical tools. Be sure to explore the Elasticsearch documentation as well as my accompanying Logstash tutorial and the company’s complete guide to the ELK Stack.

Logz.io is a predictive, cloud-based log management platform that is built on top of the open-source ELK Stack and can be used for log analysis, application monitoring, business intelligence, and more. Start your free trial today!

Jurgens tries to write good code for a living. He even succeeds at it sometimes. When he isn’t writing code, he’s wrangling data as a hobby. Sometimes the data wins, but we don’t talk about that. Ruby and Elasticsearch are his weapons of choice, but his ADD always allows for new interests. He’s also the community maintainer for a number of Logstash inputs.