Elasticsearch Queries: A Guide to Query DSL

Guide to Elasticsearch Queries

Getting the Elasticsearch query right down to its syntax can be tough and confounding, even though search is the primary function of Elastic…umm…search.To help, this guide will take you through the ins and outs of common search queries for Elasticsearch and set you up for future querying success.

Lucene Query Syntax

Elasticsearch is part of the ELK Stack and is built on Lucene, the search library from Apache, and exposes Lucene’s query syntax. It’s such an integral part of Elasticsearch that when you query the root of an Elasticsearch cluster, it will tell you the Lucene version:

{"name":"node-1","cluster_name":"my-cluster","cluster_uuid":"8AqSmmKdQgmRVPsVxyxKrw","version":{"number":"6.1.2","build_hash":"5b1fea5","build_date":"2018-01-10T02:35:59.208Z","build_snapshot":false,"lucene_version":"7.1.0","minimum_wire_compatibility_version":"5.6.0","minimum_index_compatibility_version":"5.0.0"},"tagline":"You Know, for Search"}

Knowing the Lucene syntax and operators will go a long way in helping you build queries. Its use is in both the simple and the standard query string query. Here are some of the basics:

The Query DSL

The Query DSL can be invoked using most of Elasticsearch’s search APIs. For simplicity, we’ll look only at the Search API that uses the _search endpoint. When calling the search API, you can specify the index and/or type on which you want to search. You can even search on multiple indices and types by separating their names with commas or using wildcards to match multiple indices and types:

Search on all the Logstash indices:

curl localhost:9200/logstash-*/_search

Or search in the current and legacy indices, in the documents type:

curl localhost:9200/current,legacy/documents/_search

Search in the clients indices, in the bigcorp and smallco types:

curl localhost:9200/clients/bigcorp,smallco/_search

We’ll be using Request Body Searches, so searches should be invoked as follows:

curl localhost:9200/_search -d ‘{“query”:{“match”: {“_all”:”meaning”}}}’

URI Search

The easiest way to search your Elasticsearch cluster is through URI search. You can pass a simple query to Elasticsearch using the q query parameter. The following query will search your whole cluster for documents with a name field equal to “travis”:

curl “localhost:9200/_search?q=name:travis”

With the Lucene syntax, you can build quite impressive searches. Usually you’ll have to URL-encode characters such as spaces (we omitted it in these examples for clarity):

curl “localhost:9200/_search?q=name:john~1 AND (age:[30 TO 40} OR surname:K*) AND -city”

A number of options are available that allow you to customize the URI search, specifically in terms of which analyzer to use (analyzer), whether the query should be fault-tolerant (lenient), and whether an explanation of the scoring should be provided (explain).

Although the URI search is a simple and efficient way to query your cluster, you’ll quickly find that it doesn’t support all of the features ES offers. The full power of Elasticsearch is evidentg through Request Body Search. Using Request Body Search allows you to build a complex search request using various elements and query clauses that will match, filter, and order as well as manipulate documents depending on multiple criteria.

The Request Body Search

Request Body Search uses a JSON document that contains various elements to create a search on your Elasticsearch cluster. Not only can you specify search criteria, you can also specify the range and number of documents that you expect back, the fields that you want, and various other options.

The first element of a search is the query element that uses Query DSL. Using Query DSL can sometimes be confusing because the DSL can be used to combine and build up query clauses into a query that can be nested deeply. Since most of the Elasticsearch documentation only refers to clauses in isolation, it’s easy to lose sight of where clauses should be placed.

To use the Query DSL, you need to include a “query” element in your search body and populate it with a query built using the DSL:

{“query”: { “match”: { “_all”: “meaning” } } }

In this case, the “query” element contains a “match” query clause that looks for the term “meaning” in all of the fields in all of the documents in your cluster.

The query element is used along with other elements in the search body:

{
  “query”: {
    “match”: { “_all”: “meaning” }
  },
  “fields”: [“name”, “surname”, “age”],
  “from”: 100, “size”: 20
}

Here, we’re using the “fields” element to restrict which fields should be returned and the “from” and “size” elements to tell Elasticsearch we’re looking for documents 100 to 119 (starting at 100 and counting 20 documents).

Fields

You might be looking for events where a specific field contains certain terms. You specify the field, type a colon, then a space, then the string in quotation marks or the value without quotes. Here are some Lucene field examples:

  • name: “Ned Stark”
  • status: 404

Be careful with values with spaces such as “Ned Stark.” You’ll need to enclose it in double quotes to ensure that the whole value is used.

Filters vs. Queries

People who have used Elasticsearch before version 2 will be familiar with filters and queries. You used to build up a query body using both filters and queries. The difference between the two was that filters were generally faster because they check only if a document matches at all and not whether it matches well. In other words, filters give a boolean answer whereas queries return a calculated score of how well a document matches a query.

Scoring

We have mentioned the fact that Elasticsearch returns a score along with all of the matching documents from a query:

curl “localhost:9200/_search?q=application”
{
  "_shards":{
    "total" : 5,
    "successful" : 5,
    "failed" : 0
    },
  "hits":{
    "total" : 1,
    "max_score": 2.3,
    "hits" : [
      {
      "_index" : "logstash-2016.04.04",
      "_type" : "logs",
      "_id" : "1",
      "_score": 2.3,
      "_source" : {
        "message" : "Log message from my application"
        }
      }
    ]
  }
}

This score is calculated against the documents in Elasticsearch based on the provided queries. Factors such as the length of a field, how often the specified term appears in the field, and (in the case of wildcard and fuzzy searches) how closely the term matches the specified value all influence the score. The calculated score is then used to order documents, usually from the highest score to lowest, and the highest scoring documents are then returned to the client. There are various ways to influence the scores of different queries such as the boost parameter. This is especially useful if you want certain queries in a complex query to carry more weight than others and you are looking for the most significant documents.

When using a query in a filter context (as explained earlier), no score is calculated. This provides the enhanced performance usually associated with using filters but does not provide the ordering and significance features that come with scoring.

Term Level Queries

1. Range Queries

You can search for fields within a specific range, using square brackets for inclusive range searches and curly braces for exclusive range searches:

  • age:[3 TO 10] — Will return events with age between 3 and 10
  • price:{100 TO 400} — Will return events with prices between 101 and 399
  • name: [Adam TO Ziggy] — Will return names between and including Adam and Ziggy

As you can see in the examples above, you can use ranges in non-numerical fields like strings and dates as well.

2. Wildcard Queries

The search would not be a search without wildcards. You can use the * character for multiple character wildcards or the ? character for single character wildcards:

  • Ma?s — Will match Mars, Mass, and Maps
  • Ma*s — Will match Mars, Matches, and Massachusetts

3. Regex Queries (regexp)

Regex queries (regexp) give you even more power. Just place your regex between forward slashes (/):

  • /p[ea]n/ — Will match both pen and pan
  • /<.+>/ — Will match text that resembles an HTML tag

4. Fuzzy Queries

Fuzzy searching uses the Damerau-Levenshtein Distance to match terms that are similar in spelling. This is great when your data set has misspelled words.

Use the tilde (~) to find similar terms:

  • blow~

This will return results like “blew,” “brow,” and “glow.”

Use the tilde (~) along with a number to specify the how big the distance between words can be:

  • john~2

This will match, among other things: “jean,” “johns,” “jhon,” and “horn”

5. Free Text

It’s as simple as it sounds. Just type in the term or value you want to find. This can be a field, a string within a field, etc.

6. Elasticsearch Terms Query

Also just called a term query, this will return an exact match for a given term. Take this example from a database of baseball statistics:

POST /mlb_index/_search
{
   "query": {
       "term" : {
           "pitcher_last": "rivera"
           “pitcher_first”: “mariano”
           "boost": 1.0 
       }
   },
   "_game" : [“date”,”innings_pitched”,"pitch_count","cutters",”fastballs”]
}

Make sure you are using the term query here, NOT the text query. The term query will search for the exact match; text query will automatically filter punctuation.

7. Elasticsearch Terms Set Query

Similar to the term query, the terms_set query can hunt down multiple values based on certain conditions defined in the PUT request. To further the baseball example:

PUT /pitchers
{
  "mappings": {
    "properties": {
      "pitcher_last": {
        "type": "keyword"
      "pitcher_first": {
        "type": "keyword"
      },
      "pitch_type": {
        "type": "keyword"
      }
    }
  }
}

Compound Queries

Boolean Operators and the Bool Query

As with most computer languages, Elasticsearch supports the AND, OR, and NOT operators:

  • jack AND jill — Will return events that contain both jack and jill
  • ahab NOT moby — Will return events that contain ahab but not moby
  • tom OR jerry — Will return events that contain tom or jerry, or both

Although there are multiple query clause types, the one you’ll use the most is Compound Queries because it’s used to combine multiple clauses to build up complex queries.

The Bool Query is probably used the most because it can combine the features of some of the other compound query clauses such as the And, Or, Filter, and Not clauses. It is used so much that these four clauses have been deprecated in various versions in favor of using the Bool query. Using it is best explained with an example:

curl localhost:9200/_search -d 
‘{
  “query”:{
    “bool”: {
      “must”: [
        {"fuzzy" : { "name": "john","fuzziness": 2}}
      ]
      "must_not": [
        {"match": { "_all": "city"}}
      ]
      "should": [
        { "range": {"age": { "from": 30, "to": 40 }}},
        { "wildcard" : { "surname" : "K*" }}
      ]
    }
  }
}’

Within the query element, we’ve added the bool clause that indicates that this will be a boolean query. There’s quite a lot going in there, so let’s cover it clause-by-clause, starting at the top:

must

All queries within this clause must match a document in order for ES to return it. Think of this as your AND queries. The query that we used here is the fuzzy query, and it will match any documents that have a name field that matches “john” in a fuzzy way. The extra “fuzziness” parameter tells Elasticsearch that it should be using a Damerau-Levenshtein Distance of 2 two determine the fuzziness.

must_not

Any documents that match the query within this clause will be outside of the result set. This is the NOT or minus (-) operator of the query DSL. In this case, we do a simple match query, looking for documents that contain the term “city.” Using _all as the field name indicates that the term can appear in any of the document’s fields. This is the must_not clause, so matching documents will be excluded.

should

Up until now, we have been dealing with absolutes: must and must_not. Should is not absolute and is equivalent to the OR operator. Elasticsearch will return any documents that match one or more of the queries in the should clause.

The first query that we provided looks for documents where the age field is between 30 and 40. The second query does a wildcard search on the surname field, looking for values that start with “K.”

The query contained three different clauses, so Elasticsearch will only return documents that match the criteria in all of them. These queries can be nested, so you can build up very complex queries by specifying a bool query as a must, must_not, should or filter query.

filter

One clause type we haven’t discussed for a compound query is the filter clause. Here is an example where we use one:

curl localhost:9200/_search -d 
{
  “query”:{
    “bool”: {
      “must”: [
        { “match_all”: {}},
      ],
      “filter”: [
        { “term”: { “email”: “joe@bloggs.com” }}
      ]
    }
  }
}

The match_all query in the must clause tells Elasticsearch that it should return all of the documents. This might not seem to be a very useful search, but it comes in handy when you use it in conjunction with a filter as we have done here. The filter we have specified is a term query, asking for all documents that contain an email field with the value “joe@bloggs.com.”

We have used a filter to specify which documents we want, so they will all be returned with a score of 1. Filters don’t factor into the calculation of scores, so the match_all query gives all documents a score of 1.

One thing to note is that this query won’t work if the email field is analyzed, which is the default for fields in Elasticsearch fields. The reason is best discussed in another blog post, but it comes down to the fact that Elasticsearch analyzes both fields and queries when they come in. In this case, the email field will break up into three parts: joe, blogs, and com. This means that it will match searches and documents for any three of those terms.

Boosting Queries

There are three kinds of boosting queries in Elasticsearch: positive, negative and negative_boost. Positive queries actually are the main queries that you want to accumulate relevance score points. 

The negative_boost is a value between 0 and 1, against which you would multiply negative query results (if you set the negative_boost at .25, it reduces the value of the negative query to a quarter of a positive query; .5 to half the value of a positive; .1 a tenth the value of a positive query, etc. This gives you a lot of flexibility in grading your queries.

Constant Score Queries

This is a valuable tool for segmenting certain queries that you want to give a boost in score. The “constant_score”: {} code wrap isolates certain search terms and pairs them with a separate boost value:

GET /_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": { “type”: "nginx" }
      },
      "boost": 1.5
    }
  }
}

So in this instance, you are giving any NGINX logs a greater value than others (presumably than other server logs like apache2 logs or IIS logs).

Disjunction Max Queries

Imagine if your Google results could separate between results that includes multiple things you’re searching for and only a few things. That’s what this does. 

You can group queries together as nested fields within the “queries”: [ ] parameter.

GET /_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "term": { "type": "nginx" } },
        { "term": { "fuzzy": "server" } }
      ],
      "tie_breaker": 0.5
    }
  }
}

function_score Queries

Function score queries, as their name suggests, exist to make it easier to use a function to compute a score. Define a query and set the rules to how to boost a result score.

Conclusion

The hardest thing about Elasticsearch is the depth and breadth of the available features. We have tried to cover the essential elements in as much detail as possible without drowning you in information. Ask any questions you might have in the comments, and look out for more in-depth posts covering some of the features we have mentioned. You can also read my prior Elasticsearch tutorial to learn more.

Observability at scale, powered by open source

Internal

Centralize Server Monitoring With Logz.io

See Plans