OpenSearch Queries: Query DSL and Beyond

OpenSearch Queries

OpenSearch has evolved rapidly since its fork from the source code of the last truly open source version of Elasticsearch. 

So far, the community’s work has focused on removing proprietary code from Elastic, including a number of things that were never purely open source themselves. These include some aspects of the querying languages and capabilities of Elasticsearch. But given OpenSearch’s commitment to being, well, open, there has been effort to replace those elements of the ELK Stack with something new.

As a result, OpenSearch queries now look both very familiar and also fairly unique. There are new elements to talk about in addition to carryover from Lucene and Query DSL. For more information on these, look at our introduction to Elasticsearch queries

For starters, the primary language for OpenSearch queries is Query DSL. However, there are also two plugins that allow you to use two other querying languages: SQL and PPL (Piped Processing Language). 

PPL was created by AWS for Open Distro for Elasticsearch. As Open Distro rolls over now into OpenSearch, PPL’s utility is inherently intertwined with OpenSearch applications.

Data Types and Equivalents: Query DSL, PPL, SQL

OpenSearch includes the Query Workbench (within OpenSearch Dashboards), where you can switch to SQL or PPL via plugins, translate SQL to REST, as well as query data in the OpenSearch Dashboards UI. You can also download data in CSV, JSON, text, Java Database Connectivity (JDBC), and Open Database Connectivity (ODBC) formats.

The Query Workbench is actually a part of the SQL plugin, which we will discuss below in greater detail.

The following table compares the three main querying language types in OpenSearch:

[Sourced  from https://opensearch.org/docs/search-plugins/ppl/datatypes/ ]

PPL TypeOpenSearch TypeSQL Type
booleanbooleanBOOLEAN
bytebyteTINYINT
byteshortSMALLINT
integerintegerINTEGER
longlongBIGINT
floatfloatREAL
floathalf_floatFLOAT
floatscaled_floatDOUBLE
doubledoubleDOUBLE
stringkeywordVARCHAR
texttextVARCHAR
timestampdateTIMESTAMP
ipipVARCHAR
timestampdateTIMESTAMP
binarybinaryVARBINARY
structobjectSTRUCT
arraynestedSTRUCT

Query DSL in OpenSearch

OpenSearch docs break down Query DSL into three categories: full-text queries, term-level queries, and boolean queries. We covered Query DSL extensively in our Elasticsearch queries introduction. Here, it suffices to say that boolean queries cover must, must_not, should, and filter requests. Let’s go into slightly more detail on the other two categories:

Term-Level Queries

Term-level queries are made up of the following list of options: term, terms, IDs, range, prefix, exists, wildcard, and regexp. You can find examples for Query DSL term-level searches in our Elasticsearch queries tutorial. Here is an example of prefix, though:

Prefix 

This query looks for any hits with a specified prefix:

GET shakespeare/_search
{
  "query":Hello {
    "prefix": {
      "speaker": "KING"
    }
  }
}

Full-Text Queries

OpenSearch full-text queries are a little more expansive and cover a lot of options. The list includes match, multi_match, match_phrase, match_phrase_prefix, match_bool_prefix, common, query_string, simple_query_string, match_all, and match_none.

There are also a number of optional query parameters for full-text OpenSearch Queries:

allow_leading_wildcard
analyze_wildcard
analyzer
auto_generate_synonyms_phrase_query
boost
cutoff_frequency
flags
fuzziness
fuzzy_transpositions
fields
enable_position_increments
low_freq_operator
max_determined_states
max_expansions
minimum_should_match
operator (and or or)
phrase_slop
prefix_length
quote_field_suffix
rewrite
slop
tie_breaker
time_zone
type
zero_terms_query

Take for example fuzziness.

Fuzziness instructs the query on how flexible it should be in searching for values. It instructs the limit on edits to a result in order to get it back to the original query. For example, it would take 1 character change to get from “desert” to “dessert.” Those changes can be deletions, insertions, or replacements (so you would count a letter switch as a single edit, not 2 with a deletion and an insertion). Settings can include values AUTO, 1, or any other positive value.

GET _search
{
  "query": {
    "match": {
      "title": {
        "query": "desert",
        "fuzziness": "AUTO",
      }
    }
  }
}

This dictates the number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matches a certain value. For example, the distance between wined and wind is 1. The default, AUTO, chooses a value based on the length of each term and is a good choice for most use cases.

Piped Processing Language (PPL) Queries and Syntax

OpenSearch carries over Piped Processing Language (PPL) from OpenDistro for Elasticsearch . Syntax uses the pipe marker ( | ) to separate query commands. The platform release comes with a PPL CLI built-in as well as a way to save JSON or text results.

There are also only a few commands to use with the language.

There are also only a few commands to use with the language.

Search

Syntax starts with the search command, followed by the source and any other expressions you need to add:

search source=<<index>> [some bool-expression]
search command 1 | command 2 | command 3

All the following queries are placed after the first clause and pipe: search source=<<index>> |

Head

This is a pretty important one: specify the number of results you want:

search source=voters | fields state | head 5

Dedup

This critical feature – data deduplication – removes duplicate documents from results (when they come up for different fields).

dedup [int] <<field>> [keepempty=<<true/false>>] [consecutive=<<true/false>>]

The keepempty parameter tells the query to keep results with empty fields.

You can also input multiple fields at a time:

search source=voters | dedup 3 age keepempty=true | fields age, state, phone number;
AgeStatePhone Number
87NJ555-1235
87PA555-2347
87CT555-3467
86NJ555-7654
86NY555-9876
86NC555-8642

Eval

The eval query makes quick math of your results. Input an expression within the query based on the results it returns. The final results will include the answer to the expression.

In this example, assume you’re looking at data on an election from six months ago (.5 years). For some reason, you want to update the ages in the results to be contemporary:

search source=voters | eval age = age + .5 | fields age;

Fields

Add or remove fields using + or - symbols:

search source=voters | fields age, state, city | fields - state;

Rename

To display a certain result with an alternate label, use rename to change how a field’s results display. In this example, the field was accidentally named Republican instead of Party (Republican is one of the values). To correct the error:

search source=candidates | rename republican as party | fields party;

Sort

This will sort results in a defined field either in ascending order (default) or descending order (marked by -)

search source=candidates | sort age | fields age, party, state, city, office;
AgePartyStateCityOffice
35RepublicanNJNew BrunswickMayor
47DemocratNJNew BrunswickMayor
52IndependentNJNew BrunswickMayor
67GreenNJNew BrunswickMayor
search source=candidates | sort - age | fields age, party, state, city, office;
AgePartyStateCityOffice
67GreenNJNew BrunswickMayor
52IndependentNJNew BrunswickMayor
47DemocratNJNew BrunswickMayor
35RepublicanNJNew BrunswickMayor

Also use sort for two different fields simultaneously:

search source=candidates | sort + age - sex | fields age, sex, party, state, city, office;
AgeSexPartyStateCityOffice
43MBlueNJHighland ParkMayor
44MCodependent PartyNJHighland ParkMayor
47FDemublicanNJHighland ParkMayor
75FRepubricatNJHighland ParkMayor

Stats

Calculate stats using the five optional functions: AVG, MAX, MIN, SUM, COUNT. The following example will give you the highest five average voter ages among the 50 US states:

search source=voters | stats avg(age) by state; | head 5;
Stateavg(age)
Florida55.70
California53.33
Arizona52.89
New York49.25
Utah48.00

Where

Limit results by very specific values:

search source=candidates | where party=republican and gender =”F” | fields party, gender;

Rare

Find the least common values from a given list of fields with the rare query.

search source=candidates | rare age by party | fields age, party;

Result:

AgeParty
17Republican
104Democrat

Top                                   ,

Finds the most common fields in a list, with the top result having the most hits:

top N <<field-list>> <<by-clause>>
search source=voters | top 5 age by state

Result:

AgeState
85Florida
84Florida
84California
83Texas
83Florida

Plugin: OpenSearch SQL

The OpenSearch SQL plugin supports complex SQL queries that include things like JOINs, subqueries and more. It supports the Query Workbench in OpenSearch Dashboards, an SQL CLI, and other functions.

The order of operations follows this (complete) syntax layout:

SELECT [DISTINCT] (* | expression) [[AS] alias] [, ...]
FROM index_or_table_name
[WHERE predicates]
[GROUP BY <<column_name>> [, ...]
 [HAVING predicates]]
[ORDER BY <<column_name>> [IS [NOT] NULL] [ASC | DESC] [, ...]]
[LIMIT [offset, ] size];

Summary

This post encompasses a brief overview of OpenSearch queries, highlighting some of the differences with Elasticsearch (which relies on Query DSL), as well as providing some additional info on Piped Processing Language whose Open Distro origins make it a critical component in the AWS-spearheaded OpenSearch project.

Subscribe to the blog to get more info on OpenSearch, queries, and the like.

Observability at scale, powered by open source

Internal

Consolidate Your AWS Data In One Place

Learn More
scaleup-logo Join our annual user conference! November 16, 2021