OpenSearch Queries: Query DSL and Beyond

By: Gedalyah Reback

September 9, 2021

OpenSearch Queries: Query DSL and Beyond

OpenSearch has evolved rapidly since its fork from the source code of the last truly open source version of Elasticsearch.

So far, the community’s work has focused on removing proprietary code from Elastic, including a number of things that were never purely open source themselves. These include some aspects of the querying languages and capabilities of Elasticsearch. But given OpenSearch’s commitment to being, well, open, there has been effort to replace those elements of the ELK Stack with something new.

As a result, OpenSearch queries now look both very familiar and also fairly unique. There are new elements to talk about in addition to carryover from Lucene and Query DSL. For more information on these, look at our introduction to Elasticsearch queries.

For starters, the primary language for OpenSearch queries is Query DSL. However, there are also two plugins that allow you to use two other querying languages: SQL and PPL (Piped Processing Language).

PPL was created by AWS for Open Distro for Elasticsearch. As Open Distro rolls over now into OpenSearch, PPL’s utility is inherently intertwined with OpenSearch applications.

Data Types and Equivalents: Query DSL, PPL, SQL

OpenSearch includes the Query Workbench (within OpenSearch Dashboards), where you can switch to SQL or PPL via plugins, translate SQL to REST, as well as query data in the OpenSearch Dashboards UI. You can also download data in CSV, JSON, text, Java Database Connectivity (JDBC), and Open Database Connectivity (ODBC) formats.

The Query Workbench is actually a part of the SQL plugin, which we will discuss below in greater detail.

The following table compares the three main querying language types in OpenSearch:

[Sourced from https://opensearch.org/docs/search-plugins/ppl/datatypes/ ]

PPL Type	OpenSearch Type	SQL Type
boolean	boolean	BOOLEAN
byte	byte	TINYINT
byte	short	SMALLINT
integer	integer	INTEGER
long	long	BIGINT
float	float	REAL
float	half_float	FLOAT
float	scaled_float	DOUBLE
double	double	DOUBLE
string	keyword	VARCHAR
text	text	VARCHAR
timestamp	date	TIMESTAMP
ip	ip	VARCHAR
timestamp	date	TIMESTAMP
binary	binary	VARBINARY
struct	object	STRUCT
array	nested	STRUCT

Query DSL in OpenSearch

OpenSearch docs break down Query DSL into three categories: full-text queries, term-level queries, and boolean queries. We covered Query DSL extensively in our Elasticsearch queries introduction. Here, it suffices to say that boolean queries cover must, must_not, should, and filter requests. Let’s go into slightly more detail on the other two categories:

Term-Level Queries

Term-level queries are made up of the following list of options: term, terms, IDs, range, prefix, exists, wildcard, and regexp. You can find examples for Query DSL term-level searches in our Elasticsearch queries tutorial. Here is an example of prefix, though:

Prefix

This query looks for any hits with a specified prefix:

GET shakespeare/_search
{
  "query":Hello {
    "prefix": {
      "speaker": "KING"
    }
  }
}

Full-Text Queries

OpenSearch full-text queries are a little more expansive and cover a lot of options. The list includes match, multi_match, match_phrase, match_phrase_prefix, match_bool_prefix, common, query_string, simple_query_string, match_all, and match_none.

There are also a number of optional query parameters for full-text OpenSearch Queries:

allow_leading_wildcard
analyze_wildcard
analyzer
auto_generate_synonyms_phrase_query
boost
cutoff_frequency
flags
fuzziness
fuzzy_transpositions
fields
enable_position_increments
low_freq_operator
max_determined_states
max_expansions
minimum_should_match
operator (and or or)
phrase_slop
prefix_length
quote_field_suffix
rewrite
slop
tie_breaker
time_zone
type
zero_terms_query

Take for example fuzziness.

Fuzziness instructs the query on how flexible it should be in searching for values. It instructs the limit on edits to a result in order to get it back to the original query. For example, it would take 1 character change to get from “desert” to “dessert.” Those changes can be deletions, insertions, or replacements (so you would count a letter switch as a single edit, not 2 with a deletion and an insertion). Settings can include values AUTO, 1, or any other positive value.

GET _search
{
  "query": {
    "match": {
      "title": {
        "query": "desert",
        "fuzziness": "AUTO",
      }
    }
  }
}

This dictates the number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matches a certain value. For example, the distance between wined and wind is 1. The default, AUTO, chooses a value based on the length of each term and is a good choice for most use cases.

Piped Processing Language (PPL) Queries and Syntax

OpenSearch carries over Piped Processing Language (PPL) from OpenDistro for Elasticsearch . Syntax uses the pipe marker ( | ) to separate query commands. The platform release comes with a PPL CLI built-in as well as a way to save JSON or text results.

There are also only a few commands to use with the language.

Search

Syntax starts with the search command, followed by the source and any other expressions you need to add:

search source=<<index>> [some bool-expression]

search command 1 | command 2 | command 3

All the following queries are placed after the first clause and pipe: search source=<<index>> |

Head

This is a pretty important one: specify the number of results you want:

search source=voters | fields state | head 5

Dedup

This critical feature – data deduplication – removes duplicate documents from results (when they come up for different fields).

dedup [int] <<field>> [keepempty=<<true/false>>] [consecutive=<<true/false>>]

The keepempty parameter tells the query to keep results with empty fields.

You can also input multiple fields at a time:

search source=voters | dedup 3 age keepempty=true | fields age, state, phone number;

Age	State	Phone Number
87	NJ	555-1235
87	PA	555-2347
87	CT	555-3467
86	NJ	555-7654
86	NY	555-9876
86	NC	555-8642

Eval

The eval query makes quick math of your results. Input an expression within the query based on the results it returns. The final results will include the answer to the expression.

In this example, assume you’re looking at data on an election from six months ago (.5 years). For some reason, you want to update the ages in the results to be contemporary:

search source=voters | eval age = age + .5 | fields age;

Fields

Add or remove fields using + or - symbols:

search source=voters | fields age, state, city | fields - state;

Rename

To display a certain result with an alternate label, use rename to change how a field’s results display. In this example, the field was accidentally named Republican instead of Party (Republican is one of the values). To correct the error:

search source=candidates | rename republican as party | fields party;

Sort

This will sort results in a defined field either in ascending order (default) or descending order (marked by -)

search source=candidates | sort age | fields age, party, state, city, office;

Age	Party	State	City	Office
35	Republican	NJ	New Brunswick	Mayor
47	Democrat	NJ	New Brunswick	Mayor
52	Independent	NJ	New Brunswick	Mayor
67	Green	NJ	New Brunswick	Mayor

search source=candidates | sort - age | fields age, party, state, city, office;

Age	Party	State	City	Office
67	Green	NJ	New Brunswick	Mayor
52	Independent	NJ	New Brunswick	Mayor
47	Democrat	NJ	New Brunswick	Mayor
35	Republican	NJ	New Brunswick	Mayor

Also use sort for two different fields simultaneously:

search source=candidates | sort + age - sex | fields age, sex, party, state, city, office;

Age	Sex	Party	State	City	Office
43	M	Blue	NJ	Highland Park	Mayor
44	M	Codependent Party	NJ	Highland Park	Mayor
47	F	Demublican	NJ	Highland Park	Mayor
75	F	Repubricat	NJ	Highland Park	Mayor

Stats

Calculate stats using the five optional functions: AVG, MAX, MIN, SUM, COUNT. The following example will give you the highest five average voter ages among the 50 US states:

search source=voters | stats avg(age) by state; | head 5;

State	avg(age)
Florida	55.70
California	53.33
Arizona	52.89
New York	49.25
Utah	48.00

Where

Limit results by very specific values:

search source=candidates | where party=republican and gender =”F” | fields party, gender;

Rare

Find the least common values from a given list of fields with the rare query.

search source=candidates | rare age by party | fields age, party;

Result:

Age	Party
17	Republican
104	Democrat

Top ,

Finds the most common fields in a list, with the top result having the most hits:

top N <<field-list>> <<by-clause>>

search source=voters | top 5 age by state

Result:

Age	State
85	Florida
84	Florida
84	California
83	Texas
83	Florida

Plugin: OpenSearch SQL

The OpenSearch SQL plugin supports complex SQL queries that include things like JOINs, subqueries and more. It supports the Query Workbench in OpenSearch Dashboards, an SQL CLI, and other functions.

The order of operations follows this (complete) syntax layout:

SELECT [DISTINCT] (* | expression) [[AS] alias] [, ...]
FROM index_or_table_name
[WHERE predicates]
[GROUP BY <<column_name>> [, ...]
 [HAVING predicates]]
[ORDER BY <<column_name>> [IS [NOT] NULL] [ASC | DESC] [, ...]]
[LIMIT [offset, ] size];

Summary

This post encompasses a brief overview of OpenSearch queries, highlighting some of the differences with Elasticsearch (which relies on Query DSL), as well as providing some additional info on Piped Processing Language whose Open Distro origins make it a critical component in the AWS-spearheaded OpenSearch project.

Subscribe to the blog to get more info on OpenSearch, queries, and the like.