OpenSearch Queries: Query DSL and Beyond

September 9, 2021
OpenSearch Queries: Query DSL and Beyond

    OpenSearch has evolved rapidly since its fork from the source code of the last truly open source version of Elasticsearch. 

    So far, the community’s work has focused on removing proprietary code from Elastic, including a number of things that were never purely open source themselves. These include some aspects of the querying languages and capabilities of Elasticsearch. But given OpenSearch’s commitment to being, well, open, there has been effort to replace those elements of the ELK Stack with something new.

    As a result, OpenSearch queries now look both very familiar and also fairly unique. There are new elements to talk about in addition to carryover from Lucene and Query DSL. For more information on these, look at our introduction to Elasticsearch queries

    For starters, the primary language for OpenSearch queries is Query DSL. However, there are also two plugins that allow you to use two other querying languages: SQL and PPL (Piped Processing Language). 

    PPL was created by AWS for Open Distro for Elasticsearch. As Open Distro rolls over now into OpenSearch, PPL’s utility is inherently intertwined with OpenSearch applications.

    Data Types and Equivalents: Query DSL, PPL, SQL

    OpenSearch includes the Query Workbench (within OpenSearch Dashboards), where you can switch to SQL or PPL via plugins, translate SQL to REST, as well as query data in the OpenSearch Dashboards UI. You can also download data in CSV, JSON, text, Java Database Connectivity (JDBC), and Open Database Connectivity (ODBC) formats.

    The Query Workbench is actually a part of the SQL plugin, which we will discuss below in greater detail.

    The following table compares the three main querying language types in OpenSearch:

    [Sourced  from https://opensearch.org/docs/search-plugins/ppl/datatypes/ ]

    PPL TypeOpenSearch TypeSQL Type
    booleanbooleanBOOLEAN
    bytebyteTINYINT
    byteshortSMALLINT
    integerintegerINTEGER
    longlongBIGINT
    floatfloatREAL
    floathalf_floatFLOAT
    floatscaled_floatDOUBLE
    doubledoubleDOUBLE
    stringkeywordVARCHAR
    texttextVARCHAR
    timestampdateTIMESTAMP
    ipipVARCHAR
    timestampdateTIMESTAMP
    binarybinaryVARBINARY
    structobjectSTRUCT
    arraynestedSTRUCT

    Query DSL in OpenSearch

    OpenSearch docs break down Query DSL into three categories: full-text queries, term-level queries, and boolean queries. We covered Query DSL extensively in our Elasticsearch queries introduction. Here, it suffices to say that boolean queries cover must, must_not, should, and filter requests. Let’s go into slightly more detail on the other two categories:

    Term-Level Queries

    Term-level queries are made up of the following list of options: term, terms, IDs, range, prefix, exists, wildcard, and regexp. You can find examples for Query DSL term-level searches in our Elasticsearch queries tutorial. Here is an example of prefix, though:

    Prefix 

    This query looks for any hits with a specified prefix:

    GET shakespeare/_search
    {
      "query":Hello {
        "prefix": {
          "speaker": "KING"
        }
      }
    }
    

    Full-Text Queries

    OpenSearch full-text queries are a little more expansive and cover a lot of options. The list includes match, multi_match, match_phrase, match_phrase_prefix, match_bool_prefix, common, query_string, simple_query_string, match_all, and match_none.

    There are also a number of optional query parameters for full-text OpenSearch Queries:

    allow_leading_wildcard
    analyze_wildcard
    analyzer
    auto_generate_synonyms_phrase_query
    boost
    cutoff_frequency
    flags
    fuzziness
    fuzzy_transpositions
    fields
    enable_position_increments
    low_freq_operator
    max_determined_states
    max_expansions
    minimum_should_match
    operator (and or or)
    phrase_slop
    prefix_length
    quote_field_suffix
    rewrite
    slop
    tie_breaker
    time_zone
    type
    zero_terms_query

    Take for example fuzziness.

    Fuzziness instructs the query on how flexible it should be in searching for values. It instructs the limit on edits to a result in order to get it back to the original query. For example, it would take 1 character change to get from “desert” to “dessert.” Those changes can be deletions, insertions, or replacements (so you would count a letter switch as a single edit, not 2 with a deletion and an insertion). Settings can include values AUTO, 1, or any other positive value.

    GET _search
    {
      "query": {
        "match": {
          "title": {
            "query": "desert",
            "fuzziness": "AUTO",
          }
        }
      }
    }
    

    This dictates the number of character edits (insert, delete, substitute) that it takes to change one word to another when determining whether a term matches a certain value. For example, the distance between wined and wind is 1. The default, AUTO, chooses a value based on the length of each term and is a good choice for most use cases.

    Piped Processing Language (PPL) Queries and Syntax

    OpenSearch carries over Piped Processing Language (PPL) from OpenDistro for Elasticsearch . Syntax uses the pipe marker ( | ) to separate query commands. The platform release comes with a PPL CLI built-in as well as a way to save JSON or text results.

    There are also only a few commands to use with the language.

    There are also only a few commands to use with the language.

    Search

    Syntax starts with the search command, followed by the source and any other expressions you need to add:

    search source=<<index>> [some bool-expression]
    search command 1 | command 2 | command 3

    All the following queries are placed after the first clause and pipe: search source=<<index>> |

    Head

    This is a pretty important one: specify the number of results you want:

    search source=voters | fields state | head 5

    Dedup

    This critical feature – data deduplication – removes duplicate documents from results (when they come up for different fields).

    dedup [int] <<field>> [keepempty=<<true/false>>] [consecutive=<<true/false>>]

    The keepempty parameter tells the query to keep results with empty fields.

    You can also input multiple fields at a time:

    search source=voters | dedup 3 age keepempty=true | fields age, state, phone number;
    AgeStatePhone Number
    87NJ555-1235
    87PA555-2347
    87CT555-3467
    86NJ555-7654
    86NY555-9876
    86NC555-8642

    Eval

    The eval query makes quick math of your results. Input an expression within the query based on the results it returns. The final results will include the answer to the expression.

    In this example, assume you’re looking at data on an election from six months ago (.5 years). For some reason, you want to update the ages in the results to be contemporary:

    search source=voters | eval age = age + .5 | fields age;

    Fields

    Add or remove fields using + or - symbols:

    search source=voters | fields age, state, city | fields - state;

    Rename

    To display a certain result with an alternate label, use rename to change how a field’s results display. In this example, the field was accidentally named Republican instead of Party (Republican is one of the values). To correct the error:

    search source=candidates | rename republican as party | fields party;

    Sort

    This will sort results in a defined field either in ascending order (default) or descending order (marked by -)

    search source=candidates | sort age | fields age, party, state, city, office;
    AgePartyStateCityOffice
    35RepublicanNJNew BrunswickMayor
    47DemocratNJNew BrunswickMayor
    52IndependentNJNew BrunswickMayor
    67GreenNJNew BrunswickMayor
    search source=candidates | sort - age | fields age, party, state, city, office;
    AgePartyStateCityOffice
    67GreenNJNew BrunswickMayor
    52IndependentNJNew BrunswickMayor
    47DemocratNJNew BrunswickMayor
    35RepublicanNJNew BrunswickMayor

    Also use sort for two different fields simultaneously:

    search source=candidates | sort + age - sex | fields age, sex, party, state, city, office;
    AgeSexPartyStateCityOffice
    43MBlueNJHighland ParkMayor
    44MCodependent PartyNJHighland ParkMayor
    47FDemublicanNJHighland ParkMayor
    75FRepubricatNJHighland ParkMayor

    Stats

    Calculate stats using the five optional functions: AVG, MAX, MIN, SUM, COUNT. The following example will give you the highest five average voter ages among the 50 US states:

    search source=voters | stats avg(age) by state; | head 5;
    Stateavg(age)
    Florida55.70
    California53.33
    Arizona52.89
    New York49.25
    Utah48.00

    Where

    Limit results by very specific values:

    search source=candidates | where party=republican and gender =”F” | fields party, gender;

    Rare

    Find the least common values from a given list of fields with the rare query.

    search source=candidates | rare age by party | fields age, party;

    Result:

    AgeParty
    17Republican
    104Democrat

    Top                                   ,

    Finds the most common fields in a list, with the top result having the most hits:

    top N <<field-list>> <<by-clause>>
    search source=voters | top 5 age by state

    Result:

    AgeState
    85Florida
    84Florida
    84California
    83Texas
    83Florida

    Plugin: OpenSearch SQL

    The OpenSearch SQL plugin supports complex SQL queries that include things like JOINs, subqueries and more. It supports the Query Workbench in OpenSearch Dashboards, an SQL CLI, and other functions.

    The order of operations follows this (complete) syntax layout:

    SELECT [DISTINCT] (* | expression) [[AS] alias] [, ...]
    FROM index_or_table_name
    [WHERE predicates]
    [GROUP BY <<column_name>> [, ...]
     [HAVING predicates]]
    [ORDER BY <<column_name>> [IS [NOT] NULL] [ASC | DESC] [, ...]]
    [LIMIT [offset, ] size];

    Summary

    This post encompasses a brief overview of OpenSearch queries, highlighting some of the differences with Elasticsearch (which relies on Query DSL), as well as providing some additional info on Piped Processing Language whose Open Distro origins make it a critical component in the AWS-spearheaded OpenSearch project.

    Subscribe to the blog to get more info on OpenSearch, queries, and the like.

    Get started for free

    Completely free for 14 days, no strings attached.