A Breakdown of Language Analyzers for Elasticsearch

Any search engine needs to be be able to parse language. As the field of natural language processing (NLP) has grown, specific text analysis has been applied to stop words and tokenizing (or marking) them by part of speech. In Elasticsearch (and elsewhere), the most attention has been paid to English, although the ELK stack has built-in support for 34 languages as of this writing. The English analyzer in particular comes equipped with a stemming tool, possessive stemmer, keyword marker, lowercase marker and stopword identifier. 

But there are language analyzers not native to Elasticsearch built by others in the open-source community. At the bottom of this article you will find a list of independent analyzers organized by language. While you might think this list — and links to their repositories — is pretty thin, fret not. We intend to continuously integrate (get it?) new entries onto it and maintain this index as an essential resource for DevOps, NLP, search, logs (of course), and machine learning.

What are Analyzers

Analyzers in Elasticsearch (and any time-series data search implementation really) consist of two main components: tokenizers and filters. At least one tokenizer is required (minimum 1) within an analyzer, but filters are not (minimum 0). Besides that requirement, there is no maximum number of tokenizers or filters within each analyzer,

Tokenizers break text into legible pieces, usually words but really can be modified to categorize text in any number of ways. Filters will further work the resulting tokens by modifying them in some way. It should be noted that from time to time, you’ll run into an introduction that has a lot of gray area between the definitions of “tokenizer” and “analyzer.” 

Common tasks with tokenizers are to lowercase all letters in search results (in Elastic, the Lowercase Tokenizer), delimits things according to spaces between words (Whitespace Tokenizer) and sifts out punctuation (Letter Tokenizer). The two common types of filters are character filters (often set before a tokenizer in process order) or token filters (set after inputs have been tokenized).

However, the order in which text is analyzed (through a filter first, then a tokenizer, vice versa, or with an extra layer of filters at the end, etc.) is unrestricted. 

Built-In to, Recommended by Elasticsearch 

Elasticsearch offers built-in support for 36 languages as of November 2019. They include Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, CJK (Chinese, Japanese, Korean), Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish (i.e., Irish Gaelic), Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani (dialect Of Kurdish), Spanish, Swedish, Turkish, and Thai.

There are some analyzer plugins that are recommended by Elastic for use in Elasticsearch, namely: 

  1. ICU – Unicode support for ICU libraries and Asian languages in particular
  2. Stempel – Stemming in Polish
  3. Ukrainian Analysis Plugin – Stemming in Ukrainian
  4. Kuromoji – Japanese
  5. Nori – Korean 
  6. SmartCN – Mandarin Chinese (simplified) and mixed English-Chinese texts
  7. Phonetic – utilizes multiple phonetic algorithms (Caverphone, Metaphone and Soundex)

Of course, this doesn’t exhaust the world’s estimated 7,000+ active languages. While you might have a hard time finding some of those vernaculars in analyzer form, there certainly are a number already covered by enterprising multilingual engineers:

Below is a list of popular languages and independently developed analyzers built for them for implementation in the ELK Stack. Some of these are at different levels of development. Multiple entries will probably exist for these languages:

Analyzers, Tokenizers and Filters by Language

Multilingual Analyzers

There are some analyzers that are not specific to one language or even a limited family of vernaculars. Two examples are jLemmaGen (with an obvious focus on lemmatization) and Snowball (based off the programming language of the same name, a language intentionally designed to deal with stemming):

jLemmaGen covers 15 languages that sometimes overlap with languages already natively covered by Elasticsearch: Czech, English, Persian, Polish, Russian, Ukrainian, Bulgarian, Croatian, Estonian, Hungarian, Lithuanian, Macedonian, Resian, Romanian, Serbian, Slovak, and Slovene/Slovenian. The Snowball compiler translates its native script into other programming languages. Natural languages it covers are Hindi, Basque, Catalan, Greek, Lithuanian, Indonesian, Arabic, Tamil, Irish, Czech, Armenian, Romanian, Serbian, Turkish, and Hungarian. 

Middle Eastern Languages

Hebrew

Hebmorph:

In order to install, use the following command:

~/elasticsearch-1.5.2$ bin/plugin --install analysis-hebrew --url https://bintray.com/artifact/download/synhershko/elasticsearch-analysis-hebrew/elasticsearch-analysis-hebrew-1.7.zip

From there, install Hebrew dictionaries from here and then add the following as the dictionary path in elasticsearch.yml:

hebrew.dict.path: /PATH/TO/HSPELL/FOLDER/

More options for Hebrew Analyzers

 

Hebmorph-Exist: A morphology analyzer for Hebmorph by OpenSiddur 

Sefaria: This deals with various dialects of Ancient Hebrew, including Biblical Hebrew and post-Biblical. Sefaria utilizes a corpus of classic texts to feed its tokenizer.

Elasticsearch-Hebrew: An analyzer built with Docker in mind 

Grammar Analyzer: Python-based analyzer for Hebrew grammar

Looking for an auto-scaling Elasticsearch service? Logz.io has you covered.

Arabic

Arabic is standardized, but Modern Standard Arabic is not commonly spoken. Arabic maintains diglossia, a situation where the common written form of communication differs wildly from the dialects its speakers use. As a result, many linguists treat the various dialects of colloquial Arabic as separate languages. With the proliferation of non-standard written communication online and increase of audiovisual media, many have begun work on Arabic dialect-specific tokenizers and analyzers. There are also many options for additional MSA analyzers.

ADAM: Analyzer for Dialectal Arabic Morphology

Shami-Sentiment-Analyzer: Sentiment analysis analyzer for Arabic dialects spoken in the Levant (Israel, Syria, Lebanon, Jordan and Palestinian territories) 

Intelligent Tunisian: A morphological analyzer containing a corpus of 1,000 Tunisian Arabic dialect words. 

Kurdish

There are several non-mutually intelligible dialects of Kurdish, the two main ones being Kurmanji and Sorani. Sorani is built-in for Elasticsearch.

European Languages

Estonian

Snowball for Estonian: An Estonian-specific analyzer written in the Snowball programming language provides stemming and filtering capabilities for Lucene. This is separate from the general Snowball analyzer mentioned above. https://lucene.apache.org/core/corenews.html

Polish

Morfologik: This is a comprehensive tool for Polish language analysis available on GitHub. It includes a morphosyntactic dictionary of Polish, stemmer, a speller, and fine-state automation (FSA). 

PolishLangAnalyzer: .Net library meant to extend lucene.net built on the Morfologik Dictionary mentioned above. 

Icelandic

Modern Icelandic is available in Lucene and for Microsoft Azure. There are other options though.

ICEMorph: If you want to get really niche, this UCLA-developed tool will parse Old Norse and Old Icelandic for you, related and equally complex no-longer-used dialects of the modern Scandinavian languages.

Nifgraup: This includes a spellchecker (Hunspell-is), thesaurus and morphological analyzer that covers over 300 inflection rules for Icelandic.

Serbian

Serbian Analyzer: Besides being available in jLemmaGen and Snowball, you can find a Serbian-specific analyzer here.

Slovak

Lucene.Net.SlovakAnalyzer: Marks Slovak stop words and contains conditions for overstemming. 

Greek

Greeklish: Generating Latin characters from Greek tokens, essentially a transliteration plugin. It only works on lowercase Greek characters, hence requires implementation of a Greek lowercase filter beforehand. Here you can see some examples

Portuguese

Portuguese is a standard inclusion for most software — Elasticsearch included — but it’s one of those languages that sees a lot of variety when it comes to creating parsing tools, mainly because of its dialect separation. It has also become conventional to provide separate support for the Brazilian and European dialects of the language as pronunciation and some standard syntax can make the two forms of Portuguese sound like diverging languages. There is a lot of argument over the effectiveness of certain tools though, resulting in a proliferation of original, open-source analyzers for Brazilian Portuguese.

Brazilian-Analyzer: This analyzer is written in PHP but is based on Java Brazilian Analyzer. According to its repository, it needs Zend_Lucene to run but should run in most any Lucene implementation, even though it is coded in PHP. This is an older analyzer though.

Text-Mining: This tokenizer for general Portuguese will provide you with stopword removal and part-of-speech tagging among other capabilities. It is written in Python.

PTStem: This is actually a combination of three stemmers for Portuguese, written in R. It includes the SnowballC-written Porter, spell-checker Hunspell, and RSLP.

French

Again, French comes standard with most any software. But since there are so many options, some have decided to take a federated approach by incorporating multiple tokenizers and filters into larger analyzers:

Elasticsearch-analyzers-compare-plugin: *no longer supported beyond Elasticsearch 5.0

Indian Languages

Indic NLP Library: This has a massive amount of functionalities that include text normalization, tokenization, romanization and indicization, and even machine translation functions. It might be considered essential for any serious linguist or tokenizer enthusiast (if such people exist).

Sanskrit

Lucene Analyzer for Sanskrit: Yep. It provides normalization into SLP1 (i.e., the Sanskrit Library Phonetic basic encoding scheme, an ASCII transliteration scheme) from other transliteration standards in IAST and Devanagari script; syllable tokenizer, word tokenizer that treats compounds lightly, and a separate compound tokenize. That latter tokenizer can distinguish between “S and H” to separate words (desandhification) at the appropriate spot, filter and merge prepositions or “preverbs” to their following verbs, and user-defined or customizable word/compound lists. 

There are few open-source tools for Indian languages outside of Hindi, at least tools that are widely available. Besides an analyzer that Microsoft provides, open-source tools are rare. But there are several from the Indian Language Technology Proliferation and Deployment Center identifies the root and grammatical features of words in various Indian languages:

Kannada

Morph Analyzer for Kannada: root identification and grammatical feature extractor

Kannada CRF Chunker

Kannada Synset (synonym set)

Bengali

Morph Analyzer for Bengali

Tamil

Morph Analyzer for Tamil

Asian Languages (Non-Indian)

Because of the variety in characters among Asian languages, there are several that group capabilities for multiple Asian languages together.

Vietnamese

Docker-Elasticsearch: For Japanese and Vietnamese; affiliated with neither Docker nor Elasticsearch.

Elasticsearch-for-careers: A Vietnamese-specific analyzer to help with job searches. It is built on with the VnTokenizer included.

VN-Lucene: Also uses VNTokenizer

Tibetan

Lucene Analyzer for Tibetan: Based off Tibetan-NLP analyzers. It includes lemmatization, a list of stop words, a “diacritics transliteration schema” (DTS), syllable tokenizer and affix tokenizer among other language-specific modes like the “PaBaFilter” that normalizes words with either phoneme “pa” and “ba.”

Easily monitor, troubleshoot, and secure your cloud environment with Logz.io!
Artboard Created with Sketch.
× Big News! Announcing Infrastructure Monitoring and our Cloud Observability Platform! Read more