Introduction to Elasticsearch Tokenization and Analysis

Elasticsearch is a text engine.  This is usually good if you have text to index, but can cause problems with other types of input (log files).  One of the more confusing elements of elasticsearch is the idea of tokenization and how fields are analyzed.

Tokens

In a text engine, you might want to take a string and search for each “word”.  The rules that are used to convert a string into words are defined in a tokenizer.   A simple string:

The quick brown fox

can easily be processed into a series of tokens:

[“the”, “quick”, “brown”, “fox”]

But what about punctuation?

Half-blood prince

or

/var/log/messages

The default tokenizer in elasticsearch will split those up:

[“half”, “blood”, “prince”]

[“var”, “log”, “messages”]

Unfortunately, this means that searching for “half-blood price” might also find you an article about a royal prince who fell half way to the floor while donating blood.

As of this writing, there are 12 built-in tokenizers.

You can test some input text against a tokenizer on the command line:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d '/var/log/messages'

Analyzers

An analyzer lets you combine a tokenizer with some other rules to determine how the text will be indexed.  This is not something I’ve had to do, so I don’t have examples or caveats yet.

You can test the analyzer rules on the command line as well:

curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase' -d 'The quick brown fox'

Mappings

When you define the mapping for your index, you can control how each field is analyzed.  First, you can specify *if* the field is even to be analyzed or indexed:

"myField": {
    "index": "not_analyzed"
}

By using “not_analyzed”, the value of the field will not be tokenized in any way and will only be available as a raw string.  Since this is very useful for logs, the default template in logstash uses this to create the “.raw” fields (e.g. myField.raw).

You can also specify “no”, which will prevent the field from being indexed at all.

If you would like to use a different analyzer for your field, you can specify that:

"myField": {
    "analyzer": "spanish"
}

 

Leave a Reply

Your email address will not be published.