Monthly Archives: December 2015

Introduction to Elasticsearch Tokenization and Analysis

Elasticsearch is a text engine.  This is usually good if you have text to index, but can cause problems with other types of input (log files).  One of the more confusing elements of elasticsearch is the idea of tokenization and how fields are analyzed.


In a text engine, you might want to take a string and search for each “word”.  The rules that are used to convert a string into words are defined in a tokenizer.   A simple string:

The quick brown fox

can easily be processed into a series of tokens:

[“the”, “quick”, “brown”, “fox”]

But what about punctuation?

Half-blood prince



The default tokenizer in elasticsearch will split those up:

[“half”, “blood”, “prince”]

[“var”, “log”, “messages”]

Unfortunately, this means that searching for “half-blood price” might also find you an article about a royal prince who fell half way to the floor while donating blood.

As of this writing, there are 12 built-in tokenizers.

You can test some input text against a tokenizer on the command line:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d '/var/log/messages'


An analyzer lets you combine a tokenizer with some other rules to determine how the text will be indexed.  This is not something I’ve had to do, so I don’t have examples or caveats yet.

You can test the analyzer rules on the command line as well:

curl -XGET 'localhost:9200/_analyze?tokenizer=keyword&filters=lowercase' -d 'The quick brown fox'


When you define the mapping for your index, you can control how each field is analyzed.  First, you can specify *if* the field is even to be analyzed or indexed:

"myField": {
    "index": "not_analyzed"

By using “not_analyzed”, the value of the field will not be tokenized in any way and will only be available as a raw string.  Since this is very useful for logs, the default template in logstash uses this to create the “.raw” fields (e.g. myField.raw).

You can also specify “no”, which will prevent the field from being indexed at all.

If you would like to use a different analyzer for your field, you can specify that:

"myField": {
    "analyzer": "spanish"



It was a nice run, but logstash-forwarder is dead.  In its place comes filebeat, a lightweight (still Java-free and written in Go) log file shipper that is actually supported by Elastic.

If you’re coming from logstash-forwarder, Elastic provides a migration guide.

We hope to migrate our own stuff to filebeat soon, which will certainly yield more postings.  Stay tuned!


Testing your logstash configuration


As your logstash configuration grows (mine is over 3,000 lines in 40+ files right now), you’ll want a way to make sure you don’t break anything with a new release (and that Elastic doesn’t, either!).

You can run logstash with the ‘–configtest’ option, but that’s only checking for syntax errors.

Logstash uses the rspec harness, so I wanted to start there.  All the doc I could find on the web was old and wrong (even postings from elastic).  Thanks to the great community in the #logstash IRC channel, I was able to get it working.


I had three goals when I started:

  • to re-use my production configuration files, which are already logically arranged (“apache”, “yum”, etc).  I did not want to repeat the configuration in the test files.
  • to test my desired output.  OK, this probably goes without saying.
  • to use logstash 1.4, which is [still] in use by my main cluster.  I’ve heard that things change some in 1.5, as you might expect.


You can either run the test scripts on a production machine (using a short-lived second instance of logstash) or make a VM that would have logstash and your logstash filter{} configurations installed.


As part of your installation, all your configs should be in /etc/logstash/conf.d.  We’ll use a small config example, test.conf:

filter {
    if [type] == "test" {
        grok { 
            match => [ "message", "%{TIMESTAMP_ISO8601:timestamp} %{WORD:word1} %{INT:int1:int} %{NUMBER:[inner][float1]:float}" ]
            tag_on_failure => [ "_grokparsefailure_test" ]

This grok pattern would match a string like this:

2015-21-01 12:01:02.003 UTC Hello 42 3.14159

and create four fields of the appropriate type (string, integer, float).

On my logstash machine, the built-in rspec files are in /opt/logstash/spec.  I made a parallel directory as /opt/logstash/my-spec that contained this script, test.rb:

# encoding: utf-8
require "test_utils"

file = "/etc/logstash/conf.d/test.conf"
@@configuration =
@@configuration <<

describe "Test event" do
  extend LogStash::RSpec


  message = %(2015-21-01 12:01:02.003 UTC Hello 42 3.14159)

  sample("message" => message, "type" => "test") do
    insist { subject["type"] } == "test"
    insist { subject["timestamp"] } == "2015-21-01 12:01:02.003 UTC"
    insist { subject["word1"] } == "Hello"
    insist { subject["int1"] } == 42
    insist { subject["inner"]["float1"] } == 3.14159


cd /opt/logstash
bin/logstash rspec my-spec/test.rb

and you should see

Finished in 0.361 seconds
1 example, 0 failures


In my real-world config, I have a series of filters in one file that do a lot of processing on the events.  In the simple example above, you can also imagine wanting to run the date{} filter on the `timestamp` column to update @timestamp.  As you add more complexity, update the test cases to match.  So, if your config included date{}, you would add this:

insist { subject["@timestamp"] } == Time.iso8601("2015-12-01T12:01:03.003Z").utc

Please note that this example uses the rspec2 “insist” construct.  I’ve heard that rspec3 uses “expect”.  There may be other syntax issues when using logstash2.