{"id":166,"date":"2015-12-11T17:28:50","date_gmt":"2015-12-11T17:28:50","guid":{"rendered":"http:\/\/svops.com\/blog\/?p=166"},"modified":"2015-12-11T17:28:50","modified_gmt":"2015-12-11T17:28:50","slug":"introduction-to-elasticsearch-tokenization-and-analysis","status":"publish","type":"post","link":"http:\/\/svops.com\/blog\/introduction-to-elasticsearch-tokenization-and-analysis\/","title":{"rendered":"Introduction to Elasticsearch Tokenization and Analysis"},"content":{"rendered":"<p>Elasticsearch is a text engine. \u00a0This is usually good if you have text to index, but can cause problems with other types of input (log files). \u00a0One of the more confusing elements of elasticsearch is the idea of tokenization and how fields are analyzed.<\/p>\n<h1>Tokens<\/h1>\n<p>In a text engine, you might want to take a string\u00a0and search for each &#8220;word&#8221;. \u00a0The rules that are used to convert a string into words are defined in a <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/analysis-tokenizers.html\" target=\"_blank\">tokenizer<\/a>. \u00a0 A simple string:<\/p>\n<blockquote><p>The quick brown fox<\/p><\/blockquote>\n<p>can easily\u00a0be processed into a series of tokens:<\/p>\n<blockquote><p>[&#8220;the&#8221;, &#8220;quick&#8221;, &#8220;brown&#8221;, &#8220;fox&#8221;]<\/p><\/blockquote>\n<p>But what about punctuation?<\/p>\n<blockquote><p>Half-blood prince<\/p><\/blockquote>\n<p>or<\/p>\n<blockquote><p>\/var\/log\/messages<\/p><\/blockquote>\n<p>The default tokenizer in elasticsearch will split those up:<\/p>\n<blockquote><p>[&#8220;half&#8221;, &#8220;blood&#8221;, &#8220;prince&#8221;]<\/p>\n<p>[&#8220;var&#8221;, &#8220;log&#8221;, &#8220;messages&#8221;]<\/p><\/blockquote>\n<p>Unfortunately, this means that searching for &#8220;half-blood price&#8221; might\u00a0also find you an article about a royal <em>prince<\/em> who fell <em>half<\/em> way to the floor while donating <em>blood<\/em>.<\/p>\n<p>As of this writing, there are 12 <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/analysis-tokenizers.html\" target=\"_blank\">built-in tokenizers<\/a>.<\/p>\n<p>You can test some input text against a tokenizer on the command line:<\/p>\n<pre>curl -XGET 'localhost:9200\/_analyze?analyzer=standard&amp;pretty' -d '\/var\/log\/messages'<\/pre>\n<h1>Analyzers<\/h1>\n<p>An <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/current\/analysis-analyzers.html\" target=\"_blank\">analyzer<\/a> lets you combine a tokenizer with some other rules to determine how the text will be indexed. \u00a0This is not something I&#8217;ve had to do, so I don&#8217;t have examples or caveats yet.<\/p>\n<p>You can test the analyzer rules on the command line as well:<\/p>\n<pre><span class=\"pln\">curl <\/span><span class=\"pun\">-<\/span><span class=\"pln\">XGET <\/span><span class=\"str\">'localhost:9200\/_analyze?tokenizer=keyword&amp;filters=lowercase'<\/span> <span class=\"pun\">-<\/span><span class=\"pln\">d <\/span><span class=\"str\">'The quick brown fox'<\/span><\/pre>\n<h1>Mappings<\/h1>\n<p>When you define the mapping for your index, you can control how each field is analyzed. \u00a0First, you can specify *if* the field is even to be analyzed or\u00a0indexed:<\/p>\n<pre><span class=\"str\">\"myField\"<\/span><span class=\"pun\">:<\/span> <span class=\"pun\">{<\/span>\r\n<span class=\"str\">    \"index\"<\/span><span class=\"pun\">:<\/span> <span class=\"str\">\"not_analyzed\"\r\n<\/span><span class=\"pun\">}<\/span><\/pre>\n<p>By using &#8220;not_analyzed&#8221;, the value of the field will not be tokenized in any way and will only be available as a raw string. \u00a0Since this is very useful for logs, the default template in logstash uses this to create the &#8220;.raw&#8221; fields (e.g. myField.raw).<\/p>\n<p>You can also specify &#8220;no&#8221;, which will prevent the field from being indexed at all.<\/p>\n<p>If you would like to use a different analyzer for your field, you can specify that:<\/p>\n<pre><span class=\"str\">\"myField\"<\/span><span class=\"pun\">:<\/span> <span class=\"pun\">{\r\n<\/span><span class=\"str\">    \"analyzer\"<\/span><span class=\"pun\">:<\/span> <span class=\"str\">\"spanish\"\r\n<\/span><span class=\"pun\">}<\/span><\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Elasticsearch is a text engine. \u00a0This is usually good if you have text to index, but can cause problems with other types of input (log files). \u00a0One of the more confusing elements of elasticsearch is the idea of tokenization and &hellip; <a href=\"http:\/\/svops.com\/blog\/introduction-to-elasticsearch-tokenization-and-analysis\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/posts\/166"}],"collection":[{"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/comments?post=166"}],"version-history":[{"count":1,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/posts\/166\/revisions"}],"predecessor-version":[{"id":167,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/posts\/166\/revisions\/167"}],"wp:attachment":[{"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/media?parent=166"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/categories?post=166"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/tags?post=166"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}