Category Archives: ELK (Elasticsearch/Logstash/Kibana)

Adventures in ELK pipeline debugging

I recently brought up some machines in a new colo and needed to ship their logs to a local logstash and then on to a centralized elasticsearch server at another site.

The first batch of machines seemed to come up OK.  When I brought the second set up a few weeks later, I started seeing errors in logstash:

Lumberjack input: the pipeline is blocked, temporary refusing new connection.

Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.

The “refusing new connection” part was easily confirmed in the logstash-forwarder logs:

Connecting to [10.1.2.3]:5544 (10.1.2.3)
Connected to 10.1.2.3
Read error looking for ack: EOF

These errors seem to be common understood to mean that your ELK infrastructure is congested.  Perhaps logstash is dead/slow (no); perhaps elasticsearch was dead/slow (no).

Elasticsearch looked healthy, and there were no signs of similar problem in other colos.  We’d see this behavior for most of an hour, and then a small batch of logs would make it through, which just added to the mystery.

Upon further investigation, we found that even the first batch of machines was suffering from this problem.  That they were able to process a backlog of files and keep up to date is surprising.

Turning up debugging in logstash yielded one additional message, but it was basically a repeat of information already seen.

The next step was to see what was going on when logstash talked to elasticsearch.  I grabbed a few minute’s info with tcpdump:

tcpdump -ni eth0 -w tcpdump.out tcp port 9200

This showed three interesting datapoints:

  1. all interactions between logstash and elasticsearch resulted in “200” responses (OK)
  2. the small successful batches were being sent exactly 40 seconds apart
  3. elasticsearch was responding quickly to the indexing requests that made it through.

A google search for issues around “40 seconds” turned up nothing.

To confirm the 40-second problem, I ran another tcpdump but restarted logstash at the same time.  This showed it to work for a while (~2 minutes), and then slow down to 40-second intervals.

There didn’t seem to be much more to learn from tcpdump.  That led to java debugging steps, starting with a thread dump:

kill -3 <pid>

This added a thread dump to logstash’s logfile.  Looking through it, I found about 1/3 of the threads were “BLOCKED”:


java.lang.Thread.State: BLOCKED (on object monitor)
at java.net.InetAddress.getLocalHost(InetAddress.java:1486)
- waiting to lock <0x0000000538a5dfe0> (a java.lang.Object)
at org.jruby.ext.socket.SocketUtils.gethostname(SocketUtils.java:77)
at org.jruby.ext.socket.RubySocket.gethostname(RubySocket.java:257)
at org.jruby.ext.socket.RubySocket$INVOKER$s$0$0$gethostname.call(RubySocket$INVOKER$s$0$0$gethostname.gen)
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:134) at rubyjit.LogStash::Filters::Metrics$$flush_486027b1b0fe337b5753334309092d65d96630e51028566121.__file__(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-metrics-1.0.0/lib/logstash/filters/metrics.rb:186)
at rubyjit.LogStash::Filters::Metrics$$flush_486027b1b0fe337b5753334309092d65d96630e51028566121.__file__(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-metrics-1.0.0/lib/logstash/filters/metrics.rb)

The great thing here is the stacktrace, which showed that the block was coming from the metrics{} filter, on this line:

event["message"] = Socket.gethostname

This led to the discovery that these logstash instances – in only this colo – didn’t have access to a DNS server.  This is seemingly required so that the metric event can be tagged with the hostname from which it originates.

Adding a line to /etc/hosts fixed the problem:

127.0.0.1 myhost.mydomain.com

Hopefully nobody else will hit this exact problem, but perhaps the debugging techniques described above will be helpful.

Filebeat

It was a nice run, but logstash-forwarder is dead.  In its place comes filebeat, a lightweight (still Java-free and written in Go) log file shipper that is actually supported by Elastic.

If you’re coming from logstash-forwarder, Elastic provides a migration guide.

We hope to migrate our own stuff to filebeat soon, which will certainly yield more postings.  Stay tuned!

 

Testing your logstash configuration

Overview

As your logstash configuration grows (mine is over 3,000 lines in 40+ files right now), you’ll want a way to make sure you don’t break anything with a new release (and that Elastic doesn’t, either!).

You can run logstash with the ‘–configtest’ option, but that’s only checking for syntax errors.

Logstash uses the rspec harness, so I wanted to start there.  All the doc I could find on the web was old and wrong (even postings from elastic).  Thanks to the great community in the #logstash IRC channel, I was able to get it working.

Goals

I had three goals when I started:

  • to re-use my production configuration files, which are already logically arranged (“apache”, “yum”, etc).  I did not want to repeat the configuration in the test files.
  • to test my desired output.  OK, this probably goes without saying.
  • to use logstash 1.4, which is [still] in use by my main cluster.  I’ve heard that things change some in 1.5, as you might expect.

Setup

You can either run the test scripts on a production machine (using a short-lived second instance of logstash) or make a VM that would have logstash and your logstash filter{} configurations installed.

Config

As part of your installation, all your configs should be in /etc/logstash/conf.d.  We’ll use a small config example, test.conf:

filter {
    if [type] == "test" {
        grok { 
            match => [ "message", "%{TIMESTAMP_ISO8601:timestamp} %{WORD:word1} %{INT:int1:int} %{NUMBER:[inner][float1]:float}" ]
            tag_on_failure => [ "_grokparsefailure_test" ]
        }
    } 
}

This grok pattern would match a string like this:

2015-21-01 12:01:02.003 UTC Hello 42 3.14159

and create four fields of the appropriate type (string, integer, float).

On my logstash machine, the built-in rspec files are in /opt/logstash/spec.  I made a parallel directory as /opt/logstash/my-spec that contained this script, test.rb:

# encoding: utf-8
require "test_utils"

file = "/etc/logstash/conf.d/test.conf"
@@configuration = String.new
@@configuration << File.read(file)

describe "Test event" do
  extend LogStash::RSpec

  config(@@configuration)

  message = %(2015-21-01 12:01:02.003 UTC Hello 42 3.14159)

  sample("message" => message, "type" => "test") do
    insist { subject["type"] } == "test"
    insist { subject["timestamp"] } == "2015-21-01 12:01:02.003 UTC"
    insist { subject["word1"] } == "Hello"
    insist { subject["int1"] } == 42
    insist { subject["inner"]["float1"] } == 3.14159
  end
end

Running

cd /opt/logstash
bin/logstash rspec my-spec/test.rb

and you should see

Finished in 0.361 seconds
1 example, 0 failures

Conclusion

In my real-world config, I have a series of filters in one file that do a lot of processing on the events.  In the simple example above, you can also imagine wanting to run the date{} filter on the `timestamp` column to update @timestamp.  As you add more complexity, update the test cases to match.  So, if your config included date{}, you would add this:

insist { subject["@timestamp"] } == Time.iso8601("2015-12-01T12:01:03.003Z").utc

Please note that this example uses the rspec2 “insist” construct.  I’ve heard that rspec3 uses “expect”.  There may be other syntax issues when using logstash2.

Finding the lag in your ELK cluster

By default, logstash sets @timestamp to the time when it processes the log.  Most people (correctly), use the date filter to set this to the event’s log.

Before you run the date{} filter, tuck away the value of @timestamp with a little bit of ruby:

ruby {
  code => "
    begin
      event['logstash_ts'] = event['@timestamp']
      rescue Exception => e
        event['logstash_ruby_exception'] = '[logstash_ts]: ' + e.message
    end
  "
}

This will create a new field called ‘logstash_ts’ that contains the timestamp when logstash received the event.

Now it’s safe to run your date{} filter to change @timestamp to the event’s timestamp.

After the date{} filter, compute the difference between the event’s time and the received time.  That’s the lag in getting the event processed by logstash:

ruby {
  code => "
    begin
      event['logstash_lag'] = ( ( event['logstash_ts'] - event['@timestamp'] ) ).to_f
    rescue Exception => e
      event['logstash_ruby_exception'] = '[logstash_lag]: ' + e.message
    end
  "
}

If this lag is high, the problem could be:

  • a delay in shipping the logs.  Check the health of your shipper.
  • a delay in processing the logs.  Is your logstash machine powerful enough?  Is your config “economical”?

There’s one more timestamp available, which is when Elasticsearch actually indexes the document.   You can enable the field in your mapping:

curl -XPUT localhost:9200/_template/timestamp -d '
 {
 "template" : "*",
 "order" : 10000,
 "mappings" : {
  "_default_" : {
   "_timestamp" : { "enabled" : true }
  }
 }
 }
 '

Note that this is deprecated in elasticsearch 2.0 and there’s no viable alternative!

Any new indexes created after that template is applied will have _timestamp stored for each document.

If the difference between our logstash_ts field and elasticsearch’s _timestamp is high, then:

  • you have connectivity problems between logstash and elasticsearch
  • elasticsearch is busy doing something else (check garbage collection!)
  • (maybe) you have a high index refresh_interval.

I haven’t found an easy way to compute the difference between these last two timestamps.  Scripted fields should work, but they’re disabled and/or poorly documented.

 

 

Check out the tips!

I’ve just added a ‘Logstash Tips’ menu section which will appear at the top of the page.  The tips don’t appear in the regular blog article area.  This section will replace pastebin as a repository for reusable tips.

I will be converting more logstash tips over to this section, and will also add sections for Elasticsearch, python, and grok{} patterns.

ELK System Overview

Overview

There seem to be a lot of old ELK guides on the internet.  It’s time to make a new one that can begin its own eventual decay.

To help minimize the aging process, I’m not going to cover how to install specific packages on specific platforms, but rather discuss the choice of tools and configurations that are available.

Shipper

You have to get your log files off the remote machines and eventually into Elasticsearch.  It’s the shipper’s job to, um, ship the logs to the next step.

There are a few shippers, some of which are outlined here.

tl;dr: use filebeat if you're only moving logs around.

logstash

You can use the full logstash build as your shipper.  There are almost no reasons to do this.  Being JVM-based, it’s big and takes memory.  It also has more features than you’ll typically need for a shipper.

The only excuse to run logstash as a remote shipper is if you have a ton of logs and need to apply some business logic about which ones to ship.  For example, maybe DEBUG logging is enabled in your production environment (?); you could use a full logstash to only ship the more important levels for processing.

logstash-forwarder

NOTE: logstash-forwarder is dead.  See filebeat, below.

This is the right choice.  It’s a light-weight program that does nothing other than read log files and send them to logstash.  Traffic is encrypted with SSL, so certs are required.

logstash-forwarder speaks the “lumberjack” protocol with logstash.

filebeat

Filebeat is the replacement for logstash-forwarder.  It’s also lightweight, gives you the option of not using encryption, and they’re planning to add some nice client-side features (multiline and a basic ‘grep’).

Filebeat requires logstash 1.5+.

beaver

If you need a broker (see below), then beaver is a lightweight tool that, not being encrypted, can talk to redis.

Broker

Many guides describe the use of a broker like redis or rabbitmq between the shipper and logstash.

If you’re using logstash and/or logstash-forwarder as your shipper, you don’t need a broker.  Both of these packages keep track of where they are in the local files, and should recover from a logstash outage.  (If the outage lasts through a file rotation, this may not be true!).

I only like to use brokers when shipping logs from systems that don’t automatically handle logstash failures (e.g. syslog, netflow, etc).  This covers for unplanned outages, and also lets you release changes to logstash without losing data.

Logstash Indexer

Here’s where the magic happens….  Unstructured data is turned into structured information.   See our guide to patterns and best practices.

Elasticsearch

The storage part of the whole equation.

See our best practices.

Better than NewRelic

Application Performance Logging

You can learn a lot about your code by instrumenting it.  You’ll find methods that are called more than you’re expecting, or be able to identify which part of your code is slowing the app down.

There are Saas solutions for this, like New Relic.  From my experience, New Relic “runs” out of the box, and may provide information on common tools (MySQL, etc), but doesn’t provide any real information about your app without adding instrumentation calls to your code.

Also, New Relic works on an aggregated summary of your application.  If it supplies details on a specific request, I never found it.

If you have to instrument your code, and you want to know exactly what your users are doing, then what’s the appeal of New Relic?

Recently, I was involved with a project to instrument some “under performing” python code.  New Relic had been installed, but had provided no value.

Caveat: I was involved on the analysis side.  Another developer wrote the logging framework.

A performance class was created and we added a custom django handler that initialize a performance object for the request.

To instrument a method, you would wrap it in a ‘with’ statement:

with performanceTracking("block name"):
         some block of code

The ‘with’ syntax provides an __enter__ and __exit__ hook, which was used in the performance class to start and stop a timer for that block of code.  On __exit__, logging information for this block was added to a data structure for the entire request.

When the request finished, the handler would write the entire logging block using the standard logging mechanism.  Log these at DEBUG and it was easy to disable it in production.

What you ended up with was a nested set of performance information:

{
    "requestid": "12345",
    "name": "my web page",
    "calls": 1,
    "duration": 200,
    "children": [
        {
           "calls": 2,
           "name": "mysql user lookup",
           "duration": 190
        },
        {
           "calls": 1,
           "name": "something else",
           "duration": 10
        }
    }
}

You could now see that “mysql user lookup” was called twice (!), and was responsible for 95% of time spent (!).  Even better, you knew this information for this individual request.  You can summarize the performance of these blocks across the system (ala New Relic), but you could also zoom in and see the details for this particular user.  Powerful stuff, right?

With your own code, the possibilities are limitless.  In a web app environment, maybe the page is authenticated.  Add the user’s ID to the logging block, and now you can see exactly what happened for Mary on Tuesday at 10AM.  If you run multiple applications or subsystems, add their identifiers to the logs and you can break performance out by component.

Once the information was written to the logs, it was loaded into an Elasticsearch cluster for analysis.

Performance Analysis

With all of this data, the next step is to see what’s going on.  The ELK environment has come a long way since in the year since this project, so it would probably be even easier now!

With the data properly loaded parsed in logstash and fed into elasticsearch, you can make dashboards in kibana to show things like:

  • total application performance over time (graph the top-level “duration” field).
  • the performance of each code block over time
  • the most-frequently called code blocks
  • the worst-performing code blocks
  • does performance degrade with load?

We found it to be very useful to wrap every third-party API call with this framework.  If the vendor was slow, we’d be slow, but now we had our own data to isolate the cause and work with the vendor.

Combine this information with other sources of data and you can determine neat stuff like:

  • the overhead added by your web server (difference between total time and the applications total time).

Doing a basic code review on the worst-performing block would usually make it very clear why the performance was suffering.  I was the largest creator of tickets for the dev group in the company.

Using this information, we focused on two goals:

  • making the lines flat, i.e. creating consistent, scalable performance.
  • making the lines lower, i.e. increasing overall performance.

We were very successful in both of these goals.

We had several interesting “deep dives” into this data, but one of them stands out:

The whole system was monitored with Gomez, which one of the executives liked.  Gomez started to report performance problems, but they weren’t seen in the summary data that we graphed.

Since we had the details, I was able to find the exact request that Gomez had made (user name and time), and see the block of code that was slow.

It turned out that the system was writing a row into a table every time a user logged in.  The Gomez test account logged in a lot, so there were a lot of rows for this user.  The application pulled these rows, looking for some recent information (last search string, etc).  Unfortunately, the app pulled in all the rows, not the just recent ones that it needed.

It was easy to find and easy to fix.  Management was happy.

Elasticsearch best practices

Don’t forget to check out the Logstash best practices, too.

Memory

Give elasticsearch half of your system’s RAM, up to 32GB.

Make sure the allocated memory doesn’t get swapped out by using mlockall.  In your config/elasticsearch.yml, add:

bootstrap.mlockall: true

You may need to allow this as part of the startup by running

ulimit -l unlimited

On (at least) centos6, you can have this run for you in the init.d script by adding this line to /etc/sysconfig/elasticsearch:

MAX_LOCKED_MEMORY=unlimited

For centos7, edit /usr/lib/systemd/system/elasticsearch.service:

LimitMEMLOCK=infinity

After restarting, confirm the setting is correct in elasticsearch:

curl http://localhost:9200/_nodes/process?pretty

Index Names

Use an index for each day.  There are only two ways to delete data in elasticsearch, and using curator against daily indexes is the right one.

Note that this is the default from logstash.

Run an odd number of nodes

This will prevent the split-brain problem.

Run at least three nodes

With one replica (two copies), using three nodes will give you an I/O boost.

Adjust the Mapping

Elasticsearch supports many different field types, and you should use the appropriate one for each field.

By using ‘int’, you can use comparisons (“http_status:>500”) or ranges (“http_status:[400 TO 499]”).  Other field types give similar benefits of just using strings.

Logstash best practices

Don’t forget to check out the Elasticsearch best practices, too.

Set @timestamp

If your log entry contains the time at which the event occurred, use the date{} filter to replace @timestamp with this value.  See “Save the Date”, too.

If you’ve already grok{}’ed out a field called ‘apache_timestamp’, your date{} stanza might look like this:

filter {
  if [type] == "access" {
    date {
      match => [ 'apache_timestamp', "dd/MMM/yyyy:HH:mm:ss Z" ]
      remove_field => [ 'apache_timestamp' ]
    }
  }
}

Save the Date

By overwriting @timestamp (as suggested, above), you won’t know when logstash processed the event.  Being able to manage this lag is important.  Cover this by copying @timestamp to another field (using mutate->add_field) before applying the date{} filter.

Use Smart Patterns

You can cast the field to integer or float in the pattern, e.g.:

%{INT:my_field:int}

Use grok’s tag_on_failure

When the grok{} filter fails, it adds a tag called “_grokparsefailure”.  This is helpful, unless you have multiple grok{} filters.  In this case, add a unique tag_on_failure attribute to each grok{}.

grok {
    match => [
        "message", "%{PATTERN_1}"
    ]
    tag_on_failure => [ "failedPattern1" ]
}

Note that the syslog{} input uses grok{} internally, so it can also throw _grokparsefailure messages.  In logstash 1.5, this tag is _grokparsefailure_sysloginput.

Increase the Output Workers

Many outputs (include elasticsearch{}) support using multiple output workers.  Each will maintain their own cache (of flush_size events).  Specifying multiple output workers can increase throughput.

Increase the Filter Workers

If you’re not using the multiline{} filter, you can increase the number of logstash worker threads.  This will better utilize the CPU of your logstash machine.  In your startup script, use the “-w” flag with a value that slightly less than the number of CPUs on the machine, e.g.:

-w 18

 Increase the Open File Descriptors

On centos the init script will run ‘ulimit -n’ if you update /etc/sysconfig/logstash, e.g.:

LS_OPEN_FILES=65535

 

Introduction to Logstash Grok Patterns

Introduction

The grok filter – and its use of patterns – is the truly powerful part of logstash.   Grok allows you to turn unstructured log text into structured data.

grok

The grok filter attempts to match a field with a pattern.  Think of patterns as a named regular expression.  Patterns allow for increased readability and reuse.  If the pattern matches, logstash can create additional fields (similar to a regex capture group).

This example takes the event’s “message” field and attempts to match it with 5 different patterns (e.g. “IP”, “WORD”).  If it finds a match for the entire expression, it will add fields for the patterns (“IP” will be stored in the “client” field, etc).

filter {
 grok {
   match => [ "message", "%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}" ]
 }
}

If the input doesn’t match the pattern, a tag will be added for “_grokparsefailure”.  You can (and should; see best practices) customize this tag.

Patterns

Logstash ships with lots of predefined patterns.  You can browse them on github.

Patterns consist of a label and a regex, e.g.:

USERNAME [a-zA-Z0-9._-]+

In your grok filter, you would refer to this as %{USERNAME}:

filter {
 grok {
   match => [ "message", "%{USERNAME}" ]
 }
}

Patterns can contain other patterns, e.g.:

SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}

 

Target Variables

A pattern can store the matched value in a new field.  Specify the field name in the grok filter:

filter {
 grok {
   match => [ "message", "%{USERNAME:user}" ]
 }
}

If you’re using a regexp, you can make a new field with an Oniguruma trick:

filter {
  grok {
    match => [ "message", "(?<myField>[a-z]{3})" ]
  }
}

This would find three lower case letters and create a field called ‘myField’.

Casting

By default, grok’ed fields are strings.  Numeric fields (int and float) can be declared in the pattern:

filter {
 grok {
   match => [ "message", "%{USERNAME:user:int}" ]
 }
}

Note that this is just a hint that logstash will pass along to elasticsearch when it tries to insert the event.  If the field already exists in the index with a different type, this won’t change the mapping in elasticsearch until a new index is created.

Custom Patterns

While logstash ships with many patterns, you eventually will need to write a custom pattern for your application’s logs.  The general strategy is to start slowly, working your way from the left of the input string, parsing one field at a time.

Your pattern does not need to match the entire event message, so you can skip leading and trailing information if you just need something from the middle.

Grok uses Oniguruma regular expressions.

Be sure to use the debugger (see below) when developing custom patterns.

Debugging

There is an online grok debugger available for building and testing patterns.