Finding the lag in your ELK cluster

By default, logstash sets @timestamp to the time when it processes the log.  Most people (correctly), use the date filter to set this to the event’s log.

Before you run the date{} filter, tuck away the value of @timestamp with a little bit of ruby:

ruby {
  code => "
    begin
      event['logstash_ts'] = event['@timestamp']
      rescue Exception => e
        event['logstash_ruby_exception'] = '[logstash_ts]: ' + e.message
    end
  "
}

This will create a new field called ‘logstash_ts’ that contains the timestamp when logstash received the event.

Now it’s safe to run your date{} filter to change @timestamp to the event’s timestamp.

After the date{} filter, compute the difference between the event’s time and the received time.  That’s the lag in getting the event processed by logstash:

ruby {
  code => "
    begin
      event['logstash_lag'] = ( ( event['logstash_ts'] - event['@timestamp'] ) ).to_f
    rescue Exception => e
      event['logstash_ruby_exception'] = '[logstash_lag]: ' + e.message
    end
  "
}

If this lag is high, the problem could be:

  • a delay in shipping the logs.  Check the health of your shipper.
  • a delay in processing the logs.  Is your logstash machine powerful enough?  Is your config “economical”?

There’s one more timestamp available, which is when Elasticsearch actually indexes the document.   You can enable the field in your mapping:

curl -XPUT localhost:9200/_template/timestamp -d '
 {
 "template" : "*",
 "order" : 10000,
 "mappings" : {
  "_default_" : {
   "_timestamp" : { "enabled" : true }
  }
 }
 }
 '

Note that this is deprecated in elasticsearch 2.0 and there’s no viable alternative!

Any new indexes created after that template is applied will have _timestamp stored for each document.

If the difference between our logstash_ts field and elasticsearch’s _timestamp is high, then:

  • you have connectivity problems between logstash and elasticsearch
  • elasticsearch is busy doing something else (check garbage collection!)
  • (maybe) you have a high index refresh_interval.

I haven’t found an easy way to compute the difference between these last two timestamps.  Scripted fields should work, but they’re disabled and/or poorly documented.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *