Adventures in ELK pipeline debugging

I recently brought up some machines in a new colo and needed to ship their logs to a local logstash and then on to a centralized elasticsearch server at another site.

The first batch of machines seemed to come up OK.  When I brought the second set up a few weeks later, I started seeing errors in logstash:

Lumberjack input: the pipeline is blocked, temporary refusing new connection.

Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.

The “refusing new connection” part was easily confirmed in the logstash-forwarder logs:

Connecting to [10.1.2.3]:5544 (10.1.2.3)
Connected to 10.1.2.3
Read error looking for ack: EOF

These errors seem to be common understood to mean that your ELK infrastructure is congested.  Perhaps logstash is dead/slow (no); perhaps elasticsearch was dead/slow (no).

Elasticsearch looked healthy, and there were no signs of similar problem in other colos.  We’d see this behavior for most of an hour, and then a small batch of logs would make it through, which just added to the mystery.

Upon further investigation, we found that even the first batch of machines was suffering from this problem.  That they were able to process a backlog of files and keep up to date is surprising.

Turning up debugging in logstash yielded one additional message, but it was basically a repeat of information already seen.

The next step was to see what was going on when logstash talked to elasticsearch.  I grabbed a few minute’s info with tcpdump:

tcpdump -ni eth0 -w tcpdump.out tcp port 9200

This showed three interesting datapoints:

  1. all interactions between logstash and elasticsearch resulted in “200” responses (OK)
  2. the small successful batches were being sent exactly 40 seconds apart
  3. elasticsearch was responding quickly to the indexing requests that made it through.

A google search for issues around “40 seconds” turned up nothing.

To confirm the 40-second problem, I ran another tcpdump but restarted logstash at the same time.  This showed it to work for a while (~2 minutes), and then slow down to 40-second intervals.

There didn’t seem to be much more to learn from tcpdump.  That led to java debugging steps, starting with a thread dump:

kill -3 <pid>

This added a thread dump to logstash’s logfile.  Looking through it, I found about 1/3 of the threads were “BLOCKED”:


java.lang.Thread.State: BLOCKED (on object monitor)
at java.net.InetAddress.getLocalHost(InetAddress.java:1486)
- waiting to lock <0x0000000538a5dfe0> (a java.lang.Object)
at org.jruby.ext.socket.SocketUtils.gethostname(SocketUtils.java:77)
at org.jruby.ext.socket.RubySocket.gethostname(RubySocket.java:257)
at org.jruby.ext.socket.RubySocket$INVOKER$s$0$0$gethostname.call(RubySocket$INVOKER$s$0$0$gethostname.gen)
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:134) at rubyjit.LogStash::Filters::Metrics$$flush_486027b1b0fe337b5753334309092d65d96630e51028566121.__file__(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-metrics-1.0.0/lib/logstash/filters/metrics.rb:186)
at rubyjit.LogStash::Filters::Metrics$$flush_486027b1b0fe337b5753334309092d65d96630e51028566121.__file__(/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-metrics-1.0.0/lib/logstash/filters/metrics.rb)

The great thing here is the stacktrace, which showed that the block was coming from the metrics{} filter, on this line:

event["message"] = Socket.gethostname

This led to the discovery that these logstash instances – in only this colo – didn’t have access to a DNS server.  This is seemingly required so that the metric event can be tagged with the hostname from which it originates.

Adding a line to /etc/hosts fixed the problem:

127.0.0.1 myhost.mydomain.com

Hopefully nobody else will hit this exact problem, but perhaps the debugging techniques described above will be helpful.

Leave a Reply

Your email address will not be published.