{"id":171,"date":"2016-02-03T23:20:06","date_gmt":"2016-02-03T23:20:06","guid":{"rendered":"http:\/\/svops.com\/blog\/?p=171"},"modified":"2016-02-03T23:20:06","modified_gmt":"2016-02-03T23:20:06","slug":"adventures-in-elk-pipeline-debugging","status":"publish","type":"post","link":"http:\/\/svops.com\/blog\/adventures-in-elk-pipeline-debugging\/","title":{"rendered":"Adventures in ELK pipeline debugging"},"content":{"rendered":"<p>I recently brought up some machines in a new colo and needed to ship their logs to a local logstash and then on to a centralized elasticsearch server at another site.<\/p>\n<p>The first batch of machines seemed to come up OK. \u00a0When I brought the second set up a few weeks later, I started seeing errors in logstash:<\/p>\n<pre>Lumberjack input: the pipeline is blocked, temporary refusing new connection.\r\n\r\nLumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.\r\n<\/pre>\n<p>The &#8220;refusing new connection&#8221; part was easily confirmed in the logstash-forwarder logs:<\/p>\n<pre>Connecting to [10.1.2.3]:5544 (10.1.2.3)\r\nConnected to 10.1.2.3\r\nRead error looking for ack: EOF<\/pre>\n<p>These errors seem to be common understood to mean that your ELK infrastructure is congested. \u00a0Perhaps logstash is dead\/slow (no); perhaps elasticsearch was dead\/slow (no).<\/p>\n<p>Elasticsearch looked healthy, and there were no signs of similar problem in other colos. \u00a0We&#8217;d see this behavior for most of an hour, and then a small batch of logs would make it through, which just added to the mystery.<\/p>\n<p>Upon further investigation, we found that even the first batch of machines was suffering from this problem. \u00a0That they were able to process a backlog of files and keep up to date is surprising.<\/p>\n<p>Turning up debugging in logstash yielded one additional message, but it was basically\u00a0a repeat of information already seen.<\/p>\n<p>The next step was to see what was going on when logstash talked to elasticsearch. \u00a0I grabbed a few minute&#8217;s info with tcpdump:<\/p>\n<pre>tcpdump -ni eth0 -w tcpdump.out tcp port 9200<\/pre>\n<p>This showed three\u00a0interesting datapoints:<\/p>\n<ol>\n<li>all interactions between logstash and elasticsearch resulted in &#8220;200&#8221; responses (OK)<\/li>\n<li>the small successful batches were being sent exactly 40 seconds apart<\/li>\n<li>elasticsearch was responding quickly to the indexing requests that made it through.<\/li>\n<\/ol>\n<p>A google search for issues around &#8220;40 seconds&#8221; turned up nothing.<\/p>\n<p>To confirm the 40-second problem, I ran another tcpdump but restarted logstash at the same time. \u00a0This showed it to work for a while (~2 minutes), and then slow down to 40-second intervals.<\/p>\n<p>There didn&#8217;t seem to be much more to learn from tcpdump. \u00a0That led to java debugging steps, starting with a thread dump:<\/p>\n<pre>kill -3 &lt;pid&gt;<\/pre>\n<p>This added a thread dump to logstash&#8217;s logfile. \u00a0Looking through it, I found about 1\/3 of the threads were &#8220;BLOCKED&#8221;:<\/p>\n<p style=\"padding-left: 30px;\"><code><br \/>\njava.lang.Thread.State: BLOCKED (on object monitor)<br \/>\nat java.net.InetAddress.getLocalHost(InetAddress.java:1486)<br \/>\n- waiting to lock &lt;0x0000000538a5dfe0&gt; (a java.lang.Object)<br \/>\nat org.jruby.ext.socket.SocketUtils.gethostname(SocketUtils.java:77)<br \/>\nat org.jruby.ext.socket.RubySocket.gethostname(RubySocket.java:257)<br \/>\nat org.jruby.ext.socket.RubySocket$INVOKER$s$0$0$gethostname.call(RubySocket$INVOKER$s$0$0$gethostname.gen)<br \/>\nat org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:134) at rubyjit.LogStash::Filters::Metrics$$flush_486027b1b0fe337b5753334309092d65d96630e51028566121.__file__(\/opt\/logstash\/vendor\/bundle\/jruby\/1.9\/gems\/logstash-filter-metrics-1.0.0\/lib\/logstash\/filters\/metrics.rb:186)<br \/>\nat rubyjit.LogStash::Filters::Metrics$$flush_486027b1b0fe337b5753334309092d65d96630e51028566121.__file__(\/opt\/logstash\/vendor\/bundle\/jruby\/1.9\/gems\/logstash-filter-metrics-1.0.0\/lib\/logstash\/filters\/metrics.rb)<\/code><\/p>\n<p>The great thing here is the stacktrace, which showed that the block was coming from the <a href=\"https:\/\/www.elastic.co\/guide\/en\/logstash\/current\/plugins-filters-metrics.html\" target=\"_blank\">metrics{} filter<\/a>, on this line:<\/p>\n<pre>event[<span class=\"pl-s\"><span class=\"pl-pds\">\"<\/span>message<span class=\"pl-pds\">\"<\/span><\/span>] <span class=\"pl-k\">=<\/span> <span class=\"pl-c1\">Socket<\/span>.gethostname<\/pre>\n<p>This led to the discovery that these logstash instances &#8211; in only this colo &#8211; didn&#8217;t have access to a DNS server. \u00a0This is seemingly required so that the metric event can be tagged with the hostname from which it originates.<\/p>\n<p>Adding a line to \/etc\/hosts fixed the problem:<\/p>\n<pre>127.0.0.1 myhost.mydomain.com<\/pre>\n<p>Hopefully nobody else will hit this exact problem, but perhaps the debugging techniques described above will be helpful.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently brought up some machines in a new colo and needed to ship their logs to a local logstash and then on to a centralized elasticsearch server at another site. The first batch of machines seemed to come up &hellip; <a href=\"http:\/\/svops.com\/blog\/adventures-in-elk-pipeline-debugging\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[11],"tags":[],"_links":{"self":[{"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/posts\/171"}],"collection":[{"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/comments?post=171"}],"version-history":[{"count":1,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/posts\/171\/revisions"}],"predecessor-version":[{"id":172,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/posts\/171\/revisions\/172"}],"wp:attachment":[{"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/media?parent=171"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/categories?post=171"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/svops.com\/blog\/wp-json\/wp\/v2\/tags?post=171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}