Category Archives: ELK (Elasticsearch/Logstash/Kibana)

Monitoring your log files

Overview

If you’ve setup your ELK cluster and logs are flowing in from your shippers, you’re now sitting on a goldmine of data.  The question becomes, “what should I do?!??”

A first step is to make Kibana dashboards, but they serve little value in a lights-out environment (see http://svops.com/blog/?p=11).

When you’re ready to actively monitor the information that’s sitting in the cluster, you’ll want to pull it into your monitoring system (Nagios, Zabbix, ScienceLogic, whatever).

There are many benefits to this approach over Logstash’s build-in notifications, including:

  • one alerting system (common message format, distribution groups, etc).
  • one escalation system (*)
  • one acknowledgement system (*)
  • one dashboard for monitoring

(*) Logstash doesn’t provide these features.

This system is also better than using Logstash’s nagios-related plugins, since you’ll be querying all the documents in Elasticsearch, not just one document at a time.  You’ll also be using Elasticsearch as a database, rather than using Logstash’s metric{} functionality as a poor substitute.

There are two systems that you should build.  I’ll reference Nagios as the target platform.

Individual Metrics

If you wanted to query Elasticsearch for the total number of Java exceptions that have occurred, this is a good individual metric.

In Nagios, you would first define a virtual host (e.g. “elasticsearch”, “java”, “my_app”, etc) and a virtual service (e.g. “java exceptions”).  The service would run a new command (e.g. “run_es_query”).  Set the check interval to something that makes sense for your organization.

The magic comes in writing the underlying program that is run by the “run_es_query” command.  This program should take a valid Elasticsearch query_string as a parameter, and run it against the cluster.

In the Nagios world, the script has to return the values to show OK, WARNING, etc.  The output of the script can also include performance data, which is used for charting.

The python elasticsearch module makes writing the script pretty easy.  Write one script for each query type (max, count, most recent document, etc); this will help keep your code from becoming unreadable due to being so generic.

Bulk Metrics

If you wanted to count the Java exceptions, but report them on a machine-by-machine basis, you would not want to launch the “individual metric” command for a set of physical hosts.  Doing this would result in many queries being run against Elasticsearch, and doesn’t scale well at all.

The better alternative is to run one “bulk” script that pulls the data for all hosts from Elasticsearch and then passes that information to Nagios using the “passive check” system.  Nagios will react to the information as configured.

 Where’s the Code?

I’ve written this plugin a few times for different platforms, but always as (unsharable) work-for-hire.  I hope to rewrite this in my spare time some day, but this outline should get you started.

Debugging your ELK cluster

Question

My ELK (ElasticSearch/LogStash/Kibana) cluster isn’t working.   How do I fix it?

Answer

Start at the beginning.

The Shipper

There are several popular pieces of software to ship your logs from the client to the logstash indexer.  Whether you’re using a full logstash installation, the logstash-forwarder, beaver, or something else, start by testing the network connectivity from your client to the logstash indexer:

telnet <ls_server> <ls_port>

There is no standard logstash port, so check your server configuration for the correct value.

If you can reach the server manually, then your shipper should be able to as well.

If you cannot reach the server with telnet, then you have some networking or connectivity issue.  Go work on that!

If you’re using the full logstash agent as your shipper, run it with “–debug” and check its own log files in /var/log/logstash/.

For logstash-forwarder, run it with the “-quiet=false” flag (0.4) or “-verbose -debug” (older) flags.

Check the list of filenames that you’ve configured – do they really match your paths?  Do any wildcards expand as desired?  In logstash-forwarder’s debug mode, it will show you the list of files that it’s processing.

Logstash

First, check that logstash can reach elasticsearch, using the same method as before.  From your logstash server:

telnet <es_server> <es_port>

If you can cannot reach the server, check the network.

If you can reach the server, we need to confirm that logstash is receiving the information from the shipper and what it’s doing with the data.  Add the following to your logstash output stanza and restart logstash:

output {
    stdout { codec => rubydebug }
}

This instructs logstash to print out a copy of each message that it processes.  These are usually written to /var/log/messages.

If information is being printed to the logs, then the shipper is sending good data to logstash.

Check the “@timestamp” value in these records.  By default, the documents will be written to an elasticsearch index according to that date.

Don’t forget to disable the extra “output” section, or you’ll run out of disk space pretty quickly!

Logstash also has “–debug” and “–verbose” command-line options that you can enable in your startup script, e.g. /etc/init.d/logstash.

Elasticsearch

If you can ship logs to logstash and logstash can see them, then logstash should be sending them to elasticsearch.  Check to see that the total document count on your server is increasing:

curl -s "localhost:9200/_nodes/stats?&pretty"

And examine the output at the beginning:

{
 "cluster_name" : "my_cluster",
 "nodes" : {
   "my_node" : {
     "indices" : {
       "docs" : {
       "count" : 123456789,
       "deleted" : 0
     },

If you run this a couple of times, you’d like to see the number increasing.

If the document count is not increasing, check the elasticsearch log file, typically in /var/log/elasticsearch/elasticsearch.log

Kibana

If documents are being written to elasticsearch, but you can’t find them in kibana, there are a few things to check:

First, is the default index for your dashboard correct?  In Kibana 3, click the “gear” in the top-right corner, switch to the “Index” tab, and confirm the setting:

Screen Shot 2014-11-04 at 3.07.57 PM

Second, make sure that your kibana date range covers the dates being used when documents are added to the index.  If the date is being overwritten (using logstash’s date filter), the logs will be in the past.  If the date is not being overwritten, the logs will show at the current time.