Overview
If you’ve setup your ELK cluster and logs are flowing in from your shippers, you’re now sitting on a goldmine of data. The question becomes, “what should I do?!??”
A first step is to make Kibana dashboards, but they serve little value in a lights-out environment (see http://svops.com/blog/?p=11).
When you’re ready to actively monitor the information that’s sitting in the cluster, you’ll want to pull it into your monitoring system (Nagios, Zabbix, ScienceLogic, whatever).
There are many benefits to this approach over Logstash’s build-in notifications, including:
- one alerting system (common message format, distribution groups, etc).
- one escalation system (*)
- one acknowledgement system (*)
- one dashboard for monitoring
(*) Logstash doesn’t provide these features.
This system is also better than using Logstash’s nagios-related plugins, since you’ll be querying all the documents in Elasticsearch, not just one document at a time. You’ll also be using Elasticsearch as a database, rather than using Logstash’s metric{} functionality as a poor substitute.
There are two systems that you should build. I’ll reference Nagios as the target platform.
Individual Metrics
If you wanted to query Elasticsearch for the total number of Java exceptions that have occurred, this is a good individual metric.
In Nagios, you would first define a virtual host (e.g. “elasticsearch”, “java”, “my_app”, etc) and a virtual service (e.g. “java exceptions”). The service would run a new command (e.g. “run_es_query”). Set the check interval to something that makes sense for your organization.
The magic comes in writing the underlying program that is run by the “run_es_query” command. This program should take a valid Elasticsearch query_string as a parameter, and run it against the cluster.
In the Nagios world, the script has to return the values to show OK, WARNING, etc. The output of the script can also include performance data, which is used for charting.
The python elasticsearch module makes writing the script pretty easy. Write one script for each query type (max, count, most recent document, etc); this will help keep your code from becoming unreadable due to being so generic.
Bulk Metrics
If you wanted to count the Java exceptions, but report them on a machine-by-machine basis, you would not want to launch the “individual metric” command for a set of physical hosts. Doing this would result in many queries being run against Elasticsearch, and doesn’t scale well at all.
The better alternative is to run one “bulk” script that pulls the data for all hosts from Elasticsearch and then passes that information to Nagios using the “passive check” system. Nagios will react to the information as configured.
Where’s the Code?
I’ve written this plugin a few times for different platforms, but always as (unsharable) work-for-hire. I hope to rewrite this in my spare time some day, but this outline should get you started.