Category Archives: Nagios

Monitoring your log files

Overview

If you’ve setup your ELK cluster and logs are flowing in from your shippers, you’re now sitting on a goldmine of data.  The question becomes, “what should I do?!??”

A first step is to make Kibana dashboards, but they serve little value in a lights-out environment (see http://svops.com/blog/?p=11).

When you’re ready to actively monitor the information that’s sitting in the cluster, you’ll want to pull it into your monitoring system (Nagios, Zabbix, ScienceLogic, whatever).

There are many benefits to this approach over Logstash’s build-in notifications, including:

  • one alerting system (common message format, distribution groups, etc).
  • one escalation system (*)
  • one acknowledgement system (*)
  • one dashboard for monitoring

(*) Logstash doesn’t provide these features.

This system is also better than using Logstash’s nagios-related plugins, since you’ll be querying all the documents in Elasticsearch, not just one document at a time.  You’ll also be using Elasticsearch as a database, rather than using Logstash’s metric{} functionality as a poor substitute.

There are two systems that you should build.  I’ll reference Nagios as the target platform.

Individual Metrics

If you wanted to query Elasticsearch for the total number of Java exceptions that have occurred, this is a good individual metric.

In Nagios, you would first define a virtual host (e.g. “elasticsearch”, “java”, “my_app”, etc) and a virtual service (e.g. “java exceptions”).  The service would run a new command (e.g. “run_es_query”).  Set the check interval to something that makes sense for your organization.

The magic comes in writing the underlying program that is run by the “run_es_query” command.  This program should take a valid Elasticsearch query_string as a parameter, and run it against the cluster.

In the Nagios world, the script has to return the values to show OK, WARNING, etc.  The output of the script can also include performance data, which is used for charting.

The python elasticsearch module makes writing the script pretty easy.  Write one script for each query type (max, count, most recent document, etc); this will help keep your code from becoming unreadable due to being so generic.

Bulk Metrics

If you wanted to count the Java exceptions, but report them on a machine-by-machine basis, you would not want to launch the “individual metric” command for a set of physical hosts.  Doing this would result in many queries being run against Elasticsearch, and doesn’t scale well at all.

The better alternative is to run one “bulk” script that pulls the data for all hosts from Elasticsearch and then passes that information to Nagios using the “passive check” system.  Nagios will react to the information as configured.

 Where’s the Code?

I’ve written this plugin a few times for different platforms, but always as (unsharable) work-for-hire.  I hope to rewrite this in my spare time some day, but this outline should get you started.

Monitoring using the New Relic API

A client has some code that is instrumented with a New Relic agent. We wanted to track the performance of individual portions of the code – mostly dependencies on other services like databases and third-party data sources. Rather than have yet another alerting platform, we wanted to pull the information into Nagios. Fortunately, New Relic offers an API that’s pretty easy to use.

The first step is to enable API access in your New Relic account and get the API key. According to the NR doc, the steps are:

  1. Sign in to the New Relic user interface.
  2. Select (account name) > Account settings > Integrations > Data sharing > API access.
  3. Click Enable API Access, and then copy or make a note of your API key.

Once you have the API key, the first request you’ll want to make is to get your account_id. Try this:

curl -gH "x-api-key:YOUR_API_KEY" 'https://rpm.newrelic.com/api/v1/accounts.xml'

Note that the account_id also appears in the page urls when you’re logged in to the new relic website.

With that done, there are only a handful of URLs that you might want to hit. New Relic breaks things down by application, so you’ll need a list of those:

curl -gH "x-api-key:YOUR_API_KEY" 'https://rpm.newrelic.com/api/v1/accounts/YOUR_ACCOUNT_ID/applications.xml'

The results of that call will not only give you the IDs for each application, but also links to the Overview and Servers pages. Again note that the application ID appears in the page urls when you’re on the new relic website.

Grab a list of the metrics that are available for the application:

curl -gH "x-api-key:YOUR_API_KEY" 'https://api.newrelic.com/api/v1/accounts/YOUR_ACCOUNT_ID/applications/YOUR_APPLICATION_ID/metrics.xml'

And finally pull the a statistic for that metric:

curl -gH "x-api-key:YOUR_API_KEY" 'https://api.newrelic.com/api/v1/accounts/YOUR_ACCOUNT_ID/applications/YOUR_APPLICATION_ID/data.xml?metrics[]=YOUR_METRIC_NAME&field=call_count&begin=2013-11-14T00:00:00&end=2013-11-14T23:59:59&summary=1'

This will return xml-formatted data for that metric for a single day. With “summary=1”, you get only one row returned. To get smaller buckets throughout the day, leave “summary” off.

In a quick scan, we didn’t find a way to get more than one metric value per call, so we make multiple calls to get what we need

Note that you can use “data.json” or “data.csv” to have the data returned in different formats. We used xml during manual development and then switched to json when we started writing the nagios plugin.

We now use this plugin check over 50 metrics every three minutes for the client, pulling the average_response_time, max_response_time, and call_count.