Category Archives: Monitoring

Better than NewRelic

Application Performance Logging

You can learn a lot about your code by instrumenting it.  You’ll find methods that are called more than you’re expecting, or be able to identify which part of your code is slowing the app down.

There are Saas solutions for this, like New Relic.  From my experience, New Relic “runs” out of the box, and may provide information on common tools (MySQL, etc), but doesn’t provide any real information about your app without adding instrumentation calls to your code.

Also, New Relic works on an aggregated summary of your application.  If it supplies details on a specific request, I never found it.

If you have to instrument your code, and you want to know exactly what your users are doing, then what’s the appeal of New Relic?

Recently, I was involved with a project to instrument some “under performing” python code.  New Relic had been installed, but had provided no value.

Caveat: I was involved on the analysis side.  Another developer wrote the logging framework.

A performance class was created and we added a custom django handler that initialize a performance object for the request.

To instrument a method, you would wrap it in a ‘with’ statement:

with performanceTracking("block name"):
         some block of code

The ‘with’ syntax provides an __enter__ and __exit__ hook, which was used in the performance class to start and stop a timer for that block of code.  On __exit__, logging information for this block was added to a data structure for the entire request.

When the request finished, the handler would write the entire logging block using the standard logging mechanism.  Log these at DEBUG and it was easy to disable it in production.

What you ended up with was a nested set of performance information:

{
    "requestid": "12345",
    "name": "my web page",
    "calls": 1,
    "duration": 200,
    "children": [
        {
           "calls": 2,
           "name": "mysql user lookup",
           "duration": 190
        },
        {
           "calls": 1,
           "name": "something else",
           "duration": 10
        }
    }
}

You could now see that “mysql user lookup” was called twice (!), and was responsible for 95% of time spent (!).  Even better, you knew this information for this individual request.  You can summarize the performance of these blocks across the system (ala New Relic), but you could also zoom in and see the details for this particular user.  Powerful stuff, right?

With your own code, the possibilities are limitless.  In a web app environment, maybe the page is authenticated.  Add the user’s ID to the logging block, and now you can see exactly what happened for Mary on Tuesday at 10AM.  If you run multiple applications or subsystems, add their identifiers to the logs and you can break performance out by component.

Once the information was written to the logs, it was loaded into an Elasticsearch cluster for analysis.

Performance Analysis

With all of this data, the next step is to see what’s going on.  The ELK environment has come a long way since in the year since this project, so it would probably be even easier now!

With the data properly loaded parsed in logstash and fed into elasticsearch, you can make dashboards in kibana to show things like:

  • total application performance over time (graph the top-level “duration” field).
  • the performance of each code block over time
  • the most-frequently called code blocks
  • the worst-performing code blocks
  • does performance degrade with load?

We found it to be very useful to wrap every third-party API call with this framework.  If the vendor was slow, we’d be slow, but now we had our own data to isolate the cause and work with the vendor.

Combine this information with other sources of data and you can determine neat stuff like:

  • the overhead added by your web server (difference between total time and the applications total time).

Doing a basic code review on the worst-performing block would usually make it very clear why the performance was suffering.  I was the largest creator of tickets for the dev group in the company.

Using this information, we focused on two goals:

  • making the lines flat, i.e. creating consistent, scalable performance.
  • making the lines lower, i.e. increasing overall performance.

We were very successful in both of these goals.

We had several interesting “deep dives” into this data, but one of them stands out:

The whole system was monitored with Gomez, which one of the executives liked.  Gomez started to report performance problems, but they weren’t seen in the summary data that we graphed.

Since we had the details, I was able to find the exact request that Gomez had made (user name and time), and see the block of code that was slow.

It turned out that the system was writing a row into a table every time a user logged in.  The Gomez test account logged in a lot, so there were a lot of rows for this user.  The application pulled these rows, looking for some recent information (last search string, etc).  Unfortunately, the app pulled in all the rows, not the just recent ones that it needed.

It was easy to find and easy to fix.  Management was happy.

Monitoring your log files

Overview

If you’ve setup your ELK cluster and logs are flowing in from your shippers, you’re now sitting on a goldmine of data.  The question becomes, “what should I do?!??”

A first step is to make Kibana dashboards, but they serve little value in a lights-out environment (see http://svops.com/blog/?p=11).

When you’re ready to actively monitor the information that’s sitting in the cluster, you’ll want to pull it into your monitoring system (Nagios, Zabbix, ScienceLogic, whatever).

There are many benefits to this approach over Logstash’s build-in notifications, including:

  • one alerting system (common message format, distribution groups, etc).
  • one escalation system (*)
  • one acknowledgement system (*)
  • one dashboard for monitoring

(*) Logstash doesn’t provide these features.

This system is also better than using Logstash’s nagios-related plugins, since you’ll be querying all the documents in Elasticsearch, not just one document at a time.  You’ll also be using Elasticsearch as a database, rather than using Logstash’s metric{} functionality as a poor substitute.

There are two systems that you should build.  I’ll reference Nagios as the target platform.

Individual Metrics

If you wanted to query Elasticsearch for the total number of Java exceptions that have occurred, this is a good individual metric.

In Nagios, you would first define a virtual host (e.g. “elasticsearch”, “java”, “my_app”, etc) and a virtual service (e.g. “java exceptions”).  The service would run a new command (e.g. “run_es_query”).  Set the check interval to something that makes sense for your organization.

The magic comes in writing the underlying program that is run by the “run_es_query” command.  This program should take a valid Elasticsearch query_string as a parameter, and run it against the cluster.

In the Nagios world, the script has to return the values to show OK, WARNING, etc.  The output of the script can also include performance data, which is used for charting.

The python elasticsearch module makes writing the script pretty easy.  Write one script for each query type (max, count, most recent document, etc); this will help keep your code from becoming unreadable due to being so generic.

Bulk Metrics

If you wanted to count the Java exceptions, but report them on a machine-by-machine basis, you would not want to launch the “individual metric” command for a set of physical hosts.  Doing this would result in many queries being run against Elasticsearch, and doesn’t scale well at all.

The better alternative is to run one “bulk” script that pulls the data for all hosts from Elasticsearch and then passes that information to Nagios using the “passive check” system.  Nagios will react to the information as configured.

 Where’s the Code?

I’ve written this plugin a few times for different platforms, but always as (unsharable) work-for-hire.  I hope to rewrite this in my spare time some day, but this outline should get you started.

Debugging your ELK cluster

Question

My ELK (ElasticSearch/LogStash/Kibana) cluster isn’t working.   How do I fix it?

Answer

Start at the beginning.

The Shipper

There are several popular pieces of software to ship your logs from the client to the logstash indexer.  Whether you’re using a full logstash installation, the logstash-forwarder, beaver, or something else, start by testing the network connectivity from your client to the logstash indexer:

telnet <ls_server> <ls_port>

There is no standard logstash port, so check your server configuration for the correct value.

If you can reach the server manually, then your shipper should be able to as well.

If you cannot reach the server with telnet, then you have some networking or connectivity issue.  Go work on that!

If you’re using the full logstash agent as your shipper, run it with “–debug” and check its own log files in /var/log/logstash/.

For logstash-forwarder, run it with the “-quiet=false” flag (0.4) or “-verbose -debug” (older) flags.

Check the list of filenames that you’ve configured – do they really match your paths?  Do any wildcards expand as desired?  In logstash-forwarder’s debug mode, it will show you the list of files that it’s processing.

Logstash

First, check that logstash can reach elasticsearch, using the same method as before.  From your logstash server:

telnet <es_server> <es_port>

If you can cannot reach the server, check the network.

If you can reach the server, we need to confirm that logstash is receiving the information from the shipper and what it’s doing with the data.  Add the following to your logstash output stanza and restart logstash:

output {
    stdout { codec => rubydebug }
}

This instructs logstash to print out a copy of each message that it processes.  These are usually written to /var/log/messages.

If information is being printed to the logs, then the shipper is sending good data to logstash.

Check the “@timestamp” value in these records.  By default, the documents will be written to an elasticsearch index according to that date.

Don’t forget to disable the extra “output” section, or you’ll run out of disk space pretty quickly!

Logstash also has “–debug” and “–verbose” command-line options that you can enable in your startup script, e.g. /etc/init.d/logstash.

Elasticsearch

If you can ship logs to logstash and logstash can see them, then logstash should be sending them to elasticsearch.  Check to see that the total document count on your server is increasing:

curl -s "localhost:9200/_nodes/stats?&pretty"

And examine the output at the beginning:

{
 "cluster_name" : "my_cluster",
 "nodes" : {
   "my_node" : {
     "indices" : {
       "docs" : {
       "count" : 123456789,
       "deleted" : 0
     },

If you run this a couple of times, you’d like to see the number increasing.

If the document count is not increasing, check the elasticsearch log file, typically in /var/log/elasticsearch/elasticsearch.log

Kibana

If documents are being written to elasticsearch, but you can’t find them in kibana, there are a few things to check:

First, is the default index for your dashboard correct?  In Kibana 3, click the “gear” in the top-right corner, switch to the “Index” tab, and confirm the setting:

Screen Shot 2014-11-04 at 3.07.57 PM

Second, make sure that your kibana date range covers the dates being used when documents are added to the index.  If the date is being overwritten (using logstash’s date filter), the logs will be in the past.  If the date is not being overwritten, the logs will show at the current time.

Monitoring using the New Relic API

A client has some code that is instrumented with a New Relic agent. We wanted to track the performance of individual portions of the code – mostly dependencies on other services like databases and third-party data sources. Rather than have yet another alerting platform, we wanted to pull the information into Nagios. Fortunately, New Relic offers an API that’s pretty easy to use.

The first step is to enable API access in your New Relic account and get the API key. According to the NR doc, the steps are:

  1. Sign in to the New Relic user interface.
  2. Select (account name) > Account settings > Integrations > Data sharing > API access.
  3. Click Enable API Access, and then copy or make a note of your API key.

Once you have the API key, the first request you’ll want to make is to get your account_id. Try this:

curl -gH "x-api-key:YOUR_API_KEY" 'https://rpm.newrelic.com/api/v1/accounts.xml'

Note that the account_id also appears in the page urls when you’re logged in to the new relic website.

With that done, there are only a handful of URLs that you might want to hit. New Relic breaks things down by application, so you’ll need a list of those:

curl -gH "x-api-key:YOUR_API_KEY" 'https://rpm.newrelic.com/api/v1/accounts/YOUR_ACCOUNT_ID/applications.xml'

The results of that call will not only give you the IDs for each application, but also links to the Overview and Servers pages. Again note that the application ID appears in the page urls when you’re on the new relic website.

Grab a list of the metrics that are available for the application:

curl -gH "x-api-key:YOUR_API_KEY" 'https://api.newrelic.com/api/v1/accounts/YOUR_ACCOUNT_ID/applications/YOUR_APPLICATION_ID/metrics.xml'

And finally pull the a statistic for that metric:

curl -gH "x-api-key:YOUR_API_KEY" 'https://api.newrelic.com/api/v1/accounts/YOUR_ACCOUNT_ID/applications/YOUR_APPLICATION_ID/data.xml?metrics[]=YOUR_METRIC_NAME&field=call_count&begin=2013-11-14T00:00:00&end=2013-11-14T23:59:59&summary=1'

This will return xml-formatted data for that metric for a single day. With “summary=1”, you get only one row returned. To get smaller buckets throughout the day, leave “summary” off.

In a quick scan, we didn’t find a way to get more than one metric value per call, so we make multiple calls to get what we need

Note that you can use “data.json” or “data.csv” to have the data returned in different formats. We used xml during manual development and then switched to json when we started writing the nagios plugin.

We now use this plugin check over 50 metrics every three minutes for the client, pulling the average_response_time, max_response_time, and call_count.

Monitoring and Alerting

Monitoring is a good thing (duh!).  It’s one of the core functions that an operations group provides.  As with most things, there are good ways and bad ways of doing monitoring.

When you monitor, you end up with some kind of dashboard, say with Nagios:

nagios

 

Which is very helpful, when you’re looking at it.

Ever seen a setup like this?

SOC_Security_Monitors

I think people who build systems like this really liked the movie War Games, but have missed one important difference – in the movie, it was someone’s job to sit and watch the screens 24/7/365.  Do you have staff for that?  Should we treat people that way?

In the real world, people have better things to do than stare at a monitor, waiting for some indicator to turn red.  Large displays like this become “monitoring theater” (see Security Theater) – basically fluff to make people think the system is being monitored. But, with those monitors sitting there, what happens when nobody is looking?

tree

You must have alerting to make your monitoring worthwhile. Do you?