Duplicated elasticsearch documents

Intro

The first thing to notice is that the documents probably have different _id values, so the problem then becomes, “who is inserting duplicates??”.

If you’re running logstash, some things to look at include:

  • duplicate input{} stanzas
  • duplicate output{} stanzas
  • two logstash processes running
  • bad file glob patterns
  • bad broker configuration

Duplicate Stanzas

Most people aren’t silly enough to deliberately create duplicate input or output stanzas, but there are still easy ways for them to occur:

  • a logstash config file you’ve forgotten (00-mytest.conf)
  • a backup file (00-input.conf.bak)

Remember that logstash will read in all the files it finds in your configuration directory!

Multiple Processes

Sometimes your shutdown script may not work, leaving you with two copies of your shipper running.  Check it with ‘ps’ and kill off the older one.

File Globs

If your file glob pattern is fairly open (e.g. “*”), you might be picking up files that have been rotated (“foo.log” and “foo.log.00”).

Logstash-forwarder sets a ‘file’ field that you can check in this case.

If you’ve enabled _timestamp in elasticsearch, it will show you when each of the duplicates was indexed, which might give you a clue.

Brokers

As for brokers, if you have multiple logstash indexers trying to read from the same broker without some locking mechanism, it might cause problems.

 

Leave a Reply

Your email address will not be published. Required fields are marked *