Elasticsearch backup strategies

Update: This is an old blog post and is no longer relevant as of version 1.x of Elasticsearch. Now we can just use the snapshot feature.

Hello again! Today we’re going to talk about backup strategies for Elasticsearch. One popular way to make backups of ES requires the use of separate ES node, while another relies entirely on the underlying file system of a given set of ES nodes.

The ES-based approach:

  • Bring up an independent (receiving) ES node on a machine that has network access to the actual ES cluster.
  • Trigger a script to perform a full index import from the ES cluster to the receiving node.
  • Since the receiving node is unique, every shard will be represented on said node.
  • Shutdown the receiving node.
  • Preserve the /data/ directory from the receiving node.

The file system-based approach:

  • Identify a quorum of nodes in the ES cluster.
  • Quorum is necessary in order to ensure that all of the shards are represented.
  • Trigger a script that will preserve the /data/ directory of each selected node.

At first glance the file system-based approach appears simpler – and it is – but it comes with some drawbacks, notably the fact that coherency is impossible to guarantee due to the amount of time required to preserve /data/ on each node. In other words, if data changes on node between the start and end times of the preservation mechanism, those changes may or may not be backed up. Furthermore, from an operational perspective, restoring nodes from individual shards may be problematic.

The ES-based approach does not have the coherency problem; however, beyond the fact that it is more complex to implement and maintain, it is also more costly in terms of service delivery. The actual import process itself requires a large number of requests to be made to the cluster, and the resulting resource consumption on both the cluster nodes as well as the receiving node are non-trivial. On the other hand, having a single, coherent representation of every shard in one place may pay dividends during a restoration scenario.

As is often the case, there is no one solution that is going to work for everybody all of the time – different environments have different needs, which call for different answers.  That said, if your primary goal is a consistent, coherent, and complete backup that can be easily restored when necessary (and overhead be damned!), then the ES-based approach is clearly the superior of the two.

import it !

Regarding the ES-based approach, it may be helpful to take a look at a simple import script as an example.  How about a quick and dirty Perl script (straight from the docs) ?

use ElasticSearch;

my $local = ElasticSearch->new(
    servers => 'localhost:9200'
);
my $remote = ElasticSearch->new(
    servers    => 'cluster_member:9200',
    no_refresh => 1
);

my $source = $remote->scrolled_search(
    index => 'content',
    search_type => 'scan',
    scroll      => '5m'
);
$local->reindex(source=>$source);

You’ll want to replace the relevant elements with something sane for your environment, of course.

As for preserving the resulting /data/ directory (in either method), I will leave that as an exercise to the reader, since there are simply too many equally relevant ways to go about it.  It’s worth noting that the import method doesn’t need to be complex at all – in fact, it really shouldn’t be, since complex backup schemes tend to have too many chances for failure than is necessary.

Happy indexing!

Author: phrawzty

I have a computer.

11 thoughts on “Elasticsearch backup strategies”

  1. This assumes your backup node has enough RAM and disk to store the entire cluster’s index, right? Seems the length of time to backup and restore would take too long to be practical with a large index. Then again, I don’t know if using something like the S3 gateway would be any faster or more reliable.

    Haven’t tried it, but here’s another approach that suggests disabling flushing.
    http://karussell.wordpress.com/2011/07/10/how-to-backup-elasticsearch-with-rsync/

    Like

    1. Yes, it assumes that the backup node is capable of managing the entire contents of the cluster, that is true. The link you posted is functionally a filesystem-based approach – it is, after all, simply rsync’ing the contents of the data folder. It’s not a bad approach, but the considerations regarding quorum and consistency still apply.

      Like

  2. How can you force one node to receive a copy of all shards? This isnt clear to me how your example script ensures this.

    Like

    1. As I noted in the blog post, you’re functionally bringing up a second, independent ES “cluster” (even if it’s only one node), then importing an entire index to said cluster. Your question doesn’t really apply to the situation.

      Like

  3. Ah ok, I get it now. The perl code is was a bit too concise for me to really get how this works on the ElasticSearch level. Thanks for the clarification.

    Like

  4. I am trying to understand how to implement ES-based approach described above. Basically I am trying to understand the simple import script which contains a lot of commands that I am not familiar with. Is there a way to perform an import using the ES API?

    Like

    1. Actually, that script does use the API. There are two API calls : one to set up a scroll search (i.e. gather the data), and another to “reindex” (i.e. import) said data into the new index.

      Like

  5. Thanks for your reply. I do see now the scroll search in the API section of the guide on elasticsearch.org although I do not see the reindex API anywhere. Regardless of that, why would you need to run a different cluster in order to perform a hot backup? If I have two nodes in the same cluster running on different machines and I shutdown one node and backup the data directory doesn’t that provide a valid backup? Sorry if this is a stupid question I’m a real novice. Thanks again.

    Like

    1. If you are running a simple two-node cluster where each index is fully replicated to both nodes, then your approach would work. If, however, you are running a larger cluster, then you likely do not have full replication to every node (or even any one single node, given a large enough data set).

      Please re-read the blog post – especially the section on “file-based approach” – as it addresses this question exactly. 🙂

      Like

  6. Okay. I’m learning. (Slowly). In order to ensure full replication on every node of a multi-node cluster you would need to set replicas to nodes -1. This is wasteful from a storage perspective, but it means you can shutdown any node at any time, backup the data directory and know you have a coherent backup. The ES-bases approach is much more efficient from a storage usage standpoint, but it costs in time and cpu during the re-indexing process. Correct?

    Like

    1. Having full replication across every node is costly from a CPU / memory perspective as well, since each shard in an index is a full Lucene instance.

      Like

Leave a comment