Elasticsearch backup strategies

Update: This is an old blog post and is no longer relevant as of version 1.x of Elasticsearch. Now we can just use the snapshot feature.

Hello again! Today we’re going to talk about backup strategies for Elasticsearch. One popular way to make backups of ES requires the use of separate ES node, while another relies entirely on the underlying file system of a given set of ES nodes.

The ES-based approach:

  • Bring up an independent (receiving) ES node on a machine that has network access to the actual ES cluster.
  • Trigger a script to perform a full index import from the ES cluster to the receiving node.
  • Since the receiving node is unique, every shard will be represented on said node.
  • Shutdown the receiving node.
  • Preserve the /data/ directory from the receiving node.

The file system-based approach:

  • Identify a quorum of nodes in the ES cluster.
  • Quorum is necessary in order to ensure that all of the shards are represented.
  • Trigger a script that will preserve the /data/ directory of each selected node.

At first glance the file system-based approach appears simpler – and it is – but it comes with some drawbacks, notably the fact that coherency is impossible to guarantee due to the amount of time required to preserve /data/ on each node. In other words, if data changes on node between the start and end times of the preservation mechanism, those changes may or may not be backed up. Furthermore, from an operational perspective, restoring nodes from individual shards may be problematic.

The ES-based approach does not have the coherency problem; however, beyond the fact that it is more complex to implement and maintain, it is also more costly in terms of service delivery. The actual import process itself requires a large number of requests to be made to the cluster, and the resulting resource consumption on both the cluster nodes as well as the receiving node are non-trivial. On the other hand, having a single, coherent representation of every shard in one place may pay dividends during a restoration scenario.

As is often the case, there is no one solution that is going to work for everybody all of the time – different environments have different needs, which call for different answers.  That said, if your primary goal is a consistent, coherent, and complete backup that can be easily restored when necessary (and overhead be damned!), then the ES-based approach is clearly the superior of the two.

import it !

Regarding the ES-based approach, it may be helpful to take a look at a simple import script as an example.  How about a quick and dirty Perl script (straight from the docs) ?

use ElasticSearch;

my $local = ElasticSearch->new(
    servers => 'localhost:9200'
my $remote = ElasticSearch->new(
    servers    => 'cluster_member:9200',
    no_refresh => 1

my $source = $remote->scrolled_search(
    index => 'content',
    search_type => 'scan',
    scroll      => '5m'

You’ll want to replace the relevant elements with something sane for your environment, of course.

As for preserving the resulting /data/ directory (in either method), I will leave that as an exercise to the reader, since there are simply too many equally relevant ways to go about it.  It’s worth noting that the import method doesn’t need to be complex at all – in fact, it really shouldn’t be, since complex backup schemes tend to have too many chances for failure than is necessary.

Happy indexing!

Send your logs to the cloud; Loggly vs. Papertrail

N.B. This post is from 2011 – the landscape has changed since then…


Centralised cloud-based logging.  It sounds tasty – and it is – but who should you go with?  Well, Loggly and Papertrail are the only games in town when it comes to the aforementioned service; the only other competitor in this space is Splunk Storm, but their offering – well-pedigreed though it may be – is strictly in private beta at this time, and therefore cannot really be considered a valid option.

The fact of the matter is that Loggly and Papertrail are, at a high level, functionally identical. They offer more or less the same bouquet of functionality, including alert triggers, aggregate visualisation, and even map reduce tools for data mining and reporting. Loggly has been around longer, and has a better track record for open-source involvement, meaning that the eco-system around their service is more mature; however, that doesn’t mean that they are necessarily superior to Papertrail in terms of the actual service.

My suggestion: If you’re in a hurry, flip a coin and go with one or the other. If you have the time, you should go ahead and try both out for a bit; Papertrail has a 7-day free trial programme, and Loggly is free (in perpetuity) for sufficiently small amounts of data and retention (which is no problem if you’re just poking around).

I’m very interested in hearing about actual user experiences with either or both, so please don’t hesitate to add a comment or drop me a line directly via the contact form.

Edit: From @pyr : « you  can also consider @datadoghq which has a different take on the issue but might fit the bill. »

Edit 2: From the comments, there’s also Logentries, which I don’t personally have any experience with, but which appears to offer a reasonably comprehensive offering as well.

Heavyweight tilt : GitHub vs. Bitbucket

When it comes to code hosting on The Internets today, GitHub is absolutely the hottest, trendiest service going – but it’s not alone. Right now, the primary direct competitor to GitHub is Bitbucket, and choosing the best service for you or your company can be a less than obvious scenario – so let’s break it down, shall we?

GitHub is generally considered to be the most popular code hosting and collaboration site out there today. They have an excellent track record for innovation and evolution of their service, and they put their money where their mouth is, notably by promoting and releasing their own internal tools into the open source community.  Their site offers a buffet of ever-improving facilities for collaborative activity, notably including an integrated issue tracker and excellent code comparison tools, among others. To be fair, not every feature has had the same level of care and attention paid to it, and as a result, some elements feel quite a bit more mature than others; however, again, they never stop trying to make things better.

Bitbucket looks a lot like GitHub.  That’s a fact.  I don’t honestly know which one came first, but it’s clear that today they’re bouncing off of each other in terms of design, features, and functionality.  You can more or less transpose your user experience between the two sites without missing too much of a beat, so for a casual user looking to contribute here and there, you get two learning curves for the price of one (nice).  Bitbucket’s pace of evolution is (perhaps) less blistering, but they too are capable of rolling out new and improved toys over time.

let’s get down to brass tacks

Both services offer the same basic functionality, which is the ability to create an account, and associate that account with any number of publicly-accessible repositories; however, if you want a private repository, GitHub will make you pay for it, whereas BitBucket offers it gratis.  There, as it is said, lies the rub.  More on this later.

One of the big differences between the two services lie in their respective origins: GitHub remains an independent start-up, whereas Bitbucket (although once independent) was acquired by – and is now strongly associated with – Atlassian (of JIRA fame). It is my opinion that this affects the cultural make-up of Bitbucket in subtle ways, leading to a more corporate take on development, deployment, and importantly, community relations and involvement.  Take a look at their respective blogs (go ahead, I’ll wait).

A quick scan of the past few months from each blog will reveal some important differences:

  • GitHub’s release schedule is more aggressive, with improvements and new features coming more regularly, whereas Bitbucket places greater emphasis on their tight integration with JIRA, Jenkins, and other industry tools.
  • Bitbucket advertises paid services and software on their blog, whereas GitHub advertises open source projects.
  • Bitbucket’s blog has one recent author, whereas GitHub’s blog has many recent authors.
  • GitHub hosts more community events (notably drinkups, heh) over a greater geographic area than Bitbucket (and their posts have more community response overall).

Also, check out GitHub’s “about us” page – brogrammers abound!  I’d compare the group to Bitbucket, but as it so happens, they don’t have an analogous page.

Previously I mentioned that GitHub would like you to pay for private repositories.  This is obviously part of their revenue scheme (and who can blame them for wanting to get that cheese?), but it also has the side-effect of making people choose to willingly host their projects publicly.  This has ended up creating a (very) large community of active participants representing a variety of languages and interests, which in turn results in more projects, and so on and so forth.  This feedback loop is interesting since it auto-builds popularity: as more people use it, the more people will use it.

These observations are, in no way, objective statements of the superiority of one platform over the other – they are, however, indicative of cultural differences between the two companies.  This is (or, at least, should be) a non-trivial element when deciding which service is right for you or your organisation.  For example, I’m a beer-drinking open source veteran that works in start-ups and small companies, so culturally my preferences are different than those of a suit-wearing system architect, working for a thousand-person consulting firm.  One isn’t necessarily better than the other – they’re just not the same (and that’s OK).

but wait, there’s more

Alright, here comes the shocker: for paid services (i.e. private repositories), GitHub is much more expensive than Bitbucket.  As in nowhere near the same price.  At all.  How can this be?  Well, I’m not privy to the financials of either company (if I were, I doubt I’d have written this post), but hey, the money for all those great open source projects, drinkups, and (bluntly) salaries have to come from somewhere – and while Bitbucket has Atlassian’s pockets backing them, GitHub has to stand on their own successes, and live with their own failures.

The two services are not dissimilar technically speaking, so it’s really up to you to decide which culture is better suited for your project.  Do you just need a spot to put your private project, that you program alone, isolated from the greater Internet?  BitBucket.  Do you have a public project that you’d like other people to discover, hack on together, and build a community around?  GitHub.  As for paid services, well I suppose that comes down to whether you want to pay extra to support what GitHub is doing or not.

Now, let’s be fair, for a lot of companies, “culture” is an irrelevant factor in their purchasing department – cost is the only concern.  Fair enough.  But let’s say you’ve got a team of developers, all of whom already have their own projects on GitHub, are familiar with the tools and processes, and have a network of fellow hackers built-in and ready to go.  In that case, perhaps culture is worth something after all.