HAProxy Puppet module (phrawzty remix)

As part of a big Logstash project at Mozilla (more on that to come), I was looking for an HAProxy module for Puppet, stumbling across the official Puppetlabs module in the process.  I’m told that this module works fairly well, with the caveat that it sometimes outputs poorly-formatted configuration files (due to a manifestly buggy implementation of concat).  Furthermore, the module more or less requires storeconfigs, which we do not use in our primary Puppet system.

Long story short, while I never ended up using HAProxy as part of the project, I did remix the official module to solve both of the aforementioned issues.  From the README :

This module is based on Puppetlabs’ official HAProxy module; however, it has been “remixed” for use at Mozilla. There are two major areas where the original module has been changed :

  • Storeconfigs, while present, are no longer required.
  • The “listen” stanza format has been abandoned in favour of a frontend / backend style configuration.

A very simple configuration to proxy unrelated Redis nodes :

  class { 'haproxy': }

  haproxy::frontend { 'in_redis':
    ipaddress       => $::ipaddress,
    ports           => '6379',
    default_backend => 'out_redis',
    options         => { 'balance' => 'roundrobin' }
  }

  haproxy::backend { 'out_redis':
    listening_service => 'redis',
    server_names      => ['node01', 'node02'],
    ipaddresses       => ['node01.redis.server.foo', 'node02.redis.server.foo'],
    ports             => '6379',
    options           => 'check'
  }

If that sounds interesting to you, the module is available on my puppetlabs-haproxy repo on Github. Pull requests welcome !

LCA, the event that was

Linux Conf Australia 2013, (alias linux.conf.au, alias LCA) was an amazing experience, and it’s difficult to summarise it succinctly, but there are definitely some highlights and important take-aways that I’d like to make special note of.

My fundamental reason for attending was that Ben Kero and I were invited to do a talk regarding Puppet (an important infrastructural tool that we use at Mozilla).  The talk, like all of the presentations given by Mozillians at LCA this year (and there were more than a few), was very well-received, and delivered to a packed auditorium – in fact, people had to be turned away at the door beforehand !

Since our talk bumped up against the lunch period, we had occasion to stay in the space and launch an informal panel discussion that included other Mozillians as well as representatives from a number of other Open Source companies and organisation.  We engaged on a variety of topics, ranging from IT-centric to questions about the future of the web, and the importance of open standards and market competition.  It was only when the organisers forced us out that the auditorium was finally cleared.

Interestingly, from the moment I arrived in Australia, and through until the very last day, I found myself acting as an ambassador for Mozilla.

For example, a day before the conference even started, I was part of an informal debate concerning the future of mobile and the importance of FirefoxOS within it.  Other participants included a Ubuntu employee and a number of hardware hackers who – as it turned out – were already trying to port FirefoxOS to other types of phones.

Another example even occurred at a local cricket match.  I was wearing my Firefox T-shirt (which usually generates interest), and ended up doing a some impromptu demos of Firefox for Android to a couple of different groups – the highly satisfying result was that a handful of people downloaded and installed it on the spot !

These, and more, are all opportunities to engage people about Mozilla, and in many cases, to introduce our values and mission to entirely new audiences.  During this and other conferences, I’ve found that even long-time users of Firefox and Thunderbird (for example) are often not even aware of who and what Mozilla is, and what we’re about.  I try my best to be a good ambassador, and I’m proud to represent the organisation wherever, and whenever I can.

From a personal perspective, the conference had two major benefits: it was an excellent learning experience, and a fantastic opportunity to spend time in-person with a number of my co-workers in the IT/Ops team.

Of note, I had the pleasure of attending a number of talks and presentations about technology that we use in our environment that have had a direct impact in my work-flow already.  Beyond the directly-applicable benefits, I was particularly inspired by a presentation about programming for embedded devices (shout-out to Bunnie Huang) which, due to peculiar technical constraints, requires a very precise and measured approach to development.  These ideals could – and should – be carried through outside of the embedded world.

In summary, LCA 2013 was an excellent opportunity to learn, teach, and engage the public concerning technology in specific, Mozilla in particular, and open source in general.  I’m already looking forward to next year !

How to use Puppet like an Adult

Hello friends,

Recently, Ben Kero (a fellow Mozillian) and I were invited to present a talk at Linux Conf Australia.  To say that we were excited about presenting at one of the best Libre / Open Source conferences in the world is an understatement.  We knew that we’d have to bring our A-game, and in all modesty, I like to think that we did.  If you were there in person, I’d like to personally thank you for coming out, and if you couldn’t make it, that’s ok – the organisers have made many videos from the 2013 LCA available online, including ours, entitled “How to use Puppet like an Adult“.

We cover a variety of topics, including parametrisation, how to select good pre-built modules, and how you can build eco-systems around Puppet itself.  Please feel free to drop us a line, either on Twitter or here on the blog.  Thanks !

Puppet at Mozilla, the podcast

Hello everybody !

If you’re interested in learning a little about Puppet at Mozilla, I highly recommend that you check out Puppet Podcast #7, where Brandan Burton (@solarce) and I (@phrawzty) talk with Puppet Labs’ Mike Stahnke (@stahnma) about just that topic.  It was our first podcast together, and frankly, it was more difficult that I thought it’d be.  That said, it was a great experience, and I hope to do it again sometime.  Hope you enjoy it !

quickly generate an encrypted password

Hi everybody !  Here’s a quick method for generating encrypted passwords that are suitable for things like /etc/passwd .  I realise that this isn’t terribly complex, but honestly, I always forget how to do this until I actually need to do it – so here’s a reminder for all of us. 🙂

#!/bin/bash

if [ "x$1" == 'x' ]; then
echo "USAGE: $0 'password'"
exit 1
fi


# Get an md5sum of the password string; this is used for the SHA seed.
md5=$( echo $1 | md5sum )
extract="${md5:2:8}"


# Calculate the SHA hash of the password string using the extracted seed.
mkpasswd -m SHA-512 "$1" "$extract"
exit $?

Elasticsearch backup strategies

Update: This is an old blog post and is no longer relevant as of version 1.x of Elasticsearch. Now we can just use the snapshot feature.

Hello again! Today we’re going to talk about backup strategies for Elasticsearch. One popular way to make backups of ES requires the use of separate ES node, while another relies entirely on the underlying file system of a given set of ES nodes.

The ES-based approach:

  • Bring up an independent (receiving) ES node on a machine that has network access to the actual ES cluster.
  • Trigger a script to perform a full index import from the ES cluster to the receiving node.
  • Since the receiving node is unique, every shard will be represented on said node.
  • Shutdown the receiving node.
  • Preserve the /data/ directory from the receiving node.

The file system-based approach:

  • Identify a quorum of nodes in the ES cluster.
  • Quorum is necessary in order to ensure that all of the shards are represented.
  • Trigger a script that will preserve the /data/ directory of each selected node.

At first glance the file system-based approach appears simpler – and it is – but it comes with some drawbacks, notably the fact that coherency is impossible to guarantee due to the amount of time required to preserve /data/ on each node. In other words, if data changes on node between the start and end times of the preservation mechanism, those changes may or may not be backed up. Furthermore, from an operational perspective, restoring nodes from individual shards may be problematic.

The ES-based approach does not have the coherency problem; however, beyond the fact that it is more complex to implement and maintain, it is also more costly in terms of service delivery. The actual import process itself requires a large number of requests to be made to the cluster, and the resulting resource consumption on both the cluster nodes as well as the receiving node are non-trivial. On the other hand, having a single, coherent representation of every shard in one place may pay dividends during a restoration scenario.

As is often the case, there is no one solution that is going to work for everybody all of the time – different environments have different needs, which call for different answers.  That said, if your primary goal is a consistent, coherent, and complete backup that can be easily restored when necessary (and overhead be damned!), then the ES-based approach is clearly the superior of the two.

import it !

Regarding the ES-based approach, it may be helpful to take a look at a simple import script as an example.  How about a quick and dirty Perl script (straight from the docs) ?

use ElasticSearch;

my $local = ElasticSearch->new(
    servers => 'localhost:9200'
);
my $remote = ElasticSearch->new(
    servers    => 'cluster_member:9200',
    no_refresh => 1
);

my $source = $remote->scrolled_search(
    index => 'content',
    search_type => 'scan',
    scroll      => '5m'
);
$local->reindex(source=>$source);

You’ll want to replace the relevant elements with something sane for your environment, of course.

As for preserving the resulting /data/ directory (in either method), I will leave that as an exercise to the reader, since there are simply too many equally relevant ways to go about it.  It’s worth noting that the import method doesn’t need to be complex at all – in fact, it really shouldn’t be, since complex backup schemes tend to have too many chances for failure than is necessary.

Happy indexing!

Send your logs to the cloud; Loggly vs. Papertrail

N.B. This post is from 2011 – the landscape has changed since then…

 

Centralised cloud-based logging.  It sounds tasty – and it is – but who should you go with?  Well, Loggly and Papertrail are the only games in town when it comes to the aforementioned service; the only other competitor in this space is Splunk Storm, but their offering – well-pedigreed though it may be – is strictly in private beta at this time, and therefore cannot really be considered a valid option.

The fact of the matter is that Loggly and Papertrail are, at a high level, functionally identical. They offer more or less the same bouquet of functionality, including alert triggers, aggregate visualisation, and even map reduce tools for data mining and reporting. Loggly has been around longer, and has a better track record for open-source involvement, meaning that the eco-system around their service is more mature; however, that doesn’t mean that they are necessarily superior to Papertrail in terms of the actual service.

My suggestion: If you’re in a hurry, flip a coin and go with one or the other. If you have the time, you should go ahead and try both out for a bit; Papertrail has a 7-day free trial programme, and Loggly is free (in perpetuity) for sufficiently small amounts of data and retention (which is no problem if you’re just poking around).

I’m very interested in hearing about actual user experiences with either or both, so please don’t hesitate to add a comment or drop me a line directly via the contact form.

Edit: From @pyr : « you  can also consider @datadoghq which has a different take on the issue but might fit the bill. »

Edit 2: From the comments, there’s also Logentries, which I don’t personally have any experience with, but which appears to offer a reasonably comprehensive offering as well.

Heavyweight tilt : GitHub vs. Bitbucket

When it comes to code hosting on The Internets today, GitHub is absolutely the hottest, trendiest service going – but it’s not alone. Right now, the primary direct competitor to GitHub is Bitbucket, and choosing the best service for you or your company can be a less than obvious scenario – so let’s break it down, shall we?

GitHub is generally considered to be the most popular code hosting and collaboration site out there today. They have an excellent track record for innovation and evolution of their service, and they put their money where their mouth is, notably by promoting and releasing their own internal tools into the open source community.  Their site offers a buffet of ever-improving facilities for collaborative activity, notably including an integrated issue tracker and excellent code comparison tools, among others. To be fair, not every feature has had the same level of care and attention paid to it, and as a result, some elements feel quite a bit more mature than others; however, again, they never stop trying to make things better.

Bitbucket looks a lot like GitHub.  That’s a fact.  I don’t honestly know which one came first, but it’s clear that today they’re bouncing off of each other in terms of design, features, and functionality.  You can more or less transpose your user experience between the two sites without missing too much of a beat, so for a casual user looking to contribute here and there, you get two learning curves for the price of one (nice).  Bitbucket’s pace of evolution is (perhaps) less blistering, but they too are capable of rolling out new and improved toys over time.

let’s get down to brass tacks

Both services offer the same basic functionality, which is the ability to create an account, and associate that account with any number of publicly-accessible repositories; however, if you want a private repository, GitHub will make you pay for it, whereas BitBucket offers it gratis.  There, as it is said, lies the rub.  More on this later.

One of the big differences between the two services lie in their respective origins: GitHub remains an independent start-up, whereas Bitbucket (although once independent) was acquired by – and is now strongly associated with – Atlassian (of JIRA fame). It is my opinion that this affects the cultural make-up of Bitbucket in subtle ways, leading to a more corporate take on development, deployment, and importantly, community relations and involvement.  Take a look at their respective blogs (go ahead, I’ll wait).

A quick scan of the past few months from each blog will reveal some important differences:

  • GitHub’s release schedule is more aggressive, with improvements and new features coming more regularly, whereas Bitbucket places greater emphasis on their tight integration with JIRA, Jenkins, and other industry tools.
  • Bitbucket advertises paid services and software on their blog, whereas GitHub advertises open source projects.
  • Bitbucket’s blog has one recent author, whereas GitHub’s blog has many recent authors.
  • GitHub hosts more community events (notably drinkups, heh) over a greater geographic area than Bitbucket (and their posts have more community response overall).

Also, check out GitHub’s “about us” page – brogrammers abound!  I’d compare the group to Bitbucket, but as it so happens, they don’t have an analogous page.

Previously I mentioned that GitHub would like you to pay for private repositories.  This is obviously part of their revenue scheme (and who can blame them for wanting to get that cheese?), but it also has the side-effect of making people choose to willingly host their projects publicly.  This has ended up creating a (very) large community of active participants representing a variety of languages and interests, which in turn results in more projects, and so on and so forth.  This feedback loop is interesting since it auto-builds popularity: as more people use it, the more people will use it.

These observations are, in no way, objective statements of the superiority of one platform over the other – they are, however, indicative of cultural differences between the two companies.  This is (or, at least, should be) a non-trivial element when deciding which service is right for you or your organisation.  For example, I’m a beer-drinking open source veteran that works in start-ups and small companies, so culturally my preferences are different than those of a suit-wearing system architect, working for a thousand-person consulting firm.  One isn’t necessarily better than the other – they’re just not the same (and that’s OK).

but wait, there’s more

Alright, here comes the shocker: for paid services (i.e. private repositories), GitHub is much more expensive than Bitbucket.  As in nowhere near the same price.  At all.  How can this be?  Well, I’m not privy to the financials of either company (if I were, I doubt I’d have written this post), but hey, the money for all those great open source projects, drinkups, and (bluntly) salaries have to come from somewhere – and while Bitbucket has Atlassian’s pockets backing them, GitHub has to stand on their own successes, and live with their own failures.

The two services are not dissimilar technically speaking, so it’s really up to you to decide which culture is better suited for your project.  Do you just need a spot to put your private project, that you program alone, isolated from the greater Internet?  BitBucket.  Do you have a public project that you’d like other people to discover, hack on together, and build a community around?  GitHub.  As for paid services, well I suppose that comes down to whether you want to pay extra to support what GitHub is doing or not.

Now, let’s be fair, for a lot of companies, “culture” is an irrelevant factor in their purchasing department – cost is the only concern.  Fair enough.  But let’s say you’ve got a team of developers, all of whom already have their own projects on GitHub, are familiar with the tools and processes, and have a network of fellow hackers built-in and ready to go.  In that case, perhaps culture is worth something after all.

Improvements in Cassandra 1.0, briefly stated

Datastax recently announced the availability of Cassandra 1.0 (stable), and along with that announcement, they made a series of blog posts (1, 2, 3, 4, 5) about many of the great new features and improvements that the current version brings to the table.

For those of you looking for an executive summary of those posts, you’re in luck, cause I’ve got your back on this one.

  • New multi-layer approach to compression that provides improvements to both write and (especially) read operations.
  • Said compression strategy also yields potentially significant disk space savings.
  • Leverages the JNA library in order to provide in-memory caching; this procedure is optimised for garbage collection, resulting in a more efficient collection and a smaller overall footprint.
  • A much improved compaction strategy results in less costly compaction runs, improving overall performance on each node.
  • Fewer requests are made over the network, and said requests are smaller in size, improving overall performance across the cluster.

In short, 1.0 is a very significant, very important upgrade to 0.8 (et al.), and one which will likely bring it to the forefront of the hardcore big data / nosql scene at large.

Nagios plugin to parse JSON from an HTTP response

Update 2015-10-07: This plugin has evolved – please check the latest README for up to date details.

 

Hello all !  I wrote a plugin for Nagios that will parse JSON from an HTTP response.  If that sounds interesting to you, feel free to check out my check_http_json repo on Github.  The plugin has been tested with Ruby 1.8.7 and 1.9.3.  Pull requests welcome !

Usage: ./check_http_json.rb -u  -e  -w  -c 
-h, --help                       Help info.
-v, --verbose                    Additional human output.
-u, --uri URI                    Target URI. Incompatible with -f.
    --user USERNAME              HTTP basic authentication username.
    --pass PASSWORD              HTTP basic authentication password.
-f, --file PATH                  Target file. Incompatible with -u.
-e, --element ELEMENT            Desired element (ex. foo=>bar=>ish is foo.bar.ish).
-E, --element_regex REGEX        Desired element expressed as regular expression.
-d, --delimiter CHARACTER        Element delimiter (default is period).
-w, --warn VALUE                 Warning threshold (integer).
-c, --crit VALUE                 Critical threshold (integer).
-r, --result STRING              Expected string result. No need for -w or -c.
-R, --result_regex REGEX         Expected string result expressed as regular expression. No need for -w or -c.
-W, --result_warn STRING         Warning if element is [string]. -C is required.
-C, --result_crit STRING         Critical if element is [string]. -W is required.
-t, --timeout SECONDS            Wait before HTTP timeout.

The --warn and --crit arguments conform to the Nagios threshold format guidelines.

If a simple result of either string or regular expression (-r or -R) is specified :

  • A match is OK and anything else is CRIT.
  • The warn / crit thresholds will be ignored.

If the warn and crit results (-W and -C) are specified :

  • A match is WARN or CRIT and anything else is OK.
  • The warn / crit thresholds will be ignored.

Note that (-r or -R) and (-W and -C) are mutually exclusive.

Note also that the response must be pure JSON. Bad things happen if this isn’t the case.

How you choose to implement the plugin is, of course, up to you.  Here’s one suggestion:

# check json from http
define command{
 command_name    check_http_json-string
 command_line    /etc/nagios3/plugins/check_http_json.rb -u 'http://$HOSTNAME$:$ARG1$/$ARG2$' -e '$ARG3$' -r '$ARG4$'
}
define command{
 command_name    check_http_json-int
 command_line    /etc/nagios3/plugins/check_http_json.rb -u 'http://$HOSTNAME$:$ARG1$/$ARG2$' -e '$ARG3$' -w '$ARG4$' -c '$ARG5$'
}

# make use of http json check
define service{
 service_description     elasticsearch-cluster-status
 check_command           check_http_json-string!9200!_cluster/health!status!green
}
define service{
 service_description     elasticsearch-cluster-nodes
 check_command           check_http_json-int!9200!_cluster/health!number_of_nodes!4:!3:
}