Last year, my esteemed colleague JP Schneider and I were invited to keynote at a couple of conferences last year. We gave two variations on the same talk, entitled “Genesis: Terraforming a New Home for Firefox Crash Reporter”, at each of Hashiconf 2015 and Velocity EU 2015 respectively.
The blurb for these talks is as follows:
Everyone loves talking about different stacks used in migrating applications to DevOps methodologies, but the most important and valuable change is not a technology stack; instead, it is the human stack.
In this keynote address, we will cover the intersection of technology and humans, and walk the audience through a real life example of the move of Firefox crash reporter. The three engineers tasked with this had to build the plane while it was in the air, all without landing or crashing.
As with many projects, hard deadlines and requirements made the team work through a lot of tough decisions and compromises while simultaneously training devs, product managers, managers, and other engineers in the new world of DevOps and Continuous Delivery.
The talks were a lot of fun and were well received by both audiences. We keep things light (including images and quotes from Shakespeare to Mobb Deep) and we keep them honest (both our successes and failures).
The Velocity talk, which at 25 minutes in length is the shorter of the two, is aimed at a more general audience; the Hashiconf talk is longer, and includes a lot more detail about the Hashicorp tools that we used to reach our goals. I hope you enjoy either or both of them. 🙂
Terraform is a Hashicorp tool which embraces the Infrastructure as Code model to manage a variety of platforms and services in today’s modern, cloud-based Internet. It’s still in development, but it already provides a wealth of useful functionality, notably with regards to Amazon and Digital Ocean interactions. The one thing it doesn’t do, however, is manage pre-existing infrastructure very well. In this blog post we’ll explore a way to integrate extant infra into a basic Terraform instance.
Note that this post is current as of Terraform v0.3.6. Hashicorp has hinted that future versions of Terraform will handle this problem in a more graceful way, so be sure to check those changelogs regularly. 🙂
A full example and walk-through will follow; however, for those familiar with Terraform and just looking for the tl;dr, I got you covered.
Declare a new, temporary resource in your Terraform plan that is nearly identical to the extant resource.
Apply the plan, thus instantiating the temporary “twinned” resource and building a state file.
Alter the appropriate id fields to be the same as the extant resource in both the state and config files.
Perform a refresh which will populate the state file with the correct data for the declared extant resource.
Remove the temporary resource from AWS manually.
faster and more dangerous, please.
Walking through the process and meticulously checking every step? Ain’t nobody got time for that!
Edit the state file and insert the resource directly – it’s just JSON, after all.
In the examples below, the notation [...] is used to indicate truncated output or data.
Also note that the AWS cli tool is assumed to be configured and functional.
The extant resource in this case is an S3 bucket called phrawzty-tftest-1422290325. This resource is unknown to Terraform.
$ aws s3 ls | grep tftest
2015-01-26 17:39:07 phrawzty-tftest-1422290325
Declare the temporary twin in the Terraform config:
Verify that Terraform is satisfied with the state:
Refreshing Terraform state prior to plan...
aws_s3_bucket.phrawzty-tftest: Refreshing state... (ID: phrawzty-tftest-1422290325)
No changes. Infrastructure is up-to-date. This means that Terraform
could not detect any differences between your configuration and
the real physical resources that exist. As a result, Terraform
doesn't need to do anything.
Recently, Ben Kero (a fellow Mozillian) and I were invited to present a talk at Linux Conf Australia. To say that we were excited about presenting at one of the best Libre / Open Source conferences in the world is an understatement. We knew that we’d have to bring our A-game, and in all modesty, I like to think that we did. If you were there in person, I’d like to personally thank you for coming out, and if you couldn’t make it, that’s ok – the organisers have made many videos from the 2013 LCA available online, including ours, entitled “How to use Puppet like an Adult“.
We cover a variety of topics, including parametrisation, how to select good pre-built modules, and how you can build eco-systems around Puppet itself. Please feel free to drop us a line, either on Twitter or here on the blog. Thanks !
Update: This is an old blog post and is no longer relevant as of version 1.x of Elasticsearch. Now we can just use the snapshot feature.
Hello again! Today we’re going to talk about backup strategies for Elasticsearch. One popular way to make backups of ES requires the use of separate ES node, while another relies entirely on the underlying file system of a given set of ES nodes.
The ES-based approach:
Bring up an independent (receiving) ES node on a machine that has network access to the actual ES cluster.
Trigger a script to perform a full index import from the ES cluster to the receiving node.
Since the receiving node is unique, every shard will be represented on said node.
Shutdown the receiving node.
Preserve the /data/ directory from the receiving node.
The file system-based approach:
Identify a quorum of nodes in the ES cluster.
Quorum is necessary in order to ensure that all of the shards are represented.
Trigger a script that will preserve the /data/ directory of each selected node.
At first glance the file system-based approach appears simpler – and it is – but it comes with some drawbacks, notably the fact that coherency is impossible to guarantee due to the amount of time required to preserve /data/ on each node. In other words, if data changes on node between the start and end times of the preservation mechanism, those changes may or may not be backed up. Furthermore, from an operational perspective, restoring nodes from individual shards may be problematic.
The ES-based approach does not have the coherency problem; however, beyond the fact that it is more complex to implement and maintain, it is also more costly in terms of service delivery. The actual import process itself requires a large number of requests to be made to the cluster, and the resulting resource consumption on both the cluster nodes as well as the receiving node are non-trivial. On the other hand, having a single, coherent representation of every shard in one place may pay dividends during a restoration scenario.
As is often the case, there is no one solution that is going to work for everybody all of the time – different environments have different needs, which call for different answers. That said, if your primary goal is a consistent, coherent, and complete backup that can be easily restored when necessary (and overhead be damned!), then the ES-based approach is clearly the superior of the two.
import it !
Regarding the ES-based approach, it may be helpful to take a look at a simple import script as an example. How about a quick and dirty Perl script (straight from the docs) ?
my $local = ElasticSearch->new(
servers => 'localhost:9200'
my $remote = ElasticSearch->new(
servers => 'cluster_member:9200',
no_refresh => 1
my $source = $remote->scrolled_search(
index => 'content',
search_type => 'scan',
scroll => '5m'
You’ll want to replace the relevant elements with something sane for your environment, of course.
As for preserving the resulting /data/ directory (in either method), I will leave that as an exercise to the reader, since there are simply too many equally relevant ways to go about it. It’s worth noting that the import method doesn’t need to be complex at all – in fact, it really shouldn’t be, since complex backup schemes tend to have too many chances for failure than is necessary.
Hello everybody ! Today’s post is about the Distributed Numeric Assignment (or « DNA » ) plug-in for the 389 Directory Server (also known as the Fedora, Red Hat, and CentOS Directory Servers). Although this plug-in has existed for quite some, there isn’t a whole lot of documentation about how to implement it in a real-world scenario. I recently submitted some documentation to the maintainer of the 389 wiki, but since i’m not sure how, when, or in what form that documentation will come to exist on their site, i thought i’d expand on it here as well. If you’ve made it this far, i’m going to assume that you’re already familiar with the basics of LDAP, and already have an instance of Directory Server up and running – if not, i suggest you take a look through the official Red Hat documentation in order to get you started.
By way of some background, it is worth noting that my basic requirement was simply to have a centralised back-end for authenticating SSH logins to the various machines in our park. The actual numerical values for the UID and GID fields did not need to be the same, they simply needed to be both extant and unique for each user, with the further caveat that they should not collide with any existing values that might be defined locally on the machines. This is a very basic set of requirements, so it is an excellent starting point for our example. The first step is to activate the DNA plug-in via the console :
[TAB] Servers and Applications
Domain -> Server -> Server Group -> Directory Server
Server -> Plug-ins -> Distributed Numeric Assignment
[X] Enable plug-in
The Directory Server needs to be restarted in order for the activation to take effect. This can either be done via the console, or via the command-line as normal. The next step is to define how DNA will interact with new user data ; this is different from configuring the plug-in itself, in that we will be setting up a layer in between the plug-in and the user data that will allow certain values to be generated automatically (which is, of course, the end goal of this exercise). Consider the following two LDIF snippets :
As you can see, they are nearly identical. This configuration activates the DNA magic-number functionality for the UID and GID fields as shown in the Posix attributes section of the console, though the values used may require further explanation. The only particular requirement for the magic number (specified by the « dnamagicregen » field) is that it be a value that cannot occur naturally, which is to say a value that would not be generated by the DNA plug-in, nor set manually at any time. The default value is « 0 », but since this is clearly a number with meaning on the average Posix system, i would recommend a suitably large number that is unlikely to ever be used, such as « 99999 ». Non-numerical values can technically be used too ; however, these will not be acceptable to the console, so unless you’re using a third-party interface (or doing everything from the commandline), a numerical value must be used.
The « dnanextvalue » field functionally indicates where the count will start from. As noted previously, in order to avoid collisions with existing local entries on the various machines, i chose a start point of « 1000 », which was more than acceptable in my environment. Once these two snippets are integrated via the commandline, simply re-start the Directory Server (again), and you’re good to go From now on, any time that a new user is created with the value « 99999 » entered into either (or both) of the UID and GID Posix fields, DNA will automagically generate real values as appropriate.
Hello all ! Today we’re going to take a look at a somewhat obscure problem that – once encountered – can cause nothing but headaches for a system administrator. The problem relates to conflicts in CPAN RPM packages, and what can be done to work around the issue. If you’ve made it this far, i’m going to assume a couple of things : you’re comfortable with RPMs and repositories, have worked with a .spec file before, and you know what Perl modules are. Good ? Ok, let’s go.
Edit : About a week after i posted this article, the pastebin i uploaded the examples to disappeared. Maybe it will come back – i don’t know – but if not, sorry for the broken links…
CPAN is an enormous collection of Perl modules. If you’ve ever written a Perl script, there’s a good chance you’ve used a module that – at one point or another – came from this archive. One of the really neat features of CPAN is the interactive manner in which modules can be downloaded and installed from the archive using Perl right from the command line (frankly, if you’re reading this post, there’s a good chance you’ve used this feature, too). This is a fairly common way to install new modules and add functionality to your system, especially if you’re coding for local use (i.e. on your personal box).
It’s useful, but it’s not perfect, and one of the key areas where it starts to fail is scalability : if you’ve got a bunch of machines, and you need to SSH into each one to interactively install a CPAN module or two, it’s going to be a hassle. Likewise, CPAN doesn’t often find its way into the hearts and minds of enterprise Red Hat or CentOS environments, where the official policy is often to install software via RPM only (for support, administration, and sanity reasons, this is often the case).
Luckily, some of the most commonly used CPAN modules exist as RPMs in the default repositories. Some, but not all (and not even « many ») – for this, there are other repositories available. Some examples :
That last one – Magnum – is particularly interesting given the subject of our post today. From their info page :
At Magnum we have a firm rule that all CPAN modules on our machines are installed from RPMs. The Fedora and Centos projects build RPMs for many CPAN modules, but there are always ones missing and the ones that are available often lag behind the most up to date versions. For that reason, we build a lot of RPMs of CPAN modules. And we don’t want to keep that work to ourselves, so on these pages we make them available for anyone to download.
Their RPMs are generated automagically using a great tool called « cpanspec », which does exactly what you think it does : given a CPAN tarball, it will generate a .spec file suitable for building an installable RPM. It is available in the standard repositories, and can be installed easily via YUM as normal, so go ahead and do that now. Ok, example time : say you needed HTML::Laundry, but after a quick peek through your repositories, it becomes readily apparent that an RPM is not available. Thanks to cpanspec, all is not lost :
We just downloaded the tarball right from the CPAN website, and ran cpanspec against it. The « –packager » argument simple defines the person who’s generating the .spec, and doesn’t necessarily have to be anything accurate. Go ahead and try it for yourself. Now take a look at the resulting .spec file (or on the a pastebin here). As you can see, it fills in all the fields, including the critical (and often tricky-to-determine) « BuildRequires » and « Requires » items. Frankly, it’s solid gold, and it has made the lives of CentOS / RHEL admins all over the world much easier.
That said, it’s not perfect, and there are times when you might run into problems. Actually, you may run into two problems in particular. The first is conflicts over ownership, which arises when multiple RPMs claim to be responsible for the same file (or files, or directories, or features, or whatever). The second is more nefarious : an RPM that writes files to the system without declaring ownership for them – a condition often referred to as « clobbering ». The former is irritating, but at least it’s not destructive, unlike the latter, which can cause all manner of headaches. To illustrate these two problems, let’s take a look at another example (this one being decidedly more real-world than that of Laundry above) : CGI.pm.
The .spec file that is generated from this tarball is functional and correct, and we can build an installable RPM out of it, so at first all appears well. Again, go ahead and try for yourself – i’ll wait. You may wish to capture the build output for review – otherwise, check the pastebin. I’d like to draw your attention to the « Installing » lines. By trimming the « Installing /var/tmp/perl-CGI.pm.3.49-1-root-root » element from each of those lines, we can see the actual paths and files that this RPM will install to. Examples :
At first glance this looks perfectly acceptable. But look what happens when we try to install the resulting RPM (clipped for brevity) :
[root@host-119 build]# rpm -iv /usr/src/redhat/RPMS/noarch/perl-CGI.pm-3.49-1.noarch.rpm
Preparing packages for installation...
file /usr/share/man/man3/CGI.3pm.gz from install of perl-CGI.pm-3.49-1.noarch conflicts with file from package perl-5.8.8-27.el5.x86_64
file /usr/share/man/man3/CGI::Cookie.3pm.gz from install of perl-CGI.pm-3.49-1.noarch conflicts with file from package perl-5.8.8-27.el5.x86_64
file /usr/share/man/man3/CGI::Pretty.3pm.gz from install of perl-CGI.pm-3.49-1.noarch conflicts with file from package perl-5.8.8-27.el5.x86_64
As it turns out, the Perl package that comes with RHEL / CentOS already contains CGI.pm. This is normal, since it’s so popular, and is included as a convenience. Thus, RPM – in an attempt to preserve the coherence of the package management system – refuses to install overtop of the existing owned files. This is a fine illustration of the first of the two problems previously noted : conflicts over ownership. As i mentioned above, it’s aggravating, but it’s not a bug – it’s a feature, and it’s doing exactly what it’s designed to do. Irritating, but not ultimately dire.
If you look carefully, though, it’s also an illustration of the second problem. Note the list of files that are conflicting. Look back to the list of files that the package contains – notice anything missing from the conflicts list ? That’s right – the actual module files (*.pm) are not showing conflicts, which means they’d get overwritten without complaint by RPM. You might be thinking « who cares ? that’s what i want » right now, but trust me, it’s not what you want. Imagine this CGI package, with this version of CGI.pm gets installed, and then later you upgrade the Perl package – your CGI.pm files will get overwritten by the Perl package, because as far as RPM is concerned, Perl owns those files. All of a sudden, things break because you had scripts that relied on your particular version, but since you just upgraded Perl, you think (quite naturally) that the problem could be anywhere – where do you even start looking ?
Imagine the headache if there are multiple administrators, multiple servers, multiple data centres, and multiple clients paying multiple dollars. No fun at all.
So how can we upgrade CGI.pm, using an RPM, without running into these problems ? As is often the case, the answer is deceptively simple, but not immediately obvious. Ultimately what we want to accomplish is twofold :
Avoid the man conflicts.
Ensure that the existing owned module files are not clobbered by our new package.
Concerning the man pages – and i’m going to be perfectly blunt here – the solution is to simply not install them, since, of course, they’re already there. As for avoiding a clobbering condition, this requires a little bit of investigation into how Perl modules and libraries are stored on an RHEL / CentOS machine. Consider the following output :
[root@host-119 ~]# ls -d /usr/lib64/perl5/*
/usr/lib64/perl5/5.8.8 /usr/lib64/perl5/site_perl /usr/lib64/perl5/vendor_perl
What’s it all mean ? Well, the « 5.8.8 » directory is the default directory as defined by the Perl architecture, and is system and platform-agnostic, which is to say that it’s (supposed to be) the same on every system. The « vendor_perl » directory contains everything that specific to RHEL / CentOS (the « vendor » of the distribution). As you may recall from the rpmbuild output above, this is where the RPM wants to install the modules (thus creating the clobbering condition).
There’s a third directory there, promisingly named « site_perl » ; as the name implies, this is where site-specific files are stored, which is to say items that are neither part of the default Perl architecture, nor part of the RHEL / CentOS distribution. As you’ve no doubt guessed by now, site_perl is where we’re going to put our new modules.
Luckily for us, the only thing that needs to be changed is the .spec file – and we even get a headstart, since cpanspec does most of the heavy lifting for us. Examining the .spec file once more, we see the following lines of note (again, cut for brevity) :
These indicate that the target installation directory is that of the vendor, which is normally the case, and thus the default setting. Since we want to install to the site directory, we make the following changes :
That solves our clobbering problem quite nicely, but what about the man files ? As i mentioned above, the idea is to simply avoid installing them altogether, but since they’re generated automatically during the build process, how can we exclude them ? What i’m about to present is a bit of a hack, but it’s absolutely effective, and ultimately quite clean : we delete them after they’ve been generated, and then don’t declare them in the file list. Some items are already being potentially deleted by default, so let’s go ahead and add our own line into the mix :
This will look for all of the « manified » man files and just remove from the build tree. All that’s left now is to remove them from the file list. This is as simple as deleting (or commenting out) their sole declaration :
Another option is to simply install use the « –excludedocs » argument when installing the RPM. I opted to remove the docs altogether in order to ensure that the package can be installed without errors by anyone else without needed to know about the argument requirement ahead of time (and to facilitate automated rollouts).
What you’ll end up with is a .spec file that looks like this. Go ahead and build your RPM – it’ll install without conflicts and without danger. This is a technique that can be used for other CPAN packages as well, so go ahead and install everything you’ve always wanted.
Happy 2010 fair readers ! I hope that all is well with you and yours. Let’s get right to business : Virtualbox has a feature that allows you to access the host OS’s file system from the guest OS (shared folders), which is super useful, but not exactly perfectly implemented. In particular, there are known, documented performance issues in certain scenarios, such as when accessing a Linux host via a Windows guest (which, as you might imagine, is a pretty regular sort of activity).
One common (?) workaround is to install and configure Samba on the Linux host, then access it from the Windows guest like one would access any network server. The problem here is that it requires that Samba be installed and configured, which can be a pain in the, well, you know. Furthermore, the connection will be treated like any other, and the traffic will travel up and down the network stack, which is fundamentally unnecessary since the data is, physically speaking, stored locally.
Instead, here’s another workaround, one that keeps things simple, and solves the performance problem : just map the shared folder to a local drive in the host OS. It’s that easy. For those of us who aren’t too familiar with the Windows explorer interface (me included, heh), there are tonnes of step by step instructions available. For whatever reason (i suspect Netbios insanity), accessing the network share via a mapped drive manages to avoid whatever condition creates the lag problems, resulting in a rapid, efficient access to the underlying filesystem.
Hello again fair readers ! Today’s quick tip concerns the problem with missing time zones when deploying CentOS 5.3 (and some of the more recent Fedoras) in a kickstart environment. It’s a known problem, and unfortunately, since the source of the problem (an incomplete time zone data file) lies deep in the heart of the kickstart environment, fixing it directly is a distinct pain in the buttock region.
There is, however, a workaround – and it’s not even that messy ! The first step is to use a region that does exist, such as « Europe/Paris », which will satisfy the installer – then set the time zone to what you actually want after the fact in the « %post » section. So, in the top section of the kickstart file, we’ll put :
# set temporarily to avoid time zone bug during install
timezone --utc Europe/Paris
The « –utc » switch simply states that the system clock is in UTC, which is pretty standard these days, but ultimately optional. Next, in the %post section towards the end, we’ll shoe horn our little hack fix into place :
# fix faulty time zone setting
mv /etc/sysconfig/clock /etc/sysconfig/clock.BAD
sed 's@^ZONE="Europe/Paris"@ZONE="Etc/UTC"@' /etc/sysconfig/clock.BAD > /etc/sysconfig/clock
So, what’s going on there ? Let’s break it down :
In the first line, we’re just backing up the original configuration file, to use in the next line…
The second line is the important one – this is the actual manipulation which will fix the faulty time zone, setting it to whatever we want. In this example « Etc/UTC » is used, but you can pick whatever is appropriate.
The tool being used here is « sed », a non-interactive editor which dates back to the 1970’s, and which is still used by system administrators around the world every day.
The command we’re issuing to sed is between the single quotes – astute readers will notice that it’s a regular expression, but with @’s instead of the more usual /’s. In it, we simply state that the instance of « ZONE=”Europe/Paris” » is to be replaced with « ZONE=”Etc/UTC” ».
This change is to be made against the backup file, and outputted to the actual config.
Finally, we run « tzdata-update » which, as you’ve no doubt guessed, updates the time zone data system-wide, based (in part) on the newly-corrected clock config.
And that, as they say, is that. Happy kickstarting, friends, and i’ll see you next time !
Hello again, everybody ! Today i thought that we’d take a look at a fun and useful topic of interest to many system administrators : load balancing & redundancy. Now, i know, it doesn’t sound too exciting – but trust me, once you get your first mini-cluster set up, you’ll never look at service management quite the same way again. It’s not even that tough to set up, and you can get a basic setup going in almost no time at all, thanks to some great open source software that can be found in more or less any modern repository.
First, as always, a little bit of theory. The most basic web server setup (for example), looks something like figure 001, below :
As you can see, this is a functional setup, but it does have (at least) two major drawbacks :
A critical failure on the web server means the service (i.e. the content being served) disappears along with it.
If the web server becomes overloaded, you may be forced to take the entire machine down to upgrade it (or just let your adoring public deal with a slow, unresponsive website, i suppose).
The solution to both of these problems forms the topic of this blog entry : load balancing. The idea is straightforward enough : by adding more than one web server, we can ensure that our service continues to be available even when a machine fails, and we can also spread the love, er, load, across multiple machines, thus increasing our overall efficiency. Nice !
batman and round robin
Now, there are a couple of ways to go about this, one of which is called « Round Robin DNS » (or RRDNS), which is both very simple and moderately useful. DNS, for those needing a refresher, is (in a nutshell) the way that human-readable hostnames get translated into machine-readable numbers. Generally speaking, hostnames are tied to IP addresses in a one-to-one or many-to-one fashion, such that when you type in a hostname, you get a single number back. For example :
$ host www.dark.ca
www.dark.ca has address 184.108.40.206
In other words, when you type http://www.dark.ca into your browser, you get one particular machine on the Internet (as indicated by the address); however, it is also possible to set up a one-to-many relationship – this is the basis or RRDNS. A very common example is Google :
$ host www.google.com
www.google.com is an alias for www.l.google.com.
www.l.google.com has address 220.127.116.11
www.l.google.com has address 18.104.22.168
www.l.google.com has address 22.214.171.124
www.l.google.com has address 126.96.36.199
www.l.google.com has address 188.8.131.52
www.l.google.com has address 184.108.40.206
So what’s going on here ? In essence, the Google administrators have created a situation whereby typing in http://www.google.com into your browser will get you one of a whole group of possibilities. In this way, each time you request some content from them, one of any number of machines will be responsible for delivering that service. (Now, to be fair, the reality of what’s going on at Google is likely far more complex, but the premise is identical.) Your web browser will only get one answer back, which is more or less randomly provided by the DNS server, and that response is the machine you’ll interact with. As you can see, this (sort of) satisfies our problem of resource usage, and it (sort of) addresses the problem resource failure. For those of you who are more visually inclined, please see figure 002 below :
It’s not perfect, but it is workable, and most of all, it’s dead simple to set up – you just need to set your DNS configuration up and you’re good to go (an exercise i leave to you, fair reader, as RRDNS is not really the focus of our discussion today). Thus, while RRDNS is a simple method for implementing a rudimentary load balancing infrastructure, it still has notable failings :
The load balancing isn’t systematic at all – by pure chance, one machine could end up getting hammered while others do very little, for example.
If a machine fails, there’s a chance that the DNS response will contain the address of the downed machine. In other words, the chances of you getting the downed machine are 1 in X, where X is the number of possible responses to the DNS query. The odds get better (or worse, depending on how you look at it) as more machines fail.
A slightly more obscure problem is that of response caching : as a method of optimisation, many DNS systems, as well as software that interacts with DNS, will cache (hold on to) hostname lookups for variable lengths of time. This can invalidate the magic of RRDNS altogether…
another attack vector
Another approach to the problem, and the one we’ll be exploring in great depth in this article, is using a dedicated load balancing infrastructure, combining a handful of great open source tools and proven methodologies. First, however, some more theory.
Our new approach to load balancing must propose both a solution to the original problems (critical failure & resource usage), as well as address and solve the drawbacks of RRDNS as noted above. Really, what we want is an intelligent (or, at least, systematic) distribution of load across multiple machines, and a way to ensure that requests don’t get sent to downed machines by accident. It’d be nice if these functions were automated too, since the last thing an administrator wants to do is baby-sit racks of servers. What we’d like, in other words, could be represented by replacing the phrase « RRNDS » in figure 002 above, with the word « magic ». For now, let’s imagine that this magic sits on a machine that we’ll call « Load Balancer » (or LB, for short), and that this LB machine would have a similar conceptual relationship to the web servers as RRDNS does. Consider figure 003 :
This is a basic way of thinking about what’s going to happen. It looks a lot like figure 002, but there is a very important difference : instead of relying on the somewhat nebulous concept of DNS for our load balancing, we can now give that responsibility to a proper machine running and dedicated to the purpose. As you can imagine, this is already a huge improvement, since this opens the door to all sorts of additional features and possibilities that simply aren’t possible with straight DNS. Another interesting aspect of this diagram is that, visually speaking, it would appear that the Internet cloud only « sees » one machine (the load balancer), even though there are a number of web servers behind it. This concept of having a single point of entry lies at the very core of our strategy – both figuratively andliterally – as we’ll soon discover
In the here and now, however, we’re still dealing with theory, and a solution based on « magic » is about as theoretical as it gets. Luckily for us though, magic is exactly what we’re about to unleash – in the form of « Linux Virtual Server », or « LVS » for short. From their homepage :
The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the server cluster is fully transparent to end users, and the users interact as if it were a single high-performance virtual server. […] The Linux Virtual Server as an advanced load balancing solution can be used to build highly scalable and highly available network services, such as scalable web, cache, mail, ftp, media and VoIP services.
The thing about LVS is that while it’s not inherently complex, it is highly malleable, and this means you really do need to have a solid handle on exactly what you want to do, and how you want to do it, before you start playing around. Put another way, there are a myriad of ways to use LVS, but you’ll only use one of them at a time, and picking the right methodology is important. The best way to do this is by building maps and really getting a solid feel for how the various components of the overall architecture relate to each other. Once you’ve got a good mental idea of what things should look like, actually configuring LVS is about as straightforward as it gets (no, really!).
let’s complicate the issue further, for science !
Looking back to figure 003, we can see that our map includes the Internet, the Load Balancer, and some Web Servers. This is a pretty typical sort of setup, and thus, we can approach it from a few different ways. One of the decisions that needs to be made fairly early on, though, has more to do with topology and routing than LVS specifically : how, exactly, do the objects on the map relate to each other at a network level ? As always, there can be lots of answers to this question – each with their advantages and disadvantages – but ultimately we must pick only one. Since i value simplicity when it comes to technology, figure 004 describes a simple network topology :
Now, for those of you out there who may have some experience with LVS, you can see exactly where this is headed – for everybody else, this might not be what you were expecting at all. Let’s take a look at some of the more obvious points :
There are two load balancers.
The web servers are on the same network segment as the LBs.
Unlike the previous diagrams, the LBs do not appear to be « in between » the Internet and the web servers.
The first point is easy : there are two LBs for reasons of redundancy, as a single LB represents a single point of failure. In other words, if the LB stops working for whatever reason, all of your services behind it become functionally unavailable, thus, you really, really want to have another machine ready to go immediately following a failure.
A little bit more explanation is required to explain the second and third points – but the short answer is two words : « Direct Routing » (or DR for short). From the LVS wiki :
Direct Routing [is] an IP load balancing technology implemented in LVS. It directly routes packets to backend server through rewriting MAC address of data frame with the MAC address of the selected backend server. It has the best scalability among all other methods because the overhead of rewriting MAC address is pretty low, but it requires that the load balancer and the backend servers (real servers) are in a physical network.
If that sounds heavy, don’t worry – figure 005 explains it in easy visual form :
In a nutshell, requests get sent to the LB, which then passes it to the Web Server, who in turn responds directly to the client. It’s fast, efficient, scalable, and easy to set up, with the only caveat being that the LBs and the machines they’re balancing must be on the same network. As long as you’re willing to accept that restriction, Direct Routing is an excellent choice – and it’s the one we’ll be exploring further today.
a little less conversation, a little more action
So with that in mind, let’s get started. I’m going to be describing four machines in the following scenario. All four are identical off-the-shelf servers running CentOS 5.2 – nothing fancy here. The naming and numbering conventions are simple as well :
You probably noticed the fifth item in this list, labelled « Virtual Web Server ». This represents our virtual, or clustered service, and is not a real machine. This will be explained in further detail later on – for now, let’s go ahead and install the key software on both of the Load Balancer machines :
[root@A01 etc]# yum install ipvsadm piranha httpd
« ipvsadm » is, as you might have guessed, the administrative tool for « IPVS », which is in turn an acronym for « Internet Protocol Virtual Server », which makes more sense when you say « IP-based Virtual Server » instead. As the name implies, IPVS is implemented at the IP level (which is more generically known as Layer-3 of the OSI model), and is used to spread incoming connections to one IP address towards other IP addresses according to one of many pre-defined methods. It’s the tool that allows us to control our new load balancing infrastructure, and is the key software component around which this entire exercise revolves. It is powerful, but sort of a pain to use, which brings us to the second item in the list : piranha.
Piranha is a web-based tool (hence httpd, above) for administering LVS, and is effectively a front-end for ipvsadm. As installed in CentOS, however, the Piranha package contains not only the PHP pages that make up the interface, but also a handful of other tools of particular interest and usefulness that we’ll take a look at as well. For now, let’s continue with some basic setup and configuration.
A quick word of warning : before starting « piranha-gui » (one of the services supplied by Piranha) up for the first time, it’s important that both LBs have the same time set on them. You’ve probably already got NTP installed and functioning, but if not, here’s a hint :
[root@A01 ~]# yum -y install ntp && ntpdate pool.ntp.org && chkconfig ntpd on && service start ntpd
Moving right along, the next step is to define a login for the Piranha web interface :
[root@A01 ~]# /usr/sbin/piranha-passwd
You can define multiple logins if you like, but for now, one is certainly enough. Now, unless you plan to run your load balanced infrastructure on a completely internal network, you’ll probably want to set up some basic restrictions on who can access the interface. Since the interface is served via an instance of Apache HTTPd, all we have to do is set up a normal « .htaccess » file. Now, a full breakdown of .htaccess (and, in particular, mod_access) is outside of the scope of this document, but the simple jist is as follows :
[root@A01 ~]# cat /etc/sysconfig/ha/web/secure/.htaccess
Deny from all # by default, deny from everybody
Allow from 192.168.0 # requests from this network are allowed
With those items out of the way, we can now activate piranha-gui :
[root@A01 ~]# chkconfig piranha-gui on && service piranha-gui start
Congratulations ! The interface is now running on port 3636, and can be accessed via your browser of choice – in the case of our example, it’d be « http://A01:3636/ ». The username for the web login is « piranha », and the password is the one we set above. Now that we’re logged in, let’s take a look at the interface in greater depth.
look out – piranhas !
The first screen – known as the « Control » page – is a summary of the current state of affairs. Since nothing is configured or even active, there isn’t a lot to see right now. Moving on to the « Global Settings » tab, we have our first opportunity to start putting some settings into place :
Primary server public IP : Put the IP address of the « primary » LB. In this example, we’ll put the IP of A01.
Private server public IP : If we weren’t using direct routing, this field would need a value. In our example, therefore, it should be empty.
Use network type : Direct Routing (of course!)
On to the « Redundancy » tab :
Redundant server public IP : Put the IP address of the « secondary » LB. In this example, we’ll put the IP of A02.
Syncdaemon : Optional and useful – but know that it requires additional configuration in order to make it work.
This feature (which is relatively new to LVS) ensures that the state information (i.e. connections, etc..) are shared with the secondary in the event that a failover occurs. For more information, please see this page from the LVS Howto.
It is not necessary, strictly speaking, so we can just leave it unchecked for now.
Under the « Virtual Servers » tab, let’s go ahead and click « Add », then select the new unconfigured entry and hit « Edit » :
Name : This is an arbitrary identifier for a given clustered service. For our example, we’d put « WWW ».
Application Port : The port number for the service – HTTP runs on port 80, for example.
Protocol : TCP or UDP – this is normally TCP.
Virtual IP Address : This is the IP address of the virtual service (VIP), which you may recall from the table above. This is the IP address that clients will send requests to, regardless of the real IP addresses (RIP) of the real servers which are responsible for the service. In our example, we’d put 192.168.0.40 .
Each service that you wish to cluster needs a unique « address : port » pairing. For example, 192.168.0.40:80 could be a web service, and 192.168.0.40:25 would likely be a mail service, but if you wanted to run another, separate web service, you’d need to assign a different virtual IP.
Virtual IP Network Mask : Normally this is 255.255.255.255, indicating a single IP address (the Virtual IP Address above).
You can actually cluster subnets, but this is outside of the scope of this tutorial.
Device : The Virtual IP address needs to be assigned to a « virtual network interface », which can be named more or less anything, but generally follows the format « ethN:X », where N is the physical device, and X is an incremental numeric identifier. For example, if your physical interface is « eth0 », and this is the first virtual interface, then it would be named « eth0:1 ».
If and when you set up multiple virtual interfaces, it is important to not mix these up. Piranha has no facility for sanity checking these identifiers, so you may wish to track them yourself in a Google document or something.
Scheduling : There are a number of options here, and some are very different from one another. For the purposes of this exercise, we’ll pick a simple, yet relatively effective scheduler called « Least-Connections ».
This does exactly what it sounds like : when a new request is made to the virtual service, the LB will check to see how many connections are open to each of the real servers in the cluster, and then route the connection to the machine with the least connections. Congrats, you’ve now got load balancing !
Finally, let’s add some real servers into the cluster. From the « Edit » screen we’re already in, click on the « Real Server » sub-tab.
Name : This is the hostname of the real server. In our example, we’d put B01.
Address : The IP address of the real server. In our example, for B01, we’d put 192.168.0.38 .
Port : Generally speaking this can be left empty, as it will inherit the Port value defined in the virtual service (in this case, 80).
A value would be required here if your real server is running the service on a different port than that specified in the virtual service ; if your real server is running a web service on port 8080 instead of 80, for example.
Weight : Despite the name, this value is used in various different ways depending on which Scheduler you selected for the virtual service. In our example, however, this field is irrelevant, and can be left empty.
You can apply and add as many real servers as you like, one at a time, in this fashion. Go ahead and set up B02 (or whatever your equivalent is) now.
If you’re wondering when the secondary LB is going to be configured, well, wonder no longer : the future is now. Luckily, this step is very, very easy. From the secondary :
Phew ! That was a lot of work. After consuming a suitable refreshment, let’s move on to the final few steps. Earlier i mentioned that there were some other items that we’d need to learn about besides the Piranha interface – « Pulse » is one such item. Pulse, as a tool, is in the same family as some other tools you may have heard of, such as « Heartbeat », « Keepalived », or « OpenAIS ».
The basic idea of all of these tools is simple : to provide a « failover » facility between a group of two or more machines. In our example, our primary LB is the one that is normally active, but in the case that it fails for some reason, we’d like our secondary to click in and take over the responsibilities of the unavailable primary – this is what Pulse does. Each of the load balancers runs an instance of « pulse » (the executable, not the package), which behaves in this fashion :
Each LB sends out a broadcast packet (a pulse, as it were) stating that they are alive. As long as the active LB (commonly the primary) continues to announce itself, everybody is happy and nothing changes.
If, however, the inactive LB (commonly the secondary) server notices that it hasn’t seen any pulses from the active LB lately, it assumes that the active LB has failed.
The secondary, formerly inactive LB, then becomes active. This state is maintained until such a time as the primary starts announcing itself again, at which point the secondary demotes itself back to inactivity.
The difference between the active and the inactive server is actually very simple : the active server is the one with the virtual addresses assigned to it (remember those, from the Virtual Servers tab in Piranha?).
Let’s go ahead of start it up (on the primary LB first, then on the secondary) :
[root@A01 ~]# chkconfig --add pulse
[root@A01 ~]# service pulse start
an internet ballgame drama – in 5 parts
You may have noticed that we haven’t even touched the « real » servers (i.e. the web servers) yet. Now is the time. As it so happens, there’s only one major step that relates to the real servers, but it’s a very, very important one : defining VIPs, and then ensuring that the web servers are OK with the routing voodoo that we’re using to make this whole load balancing infrastructure work. The solution is simple, but the reason for the solution may not be immediately obvious – for that, we need to take a look at the IP layer of each packet (neat!). First, though, let’s run through a series of little stories :
Alice has a ball that she’d like to pass to Bob, so she tosses it his way.
Bob catches the ball, sees that it’s from Alice, and throws it back at her. What great fun !
Now imagine that Alice and Bob are hanging out with a few hundred million of their closest friends – but they still want to play ball.
Alice writes Bob’s name on the ball, who then passes it to somebody else, and so forth.
Eventually the ball gets passed to Bob. Unfortunately for Bob, he has no idea where it came from, so he can’t send it back.
The solution is obvious :
Alice writes « From : Alice, To : Bob » on the ball, the passes it along.
Bob gets the ball, and switches the names around so that it says « From : Bob, To : Alice », and sends it back.
OK, so, those were some nice stories, but how do they apply to our Load Balancing setup ? As it turns out, all we need to do is throw in some tubes, and we’ve described one of the basic functions of the Internet Protocol – that the source and destination IP addresses of a given packet are part of the IP layer of said packet. Let’s complicate it by one more level :
Alice prepares the ball as above, and send it flying.
Bob gets the ball, who’s been avoiding Alice since things got weird at the bar last week-end, passes it along to Charles.
Charles – who’s had a not-so-secret crush on Alice since high school – happily writes « From : Charles, To : Alice », and tosses it away.
Alice receives the ball, but much to her surprise, it’s from Charles, and not Bob as she expected. Awkward !
With that last story in mind, let’s take another look at figure 005 above (go ahead, i’ll wait). Notice anything ? That’s right – the original source sends their packet off, but then receives a response from a different machine than they expected. This does not work – it violates some basic rules about how communications are supposed to function on the Internet. For the thrilling conclusion – and a solution to the problem – let’s return to our drama :
As it turns out, Bob is a player : he gets so many balls from so many women that he needs to keep track of them all in a little notebook.
When Bob gets Alice’s ball he passes it to Charles, then he records where it came from and who he gave it to in his notebook
Charles – in an attempt to get into Bob’s circle of friends – agrees to write « From : Bob, To : Alice » on the ball, then sends it back.
Alice – expecting a ball from Bob – is happy to receive her Bob-signed spheroid.
Bob then gets another ball from Denise, passes it to Edward, and records this relationship as well.
Edward – a sycophant if ever there was – prepares the ball in the same fashion as Charles, and fires it back.
Of course, the more balls Bob has to deal with, the more helpers he can use to spread the work around. Now, as you’ve no doubt pieced together, Alice and Denise are any given sources on the Internet, Bob is our LB, and Charles & Edward are the web servers. Now, instead of writing people’s names on balls, we should now make the mental leap to IP addresses in packets. With our tables of hostnames and addresses in mind, let’s consider the following example :
The source sends a request for a web page.
The source IP is « 10.1.2.3 », and the destination IP is « 192.168.0.40 » (the VIP for WWW).
The packet is sent to A01, which is currently active, and thus has the VIP for WWW assigned to it.
A01 then forwards the packet to B02 (by chance), which crafts a response packet.
The RIP for B02 is « 192.168.0.39 », but instead of using that, the source IP is set to « 192.168.0.40 », and the destination is « 10.1.2.3 ».
The source, expecting a response from « .40 », indeed receives a packet that appears to be from WWW. Done and done.
The theory is sound, but how can we implement this in practice ? As i said – it’s simple ! We simply add a dummy interface to each of the web servers that has the same address as the VIP, which will allow the web servers to interact with packets properly. This is best done by creating a simple sysconfig entry on each of the web servers for the required dummy interface, as follows :
[root@B01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-lo:0
# for VIP
[root@B01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-lo:0
# for VIP
NAMall together now
The « lo » indicates that it’s a « Loopback address », which is best described by Wikipedia :
Such an interface is assigned an address that can be accessed from management equipment over a network but is not assigned to any of the real interfaces on the device. This loopback address is also used for management datagrams, such as alarms, originating from the equipment. The property that makes this virtual interface special is that applications that use it will send or receive traffic using the address assigned to the virtual interface as opposed to the address on the physical interface through which the traffic passes.
In other words, it’s a fake IP that the machine can use to make packets anyways. Now, there is a known scenario in which a machine with a given loopback address will, in this particular situation, cause confusion on the network about which interface actually « owns » a given address. It has to do with ARP, and interested readers are encouraged to Google for « LVS ARP problem » for more technical details – for now, let’s just get right to the solution. On each of the real servers, we’ll need to edit « sysctl.conf » :
[root@B01 ~]# vim /etc/sysctl.conf
# this file already has stuff in it, so put this at the bottom
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2
At this point we’ve now explored each key item that is necessary to make this whole front-end infrastructure work, but it is perhaps not quite clear how it all works together. So, let’s take a step back for a moment and review :
There are four servers : two are load balancers, and two are web servers.
Of the two load balancers, only one is active at any given time ; the other is a backup.
Every DNS entry for the sites on the web servers points to one actual IP address.
This IP address is called the « Virtual IP ».
The VIP is claimed by the active load balancer, meaning that when a request is made for a given website, it goes to the active LB.
The LB then re-directs the request to an actual web server.
The re-direction can be random, or based on varying levels of logical decision making.
The web server will respond directly – the LB is not a proxy.
Great ! Now, what software runs where, and why ?
The load balancers use LVS in order to manage the relationship between VIPs and RIPs.
Pulse is used between the LBs in order to determine who is alive, and which one is active.
An optional (but useful) web interface to both LVS and Pulse comes in the form of Piranha, which runs on a dedicated instance of Apache HTTPd on port 3636.
And that, my friends, is that ! If you have any questions, feel free to comment below (remember to subscribe to the RSS feed for responses). Happy balancing !
oh, p.s., one last thing…
In case you’re wondering how to keep your LVS configuration file synchronised across both of the load balancers, one way to do it would be with a network-aware filesystem – POHMELFS, for example. 😉
Hi everybody – here’s a super-quick update for you concerning « ethtool », and how to use it to set options in Fedora properly. Ethtool is a great little tool that can be used to configure all manner of network interface related settings – notably the speed and duplex of a card – on the fly and in real time. One of the most common situations where ethtool would be used is at boot time, especially for cards which are finnicky, or have buggy drivers, or poor software support, or.. well, you get the idea.
Times were that if you needed to use ethtool to configure a NIC setting at boot time, you’d just stick the given command line into « rc.local », or perhaps another runlevel script, and forget about it. The problem with this approach is (at least) twofold :
Frankly, it’s easy to forget about something like this, which makes future support / debugging of network issues more of a pain.
Anything that automatically modifies the runlevel script (such as updates to the parent package) may destroy your local edits.
In order to deal with these issues, and to standardise the implementation of the ethtool-at-boot technique, the Red Hat (and, thus, Fedora) maintainers introduced an option for defining ethtool parameters on a per-interface basis via the standard « sysconfig » directory system. Now, this actually happened a number of years ago, but the implementation was poorly announced (and poorly documented at the time), and thus, even today a lot of users and administrators don’t seem to know about it.
Now, there’s a very good chance that you already know this, but just to refresh your memory : in the sysconfig directory, there is another directory called « network-scripts », which in turn contains a series of files named « ifcfg-eth? », where « ? » is a device number. Each network device has a configuration file associated with it ; for example, ifcfg-eth1 is the configuration file for the « eth1 » device.
In order to specify the ethtool options for a given network interface, simply edit the associated configuration file, and add a « ETHTOOL_OPTS » line. For example :
ETHTOOL_OPTS="autoneg off speed 100 duplex full"
Now, whenever the network service initialises that interface, ethtool will be run with the specified options. Simple, easy, and best of all, standardised. What could be better ?