Handling extant resources in Terraform

Terraform is a Hashicorp tool which embraces the Infrastructure as Code model to manage a variety of platforms and services in today’s modern, cloud-based Internet.  It’s still in development, but it already provides a wealth of useful functionality, notably with regards to Amazon and Digital Ocean interactions.  The one thing it doesn’t do, however, is manage pre-existing infrastructure very well.  In this blog post we’ll explore a way to integrate extant infra into a basic Terraform instance.

Note that this post is current as of Terraform v0.3.6.  Hashicorp has hinted that future versions of Terraform will handle this problem in a more graceful way, so be sure to check those changelogs regularly. 🙂

summary

A full example and walk-through will follow; however, for those familiar with Terraform and just looking for the tl;dr, I got you covered.

  • Declare a new, temporary resource in your Terraform plan that is nearly identical to the extant resource.
  • Apply the plan, thus instantiating the temporary “twinned” resource and building a state file.
  • Alter the appropriate id fields to be the same as the extant resource in both the state and config files.
  • Perform a refresh which will populate the state file with the correct data for the declared extant resource.
  • Remove the temporary resource from AWS manually.
  • Voilà.

faster and more dangerous, please.

Walking through the process and meticulously checking every step? Ain’t nobody got time for that!

  • Edit the state file and insert the resource directly – it’s just JSON, after all.

examples

In the examples below, the notation [...] is used to indicate truncated output or data.

Also note that the AWS cli tool is assumed to be configured and functional.

S3

The extant resource in this case is an S3 bucket called phrawzty-tftest-1422290325. This resource is unknown to Terraform.

$ aws s3 ls | grep tftest
2015-01-26 17:39:07 phrawzty-tftest-1422290325

Declare the temporary twin in the Terraform config:

resource "aws_s3_bucket" "phrawzty-tftest" {
    bucket = "phrawzty-tftest-1422353583"
}

Verify and prepare the plan:

$ terraform plan -out=terratest.plan
    [...]
Path: terratest.plan

+ aws_s3_bucket.phrawzty-tftest
    acl:    "" => "private"
    bucket: "" => "phrawzty-tftest-1422353583"

Apply the plan (this will create the twin):

$ terraform apply ./terratest.plan
    [...]
aws_s3_bucket.phrawzty-tftest: Creation complete

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
    [...]
State path: terraform.tfstate

Verify that the both the extant and temporary resources exist:

$ aws s3 ls | grep phrawzty-tftest
2015-01-26 17:39:07 phrawzty-tftest-1422290325
2015-01-27 11:14:09 phrawzty-tftest-1422353583

Verify that Terraform is aware of the temporary resource:

$ terraform show
aws_s3_bucket.phrawzty-tftest:
  id = phrawzty-tftest-1422353583
  acl = private
  bucket = phrawzty-tftest-1422353583

Alter the config file:

  • Insert the name of the extant resource in place of the temporary.
  • Strictly speaking this is not necessary, but it helps to keep things tidy.
resource "aws_s3_bucket" "phrawzty-tftest" {
    bucket = "phrawzty-tftest-1422290325"
}

Alter the state file:

  • Insert the name (id) of the extant resource in place of the temporary.
            "resources": {
                "aws_s3_bucket.phrawzty-tftest": {
                    "type": "aws_s3_bucket",
                    "primary": {
                        "id": "phrawzty-tftest-1422290325",
                        "attributes": {
                            "acl": "private",
                            "bucket": "phrawzty-tftest-1422290325",
                            "id": "phrawzty-tftest-1422290325"
                        }
                    }
                }
            }

Refresh the Terraform state (note the ID):

$ terraform refresh
aws_s3_bucket.phrawzty-tftest: Refreshing state... (ID: phrawzty-tftest-1422290325)

Verify that Terraform is satisfied with the state:

terraform plan
Refreshing Terraform state prior to plan...

aws_s3_bucket.phrawzty-tftest: Refreshing state... (ID: phrawzty-tftest-1422290325)

No changes. Infrastructure is up-to-date. This means that Terraform
could not detect any differences between your configuration and
the real physical resources that exist. As a result, Terraform
doesn't need to do anything.

Remove the temporary resource:

$ aws s3 rb s3://phrawzty-tftest-1422353583/
remove_bucket: s3://phrawzty-tftest-1422353583/

S3, faster.

For the sake of this example, the state file already contains an S3 resource called phrawzty-tftest-blah.

Add the “extant” resource directly to the state file.

            "resources": {
                [...]
                },
                "aws_s3_bucket.phrawzty-tftest": {
                    "type": "aws_s3_bucket",
                    "primary": {
                        "id": "phrawzty-tftest-1422290325",
                        "attributes": {
                            "acl": "private",
                            "bucket": "phrawzty-tftest-1422290325",
                            "id": "phrawzty-tftest-1422290325"
                        }
                    }
                }

Refresh:

$ terraform refresh
aws_s3_bucket.phrawzty-tftest: Refreshing state... (ID: phrawzty-tftest-1422290325)
aws_s3_bucket.phrawzty-tftest-blah: Refreshing state... (ID: phrawzty-tftest-blah)

Verify:

$ terraform show
aws_s3_bucket.phrawzty-tftest:
  id = phrawzty-tftest-1422290325
  acl = private
  bucket = phrawzty-tftest-1422290325
aws_s3_bucket.phrawzty-tftest-blah:
  id = phrawzty-tftest-blah
  acl = private
  bucket = phrawzty-tftest-blah

That’s that.

RabbitMQ plugin for Collectd

Hello all,

I wrote a rudimentary RabbitMQ plugin for Collectd.  If that sounds interesting to you, feel free to take a look at my GitHub.  The plugin itself is written in Python and makes use of the Python plugin for Collectd.

It will accept four options from the Collectd plugin configuration :

Locations of binaries :

RmqcBin = /usr/sbin/rabbitmqctl
PmapBin = /usr/bin/pmap
PidofBin = /bin/pidof

Logging :

Verbose = false

It will attempt to gather the following information :

From « rabbitmqctl list_queues » :

messages
memory
consumser

From « pmap » of « beam.smp » :

memory mapped
memory writeable/private (used)
memory shared

Props to Garret Heaton for inspiration and conceptual guidance from his « redis-collectd-plugin ».

workaround for slow shared folders in Virtualbox 3.x

Happy 2010 fair readers !  I hope that all is well with you and yours.  Let’s get right to business : Virtualbox has a feature that allows you to access the host OS’s file system from the guest OS (shared folders), which is super useful, but not exactly perfectly implemented.  In particular, there are known, documented performance issues in certain scenarios, such as when accessing a Linux host via a Windows guest (which, as you might imagine, is a pretty regular sort of activity).

One common (?) workaround is to install and configure Samba on the Linux host, then access it from the Windows guest like one would access any network server.  The problem here is that it requires that Samba be installed and configured, which can be a pain in the, well, you know.  Furthermore, the connection will be treated like any other, and the traffic will travel up and down the network stack, which is fundamentally unnecessary since the data is, physically speaking, stored locally.

Instead, here’s another workaround, one that keeps things simple, and solves the performance problem : just map the shared folder to a local drive in the host OS.  It’s that easy.  For those of us who aren’t too familiar with the Windows explorer interface (me included, heh), there are tonnes of step by step instructions available.  For whatever reason (i suspect Netbios insanity), accessing the network share via a mapped drive manages to avoid whatever condition creates the lag problems, resulting in a rapid, efficient access to the underlying filesystem.

Hope that helps – enjoy !

load balancing, or, « how i learned to love a piranha »

Hello again, everybody !  Today i thought that we’d take a look at a fun and useful topic of interest to many system administrators : load balancing & redundancy.  Now, i know, it doesn’t sound too exciting – but trust me, once you get your first mini-cluster set up, you’ll never look at service management quite the same way again.  It’s not even that tough to set up, and you can get a basic setup going in almost no time at all, thanks to some great open source software that can be found in more or less any modern repository.

First, as always, a little bit of theory.  The most basic web server setup (for example), looks something like figure 001, below :

Figure 001
Figure 001

As you can see, this is a functional setup, but it does have (at least) two major drawbacks :

  • A critical failure on the web server means the service (i.e. the content being served) disappears along with it.
  • If the web server becomes overloaded, you may be forced to take the entire machine down to upgrade it (or just let your adoring public deal with a slow, unresponsive website, i suppose).

The solution to both of these problems forms the topic of this blog entry : load balancing.  The idea is straightforward enough : by adding more than one web server, we can ensure that our service continues to be available even when a machine fails, and we can also spread the love, er, load, across multiple machines, thus increasing our overall efficiency.  Nice !

batman and round robin

Now, there are a couple of ways to go about this, one of which is called « Round Robin DNS » (or RRDNS), which is both very simple and moderately useful.  DNS, for those needing a refresher, is (in a nutshell) the way that human-readable hostnames get translated into machine-readable numbers.  Generally speaking, hostnames are tied to IP addresses in a one-to-one or many-to-one fashion, such that when you type in a hostname, you get a single number back.  For example :

$ host www.dark.ca
www.dark.ca has address 88.191.66.127

In other words, when you type http://www.dark.ca into your browser, you get one particular machine on the Internet (as indicated by the address); however, it is also possible to set up a one-to-many relationship – this is the basis or RRDNS.  A very common example is Google :

$ host www.google.com
www.google.com is an alias for www.l.google.com.
www.l.google.com has address 74.125.39.99
www.l.google.com has address 74.125.39.103
www.l.google.com has address 74.125.39.104
www.l.google.com has address 74.125.39.105
www.l.google.com has address 74.125.39.106
www.l.google.com has address 74.125.39.147

So what’s going on here ?  In essence, the Google administrators have created a situation whereby typing in http://www.google.com into your browser will get you one of a whole group of possibilities.  In this way, each time you request some content from them, one of any number of machines will be responsible for delivering that service.  (Now, to be fair, the reality of what’s going on at Google is likely far more complex, but the premise is identical.)  Your web browser will only get one answer back, which is more or less randomly provided by the DNS server, and that response is the machine you’ll interact with.  As you can see, this (sort of) satisfies our problem of resource usage, and it (sort of) addresses the problem resource failure.  For those of you who are more visually inclined, please see figure 002 below :

Figure 002
Figure 002

It’s not perfect, but it is workable, and most of all, it’s dead simple to set up – you just need to set your DNS configuration up and you’re good to go (an exercise i leave to you, fair reader, as RRDNS is not really the focus of our discussion today).  Thus, while RRDNS is a simple method for implementing a rudimentary load balancing infrastructure, it still has notable failings :

  • The load balancing isn’t systematic at all – by pure chance, one machine could end up getting hammered while others do very little, for example.
  • If a machine fails, there’s a chance that the DNS response will contain the address of the downed machine.  In other words, the chances of you getting the downed machine are 1 in X, where X is the number of possible responses to the DNS query.  The odds get better (or worse, depending on how you look at it) as more machines fail.
  • A slightly more obscure problem is that of response caching : as a method of optimisation, many DNS systems, as well as software that interacts with DNS, will cache (hold on to) hostname lookups for variable lengths of time.  This can invalidate the magic of RRDNS altogether…

another attack vector

Another approach to the problem, and the one we’ll be exploring in great depth in this article, is using a dedicated load balancing infrastructure, combining a handful of great open source tools and proven methodologies.  First, however, some more theory.

Our new approach to load balancing must propose both a solution to the original problems (critical failure & resource usage), as well as address and solve the drawbacks of RRDNS as noted above.  Really, what we want is an intelligent (or, at least, systematic) distribution of load across multiple machines, and a way to ensure that requests don’t get sent to downed machines by accident.  It’d be nice if these functions were automated too, since the last thing an administrator wants to do is baby-sit racks of servers.  What we’d like, in other words, could be represented by replacing the phrase « RRNDS » in figure 002 above, with the word « magic ».  For now, let’s imagine that this magic sits on a machine that we’ll call « Load Balancer » (or LB, for short), and that this LB machine would have a similar conceptual relationship to the web servers as RRDNS does.  Consider figure 003 :

Figure 003
Figure 003

This is a basic way of thinking about what’s going to happen.  It looks a lot like figure 002, but there is a very important difference : instead of relying on the somewhat nebulous concept of DNS for our load balancing, we can now give that responsibility to a proper machine running and dedicated to the purpose.  As you can imagine, this is already a huge improvement, since this opens the door to all sorts of additional features and possibilities that simply aren’t possible with straight DNS.  Another interesting aspect of this diagram is that, visually speaking, it would appear that the Internet cloud only « sees » one machine (the load balancer), even though there are a number of web servers behind it.  This concept of having a single point of entry lies at the very core of our strategy – both figuratively and literally – as we’ll soon discover

In the here and now, however, we’re still dealing with theory, and a solution based on « magic » is about as theoretical as it gets. Luckily for us though, magic is exactly what we’re about to unleash – in the form of « Linux Virtual Server », or « LVS » for short.  From their homepage :

The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the server cluster is fully transparent to end users, and the users interact as if it were a single high-performance virtual server. […] The Linux Virtual Server as an advanced load balancing solution can be used to build highly scalable and highly available network services, such as scalable web, cache, mail, ftp, media and VoIP services.

The thing about LVS is that while it’s not inherently complex, it is highly malleable, and this means you really do need to have a solid handle on exactly what you want to do, and how you want to do it, before you start playing around.  Put another way, there are a myriad of ways to use LVS, but you’ll only use one of them at a time, and picking the right methodology is important.  The best way to do this is by building maps and really getting a solid feel for how the various components of the overall architecture relate to each other.  Once you’ve got a good mental idea of what things should look like, actually configuring LVS is about as straightforward as it gets (no, really!).

let’s complicate the issue further, for science !

Looking back to figure 003, we can see that our map includes the Internet, the Load Balancer, and some Web Servers.  This is a pretty typical sort of setup, and thus, we can approach it from a few different ways.  One of the decisions that needs to be made fairly early on, though, has more to do with topology and routing than LVS specifically : how, exactly, do the objects on the map relate to each other at a network level ?  As always, there can be lots of answers to this question – each with their advantages and disadvantages – but ultimately we must pick only one.  Since i value simplicity when it comes to technology, figure 004 describes a simple network topology :

figure 004
figure 004

Now, for those of you out there who may have some experience with LVS, you can see exactly where this is headed – for everybody else, this might not be what you were expecting at all.  Let’s take a look at some of the more obvious points :

  • There are two load balancers.
  • The web servers are on the same network segment as the LBs.
  • Unlike the previous diagrams, the LBs do not appear to be « in between » the Internet and the web servers.

The first point is easy  : there are two LBs for reasons of redundancy, as a single LB represents a single point of failure.  In other words, if the LB stops working for whatever reason, all of your services behind it become functionally unavailable, thus, you really, really want to have another machine ready to go immediately following a failure.

A little bit more explanation is required to explain the second and third points – but the short answer is two words : « Direct Routing » (or DR for short).  From the LVS wiki :

Direct Routing [is] an IP load balancing technology implemented in LVS. It directly routes packets to backend server through rewriting MAC address of data frame with the MAC address of the selected backend server. It has the best scalability among all other methods because the overhead of rewriting MAC address is pretty low, but it requires that the load balancer and the backend servers (real servers) are in a physical network.

If that sounds heavy, don’t worry – figure 005 explains it in easy visual form :

figure 005

In a nutshell, requests get sent to the LB, which then passes it to the Web Server, who in turn responds directly to the client.  It’s fast, efficient, scalable, and easy to set up, with the only caveat being that the LBs and the machines they’re balancing must be on the same network.  As long as you’re willing to accept that restriction, Direct Routing is an excellent choice – and it’s the one we’ll be exploring further today.

a little less conversation, a little more action

So with that in mind, let’s get started.  I’m going to be describing four machines in the following scenario.  All four are identical off-the-shelf servers running CentOS 5.2 – nothing fancy here.  The naming and numbering conventions are simple as well :

[TABLE=2]

You probably noticed the fifth item in this list, labelled « Virtual Web Server ».  This represents our virtual, or clustered service, and is not a real machine.  This will be explained in further detail later on – for now, let’s go ahead and install the key software on both of the Load Balancer machines :

[root@A01 etc]# yum install ipvsadm piranha httpd

« ipvsadm » is, as you might have guessed, the administrative tool for « IPVS », which is in turn an acronym for « Internet Protocol Virtual Server », which makes more sense when you say « IP-based Virtual Server » instead.  As the name implies, IPVS is implemented at the IP level (which is more generically known as Layer-3 of the OSI model), and is used to spread incoming connections to one IP address towards other IP addresses according to one of many pre-defined methods.  It’s the tool that allows us to control our new load balancing infrastructure, and is the key software component around which this entire exercise revolves.  It is powerful, but sort of a pain to use, which brings us to the second item in the list : piranha.

Piranha is a web-based tool (hence httpd, above) for administering LVS, and is effectively a front-end for ipvsadm.  As installed in CentOS, however, the Piranha package contains not only the PHP pages that make up the interface, but also a handful of other tools of particular interest and usefulness that we’ll take a look at as well.  For now, let’s continue with some basic setup and configuration.

A quick word of warning : before starting « piranha-gui » (one of the services supplied by Piranha) up for the first time, it’s important that both LBs have the same time set on them.  You’ve probably already got NTP installed and functioning, but if not, here’s a hint :

[root@A01 ~]# yum -y install ntp && ntpdate pool.ntp.org && chkconfig ntpd on && service start ntpd

Moving right along, the next step is to define a login for the Piranha web interface :

[root@A01 ~]# /usr/sbin/piranha-passwd

You can define multiple logins if you like, but for now, one is certainly enough.  Now, unless you plan to run your load balanced infrastructure on a completely internal network, you’ll probably want to set up some basic restrictions on who can access the interface.  Since the interface is served via an instance of Apache HTTPd, all we have to do is set up a normal « .htaccess » file.  Now, a full breakdown of .htaccess (and, in particular, mod_access) is outside of the scope of this document, but the simple jist is as follows :

[root@A01 ~]# cat /etc/sysconfig/ha/web/secure/.htaccess
Order deny,allow
Deny from all              # by default, deny from everybody
Allow from 192.168.0       # requests from this network are allowed

With those items out of the way, we can now activate piranha-gui :

[root@A01 ~]# chkconfig piranha-gui on && service piranha-gui start

Congratulations !  The interface is now running on port 3636, and can be accessed via your browser of choice – in the case of our example, it’d be « http://A01:3636/ ».  The username for the web login is « piranha », and the password is the one we set above.  Now that we’re logged in, let’s take a look at the interface in greater depth.

look out – piranhas !

The first screen – known as the « Control » page – is a summary of the current state of affairs.  Since nothing is configured or even active, there isn’t a lot to see right now.  Moving on to the « Global Settings » tab, we have our first opportunity to start putting some settings into place :

  • Primary server public IP : Put the IP address of the « primary » LB.  In this example, we’ll put the IP of A01.
  • Private server public IP : If we weren’t using direct routing, this field would need a value.  In our example, therefore, it should be empty.
  • Use network type : Direct Routing (of course!)

On to the « Redundancy » tab :

  • Redundant server public IP : Put the IP address of the « secondary » LB.  In this example, we’ll put the IP of A02.
  • Syncdaemon : Optional and useful – but know that it requires additional configuration in order to make it work.
    • This feature (which is relatively new to LVS) ensures that the state information (i.e. connections, etc..) are shared with the secondary in the event that a failover occurs.  For more information, please see this page from the LVS Howto.
    • It is not necessary, strictly speaking, so we can just leave it unchecked for now.

Under the « Virtual Servers » tab, let’s go ahead and click « Add », then select the new unconfigured entry and hit « Edit » :

  • Name : This is an arbitrary identifier for a given clustered service.  For our example, we’d put « WWW ».
  • Application Port : The port number for the service – HTTP runs on port 80, for example.
  • Protocol : TCP or UDP – this is normally TCP.
  • Virtual IP Address : This is the IP address of the virtual service (VIP), which you may recall from the table above.  This is the IP address that clients will send requests to, regardless of the real IP addresses (RIP) of the real servers which are responsible for the service.  In our example, we’d put 192.168.0.40 .
    • Each service that you wish to cluster needs a unique « address : port » pairing.  For example, 192.168.0.40:80 could be a web service, and 192.168.0.40:25 would likely be a mail service, but if you wanted to run another, separate web service, you’d need to assign a different virtual IP.
  • Virtual IP Network Mask : Normally this is 255.255.255.255, indicating a single IP address (the Virtual IP Address above).
    • You can actually cluster subnets, but this is outside of the scope of this tutorial.
  • Device : The Virtual IP address needs to be assigned to a « virtual network interface », which can be named more or less anything, but generally follows the format « ethN:X », where N is the physical device, and X is an incremental numeric identifier.  For example, if your physical interface is « eth0 », and this is the first virtual interface, then it would be named « eth0:1 ».
    • If and when you set up multiple virtual interfaces, it is important to not mix these up.  Piranha has no facility for sanity checking these identifiers, so you may wish to track them yourself in a Google document or something.
  • Scheduling : There are a number of options here, and some are very different from one another.  For the purposes of this exercise, we’ll pick a simple, yet relatively effective scheduler called « Least-Connections ».
    • This does exactly what it sounds like : when a new request is made to the virtual service, the LB will check to see how many connections are open to each of the real servers in the cluster, and then route the connection to the machine with the least connections.  Congrats, you’ve now got load balancing !

Finally, let’s add some real servers into the cluster.  From the « Edit » screen we’re already in, click on the « Real Server » sub-tab.

  • Name : This is the hostname of the real server.  In our example, we’d put B01.
  • Address : The IP address of the real server.  In our example, for B01, we’d put 192.168.0.38 .
  • Port : Generally speaking this can be left empty, as it will inherit the Port value defined in the virtual service (in this case, 80).
    • A value would be required here if your real server is running the service on a different port than that specified in the virtual service ; if your real server is running a web service on port 8080 instead of 80, for example.
  • Weight : Despite the name, this value is used in various different ways depending on which Scheduler you selected for the virtual service.  In our example, however, this field is irrelevant, and can be left empty.

You can apply and add as many real servers as you like, one at a time, in this fashion.  Go ahead and set up B02 (or whatever your equivalent is) now.

If you’re wondering when the secondary LB is going to be configured, well, wonder no longer : the future is now.  Luckily, this step is very, very easy.  From the secondary :

[root@A02 ~]# scp root@A01:/etc/sysconfig/ha/lvs.conf /etc/sysconfig/ha/

now is a good time to grab a beer

Phew !  That was a lot of work.  After consuming a suitable refreshment, let’s move on to the final few steps.  Earlier i mentioned that there were some other items that we’d need to learn about besides the Piranha interface – « Pulse » is one such item.  Pulse, as a tool, is in the same family as some other tools you may have heard of, such as « Heartbeat », « Keepalived », or « OpenAIS ».

The basic idea of all of these tools is simple : to provide a « failover » facility between a group of two or more machines.  In our example, our primary LB is the one that is normally active, but in the case that it fails for some reason, we’d like our secondary to click in and take over the responsibilities of the unavailable primary – this is what Pulse does.  Each of the load balancers runs an instance of « pulse » (the executable, not the package), which behaves in this fashion :

  • Each LB sends out a broadcast packet (a pulse, as it were) stating that they are alive.  As long as the active LB (commonly the primary) continues to announce itself, everybody is happy and nothing changes.
  • If, however, the inactive LB (commonly the secondary) server notices that it hasn’t seen any pulses from the active LB lately, it assumes that the active LB has failed.
  • The secondary, formerly inactive LB, then becomes active.  This state is maintained until such a time as the primary starts announcing itself again, at which point the secondary demotes itself back to inactivity.

The difference between the active and the inactive server is actually very simple : the active server is the one with the virtual addresses assigned to it (remember those, from the Virtual Servers tab in Piranha?).

Let’s go ahead of start it up (on the primary LB first, then on the secondary) :

[root@A01 ~]# chkconfig --add pulse
[root@A01 ~]# service pulse start

an internet ballgame drama – in 5 parts

You may have noticed that we haven’t even touched the « real » servers (i.e. the web servers) yet.  Now is the time.  As it so happens, there’s only one major step that relates to the real servers, but it’s a very, very important one : defining VIPs, and then ensuring that the web servers are OK with the routing voodoo that we’re using to make this whole load balancing infrastructure work.  The solution is simple, but the reason for the solution may not be immediately obvious – for that, we need to take a look at the IP layer of each packet (neat!).  First, though, let’s run through a series of little stories :

  • Alice has a ball that she’d like to pass to Bob, so she tosses it his way.
  • Bob catches the ball, sees that it’s from Alice, and throws it back at her.  What great fun !

Now imagine that Alice and Bob are hanging out with a few hundred million of their closest friends – but they still want to play ball.

  • Alice writes Bob’s name on the ball, who then passes it to somebody else, and so forth.
  • Eventually the ball gets passed to Bob.  Unfortunately for Bob, he has no idea where it came from, so he can’t send it back.

The solution is obvious :

  • Alice writes « From : Alice, To : Bob » on the ball, the passes it along.
  • Bob gets the ball, and switches the names around so that it says « From : Bob, To : Alice », and sends it back.

OK, so, those were some nice stories, but how do they apply to our Load Balancing setup ?  As it turns out, all we need to do is throw in some tubes, and we’ve described one of the basic functions of the Internet Protocol – that the source and destination IP addresses of a given packet are part of the IP layer of said packet.  Let’s complicate it by one more level :

  • Alice prepares the ball as above, and send it flying.
  • Bob gets the ball, who’s been avoiding Alice since things got weird at the bar last week-end, passes it along to Charles.
  • Charles – who’s had a not-so-secret crush on Alice since high school – happily writes « From : Charles, To : Alice », and tosses it away.
  • Alice receives the ball, but much to her surprise, it’s from Charles, and not Bob as she expected. Awkward !

With that last story in mind, let’s take another look at figure 005 above (go ahead, i’ll wait).  Notice anything ?  That’s right – the original source sends their packet off, but then receives a response from a different machine than they expected.  This does not work – it violates some basic rules about how communications are supposed to function on the Internet.   For the thrilling conclusion – and a solution to the problem – let’s return to our drama :

  • As it turns out, Bob is a player : he gets so many balls from so many women that he needs to keep track of them all in a little notebook.
  • When Bob gets Alice’s ball he passes it to Charles, then he records where it came from and who he gave it to in his notebook
  • Charles – in an attempt to get into Bob’s circle of friends – agrees to write « From : Bob, To : Alice » on the ball, then sends it back.
  • Alice – expecting a ball from Bob – is happy to receive her Bob-signed spheroid.
  • Bob then gets another ball from Denise, passes it to Edward, and records this relationship as well.
  • Edward – a sycophant if ever there was – prepares the ball in the same fashion as Charles, and fires it back.

Of course, the more balls Bob has to deal with, the more helpers he can use to spread the work around.  Now, as you’ve no doubt pieced together, Alice and Denise are any given sources on the Internet, Bob is our LB, and Charles & Edward are the web servers.  Now, instead of writing people’s names on balls, we should now make the mental leap to IP addresses in packets.  With our tables of hostnames and addresses in mind, let’s consider the following example :

  • The source sends a request for a web page.
    • The source IP is « 10.1.2.3 », and the destination IP is « 192.168.0.40 » (the VIP for WWW).
  • The packet is sent to A01, which is currently active, and thus has the VIP for WWW assigned to it.
  • A01 then forwards the packet to B02 (by chance), which crafts a response packet.
    • The RIP for B02 is « 192.168.0.39 », but instead of using that, the source IP is set to « 192.168.0.40 », and the destination is « 10.1.2.3 ».
  • The source, expecting a response from « .40 », indeed receives a packet that appears to be from WWW.  Done and done.

The theory is sound, but how can we implement this in practice ?  As i said – it’s simple !  We simply add a dummy interface to each of the web servers that has the same address as the VIP, which will allow the web servers to interact with packets properly.  This is best done by creating a simple sysconfig entry on each of the web servers for the required dummy interface, as follows :

[root@B01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-lo:0
# for VIP
DEVICE=lo:0
IPADDR=192.168.0.40
NETMASK=255.255.255.255
BROADCAST=192.168.0.40
ONBOOT=yes
NAME=vip0
[root@B01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-lo:0
# for VIP
DEVICE=lo:0
IPADDR=192.168.0.34
NETMASK=255.255.255.255
BROADCAST=192.168.0.34
ONBOOT=yes
NAMall together now

The « lo » indicates that it’s a « Loopback address », which is best described by Wikipedia :

Such an interface is assigned an address that can be accessed from management equipment over a network but is not assigned to any of the real interfaces on the device. This loopback address is also used for management datagrams, such as alarms, originating from the equipment. The property that makes this virtual interface special is that applications that use it will send or receive traffic using the address assigned to the virtual interface as opposed to the address on the physical interface through which the traffic passes.

In other words, it’s a fake IP that the machine can use to make packets anyways.  Now, there is a known scenario in which a machine with a given loopback address will, in this particular situation, cause confusion on the network about which interface actually « owns » a given address.  It has to do with ARP, and interested readers are encouraged to Google for « LVS ARP problem » for more technical details – for now, let’s just get right to the solution.  On each of the real servers, we’ll need to edit « sysctl.conf » :

[root@B01 ~]# vim /etc/sysctl.conf
# this file already has stuff in it, so put this at the bottom
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2

Now, restart sysctl :

[root@B01 ~]# sysctl -p

That’s it – problem solved.

all together now !

At this point we’ve now explored each key item that is necessary to make this whole front-end infrastructure work, but it is perhaps not quite clear how it all works together.  So, let’s take a step back for a moment and review :

  • There are four servers : two are load balancers, and two are web servers.
    • Of the two load balancers, only one is active at any given time ; the other is a backup.
  • Every DNS entry for the sites on the web servers points to one actual IP address.
    • This IP address is called the « Virtual IP ».
    • The VIP is claimed by the active load balancer, meaning that when a request is made for a given website, it goes to the active LB.
  • The LB then re-directs the request to an actual web server.
    • The re-direction can be random, or based on varying levels of logical decision making.
    • The web server will respond directly – the LB is not a proxy.

Great !  Now, what software runs where, and why ?

  • The load balancers use LVS in order to manage the relationship between VIPs and RIPs.
  • Pulse is used between the LBs in order to determine who is alive, and which one is active.
  • An optional (but useful) web interface to both LVS and Pulse comes in the form of Piranha, which runs on a dedicated instance of Apache HTTPd on port 3636.

And that, my friends, is that !  If you have any questions, feel free to comment below (remember to subscribe to the RSS feed for responses).  Happy balancing !

oh, p.s., one last thing…

In case you’re wondering how to keep your LVS configuration file synchronised across both of the load balancers, one way to do it would be with a network-aware filesystem – POHMELFS, for example. 😉

(complex) partitioning in kickstart

UPDATE: This article was written back in 2009. According to a commenter below, Busybox has been replaced by Bash in RHEL 6; perhaps Fedora as well?

Bonjour my geeky friends ! 🙂  As you are likely aware, it is now summer-time here in the northern hemisphere, and thus, i’ve been spending as much time away from the computer as possible.  That said, it’s been a long time, i shouldn’t have left you, without a strong beat to step to.

Now, if you’re not familiar with kickstarting, it’s basically just a way to automate the installation of an operating environment on a machine – think hands-free installation.  Anaconda is the OS installation tool used in Fedora, RedHat, and some other Linux OS’s, and it can be used in a kickstart capacity.  For those of you looking for an intro, i heavily suggest reading over the excellent documentation at the Fedora project website.  The kickstart configuration process could very easily be a couple of blog entries on its own (which i’ll no doubt get around to in the future), but for now i want to touch on one particular aspect of it : complex partition schemes.

how it is

The current method for declaring partitions is relatively powerful, in that all manner of basic partitions, LVM components, and even RAID devices can be specified – but where it fails is in the creating of the actual partitions on the disk itself.  The options that can be supplied to the partition keywords can make this clunky at best (and impossible at worst).

A basic example of a partitioning scheme that requires nothing outside of the available functions :

DEVICE                 MOUNTPOINT               SIZE
/dev/sda               (total)                  500,000 MB
/dev/sda1              /boot/                       128 MB
/dev/sda2              /                         20,000 MB
/dev/sda3              /var/log/                 20,000 MB
/dev/sda5              /home/                   400,000 MB
/dev/sda6              /opt/                     51,680 MB
/dev/sda7              swap                       8,192 MB

Great, no problem – we can easily define that in the kickstart :

part  /boot     --asprimary  --size=128
part  /         --asprimary  --size=20000
part  /var/log  --asprimary  --size=20000
part  /home                  --size=400000
part  /opt                   --size=51680
part  swap                   --size=8192

But what happens if we want to use this same kickstart on another machine (or, indeed, many other machines) that don’t have the same disk size ?  One of the options that can be used with the « part » keyword is « –grow », which tells Anaconda to create as large a partition as possible.  This can be used along with « –maxsize= », which does exactly what you think it does.

Continuing with the example, we can modify the « /home » partition to be of a variable size, which should do us nicely on disks which may be smaller or larger than our original 500GB unit.

part  /home  --size=1024  --grow

Here we’ve stated that we’d like the partition to be at least a gig, but that it should otherwise be as large as possible given the constraints of both the other partitions, as well as the total space available on the device.  But what if you also want « /opt » to be variable in size ?  One way would be to grow both of them :

part  /home  --size=1024  --grow
part  /opt   --size=1024  --grow

Now, what do you think that will do ? If you guessed « grow both of them to half the total available size each », you’d be correct.  Maybe this is what you wanted – but then again, maybe it wasn’t.  Of course, we could always specify a maximum ceiling on how far /opt will grow :

part  /opt  --size=1024  --maxsize=200000  --grow

That works, but only at the potential expense of /home.  Consider what would happen if this was run against a 250GB disk ; the other (static) partitions would eat up some 48GB, /opt would grow to the maximum specified size of 200GB, and /home would be left with the remaining 2GB of available space.

If we were to add more partitions into the mix, the whole thing would become an imprecise mess rather quickly.  Furthermore, we haven’t even begun to look at scenarios where there may (or may not) more than one disk, nor any fun tricks like automatically setting the swap size to be same as the actual amount of RAM (for example).  For these sorts of things we need a different approach.

the magic of pre, the power of parted

The kickstart configuration contains a section called « %pre », which should be familiar to anybody who’s dealt with RPM packaging.  Basically, the pre section contains text which will be parsed by the shell during the installation process – in other words, you can write a shell script here.  Fairly be thee warned, however, as the shell spawned by Anaconda is « BusyBox », not « bash », and it lacks some of the functionality that you might expect.  We can use the %pre section to our advantage in many ways – including partitioning.  Instead of using the built-in functions to set up the partitions, we can do it ourselves (in a manner of speaking) using « parted ».

Parted is, as you might expect, a tool for editing partition data.  Generally speaking it’s an interactive tool, but one of the nifty features is the « scripted mode », wherein partitioning commands can be passed to Parted on the command-line and executed immediately without further intervention.  This is very handy in any sort of automated scenario, including during a kickstart.

We can use Parted to lay the groundwork for the basic example above, wherein /home is dynamically sized.  Initially this will appear inefficient, since we won’t be doing anything that can’t be accomplished by using the existing Kickstart functionality, but it provides an excellent base from which to do more interesting things.  What follows (until otherwise noted) are text blocks that can be inserted directly into the %pre section of the kickstart config :

# clear the MBR and partition table
dd if=/dev/zero of=/dev/sda bs=512 count=1
parted -s /dev/sda mklabel msdos

This ensures that the disk is clean, so that we don’t run into any existing partition data that might cause trouble.  The « dd » command overwrites the first bit of the disk, so that any basic partition information is destroyed, then Parted is used to create a new disk label.

TOTAL=`parted -s /dev/sda unit mb print free | grep Free | awk '{print $3}' | cut -d "M" -f1`

That little line gives us the total size of the disk, and assigns to a variable named « TOTAL ».  There are other ways to obtain this value, but in keeping with the spirit of using Parted to solve our problems, this works.  In this instance, « awk » and « cut » are used to extract the string we’re interested in.  Continuing on…

# calculate start points
let SWAP_START=$TOTAL-8192
let OPT_START=$SWAP_START-51680

Here we determine the starting position for the swap and /opt partitions.  Since we know the total size, we can subtract 8GB from it, and that gives us where the swap partition starts.  Likewise, we can calculate the starting position of /opt based on the start point of swap (and so forth, were there other partitions to calculate).

# partitions IN ORDER
parted -s /dev/sda mkpart primary ext3 0 128
parted -s /dev/sda mkpart primary ext3 128 20128
parted -s /dev/sda mkpart primary ext3 20128 40256
parted -s /dev/sda mkpart extended 40256 $TOTAL
parted -s /dev/sda mkpart logical ext3 40256 $OPT_START
parted -s /dev/sda mkpart logical ext3 $OPT_START $SWAP_START
parted -s /dev/sda mkpart logical $SWAP_START $TOTAL

The variables we populated above are used here in order to create the partitions on the disk.  The syntax is very simple :

  • « parted -s »  : run Parted in scripted (non-interactive) mode.
  • « /dev/sda » : the device (later, we’ll see how to determine this dynamically).
  • « mkpart » : the action to take (make partition).
  • « primary | extended | logical » : the type of partition.
  • « ext3 » : the type of filesystem (there are a number of possible options, but ext3 is pretty standard).
    • Notice that the « extended » and « swap » definitions do not contain a filesystem type – it is not necessary.
  • « start# end# » : the start and end points, expressed in MB.

Finally, we must still declare the partitions in the usual way.  Take note that this does not occur in the %pre section – this goes in the normal portion of the configuration for defining partitions :

part  /boot     --onpart=/dev/sda1
part  /         --onpart=/dev/sda2
part  /var/log  --onpart=/dev/sda3
part  /home     --onpart=/dev/sda5
part  /opt      --onpart=/dev/sda6
part  swap      --onpart=/dev/sda7

As i mentioned when we began this section, yes, this is (so far) a remarkably inefficient way to set this particular basic configuration up.  But, again to re-iterate, this exercise is about putting the groundwork in place for much more interesting applications of the technique.

mo’ drives, mo’ better

Perhaps some of your machines have more than one drive, and some don’t.  These sorts of things can be determined, and then reacted upon dynamically using the described technique.  Back to the %pre section :

# Determine number of drives (one or two in this case)
set $(list-harddrives)
let numd=$#/2
d1=$1
d2=$3

In this case, we’re using a built-in function called « list-harddrives » to help us determine which drive or drives are present, and then assign their device identifiers to variables.  In other words, if you have an « sda » and an « sdb », those identifiers will be assigned to « $d1 » and « $d2 », and if you just have an sda, then $d2 will be empty.

This gives us some interesting new options ; for example, if we wanted to put /home on to the second drive, we could write up some simple logic to make that happen :

# if $d2 has a value, it's that of the second device.
if [ ! -z $d2 ]
then
  HOMEDEVICE=$d2
else
  HOMEDEVICE=$d1
fi

# snip...
part  /home  --size=1024  --ondisk=/dev/$HOMEDEVICE  --grow

That, of course, assumes that the other partitions are defined, and that /home is the only entity which should be grown dynamically – but you get the idea.  There’s nothing stopping us from writing a normal shell script that could determine the number of drives, their total size, and where the partition start points should be based on that information.  In fact, let’s examine this idea a little further.

the size, she is dynamic !

Instead of trying to wrangle the partition sizes together with the default options, we can get as complex (or as simple) as we like with a few if statements, and some basic maths.  Thinking about our layout then, we can express something like the following quite easily :

  • If there is one drive that is at least 500 GB in size, then /opt should be 200 GB, and /home should consume the rest.
  • If there is one drive is less than 500 GB, but more than 250 GB, then /opt and /home should each take half.
  • If there is one drive that is less than 250 GB, then /home should take two-thirds, and /opt gets the rest.
# $TOTAL from above...
if [ $TOTAL -ge 512000 ]
then
  let OPT_START=$SWAP_START-204800
elif [ $TOTAL -lt 512000 ] && [ $TOTAL -ge 256000 ]
then
  # get the dynamic space total, which is between where /var/log ends, and swap begins
  let DYN_TOTAL=$SWAP_START-40256
  let OPT_START=$DYN_TOTAL/2
elif [ $TOTAL -lt 256000 ]
then
  let DYN_TOTAL=$SWAP_START-40256
  let OPT_START=$DYN_TOTAL/3
  let OPT_START=$OPT_START+$OPT_START
fi

Now, instead of having to create three different kickstart files, each describing a different scenario, we’ve covered it with one – nice !

other possibilities

At the end of the day, the possilibities are nearly endless, with the only restriction being that whatever you’d like to do has to be do-able in BusyBox – which, at this level, provides a lot great functionality.

Stay tuned for more entries related to kickstarting, PXE-based installations, and so forth, all to come here on dan’s linux blog.  Cheers !

getting Dia to give you a pdf

Hello again !  Today’s quick tip concerns a software package called Dia, which is an open source tool (available for both Windows and Linux, as it goes) used to make diagrams, flowcharts, network maps, and so forth.  It has its own file format (.dia), which is (obviously?) useful for saving the projects you’re working on, but less useful if you need to give the diagram to anybody else, either in print or electronic form.

Dia can export to a variety of formats including SVG, PNG, and EPS, but one export format that it lacks native support for is the venerable PDF, which has become a de facto standard for transmitting documents between diverse environments.  There are many advantages and interesting aspects of the PDF format, not the least of which being that what you see on your screen is what you get when it’s printed.  It is unfortunate, then, that Dia won’t spit out a PDF (even if you ask very nicely).

Of course, being that it’s so easy to print directly to PDF (via CUPS, for example) these days, having native support for PDF may not, at first, seem all that useful.  Well, as it turns out, printing directly to PDF might not give you quite what you were looking for.  In practice, you do get a PDF, but what appeared to be a modestly-sized diagram in Dia will turn out to be a multi-page monster in (virtually) printed form.  As a general rule, this is not what you want.

In order to get a usable PDF we need to use an intermediate step between Dia and the final file.  The idea, quite simply, is to export the diagram as one of the supported formats, then convert that file into a PDF.  There are a number of options here, but for our purposes we’ll save the diagram as an EPS file, then use a quick little command-line tool called « epstopdf » to perform the conversion.

There’s a good chance that you don’t have epstopdf on your machine.  If you’re using Ubuntu, you used to be able to install it easily via the APT packager, but these days the little conversion tool comes as part of a larger suite of tools called « texlive-extra-utils ».  This suite is dependant on a number of other packages, so go ahead and install them all :

$ sudo apt-get install texlive-extra-tools

EDIT : In Ubuntu 10.04, the package is named « texlive-font-utils ».

Among many, many little items of interest, our target application will be installed.  To use it, simply feed it the name of the EPS file as an argument :

$ epstopdf somediagram.eps

It will automatically output a PDF file of the same name.  There you go – a nice, shiny PDF of your Dia diagram.

Enjoy !