how to deal with broken time zones during a CentOS 5.3 kickstart

Hello again fair readers !  Today’s quick tip concerns the problem with missing time zones when deploying CentOS 5.3 (and some of the more recent Fedoras) in a kickstart environment.  It’s a known problem, and unfortunately, since the source of the problem (an incomplete time zone data file) lies deep in the heart of the kickstart environment, fixing it directly is a distinct pain in the buttock region.

There is, however, a workaround – and it’s not even that messy !  The first step is to use a region that does exist, such as « Europe/Paris », which will satisfy the installer – then set the time zone to what you actually want after the fact in the « %post » section.  So, in the top section of the kickstart file, we’ll put :

# set temporarily to avoid time zone bug during install
timezone --utc Europe/Paris

The « –utc » switch simply states that the system clock is in UTC, which is pretty standard these days, but ultimately optional.  Next, in the %post section towards the end, we’ll shoe horn our little hack fix into place :

# fix faulty time zone setting
mv /etc/sysconfig/clock /etc/sysconfig/clock.BAD
sed 's@^ZONE="Europe/Paris"@ZONE="Etc/UTC"@' /etc/sysconfig/clock.BAD > /etc/sysconfig/clock
/usr/sbin/tzdata-update

So, what’s going on there ?  Let’s break it down :

  • In the first line, we’re just backing up the original configuration file, to use in the next line…
  • The second line is the important one – this is the actual manipulation which will fix the faulty time zone, setting it to whatever we want.  In this example « Etc/UTC » is used, but you can pick whatever is appropriate.
    • The tool being used here is « sed », a non-interactive editor which dates back to the 1970’s, and which is still used by system administrators around the world every day.
    • The command we’re issuing to sed is between the single quotes – astute readers will notice that it’s a regular expression, but with @’s instead of the more usual /’s.  In it, we simply state that the instance of « ZONE=”Europe/Paris” » is to be replaced with « ZONE=”Etc/UTC” ».
    • This change is to be made against the backup file, and outputted to the actual config.
  • Finally, we run « tzdata-update » which, as you’ve no doubt guessed, updates the time zone data system-wide, based (in part) on the newly-corrected clock config.

And that, as they say, is that.  Happy kickstarting, friends, and i’ll see you next time !

load balancing, or, « how i learned to love a piranha »

Hello again, everybody !  Today i thought that we’d take a look at a fun and useful topic of interest to many system administrators : load balancing & redundancy.  Now, i know, it doesn’t sound too exciting – but trust me, once you get your first mini-cluster set up, you’ll never look at service management quite the same way again.  It’s not even that tough to set up, and you can get a basic setup going in almost no time at all, thanks to some great open source software that can be found in more or less any modern repository.

First, as always, a little bit of theory.  The most basic web server setup (for example), looks something like figure 001, below :

Figure 001
Figure 001

As you can see, this is a functional setup, but it does have (at least) two major drawbacks :

  • A critical failure on the web server means the service (i.e. the content being served) disappears along with it.
  • If the web server becomes overloaded, you may be forced to take the entire machine down to upgrade it (or just let your adoring public deal with a slow, unresponsive website, i suppose).

The solution to both of these problems forms the topic of this blog entry : load balancing.  The idea is straightforward enough : by adding more than one web server, we can ensure that our service continues to be available even when a machine fails, and we can also spread the love, er, load, across multiple machines, thus increasing our overall efficiency.  Nice !

batman and round robin

Now, there are a couple of ways to go about this, one of which is called « Round Robin DNS » (or RRDNS), which is both very simple and moderately useful.  DNS, for those needing a refresher, is (in a nutshell) the way that human-readable hostnames get translated into machine-readable numbers.  Generally speaking, hostnames are tied to IP addresses in a one-to-one or many-to-one fashion, such that when you type in a hostname, you get a single number back.  For example :

$ host www.dark.ca
www.dark.ca has address 88.191.66.127

In other words, when you type http://www.dark.ca into your browser, you get one particular machine on the Internet (as indicated by the address); however, it is also possible to set up a one-to-many relationship – this is the basis or RRDNS.  A very common example is Google :

$ host www.google.com
www.google.com is an alias for www.l.google.com.
www.l.google.com has address 74.125.39.99
www.l.google.com has address 74.125.39.103
www.l.google.com has address 74.125.39.104
www.l.google.com has address 74.125.39.105
www.l.google.com has address 74.125.39.106
www.l.google.com has address 74.125.39.147

So what’s going on here ?  In essence, the Google administrators have created a situation whereby typing in http://www.google.com into your browser will get you one of a whole group of possibilities.  In this way, each time you request some content from them, one of any number of machines will be responsible for delivering that service.  (Now, to be fair, the reality of what’s going on at Google is likely far more complex, but the premise is identical.)  Your web browser will only get one answer back, which is more or less randomly provided by the DNS server, and that response is the machine you’ll interact with.  As you can see, this (sort of) satisfies our problem of resource usage, and it (sort of) addresses the problem resource failure.  For those of you who are more visually inclined, please see figure 002 below :

Figure 002
Figure 002

It’s not perfect, but it is workable, and most of all, it’s dead simple to set up – you just need to set your DNS configuration up and you’re good to go (an exercise i leave to you, fair reader, as RRDNS is not really the focus of our discussion today).  Thus, while RRDNS is a simple method for implementing a rudimentary load balancing infrastructure, it still has notable failings :

  • The load balancing isn’t systematic at all – by pure chance, one machine could end up getting hammered while others do very little, for example.
  • If a machine fails, there’s a chance that the DNS response will contain the address of the downed machine.  In other words, the chances of you getting the downed machine are 1 in X, where X is the number of possible responses to the DNS query.  The odds get better (or worse, depending on how you look at it) as more machines fail.
  • A slightly more obscure problem is that of response caching : as a method of optimisation, many DNS systems, as well as software that interacts with DNS, will cache (hold on to) hostname lookups for variable lengths of time.  This can invalidate the magic of RRDNS altogether…

another attack vector

Another approach to the problem, and the one we’ll be exploring in great depth in this article, is using a dedicated load balancing infrastructure, combining a handful of great open source tools and proven methodologies.  First, however, some more theory.

Our new approach to load balancing must propose both a solution to the original problems (critical failure & resource usage), as well as address and solve the drawbacks of RRDNS as noted above.  Really, what we want is an intelligent (or, at least, systematic) distribution of load across multiple machines, and a way to ensure that requests don’t get sent to downed machines by accident.  It’d be nice if these functions were automated too, since the last thing an administrator wants to do is baby-sit racks of servers.  What we’d like, in other words, could be represented by replacing the phrase « RRNDS » in figure 002 above, with the word « magic ».  For now, let’s imagine that this magic sits on a machine that we’ll call « Load Balancer » (or LB, for short), and that this LB machine would have a similar conceptual relationship to the web servers as RRDNS does.  Consider figure 003 :

Figure 003
Figure 003

This is a basic way of thinking about what’s going to happen.  It looks a lot like figure 002, but there is a very important difference : instead of relying on the somewhat nebulous concept of DNS for our load balancing, we can now give that responsibility to a proper machine running and dedicated to the purpose.  As you can imagine, this is already a huge improvement, since this opens the door to all sorts of additional features and possibilities that simply aren’t possible with straight DNS.  Another interesting aspect of this diagram is that, visually speaking, it would appear that the Internet cloud only « sees » one machine (the load balancer), even though there are a number of web servers behind it.  This concept of having a single point of entry lies at the very core of our strategy – both figuratively and literally – as we’ll soon discover

In the here and now, however, we’re still dealing with theory, and a solution based on « magic » is about as theoretical as it gets. Luckily for us though, magic is exactly what we’re about to unleash – in the form of « Linux Virtual Server », or « LVS » for short.  From their homepage :

The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the server cluster is fully transparent to end users, and the users interact as if it were a single high-performance virtual server. […] The Linux Virtual Server as an advanced load balancing solution can be used to build highly scalable and highly available network services, such as scalable web, cache, mail, ftp, media and VoIP services.

The thing about LVS is that while it’s not inherently complex, it is highly malleable, and this means you really do need to have a solid handle on exactly what you want to do, and how you want to do it, before you start playing around.  Put another way, there are a myriad of ways to use LVS, but you’ll only use one of them at a time, and picking the right methodology is important.  The best way to do this is by building maps and really getting a solid feel for how the various components of the overall architecture relate to each other.  Once you’ve got a good mental idea of what things should look like, actually configuring LVS is about as straightforward as it gets (no, really!).

let’s complicate the issue further, for science !

Looking back to figure 003, we can see that our map includes the Internet, the Load Balancer, and some Web Servers.  This is a pretty typical sort of setup, and thus, we can approach it from a few different ways.  One of the decisions that needs to be made fairly early on, though, has more to do with topology and routing than LVS specifically : how, exactly, do the objects on the map relate to each other at a network level ?  As always, there can be lots of answers to this question – each with their advantages and disadvantages – but ultimately we must pick only one.  Since i value simplicity when it comes to technology, figure 004 describes a simple network topology :

figure 004
figure 004

Now, for those of you out there who may have some experience with LVS, you can see exactly where this is headed – for everybody else, this might not be what you were expecting at all.  Let’s take a look at some of the more obvious points :

  • There are two load balancers.
  • The web servers are on the same network segment as the LBs.
  • Unlike the previous diagrams, the LBs do not appear to be « in between » the Internet and the web servers.

The first point is easy  : there are two LBs for reasons of redundancy, as a single LB represents a single point of failure.  In other words, if the LB stops working for whatever reason, all of your services behind it become functionally unavailable, thus, you really, really want to have another machine ready to go immediately following a failure.

A little bit more explanation is required to explain the second and third points – but the short answer is two words : « Direct Routing » (or DR for short).  From the LVS wiki :

Direct Routing [is] an IP load balancing technology implemented in LVS. It directly routes packets to backend server through rewriting MAC address of data frame with the MAC address of the selected backend server. It has the best scalability among all other methods because the overhead of rewriting MAC address is pretty low, but it requires that the load balancer and the backend servers (real servers) are in a physical network.

If that sounds heavy, don’t worry – figure 005 explains it in easy visual form :

figure 005

In a nutshell, requests get sent to the LB, which then passes it to the Web Server, who in turn responds directly to the client.  It’s fast, efficient, scalable, and easy to set up, with the only caveat being that the LBs and the machines they’re balancing must be on the same network.  As long as you’re willing to accept that restriction, Direct Routing is an excellent choice – and it’s the one we’ll be exploring further today.

a little less conversation, a little more action

So with that in mind, let’s get started.  I’m going to be describing four machines in the following scenario.  All four are identical off-the-shelf servers running CentOS 5.2 – nothing fancy here.  The naming and numbering conventions are simple as well :

[TABLE=2]

You probably noticed the fifth item in this list, labelled « Virtual Web Server ».  This represents our virtual, or clustered service, and is not a real machine.  This will be explained in further detail later on – for now, let’s go ahead and install the key software on both of the Load Balancer machines :

[root@A01 etc]# yum install ipvsadm piranha httpd

« ipvsadm » is, as you might have guessed, the administrative tool for « IPVS », which is in turn an acronym for « Internet Protocol Virtual Server », which makes more sense when you say « IP-based Virtual Server » instead.  As the name implies, IPVS is implemented at the IP level (which is more generically known as Layer-3 of the OSI model), and is used to spread incoming connections to one IP address towards other IP addresses according to one of many pre-defined methods.  It’s the tool that allows us to control our new load balancing infrastructure, and is the key software component around which this entire exercise revolves.  It is powerful, but sort of a pain to use, which brings us to the second item in the list : piranha.

Piranha is a web-based tool (hence httpd, above) for administering LVS, and is effectively a front-end for ipvsadm.  As installed in CentOS, however, the Piranha package contains not only the PHP pages that make up the interface, but also a handful of other tools of particular interest and usefulness that we’ll take a look at as well.  For now, let’s continue with some basic setup and configuration.

A quick word of warning : before starting « piranha-gui » (one of the services supplied by Piranha) up for the first time, it’s important that both LBs have the same time set on them.  You’ve probably already got NTP installed and functioning, but if not, here’s a hint :

[root@A01 ~]# yum -y install ntp && ntpdate pool.ntp.org && chkconfig ntpd on && service start ntpd

Moving right along, the next step is to define a login for the Piranha web interface :

[root@A01 ~]# /usr/sbin/piranha-passwd

You can define multiple logins if you like, but for now, one is certainly enough.  Now, unless you plan to run your load balanced infrastructure on a completely internal network, you’ll probably want to set up some basic restrictions on who can access the interface.  Since the interface is served via an instance of Apache HTTPd, all we have to do is set up a normal « .htaccess » file.  Now, a full breakdown of .htaccess (and, in particular, mod_access) is outside of the scope of this document, but the simple jist is as follows :

[root@A01 ~]# cat /etc/sysconfig/ha/web/secure/.htaccess
Order deny,allow
Deny from all              # by default, deny from everybody
Allow from 192.168.0       # requests from this network are allowed

With those items out of the way, we can now activate piranha-gui :

[root@A01 ~]# chkconfig piranha-gui on && service piranha-gui start

Congratulations !  The interface is now running on port 3636, and can be accessed via your browser of choice – in the case of our example, it’d be « http://A01:3636/ ».  The username for the web login is « piranha », and the password is the one we set above.  Now that we’re logged in, let’s take a look at the interface in greater depth.

look out – piranhas !

The first screen – known as the « Control » page – is a summary of the current state of affairs.  Since nothing is configured or even active, there isn’t a lot to see right now.  Moving on to the « Global Settings » tab, we have our first opportunity to start putting some settings into place :

  • Primary server public IP : Put the IP address of the « primary » LB.  In this example, we’ll put the IP of A01.
  • Private server public IP : If we weren’t using direct routing, this field would need a value.  In our example, therefore, it should be empty.
  • Use network type : Direct Routing (of course!)

On to the « Redundancy » tab :

  • Redundant server public IP : Put the IP address of the « secondary » LB.  In this example, we’ll put the IP of A02.
  • Syncdaemon : Optional and useful – but know that it requires additional configuration in order to make it work.
    • This feature (which is relatively new to LVS) ensures that the state information (i.e. connections, etc..) are shared with the secondary in the event that a failover occurs.  For more information, please see this page from the LVS Howto.
    • It is not necessary, strictly speaking, so we can just leave it unchecked for now.

Under the « Virtual Servers » tab, let’s go ahead and click « Add », then select the new unconfigured entry and hit « Edit » :

  • Name : This is an arbitrary identifier for a given clustered service.  For our example, we’d put « WWW ».
  • Application Port : The port number for the service – HTTP runs on port 80, for example.
  • Protocol : TCP or UDP – this is normally TCP.
  • Virtual IP Address : This is the IP address of the virtual service (VIP), which you may recall from the table above.  This is the IP address that clients will send requests to, regardless of the real IP addresses (RIP) of the real servers which are responsible for the service.  In our example, we’d put 192.168.0.40 .
    • Each service that you wish to cluster needs a unique « address : port » pairing.  For example, 192.168.0.40:80 could be a web service, and 192.168.0.40:25 would likely be a mail service, but if you wanted to run another, separate web service, you’d need to assign a different virtual IP.
  • Virtual IP Network Mask : Normally this is 255.255.255.255, indicating a single IP address (the Virtual IP Address above).
    • You can actually cluster subnets, but this is outside of the scope of this tutorial.
  • Device : The Virtual IP address needs to be assigned to a « virtual network interface », which can be named more or less anything, but generally follows the format « ethN:X », where N is the physical device, and X is an incremental numeric identifier.  For example, if your physical interface is « eth0 », and this is the first virtual interface, then it would be named « eth0:1 ».
    • If and when you set up multiple virtual interfaces, it is important to not mix these up.  Piranha has no facility for sanity checking these identifiers, so you may wish to track them yourself in a Google document or something.
  • Scheduling : There are a number of options here, and some are very different from one another.  For the purposes of this exercise, we’ll pick a simple, yet relatively effective scheduler called « Least-Connections ».
    • This does exactly what it sounds like : when a new request is made to the virtual service, the LB will check to see how many connections are open to each of the real servers in the cluster, and then route the connection to the machine with the least connections.  Congrats, you’ve now got load balancing !

Finally, let’s add some real servers into the cluster.  From the « Edit » screen we’re already in, click on the « Real Server » sub-tab.

  • Name : This is the hostname of the real server.  In our example, we’d put B01.
  • Address : The IP address of the real server.  In our example, for B01, we’d put 192.168.0.38 .
  • Port : Generally speaking this can be left empty, as it will inherit the Port value defined in the virtual service (in this case, 80).
    • A value would be required here if your real server is running the service on a different port than that specified in the virtual service ; if your real server is running a web service on port 8080 instead of 80, for example.
  • Weight : Despite the name, this value is used in various different ways depending on which Scheduler you selected for the virtual service.  In our example, however, this field is irrelevant, and can be left empty.

You can apply and add as many real servers as you like, one at a time, in this fashion.  Go ahead and set up B02 (or whatever your equivalent is) now.

If you’re wondering when the secondary LB is going to be configured, well, wonder no longer : the future is now.  Luckily, this step is very, very easy.  From the secondary :

[root@A02 ~]# scp root@A01:/etc/sysconfig/ha/lvs.conf /etc/sysconfig/ha/

now is a good time to grab a beer

Phew !  That was a lot of work.  After consuming a suitable refreshment, let’s move on to the final few steps.  Earlier i mentioned that there were some other items that we’d need to learn about besides the Piranha interface – « Pulse » is one such item.  Pulse, as a tool, is in the same family as some other tools you may have heard of, such as « Heartbeat », « Keepalived », or « OpenAIS ».

The basic idea of all of these tools is simple : to provide a « failover » facility between a group of two or more machines.  In our example, our primary LB is the one that is normally active, but in the case that it fails for some reason, we’d like our secondary to click in and take over the responsibilities of the unavailable primary – this is what Pulse does.  Each of the load balancers runs an instance of « pulse » (the executable, not the package), which behaves in this fashion :

  • Each LB sends out a broadcast packet (a pulse, as it were) stating that they are alive.  As long as the active LB (commonly the primary) continues to announce itself, everybody is happy and nothing changes.
  • If, however, the inactive LB (commonly the secondary) server notices that it hasn’t seen any pulses from the active LB lately, it assumes that the active LB has failed.
  • The secondary, formerly inactive LB, then becomes active.  This state is maintained until such a time as the primary starts announcing itself again, at which point the secondary demotes itself back to inactivity.

The difference between the active and the inactive server is actually very simple : the active server is the one with the virtual addresses assigned to it (remember those, from the Virtual Servers tab in Piranha?).

Let’s go ahead of start it up (on the primary LB first, then on the secondary) :

[root@A01 ~]# chkconfig --add pulse
[root@A01 ~]# service pulse start

an internet ballgame drama – in 5 parts

You may have noticed that we haven’t even touched the « real » servers (i.e. the web servers) yet.  Now is the time.  As it so happens, there’s only one major step that relates to the real servers, but it’s a very, very important one : defining VIPs, and then ensuring that the web servers are OK with the routing voodoo that we’re using to make this whole load balancing infrastructure work.  The solution is simple, but the reason for the solution may not be immediately obvious – for that, we need to take a look at the IP layer of each packet (neat!).  First, though, let’s run through a series of little stories :

  • Alice has a ball that she’d like to pass to Bob, so she tosses it his way.
  • Bob catches the ball, sees that it’s from Alice, and throws it back at her.  What great fun !

Now imagine that Alice and Bob are hanging out with a few hundred million of their closest friends – but they still want to play ball.

  • Alice writes Bob’s name on the ball, who then passes it to somebody else, and so forth.
  • Eventually the ball gets passed to Bob.  Unfortunately for Bob, he has no idea where it came from, so he can’t send it back.

The solution is obvious :

  • Alice writes « From : Alice, To : Bob » on the ball, the passes it along.
  • Bob gets the ball, and switches the names around so that it says « From : Bob, To : Alice », and sends it back.

OK, so, those were some nice stories, but how do they apply to our Load Balancing setup ?  As it turns out, all we need to do is throw in some tubes, and we’ve described one of the basic functions of the Internet Protocol – that the source and destination IP addresses of a given packet are part of the IP layer of said packet.  Let’s complicate it by one more level :

  • Alice prepares the ball as above, and send it flying.
  • Bob gets the ball, who’s been avoiding Alice since things got weird at the bar last week-end, passes it along to Charles.
  • Charles – who’s had a not-so-secret crush on Alice since high school – happily writes « From : Charles, To : Alice », and tosses it away.
  • Alice receives the ball, but much to her surprise, it’s from Charles, and not Bob as she expected. Awkward !

With that last story in mind, let’s take another look at figure 005 above (go ahead, i’ll wait).  Notice anything ?  That’s right – the original source sends their packet off, but then receives a response from a different machine than they expected.  This does not work – it violates some basic rules about how communications are supposed to function on the Internet.   For the thrilling conclusion – and a solution to the problem – let’s return to our drama :

  • As it turns out, Bob is a player : he gets so many balls from so many women that he needs to keep track of them all in a little notebook.
  • When Bob gets Alice’s ball he passes it to Charles, then he records where it came from and who he gave it to in his notebook
  • Charles – in an attempt to get into Bob’s circle of friends – agrees to write « From : Bob, To : Alice » on the ball, then sends it back.
  • Alice – expecting a ball from Bob – is happy to receive her Bob-signed spheroid.
  • Bob then gets another ball from Denise, passes it to Edward, and records this relationship as well.
  • Edward – a sycophant if ever there was – prepares the ball in the same fashion as Charles, and fires it back.

Of course, the more balls Bob has to deal with, the more helpers he can use to spread the work around.  Now, as you’ve no doubt pieced together, Alice and Denise are any given sources on the Internet, Bob is our LB, and Charles & Edward are the web servers.  Now, instead of writing people’s names on balls, we should now make the mental leap to IP addresses in packets.  With our tables of hostnames and addresses in mind, let’s consider the following example :

  • The source sends a request for a web page.
    • The source IP is « 10.1.2.3 », and the destination IP is « 192.168.0.40 » (the VIP for WWW).
  • The packet is sent to A01, which is currently active, and thus has the VIP for WWW assigned to it.
  • A01 then forwards the packet to B02 (by chance), which crafts a response packet.
    • The RIP for B02 is « 192.168.0.39 », but instead of using that, the source IP is set to « 192.168.0.40 », and the destination is « 10.1.2.3 ».
  • The source, expecting a response from « .40 », indeed receives a packet that appears to be from WWW.  Done and done.

The theory is sound, but how can we implement this in practice ?  As i said – it’s simple !  We simply add a dummy interface to each of the web servers that has the same address as the VIP, which will allow the web servers to interact with packets properly.  This is best done by creating a simple sysconfig entry on each of the web servers for the required dummy interface, as follows :

[root@B01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-lo:0
# for VIP
DEVICE=lo:0
IPADDR=192.168.0.40
NETMASK=255.255.255.255
BROADCAST=192.168.0.40
ONBOOT=yes
NAME=vip0
[root@B01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-lo:0
# for VIP
DEVICE=lo:0
IPADDR=192.168.0.34
NETMASK=255.255.255.255
BROADCAST=192.168.0.34
ONBOOT=yes
NAMall together now

The « lo » indicates that it’s a « Loopback address », which is best described by Wikipedia :

Such an interface is assigned an address that can be accessed from management equipment over a network but is not assigned to any of the real interfaces on the device. This loopback address is also used for management datagrams, such as alarms, originating from the equipment. The property that makes this virtual interface special is that applications that use it will send or receive traffic using the address assigned to the virtual interface as opposed to the address on the physical interface through which the traffic passes.

In other words, it’s a fake IP that the machine can use to make packets anyways.  Now, there is a known scenario in which a machine with a given loopback address will, in this particular situation, cause confusion on the network about which interface actually « owns » a given address.  It has to do with ARP, and interested readers are encouraged to Google for « LVS ARP problem » for more technical details – for now, let’s just get right to the solution.  On each of the real servers, we’ll need to edit « sysctl.conf » :

[root@B01 ~]# vim /etc/sysctl.conf
# this file already has stuff in it, so put this at the bottom
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2

Now, restart sysctl :

[root@B01 ~]# sysctl -p

That’s it – problem solved.

all together now !

At this point we’ve now explored each key item that is necessary to make this whole front-end infrastructure work, but it is perhaps not quite clear how it all works together.  So, let’s take a step back for a moment and review :

  • There are four servers : two are load balancers, and two are web servers.
    • Of the two load balancers, only one is active at any given time ; the other is a backup.
  • Every DNS entry for the sites on the web servers points to one actual IP address.
    • This IP address is called the « Virtual IP ».
    • The VIP is claimed by the active load balancer, meaning that when a request is made for a given website, it goes to the active LB.
  • The LB then re-directs the request to an actual web server.
    • The re-direction can be random, or based on varying levels of logical decision making.
    • The web server will respond directly – the LB is not a proxy.

Great !  Now, what software runs where, and why ?

  • The load balancers use LVS in order to manage the relationship between VIPs and RIPs.
  • Pulse is used between the LBs in order to determine who is alive, and which one is active.
  • An optional (but useful) web interface to both LVS and Pulse comes in the form of Piranha, which runs on a dedicated instance of Apache HTTPd on port 3636.

And that, my friends, is that !  If you have any questions, feel free to comment below (remember to subscribe to the RSS feed for responses).  Happy balancing !

oh, p.s., one last thing…

In case you’re wondering how to keep your LVS configuration file synchronised across both of the load balancers, one way to do it would be with a network-aware filesystem – POHMELFS, for example. 😉

where to specify ethtool options in Fedora

Hi everybody – here’s a super-quick update for you concerning « ethtool », and how to use it to set options in Fedora properly.  Ethtool is a great little tool that can be used to configure all manner of network interface related settings – notably the speed and duplex of a card – on the fly and in real time.  One of the most common situations where ethtool would be used is at boot time, especially for cards which are finnicky, or have buggy drivers, or poor software support, or.. well, you get the idea.

Times were that if you needed to use ethtool to configure a NIC setting at boot time, you’d just stick the given command line into « rc.local », or perhaps another runlevel script, and forget about it.  The problem with this approach is (at least) twofold :

  • Frankly, it’s easy to forget about something like this, which makes future support / debugging of network issues more of a pain.
  • Anything that automatically modifies the runlevel script (such as updates to the parent package) may destroy your local edits.

In order to deal with these issues, and to standardise the implementation of the ethtool-at-boot technique, the Red Hat (and, thus, Fedora) maintainers introduced an option for defining ethtool parameters on a per-interface basis via the standard « sysconfig » directory system.  Now, this actually happened a number of years ago, but the implementation was poorly announced (and poorly documented at the time), and thus, even today a lot of users and administrators don’t seem to know about it.

Now, there’s a very good chance that you already know this, but just to refresh your memory : in the sysconfig directory, there is another directory called « network-scripts », which in turn contains a series of files named « ifcfg-eth? », where « ? » is a device number.  Each network device has a configuration file associated with it ; for example, ifcfg-eth1 is the configuration file for the « eth1 » device.

In order to specify the ethtool options for a given network interface, simply edit the associated configuration file, and add a « ETHTOOL_OPTS » line.  For example :

ETHTOOL_OPTS="autoneg off speed 100 duplex full"

Now, whenever the network service initialises that interface, ethtool will be run with the specified options.  Simple, easy, and best of all, standardised.  What could be better ?

(complex) partitioning in kickstart

UPDATE: This article was written back in 2009. According to a commenter below, Busybox has been replaced by Bash in RHEL 6; perhaps Fedora as well?

Bonjour my geeky friends ! 🙂  As you are likely aware, it is now summer-time here in the northern hemisphere, and thus, i’ve been spending as much time away from the computer as possible.  That said, it’s been a long time, i shouldn’t have left you, without a strong beat to step to.

Now, if you’re not familiar with kickstarting, it’s basically just a way to automate the installation of an operating environment on a machine – think hands-free installation.  Anaconda is the OS installation tool used in Fedora, RedHat, and some other Linux OS’s, and it can be used in a kickstart capacity.  For those of you looking for an intro, i heavily suggest reading over the excellent documentation at the Fedora project website.  The kickstart configuration process could very easily be a couple of blog entries on its own (which i’ll no doubt get around to in the future), but for now i want to touch on one particular aspect of it : complex partition schemes.

how it is

The current method for declaring partitions is relatively powerful, in that all manner of basic partitions, LVM components, and even RAID devices can be specified – but where it fails is in the creating of the actual partitions on the disk itself.  The options that can be supplied to the partition keywords can make this clunky at best (and impossible at worst).

A basic example of a partitioning scheme that requires nothing outside of the available functions :

DEVICE                 MOUNTPOINT               SIZE
/dev/sda               (total)                  500,000 MB
/dev/sda1              /boot/                       128 MB
/dev/sda2              /                         20,000 MB
/dev/sda3              /var/log/                 20,000 MB
/dev/sda5              /home/                   400,000 MB
/dev/sda6              /opt/                     51,680 MB
/dev/sda7              swap                       8,192 MB

Great, no problem – we can easily define that in the kickstart :

part  /boot     --asprimary  --size=128
part  /         --asprimary  --size=20000
part  /var/log  --asprimary  --size=20000
part  /home                  --size=400000
part  /opt                   --size=51680
part  swap                   --size=8192

But what happens if we want to use this same kickstart on another machine (or, indeed, many other machines) that don’t have the same disk size ?  One of the options that can be used with the « part » keyword is « –grow », which tells Anaconda to create as large a partition as possible.  This can be used along with « –maxsize= », which does exactly what you think it does.

Continuing with the example, we can modify the « /home » partition to be of a variable size, which should do us nicely on disks which may be smaller or larger than our original 500GB unit.

part  /home  --size=1024  --grow

Here we’ve stated that we’d like the partition to be at least a gig, but that it should otherwise be as large as possible given the constraints of both the other partitions, as well as the total space available on the device.  But what if you also want « /opt » to be variable in size ?  One way would be to grow both of them :

part  /home  --size=1024  --grow
part  /opt   --size=1024  --grow

Now, what do you think that will do ? If you guessed « grow both of them to half the total available size each », you’d be correct.  Maybe this is what you wanted – but then again, maybe it wasn’t.  Of course, we could always specify a maximum ceiling on how far /opt will grow :

part  /opt  --size=1024  --maxsize=200000  --grow

That works, but only at the potential expense of /home.  Consider what would happen if this was run against a 250GB disk ; the other (static) partitions would eat up some 48GB, /opt would grow to the maximum specified size of 200GB, and /home would be left with the remaining 2GB of available space.

If we were to add more partitions into the mix, the whole thing would become an imprecise mess rather quickly.  Furthermore, we haven’t even begun to look at scenarios where there may (or may not) more than one disk, nor any fun tricks like automatically setting the swap size to be same as the actual amount of RAM (for example).  For these sorts of things we need a different approach.

the magic of pre, the power of parted

The kickstart configuration contains a section called « %pre », which should be familiar to anybody who’s dealt with RPM packaging.  Basically, the pre section contains text which will be parsed by the shell during the installation process – in other words, you can write a shell script here.  Fairly be thee warned, however, as the shell spawned by Anaconda is « BusyBox », not « bash », and it lacks some of the functionality that you might expect.  We can use the %pre section to our advantage in many ways – including partitioning.  Instead of using the built-in functions to set up the partitions, we can do it ourselves (in a manner of speaking) using « parted ».

Parted is, as you might expect, a tool for editing partition data.  Generally speaking it’s an interactive tool, but one of the nifty features is the « scripted mode », wherein partitioning commands can be passed to Parted on the command-line and executed immediately without further intervention.  This is very handy in any sort of automated scenario, including during a kickstart.

We can use Parted to lay the groundwork for the basic example above, wherein /home is dynamically sized.  Initially this will appear inefficient, since we won’t be doing anything that can’t be accomplished by using the existing Kickstart functionality, but it provides an excellent base from which to do more interesting things.  What follows (until otherwise noted) are text blocks that can be inserted directly into the %pre section of the kickstart config :

# clear the MBR and partition table
dd if=/dev/zero of=/dev/sda bs=512 count=1
parted -s /dev/sda mklabel msdos

This ensures that the disk is clean, so that we don’t run into any existing partition data that might cause trouble.  The « dd » command overwrites the first bit of the disk, so that any basic partition information is destroyed, then Parted is used to create a new disk label.

TOTAL=`parted -s /dev/sda unit mb print free | grep Free | awk '{print $3}' | cut -d "M" -f1`

That little line gives us the total size of the disk, and assigns to a variable named « TOTAL ».  There are other ways to obtain this value, but in keeping with the spirit of using Parted to solve our problems, this works.  In this instance, « awk » and « cut » are used to extract the string we’re interested in.  Continuing on…

# calculate start points
let SWAP_START=$TOTAL-8192
let OPT_START=$SWAP_START-51680

Here we determine the starting position for the swap and /opt partitions.  Since we know the total size, we can subtract 8GB from it, and that gives us where the swap partition starts.  Likewise, we can calculate the starting position of /opt based on the start point of swap (and so forth, were there other partitions to calculate).

# partitions IN ORDER
parted -s /dev/sda mkpart primary ext3 0 128
parted -s /dev/sda mkpart primary ext3 128 20128
parted -s /dev/sda mkpart primary ext3 20128 40256
parted -s /dev/sda mkpart extended 40256 $TOTAL
parted -s /dev/sda mkpart logical ext3 40256 $OPT_START
parted -s /dev/sda mkpart logical ext3 $OPT_START $SWAP_START
parted -s /dev/sda mkpart logical $SWAP_START $TOTAL

The variables we populated above are used here in order to create the partitions on the disk.  The syntax is very simple :

  • « parted -s »  : run Parted in scripted (non-interactive) mode.
  • « /dev/sda » : the device (later, we’ll see how to determine this dynamically).
  • « mkpart » : the action to take (make partition).
  • « primary | extended | logical » : the type of partition.
  • « ext3 » : the type of filesystem (there are a number of possible options, but ext3 is pretty standard).
    • Notice that the « extended » and « swap » definitions do not contain a filesystem type – it is not necessary.
  • « start# end# » : the start and end points, expressed in MB.

Finally, we must still declare the partitions in the usual way.  Take note that this does not occur in the %pre section – this goes in the normal portion of the configuration for defining partitions :

part  /boot     --onpart=/dev/sda1
part  /         --onpart=/dev/sda2
part  /var/log  --onpart=/dev/sda3
part  /home     --onpart=/dev/sda5
part  /opt      --onpart=/dev/sda6
part  swap      --onpart=/dev/sda7

As i mentioned when we began this section, yes, this is (so far) a remarkably inefficient way to set this particular basic configuration up.  But, again to re-iterate, this exercise is about putting the groundwork in place for much more interesting applications of the technique.

mo’ drives, mo’ better

Perhaps some of your machines have more than one drive, and some don’t.  These sorts of things can be determined, and then reacted upon dynamically using the described technique.  Back to the %pre section :

# Determine number of drives (one or two in this case)
set $(list-harddrives)
let numd=$#/2
d1=$1
d2=$3

In this case, we’re using a built-in function called « list-harddrives » to help us determine which drive or drives are present, and then assign their device identifiers to variables.  In other words, if you have an « sda » and an « sdb », those identifiers will be assigned to « $d1 » and « $d2 », and if you just have an sda, then $d2 will be empty.

This gives us some interesting new options ; for example, if we wanted to put /home on to the second drive, we could write up some simple logic to make that happen :

# if $d2 has a value, it's that of the second device.
if [ ! -z $d2 ]
then
  HOMEDEVICE=$d2
else
  HOMEDEVICE=$d1
fi

# snip...
part  /home  --size=1024  --ondisk=/dev/$HOMEDEVICE  --grow

That, of course, assumes that the other partitions are defined, and that /home is the only entity which should be grown dynamically – but you get the idea.  There’s nothing stopping us from writing a normal shell script that could determine the number of drives, their total size, and where the partition start points should be based on that information.  In fact, let’s examine this idea a little further.

the size, she is dynamic !

Instead of trying to wrangle the partition sizes together with the default options, we can get as complex (or as simple) as we like with a few if statements, and some basic maths.  Thinking about our layout then, we can express something like the following quite easily :

  • If there is one drive that is at least 500 GB in size, then /opt should be 200 GB, and /home should consume the rest.
  • If there is one drive is less than 500 GB, but more than 250 GB, then /opt and /home should each take half.
  • If there is one drive that is less than 250 GB, then /home should take two-thirds, and /opt gets the rest.
# $TOTAL from above...
if [ $TOTAL -ge 512000 ]
then
  let OPT_START=$SWAP_START-204800
elif [ $TOTAL -lt 512000 ] && [ $TOTAL -ge 256000 ]
then
  # get the dynamic space total, which is between where /var/log ends, and swap begins
  let DYN_TOTAL=$SWAP_START-40256
  let OPT_START=$DYN_TOTAL/2
elif [ $TOTAL -lt 256000 ]
then
  let DYN_TOTAL=$SWAP_START-40256
  let OPT_START=$DYN_TOTAL/3
  let OPT_START=$OPT_START+$OPT_START
fi

Now, instead of having to create three different kickstart files, each describing a different scenario, we’ve covered it with one – nice !

other possibilities

At the end of the day, the possilibities are nearly endless, with the only restriction being that whatever you’d like to do has to be do-able in BusyBox – which, at this level, provides a lot great functionality.

Stay tuned for more entries related to kickstarting, PXE-based installations, and so forth, all to come here on dan’s linux blog.  Cheers !

getting Dia to give you a pdf

Hello again !  Today’s quick tip concerns a software package called Dia, which is an open source tool (available for both Windows and Linux, as it goes) used to make diagrams, flowcharts, network maps, and so forth.  It has its own file format (.dia), which is (obviously?) useful for saving the projects you’re working on, but less useful if you need to give the diagram to anybody else, either in print or electronic form.

Dia can export to a variety of formats including SVG, PNG, and EPS, but one export format that it lacks native support for is the venerable PDF, which has become a de facto standard for transmitting documents between diverse environments.  There are many advantages and interesting aspects of the PDF format, not the least of which being that what you see on your screen is what you get when it’s printed.  It is unfortunate, then, that Dia won’t spit out a PDF (even if you ask very nicely).

Of course, being that it’s so easy to print directly to PDF (via CUPS, for example) these days, having native support for PDF may not, at first, seem all that useful.  Well, as it turns out, printing directly to PDF might not give you quite what you were looking for.  In practice, you do get a PDF, but what appeared to be a modestly-sized diagram in Dia will turn out to be a multi-page monster in (virtually) printed form.  As a general rule, this is not what you want.

In order to get a usable PDF we need to use an intermediate step between Dia and the final file.  The idea, quite simply, is to export the diagram as one of the supported formats, then convert that file into a PDF.  There are a number of options here, but for our purposes we’ll save the diagram as an EPS file, then use a quick little command-line tool called « epstopdf » to perform the conversion.

There’s a good chance that you don’t have epstopdf on your machine.  If you’re using Ubuntu, you used to be able to install it easily via the APT packager, but these days the little conversion tool comes as part of a larger suite of tools called « texlive-extra-utils ».  This suite is dependant on a number of other packages, so go ahead and install them all :

$ sudo apt-get install texlive-extra-tools

EDIT : In Ubuntu 10.04, the package is named « texlive-font-utils ».

Among many, many little items of interest, our target application will be installed.  To use it, simply feed it the name of the EPS file as an argument :

$ epstopdf somediagram.eps

It will automatically output a PDF file of the same name.  There you go – a nice, shiny PDF of your Dia diagram.

Enjoy !

pohmelfs pt. 2, return of pohmelfs !

Hello again fair readers.  Today i’m going to re-visit POHMELFS, which i introduced in an earlier blog post.  I received a comment on that post which basically asked for more information on some of the more interesting (read : advanced) features of POHMELFS, such as distributed storage, and the like.  Well, today is the day !  If you need a refresher, be sure to skim over my previous post, as we’re going to dive in now right where i left off last time.

patch for the win

One of the reasons that there was a bit of a delay between my last POHMELFS post and this one was because i hit a bug.  Given that we’re working with staging-level code here, that’s to be expected – luckily, thanks to some quick work by Evgeniy Polyakov on the POHMLEFS mailing list, there is still hope – hope in the form of a tasty little patch.

diff --git a/drivers/staging/pohmelfs/trans.c b/drivers/staging/pohmelfs/trans.c
index eab7868..bf7b09a 100644
--- a/drivers/staging/pohmelfs/trans.c
+++ b/drivers/staging/pohmelfs/trans.c
@@ -467,7 +467,8 @@ int netfs_trans_finish_send(struct netfs_trans *t, struct pohmelfs_sb *psb)
 				continue;
 		}

-		if (psb->active_state && (psb->active_state->state.ctl.prio >= st->ctl.prio))
+		if (psb->active_state && (psb->active_state->state.ctl.prio >= st->ctl.prio) &&
+				(t->flags & NETFS_TRANS_SINGLE_DST))
 			st = &psb->active_state->state;

 		err = netfs_trans_push(t, st);

Basically, this patch fixes a minor, but ultimately crippling bug related to writing to multiple servers.  The details are not important – what’s important is that we apply the patch and keep the dream alive.  First, you’ll need to copy and paste that block of code into a text file on one of the systems (in « ~/pohmel.diff, for example »).  Then, in order to apply the patch, we’ll need to use a standard tool called (appropriately) « patch » :

[root@host_75 ~]# cd /usr/src/linux
[root@host_75 ~]# patch -p 1 < ~/pohmel.diff
patching file drivers/staging/pohmelfs/trans.c

Now, just as we did last time, we must play the kernel and module compilation and installation game (fun!).  If you need a refresher on how to do this, just go back to my previous post.  Note that this time around, the whole process will be much faster, since only the POHMELFS components need to be recompiled – everything else will stay the same.  As a result, you can skip the part where you archive the entire kernel tree and copy it over – instead, just patch and recompile on each server and the client.  It’s your call.

Once that’s out of the way we’ll reboot, and then it’s off to the races.

a new challenger appears !

It’s now time to add a third machine into the mix (« host_147 » in this case).  Using this new box, we’ll create a simple sort of setup which is, in fact, quite representative of how things might work in the real world : two storage servers and a client.  As you no doubt recall, one of the neat features of POHMELFS is that it can be employed in a parallel fashion, meaning that a file which appears to the client to be in one place, is actaully located in more than one storage medium.  A general way of describing these ideas is by using the terms « logical » and « physical » ; the logical medium is the filesystem that the client sees, and the physical medium is the actual hard drive upon which the data is stored.

In this case, host_75 and host_166 will be the servers, each containing one copy of the data on their respective physical mediums (i.e. hard drives), and host_147 will be our client, which will access the data via the logical medium (i.e. the POHMELFS export).  The new machine was set up in the same way as host_166 was, so we’ll skip over that, and get right to the good stuff.

A new directory should be created on each of the machines : « /opt/pohtest ».  This will serve as the export directory on the servers, and the mount directory on the client – don’t put any data in it yet, though.

server config

On the servers, we’ll initiate the server daemon.  Unlike our first test, where we just let the defaults ride, this time around we’ll configure things a bit more intelligently :

[root@host_75 (and host_166) ~]# fserver -r /opt/pohtest -l /var/log/pohmelfs.log -d

In the above example, « -r » defines the directory to export, « -l » is where to output the logs to, and « -d » puts the process into the background, instead of on our console as before.  This is normally how things would work, so it’s good to get used to it now.  Now, we can follow the log files on each machine by using « tail » :

[root@host_75 (and host_166) ~]# tail -f /var/log/pohmelfs.log
Server is now listening at 0.0.0.0:1025.

client config

With the servers up and ready to go, we can now turn our attention on the client.  Don’t forget to load the pohmelfs module first !

[root@host_147 ~]# modprobe pohmelfs
[root@host_147 ~]# cfg -A add -a 192.168.0.75 -p 1025 -i 1

Now we mount.  It’s important that we mount before we attempt to add the second server into the mix – trying to do it ahead of time will only result in terrible, crippling failure.

[root@host_147 ~]# mount -t pohmel -o idx=1 none /opt/pohtest/

No output means it worked (as usual), so let’s verify :

[root@host_147 ~]# df | grep poh
none                 154590376  10018492 144571884   7% /opt/pohtest

Great, now let’s add the other server :

[root@host_147 opt]# cfg -A add -a 192.168.0.166 -p 1025 -i 1

Now we must wait at least 5 seconds for the synchronisation to occur.  In reality it’s shorter than that, but 5 seconds is an easy number to remember, and it’s safe.  So far this looks exactly the same as before, but there’s a bit of a conceptual twist – as you can see, both of those new add statements have the same index (as denoted by the -i).  This means that they’re grouped together as part of the same logical medium.  We can check on this by using the « show » action :

[root@host_147 ~]# cfg -A show -i 1
Config Index = 1
Family    Server IP                                            Port     
AF_INET   192.168.0.75                                         1025
AF_INET   192.168.0.166                                        1025

Everything seems on the up and up so far, so we can go ahead and try our first mount.  A series of options will be passed to the mount line, notably « idx=1 », which means index 1 (as seen above) – this is very important to specify, as without it, POHMELFS won’t be able to determine which logical group you’re talking about.

And if we take a look at the log output on the servers, we’ll see that the client connection has been accepted.  Both of the logs should show the accepted line, but with different port numbers (the trailing digits at the end) :

Accepted client 192.168.0.147:48277.

There are other diagnostics we can run to take a look at what we’ve got running.  At this stage they won’t tell us anything we don’t already know, but it will give us some practice with the tools and data, so that when the time comes to debug problems down the road, we’ll be ready.

For example, POHMELFS will write some handy information to « mountstats », which is exactly what it sounds like :

[root@host_147 ~]# cat /proc/1/mountstats
   ...
device none mounted on /opt/pohtest with fstype pohmel
idx addr(:port) socket_type protocol active priority permissions
1 192.168.0.75:1025 1 6 1 0 3
1 192.168.0.166:1025 1 6 1 0 3

It’s not lined up very nicely, but the interesting column right now is « active », which lists « 1 » in both cases, meaning the connections are open.  The « permissions » column lists « 3 » for both nodes which, in this case, means that they’re both available for reading and writing (as opposed to being read or write-only, which are also valid options).

but will it blend ?

Accepting the connection is one thing – successfully reading and writing files is entirely another.  Let’s do some tests ; first we’ll use the client to create an empty file in mount :

[root@host_147 ~]# cd /opt/pohtest/
[root@host_147 pohtest]# touch FILE
[root@host_147 pohtest]# ls
FILE

Great, now let’s take a look at our servers :

[root@host_166 pohtest]# ls -l
total 0
-rw-r--r-- 1 root root 0 2009-07-06 16:58 FILE
[root@host_166 ~]#

And the other :

[root@host_75 ~]# ls -l /opt/pohtest/
total 0
-rw-r--r-- 1 root root 0 2009-07-06 16:46 FILE
[root@host_75 ~]#

Now, during my limited tests, i noticed a small lag time between my manipulations on the client, and when those actions were reflected on the servers.  At this stage of the game i’m not sure whether that’s normal or not, or exactly what’s causing it – so don’t be alarmed if you see a small lag as well.  I’ll be sure to post further updates on this point once i’ve got more information.

Update : As per Evgeniy on the mailing list :

This delay is not a bug, but feature - POHMELFS has local cache on
clients and  data written on client is stored in that cache first and
then flushed to the server when client is under memory pressure or when
another one requests updated but not yet flushed data.

To force client to flush the data one can 'sync' on client or use
'flush' utility on the server. The latter will invalidate data on the
client (which implies it to be flushed to the server first), so server
update will become visible next time client reads that data.

how not to do it

Let’s do another little test.  On one of the servers, we’ll perform a manipulation in the POHMELFS export directory :

[root@host_75 ~]# touch /opt/pohtest/host75file
[root@host_75 ~]# ls -l /opt/pohtest/
total 4
-rw-r--r-- 1 root root 5 2009-07-06 16:46 FILE
-rw-r--r-- 1 root root 0 2009-07-06 16:57 host75file
[root@host_75 ~]#

Great, but if we take a look at the other server :

[root@host_166 ~]# ls -l /opt/pohtest/
total 4
-rw-r--r-- 1 root root 5 2009-07-06 16:59 FILE

And the client :

[root@host_147 ~]# ls -l /opt/pohtest/
total 0
-rw-r--r-- 1 root root 5 2009-07-06 20:47 FILE

We notice that it’s not there.  Why ?  Unfortunately, like so much bureaucracy, we didn’t go through the proper channels.  Recall that our client has certain software running on it that allows it to speak to both servers, and that the mountpoint uses that software to ensure consistency between across the shared filesystem.  In the example above, we wrote directly to the underlying filesystem of the server – completely avoiding said software – and thus POHMELFS had no way of knowing that a manipulation had occured.

In short – if you want to keep things consistent, you must interact via a client.  But what if we want our servers to be able to interact with the data as well ?  Well, there’s nothing stopping us from setting up client processes on our servers, too.  This, however, will have to wait for the next instalment.

See you on the intertubes !