Hello everybody ! Today’s post is about the Distributed Numeric Assignment (or « DNA » ) plug-in for the 389 Directory Server (also known as the Fedora, Red Hat, and CentOS Directory Servers). Although this plug-in has existed for quite some, there isn’t a whole lot of documentation about how to implement it in a real-world scenario. I recently submitted some documentation to the maintainer of the 389 wiki, but since i’m not sure how, when, or in what form that documentation will come to exist on their site, i thought i’d expand on it here as well. If you’ve made it this far, i’m going to assume that you’re already familiar with the basics of LDAP, and already have an instance of Directory Server up and running – if not, i suggest you take a look through the official Red Hat documentation in order to get you started.
By way of some background, it is worth noting that my basic requirement was simply to have a centralised back-end for authenticating SSH logins to the various machines in our park. The actual numerical values for the UID and GID fields did not need to be the same, they simply needed to be both extant and unique for each user, with the further caveat that they should not collide with any existing values that might be defined locally on the machines. This is a very basic set of requirements, so it is an excellent starting point for our example. The first step is to activate the DNA plug-in via the console :
[TAB] Servers and Applications
Domain -> Server -> Server Group -> Directory Server
Server -> Plug-ins -> Distributed Numeric Assignment
[X] Enable plug-in
The Directory Server needs to be restarted in order for the activation to take effect. This can either be done via the console, or via the command-line as normal. The next step is to define how DNA will interact with new user data ; this is different from configuring the plug-in itself, in that we will be setting up a layer in between the plug-in and the user data that will allow certain values to be generated automatically (which is, of course, the end goal of this exercise). Consider the following two LDIF snippets :
As you can see, they are nearly identical. This configuration activates the DNA magic-number functionality for the UID and GID fields as shown in the Posix attributes section of the console, though the values used may require further explanation. The only particular requirement for the magic number (specified by the « dnamagicregen » field) is that it be a value that cannot occur naturally, which is to say a value that would not be generated by the DNA plug-in, nor set manually at any time. The default value is « 0 », but since this is clearly a number with meaning on the average Posix system, i would recommend a suitably large number that is unlikely to ever be used, such as « 99999 ». Non-numerical values can technically be used too ; however, these will not be acceptable to the console, so unless you’re using a third-party interface (or doing everything from the commandline), a numerical value must be used.
The « dnanextvalue » field functionally indicates where the count will start from. As noted previously, in order to avoid collisions with existing local entries on the various machines, i chose a start point of « 1000 », which was more than acceptable in my environment. Once these two snippets are integrated via the commandline, simply re-start the Directory Server (again), and you’re good to go From now on, any time that a new user is created with the value « 99999 » entered into either (or both) of the UID and GID Posix fields, DNA will automagically generate real values as appropriate.
Hello all ! Today we’re going to take a look at a somewhat obscure problem that – once encountered – can cause nothing but headaches for a system administrator. The problem relates to conflicts in CPAN RPM packages, and what can be done to work around the issue. If you’ve made it this far, i’m going to assume a couple of things : you’re comfortable with RPMs and repositories, have worked with a .spec file before, and you know what Perl modules are. Good ? Ok, let’s go.
Edit : About a week after i posted this article, the pastebin i uploaded the examples to disappeared. Maybe it will come back – i don’t know – but if not, sorry for the broken links…
CPAN is an enormous collection of Perl modules. If you’ve ever written a Perl script, there’s a good chance you’ve used a module that – at one point or another – came from this archive. One of the really neat features of CPAN is the interactive manner in which modules can be downloaded and installed from the archive using Perl right from the command line (frankly, if you’re reading this post, there’s a good chance you’ve used this feature, too). This is a fairly common way to install new modules and add functionality to your system, especially if you’re coding for local use (i.e. on your personal box).
It’s useful, but it’s not perfect, and one of the key areas where it starts to fail is scalability : if you’ve got a bunch of machines, and you need to SSH into each one to interactively install a CPAN module or two, it’s going to be a hassle. Likewise, CPAN doesn’t often find its way into the hearts and minds of enterprise Red Hat or CentOS environments, where the official policy is often to install software via RPM only (for support, administration, and sanity reasons, this is often the case).
Luckily, some of the most commonly used CPAN modules exist as RPMs in the default repositories. Some, but not all (and not even « many ») – for this, there are other repositories available. Some examples :
That last one – Magnum – is particularly interesting given the subject of our post today. From their info page :
At Magnum we have a firm rule that all CPAN modules on our machines are installed from RPMs. The Fedora and Centos projects build RPMs for many CPAN modules, but there are always ones missing and the ones that are available often lag behind the most up to date versions. For that reason, we build a lot of RPMs of CPAN modules. And we don’t want to keep that work to ourselves, so on these pages we make them available for anyone to download.
Their RPMs are generated automagically using a great tool called « cpanspec », which does exactly what you think it does : given a CPAN tarball, it will generate a .spec file suitable for building an installable RPM. It is available in the standard repositories, and can be installed easily via YUM as normal, so go ahead and do that now. Ok, example time : say you needed HTML::Laundry, but after a quick peek through your repositories, it becomes readily apparent that an RPM is not available. Thanks to cpanspec, all is not lost :
We just downloaded the tarball right from the CPAN website, and ran cpanspec against it. The « –packager » argument simple defines the person who’s generating the .spec, and doesn’t necessarily have to be anything accurate. Go ahead and try it for yourself. Now take a look at the resulting .spec file (or on the a pastebin here). As you can see, it fills in all the fields, including the critical (and often tricky-to-determine) « BuildRequires » and « Requires » items. Frankly, it’s solid gold, and it has made the lives of CentOS / RHEL admins all over the world much easier.
That said, it’s not perfect, and there are times when you might run into problems. Actually, you may run into two problems in particular. The first is conflicts over ownership, which arises when multiple RPMs claim to be responsible for the same file (or files, or directories, or features, or whatever). The second is more nefarious : an RPM that writes files to the system without declaring ownership for them – a condition often referred to as « clobbering ». The former is irritating, but at least it’s not destructive, unlike the latter, which can cause all manner of headaches. To illustrate these two problems, let’s take a look at another example (this one being decidedly more real-world than that of Laundry above) : CGI.pm.
The .spec file that is generated from this tarball is functional and correct, and we can build an installable RPM out of it, so at first all appears well. Again, go ahead and try for yourself – i’ll wait. You may wish to capture the build output for review – otherwise, check the pastebin. I’d like to draw your attention to the « Installing » lines. By trimming the « Installing /var/tmp/perl-CGI.pm.3.49-1-root-root » element from each of those lines, we can see the actual paths and files that this RPM will install to. Examples :
At first glance this looks perfectly acceptable. But look what happens when we try to install the resulting RPM (clipped for brevity) :
[root@host-119 build]# rpm -iv /usr/src/redhat/RPMS/noarch/perl-CGI.pm-3.49-1.noarch.rpm
Preparing packages for installation...
file /usr/share/man/man3/CGI.3pm.gz from install of perl-CGI.pm-3.49-1.noarch conflicts with file from package perl-5.8.8-27.el5.x86_64
file /usr/share/man/man3/CGI::Cookie.3pm.gz from install of perl-CGI.pm-3.49-1.noarch conflicts with file from package perl-5.8.8-27.el5.x86_64
file /usr/share/man/man3/CGI::Pretty.3pm.gz from install of perl-CGI.pm-3.49-1.noarch conflicts with file from package perl-5.8.8-27.el5.x86_64
As it turns out, the Perl package that comes with RHEL / CentOS already contains CGI.pm. This is normal, since it’s so popular, and is included as a convenience. Thus, RPM – in an attempt to preserve the coherence of the package management system – refuses to install overtop of the existing owned files. This is a fine illustration of the first of the two problems previously noted : conflicts over ownership. As i mentioned above, it’s aggravating, but it’s not a bug – it’s a feature, and it’s doing exactly what it’s designed to do. Irritating, but not ultimately dire.
If you look carefully, though, it’s also an illustration of the second problem. Note the list of files that are conflicting. Look back to the list of files that the package contains – notice anything missing from the conflicts list ? That’s right – the actual module files (*.pm) are not showing conflicts, which means they’d get overwritten without complaint by RPM. You might be thinking « who cares ? that’s what i want » right now, but trust me, it’s not what you want. Imagine this CGI package, with this version of CGI.pm gets installed, and then later you upgrade the Perl package – your CGI.pm files will get overwritten by the Perl package, because as far as RPM is concerned, Perl owns those files. All of a sudden, things break because you had scripts that relied on your particular version, but since you just upgraded Perl, you think (quite naturally) that the problem could be anywhere – where do you even start looking ?
Imagine the headache if there are multiple administrators, multiple servers, multiple data centres, and multiple clients paying multiple dollars. No fun at all.
So how can we upgrade CGI.pm, using an RPM, without running into these problems ? As is often the case, the answer is deceptively simple, but not immediately obvious. Ultimately what we want to accomplish is twofold :
Avoid the man conflicts.
Ensure that the existing owned module files are not clobbered by our new package.
Concerning the man pages – and i’m going to be perfectly blunt here – the solution is to simply not install them, since, of course, they’re already there. As for avoiding a clobbering condition, this requires a little bit of investigation into how Perl modules and libraries are stored on an RHEL / CentOS machine. Consider the following output :
[root@host-119 ~]# ls -d /usr/lib64/perl5/*
/usr/lib64/perl5/5.8.8 /usr/lib64/perl5/site_perl /usr/lib64/perl5/vendor_perl
What’s it all mean ? Well, the « 5.8.8 » directory is the default directory as defined by the Perl architecture, and is system and platform-agnostic, which is to say that it’s (supposed to be) the same on every system. The « vendor_perl » directory contains everything that specific to RHEL / CentOS (the « vendor » of the distribution). As you may recall from the rpmbuild output above, this is where the RPM wants to install the modules (thus creating the clobbering condition).
There’s a third directory there, promisingly named « site_perl » ; as the name implies, this is where site-specific files are stored, which is to say items that are neither part of the default Perl architecture, nor part of the RHEL / CentOS distribution. As you’ve no doubt guessed by now, site_perl is where we’re going to put our new modules.
Luckily for us, the only thing that needs to be changed is the .spec file – and we even get a headstart, since cpanspec does most of the heavy lifting for us. Examining the .spec file once more, we see the following lines of note (again, cut for brevity) :
These indicate that the target installation directory is that of the vendor, which is normally the case, and thus the default setting. Since we want to install to the site directory, we make the following changes :
That solves our clobbering problem quite nicely, but what about the man files ? As i mentioned above, the idea is to simply avoid installing them altogether, but since they’re generated automatically during the build process, how can we exclude them ? What i’m about to present is a bit of a hack, but it’s absolutely effective, and ultimately quite clean : we delete them after they’ve been generated, and then don’t declare them in the file list. Some items are already being potentially deleted by default, so let’s go ahead and add our own line into the mix :
This will look for all of the « manified » man files and just remove from the build tree. All that’s left now is to remove them from the file list. This is as simple as deleting (or commenting out) their sole declaration :
Another option is to simply install use the « –excludedocs » argument when installing the RPM. I opted to remove the docs altogether in order to ensure that the package can be installed without errors by anyone else without needed to know about the argument requirement ahead of time (and to facilitate automated rollouts).
What you’ll end up with is a .spec file that looks like this. Go ahead and build your RPM – it’ll install without conflicts and without danger. This is a technique that can be used for other CPAN packages as well, so go ahead and install everything you’ve always wanted.
Happy 2010 fair readers ! I hope that all is well with you and yours. Let’s get right to business : Virtualbox has a feature that allows you to access the host OS’s file system from the guest OS (shared folders), which is super useful, but not exactly perfectly implemented. In particular, there are known, documented performance issues in certain scenarios, such as when accessing a Linux host via a Windows guest (which, as you might imagine, is a pretty regular sort of activity).
One common (?) workaround is to install and configure Samba on the Linux host, then access it from the Windows guest like one would access any network server. The problem here is that it requires that Samba be installed and configured, which can be a pain in the, well, you know. Furthermore, the connection will be treated like any other, and the traffic will travel up and down the network stack, which is fundamentally unnecessary since the data is, physically speaking, stored locally.
Instead, here’s another workaround, one that keeps things simple, and solves the performance problem : just map the shared folder to a local drive in the host OS. It’s that easy. For those of us who aren’t too familiar with the Windows explorer interface (me included, heh), there are tonnes of step by step instructions available. For whatever reason (i suspect Netbios insanity), accessing the network share via a mapped drive manages to avoid whatever condition creates the lag problems, resulting in a rapid, efficient access to the underlying filesystem.
Hello again fair readers ! Today’s quick tip concerns the problem with missing time zones when deploying CentOS 5.3 (and some of the more recent Fedoras) in a kickstart environment. It’s a known problem, and unfortunately, since the source of the problem (an incomplete time zone data file) lies deep in the heart of the kickstart environment, fixing it directly is a distinct pain in the buttock region.
There is, however, a workaround – and it’s not even that messy ! The first step is to use a region that does exist, such as « Europe/Paris », which will satisfy the installer – then set the time zone to what you actually want after the fact in the « %post » section. So, in the top section of the kickstart file, we’ll put :
# set temporarily to avoid time zone bug during install
timezone --utc Europe/Paris
The « –utc » switch simply states that the system clock is in UTC, which is pretty standard these days, but ultimately optional. Next, in the %post section towards the end, we’ll shoe horn our little hack fix into place :
# fix faulty time zone setting
mv /etc/sysconfig/clock /etc/sysconfig/clock.BAD
sed 's@^ZONE="Europe/Paris"@ZONE="Etc/UTC"@' /etc/sysconfig/clock.BAD > /etc/sysconfig/clock
So, what’s going on there ? Let’s break it down :
In the first line, we’re just backing up the original configuration file, to use in the next line…
The second line is the important one – this is the actual manipulation which will fix the faulty time zone, setting it to whatever we want. In this example « Etc/UTC » is used, but you can pick whatever is appropriate.
The tool being used here is « sed », a non-interactive editor which dates back to the 1970’s, and which is still used by system administrators around the world every day.
The command we’re issuing to sed is between the single quotes – astute readers will notice that it’s a regular expression, but with @’s instead of the more usual /’s. In it, we simply state that the instance of « ZONE=”Europe/Paris” » is to be replaced with « ZONE=”Etc/UTC” ».
This change is to be made against the backup file, and outputted to the actual config.
Finally, we run « tzdata-update » which, as you’ve no doubt guessed, updates the time zone data system-wide, based (in part) on the newly-corrected clock config.
And that, as they say, is that. Happy kickstarting, friends, and i’ll see you next time !
Hello again, everybody ! Today i thought that we’d take a look at a fun and useful topic of interest to many system administrators : load balancing & redundancy. Now, i know, it doesn’t sound too exciting – but trust me, once you get your first mini-cluster set up, you’ll never look at service management quite the same way again. It’s not even that tough to set up, and you can get a basic setup going in almost no time at all, thanks to some great open source software that can be found in more or less any modern repository.
First, as always, a little bit of theory. The most basic web server setup (for example), looks something like figure 001, below :
As you can see, this is a functional setup, but it does have (at least) two major drawbacks :
A critical failure on the web server means the service (i.e. the content being served) disappears along with it.
If the web server becomes overloaded, you may be forced to take the entire machine down to upgrade it (or just let your adoring public deal with a slow, unresponsive website, i suppose).
The solution to both of these problems forms the topic of this blog entry : load balancing. The idea is straightforward enough : by adding more than one web server, we can ensure that our service continues to be available even when a machine fails, and we can also spread the love, er, load, across multiple machines, thus increasing our overall efficiency. Nice !
batman and round robin
Now, there are a couple of ways to go about this, one of which is called « Round Robin DNS » (or RRDNS), which is both very simple and moderately useful. DNS, for those needing a refresher, is (in a nutshell) the way that human-readable hostnames get translated into machine-readable numbers. Generally speaking, hostnames are tied to IP addresses in a one-to-one or many-to-one fashion, such that when you type in a hostname, you get a single number back. For example :
$ host www.dark.ca
www.dark.ca has address 188.8.131.52
In other words, when you type http://www.dark.ca into your browser, you get one particular machine on the Internet (as indicated by the address); however, it is also possible to set up a one-to-many relationship – this is the basis or RRDNS. A very common example is Google :
$ host www.google.com
www.google.com is an alias for www.l.google.com.
www.l.google.com has address 184.108.40.206
www.l.google.com has address 220.127.116.11
www.l.google.com has address 18.104.22.168
www.l.google.com has address 22.214.171.124
www.l.google.com has address 126.96.36.199
www.l.google.com has address 188.8.131.52
So what’s going on here ? In essence, the Google administrators have created a situation whereby typing in http://www.google.com into your browser will get you one of a whole group of possibilities. In this way, each time you request some content from them, one of any number of machines will be responsible for delivering that service. (Now, to be fair, the reality of what’s going on at Google is likely far more complex, but the premise is identical.) Your web browser will only get one answer back, which is more or less randomly provided by the DNS server, and that response is the machine you’ll interact with. As you can see, this (sort of) satisfies our problem of resource usage, and it (sort of) addresses the problem resource failure. For those of you who are more visually inclined, please see figure 002 below :
It’s not perfect, but it is workable, and most of all, it’s dead simple to set up – you just need to set your DNS configuration up and you’re good to go (an exercise i leave to you, fair reader, as RRDNS is not really the focus of our discussion today). Thus, while RRDNS is a simple method for implementing a rudimentary load balancing infrastructure, it still has notable failings :
The load balancing isn’t systematic at all – by pure chance, one machine could end up getting hammered while others do very little, for example.
If a machine fails, there’s a chance that the DNS response will contain the address of the downed machine. In other words, the chances of you getting the downed machine are 1 in X, where X is the number of possible responses to the DNS query. The odds get better (or worse, depending on how you look at it) as more machines fail.
A slightly more obscure problem is that of response caching : as a method of optimisation, many DNS systems, as well as software that interacts with DNS, will cache (hold on to) hostname lookups for variable lengths of time. This can invalidate the magic of RRDNS altogether…
another attack vector
Another approach to the problem, and the one we’ll be exploring in great depth in this article, is using a dedicated load balancing infrastructure, combining a handful of great open source tools and proven methodologies. First, however, some more theory.
Our new approach to load balancing must propose both a solution to the original problems (critical failure & resource usage), as well as address and solve the drawbacks of RRDNS as noted above. Really, what we want is an intelligent (or, at least, systematic) distribution of load across multiple machines, and a way to ensure that requests don’t get sent to downed machines by accident. It’d be nice if these functions were automated too, since the last thing an administrator wants to do is baby-sit racks of servers. What we’d like, in other words, could be represented by replacing the phrase « RRNDS » in figure 002 above, with the word « magic ». For now, let’s imagine that this magic sits on a machine that we’ll call « Load Balancer » (or LB, for short), and that this LB machine would have a similar conceptual relationship to the web servers as RRDNS does. Consider figure 003 :
This is a basic way of thinking about what’s going to happen. It looks a lot like figure 002, but there is a very important difference : instead of relying on the somewhat nebulous concept of DNS for our load balancing, we can now give that responsibility to a proper machine running and dedicated to the purpose. As you can imagine, this is already a huge improvement, since this opens the door to all sorts of additional features and possibilities that simply aren’t possible with straight DNS. Another interesting aspect of this diagram is that, visually speaking, it would appear that the Internet cloud only « sees » one machine (the load balancer), even though there are a number of web servers behind it. This concept of having a single point of entry lies at the very core of our strategy – both figuratively andliterally – as we’ll soon discover
In the here and now, however, we’re still dealing with theory, and a solution based on « magic » is about as theoretical as it gets. Luckily for us though, magic is exactly what we’re about to unleash – in the form of « Linux Virtual Server », or « LVS » for short. From their homepage :
The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the server cluster is fully transparent to end users, and the users interact as if it were a single high-performance virtual server. […] The Linux Virtual Server as an advanced load balancing solution can be used to build highly scalable and highly available network services, such as scalable web, cache, mail, ftp, media and VoIP services.
The thing about LVS is that while it’s not inherently complex, it is highly malleable, and this means you really do need to have a solid handle on exactly what you want to do, and how you want to do it, before you start playing around. Put another way, there are a myriad of ways to use LVS, but you’ll only use one of them at a time, and picking the right methodology is important. The best way to do this is by building maps and really getting a solid feel for how the various components of the overall architecture relate to each other. Once you’ve got a good mental idea of what things should look like, actually configuring LVS is about as straightforward as it gets (no, really!).
let’s complicate the issue further, for science !
Looking back to figure 003, we can see that our map includes the Internet, the Load Balancer, and some Web Servers. This is a pretty typical sort of setup, and thus, we can approach it from a few different ways. One of the decisions that needs to be made fairly early on, though, has more to do with topology and routing than LVS specifically : how, exactly, do the objects on the map relate to each other at a network level ? As always, there can be lots of answers to this question – each with their advantages and disadvantages – but ultimately we must pick only one. Since i value simplicity when it comes to technology, figure 004 describes a simple network topology :
Now, for those of you out there who may have some experience with LVS, you can see exactly where this is headed – for everybody else, this might not be what you were expecting at all. Let’s take a look at some of the more obvious points :
There are two load balancers.
The web servers are on the same network segment as the LBs.
Unlike the previous diagrams, the LBs do not appear to be « in between » the Internet and the web servers.
The first point is easy : there are two LBs for reasons of redundancy, as a single LB represents a single point of failure. In other words, if the LB stops working for whatever reason, all of your services behind it become functionally unavailable, thus, you really, really want to have another machine ready to go immediately following a failure.
A little bit more explanation is required to explain the second and third points – but the short answer is two words : « Direct Routing » (or DR for short). From the LVS wiki :
Direct Routing [is] an IP load balancing technology implemented in LVS. It directly routes packets to backend server through rewriting MAC address of data frame with the MAC address of the selected backend server. It has the best scalability among all other methods because the overhead of rewriting MAC address is pretty low, but it requires that the load balancer and the backend servers (real servers) are in a physical network.
If that sounds heavy, don’t worry – figure 005 explains it in easy visual form :
In a nutshell, requests get sent to the LB, which then passes it to the Web Server, who in turn responds directly to the client. It’s fast, efficient, scalable, and easy to set up, with the only caveat being that the LBs and the machines they’re balancing must be on the same network. As long as you’re willing to accept that restriction, Direct Routing is an excellent choice – and it’s the one we’ll be exploring further today.
a little less conversation, a little more action
So with that in mind, let’s get started. I’m going to be describing four machines in the following scenario. All four are identical off-the-shelf servers running CentOS 5.2 – nothing fancy here. The naming and numbering conventions are simple as well :
You probably noticed the fifth item in this list, labelled « Virtual Web Server ». This represents our virtual, or clustered service, and is not a real machine. This will be explained in further detail later on – for now, let’s go ahead and install the key software on both of the Load Balancer machines :
[root@A01 etc]# yum install ipvsadm piranha httpd
« ipvsadm » is, as you might have guessed, the administrative tool for « IPVS », which is in turn an acronym for « Internet Protocol Virtual Server », which makes more sense when you say « IP-based Virtual Server » instead. As the name implies, IPVS is implemented at the IP level (which is more generically known as Layer-3 of the OSI model), and is used to spread incoming connections to one IP address towards other IP addresses according to one of many pre-defined methods. It’s the tool that allows us to control our new load balancing infrastructure, and is the key software component around which this entire exercise revolves. It is powerful, but sort of a pain to use, which brings us to the second item in the list : piranha.
Piranha is a web-based tool (hence httpd, above) for administering LVS, and is effectively a front-end for ipvsadm. As installed in CentOS, however, the Piranha package contains not only the PHP pages that make up the interface, but also a handful of other tools of particular interest and usefulness that we’ll take a look at as well. For now, let’s continue with some basic setup and configuration.
A quick word of warning : before starting « piranha-gui » (one of the services supplied by Piranha) up for the first time, it’s important that both LBs have the same time set on them. You’ve probably already got NTP installed and functioning, but if not, here’s a hint :
[root@A01 ~]# yum -y install ntp && ntpdate pool.ntp.org && chkconfig ntpd on && service start ntpd
Moving right along, the next step is to define a login for the Piranha web interface :
[root@A01 ~]# /usr/sbin/piranha-passwd
You can define multiple logins if you like, but for now, one is certainly enough. Now, unless you plan to run your load balanced infrastructure on a completely internal network, you’ll probably want to set up some basic restrictions on who can access the interface. Since the interface is served via an instance of Apache HTTPd, all we have to do is set up a normal « .htaccess » file. Now, a full breakdown of .htaccess (and, in particular, mod_access) is outside of the scope of this document, but the simple jist is as follows :
[root@A01 ~]# cat /etc/sysconfig/ha/web/secure/.htaccess
Deny from all # by default, deny from everybody
Allow from 192.168.0 # requests from this network are allowed
With those items out of the way, we can now activate piranha-gui :
[root@A01 ~]# chkconfig piranha-gui on && service piranha-gui start
Congratulations ! The interface is now running on port 3636, and can be accessed via your browser of choice – in the case of our example, it’d be « http://A01:3636/ ». The username for the web login is « piranha », and the password is the one we set above. Now that we’re logged in, let’s take a look at the interface in greater depth.
look out – piranhas !
The first screen – known as the « Control » page – is a summary of the current state of affairs. Since nothing is configured or even active, there isn’t a lot to see right now. Moving on to the « Global Settings » tab, we have our first opportunity to start putting some settings into place :
Primary server public IP : Put the IP address of the « primary » LB. In this example, we’ll put the IP of A01.
Private server public IP : If we weren’t using direct routing, this field would need a value. In our example, therefore, it should be empty.
Use network type : Direct Routing (of course!)
On to the « Redundancy » tab :
Redundant server public IP : Put the IP address of the « secondary » LB. In this example, we’ll put the IP of A02.
Syncdaemon : Optional and useful – but know that it requires additional configuration in order to make it work.
This feature (which is relatively new to LVS) ensures that the state information (i.e. connections, etc..) are shared with the secondary in the event that a failover occurs. For more information, please see this page from the LVS Howto.
It is not necessary, strictly speaking, so we can just leave it unchecked for now.
Under the « Virtual Servers » tab, let’s go ahead and click « Add », then select the new unconfigured entry and hit « Edit » :
Name : This is an arbitrary identifier for a given clustered service. For our example, we’d put « WWW ».
Application Port : The port number for the service – HTTP runs on port 80, for example.
Protocol : TCP or UDP – this is normally TCP.
Virtual IP Address : This is the IP address of the virtual service (VIP), which you may recall from the table above. This is the IP address that clients will send requests to, regardless of the real IP addresses (RIP) of the real servers which are responsible for the service. In our example, we’d put 192.168.0.40 .
Each service that you wish to cluster needs a unique « address : port » pairing. For example, 192.168.0.40:80 could be a web service, and 192.168.0.40:25 would likely be a mail service, but if you wanted to run another, separate web service, you’d need to assign a different virtual IP.
Virtual IP Network Mask : Normally this is 255.255.255.255, indicating a single IP address (the Virtual IP Address above).
You can actually cluster subnets, but this is outside of the scope of this tutorial.
Device : The Virtual IP address needs to be assigned to a « virtual network interface », which can be named more or less anything, but generally follows the format « ethN:X », where N is the physical device, and X is an incremental numeric identifier. For example, if your physical interface is « eth0 », and this is the first virtual interface, then it would be named « eth0:1 ».
If and when you set up multiple virtual interfaces, it is important to not mix these up. Piranha has no facility for sanity checking these identifiers, so you may wish to track them yourself in a Google document or something.
Scheduling : There are a number of options here, and some are very different from one another. For the purposes of this exercise, we’ll pick a simple, yet relatively effective scheduler called « Least-Connections ».
This does exactly what it sounds like : when a new request is made to the virtual service, the LB will check to see how many connections are open to each of the real servers in the cluster, and then route the connection to the machine with the least connections. Congrats, you’ve now got load balancing !
Finally, let’s add some real servers into the cluster. From the « Edit » screen we’re already in, click on the « Real Server » sub-tab.
Name : This is the hostname of the real server. In our example, we’d put B01.
Address : The IP address of the real server. In our example, for B01, we’d put 192.168.0.38 .
Port : Generally speaking this can be left empty, as it will inherit the Port value defined in the virtual service (in this case, 80).
A value would be required here if your real server is running the service on a different port than that specified in the virtual service ; if your real server is running a web service on port 8080 instead of 80, for example.
Weight : Despite the name, this value is used in various different ways depending on which Scheduler you selected for the virtual service. In our example, however, this field is irrelevant, and can be left empty.
You can apply and add as many real servers as you like, one at a time, in this fashion. Go ahead and set up B02 (or whatever your equivalent is) now.
If you’re wondering when the secondary LB is going to be configured, well, wonder no longer : the future is now. Luckily, this step is very, very easy. From the secondary :
Phew ! That was a lot of work. After consuming a suitable refreshment, let’s move on to the final few steps. Earlier i mentioned that there were some other items that we’d need to learn about besides the Piranha interface – « Pulse » is one such item. Pulse, as a tool, is in the same family as some other tools you may have heard of, such as « Heartbeat », « Keepalived », or « OpenAIS ».
The basic idea of all of these tools is simple : to provide a « failover » facility between a group of two or more machines. In our example, our primary LB is the one that is normally active, but in the case that it fails for some reason, we’d like our secondary to click in and take over the responsibilities of the unavailable primary – this is what Pulse does. Each of the load balancers runs an instance of « pulse » (the executable, not the package), which behaves in this fashion :
Each LB sends out a broadcast packet (a pulse, as it were) stating that they are alive. As long as the active LB (commonly the primary) continues to announce itself, everybody is happy and nothing changes.
If, however, the inactive LB (commonly the secondary) server notices that it hasn’t seen any pulses from the active LB lately, it assumes that the active LB has failed.
The secondary, formerly inactive LB, then becomes active. This state is maintained until such a time as the primary starts announcing itself again, at which point the secondary demotes itself back to inactivity.
The difference between the active and the inactive server is actually very simple : the active server is the one with the virtual addresses assigned to it (remember those, from the Virtual Servers tab in Piranha?).
Let’s go ahead of start it up (on the primary LB first, then on the secondary) :
[root@A01 ~]# chkconfig --add pulse
[root@A01 ~]# service pulse start
an internet ballgame drama – in 5 parts
You may have noticed that we haven’t even touched the « real » servers (i.e. the web servers) yet. Now is the time. As it so happens, there’s only one major step that relates to the real servers, but it’s a very, very important one : defining VIPs, and then ensuring that the web servers are OK with the routing voodoo that we’re using to make this whole load balancing infrastructure work. The solution is simple, but the reason for the solution may not be immediately obvious – for that, we need to take a look at the IP layer of each packet (neat!). First, though, let’s run through a series of little stories :
Alice has a ball that she’d like to pass to Bob, so she tosses it his way.
Bob catches the ball, sees that it’s from Alice, and throws it back at her. What great fun !
Now imagine that Alice and Bob are hanging out with a few hundred million of their closest friends – but they still want to play ball.
Alice writes Bob’s name on the ball, who then passes it to somebody else, and so forth.
Eventually the ball gets passed to Bob. Unfortunately for Bob, he has no idea where it came from, so he can’t send it back.
The solution is obvious :
Alice writes « From : Alice, To : Bob » on the ball, the passes it along.
Bob gets the ball, and switches the names around so that it says « From : Bob, To : Alice », and sends it back.
OK, so, those were some nice stories, but how do they apply to our Load Balancing setup ? As it turns out, all we need to do is throw in some tubes, and we’ve described one of the basic functions of the Internet Protocol – that the source and destination IP addresses of a given packet are part of the IP layer of said packet. Let’s complicate it by one more level :
Alice prepares the ball as above, and send it flying.
Bob gets the ball, who’s been avoiding Alice since things got weird at the bar last week-end, passes it along to Charles.
Charles – who’s had a not-so-secret crush on Alice since high school – happily writes « From : Charles, To : Alice », and tosses it away.
Alice receives the ball, but much to her surprise, it’s from Charles, and not Bob as she expected. Awkward !
With that last story in mind, let’s take another look at figure 005 above (go ahead, i’ll wait). Notice anything ? That’s right – the original source sends their packet off, but then receives a response from a different machine than they expected. This does not work – it violates some basic rules about how communications are supposed to function on the Internet. For the thrilling conclusion – and a solution to the problem – let’s return to our drama :
As it turns out, Bob is a player : he gets so many balls from so many women that he needs to keep track of them all in a little notebook.
When Bob gets Alice’s ball he passes it to Charles, then he records where it came from and who he gave it to in his notebook
Charles – in an attempt to get into Bob’s circle of friends – agrees to write « From : Bob, To : Alice » on the ball, then sends it back.
Alice – expecting a ball from Bob – is happy to receive her Bob-signed spheroid.
Bob then gets another ball from Denise, passes it to Edward, and records this relationship as well.
Edward – a sycophant if ever there was – prepares the ball in the same fashion as Charles, and fires it back.
Of course, the more balls Bob has to deal with, the more helpers he can use to spread the work around. Now, as you’ve no doubt pieced together, Alice and Denise are any given sources on the Internet, Bob is our LB, and Charles & Edward are the web servers. Now, instead of writing people’s names on balls, we should now make the mental leap to IP addresses in packets. With our tables of hostnames and addresses in mind, let’s consider the following example :
The source sends a request for a web page.
The source IP is « 10.1.2.3 », and the destination IP is « 192.168.0.40 » (the VIP for WWW).
The packet is sent to A01, which is currently active, and thus has the VIP for WWW assigned to it.
A01 then forwards the packet to B02 (by chance), which crafts a response packet.
The RIP for B02 is « 192.168.0.39 », but instead of using that, the source IP is set to « 192.168.0.40 », and the destination is « 10.1.2.3 ».
The source, expecting a response from « .40 », indeed receives a packet that appears to be from WWW. Done and done.
The theory is sound, but how can we implement this in practice ? As i said – it’s simple ! We simply add a dummy interface to each of the web servers that has the same address as the VIP, which will allow the web servers to interact with packets properly. This is best done by creating a simple sysconfig entry on each of the web servers for the required dummy interface, as follows :
[root@B01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-lo:0
# for VIP
[root@B01 ~]# vim /etc/sysconfig/network-scripts/ifcfg-lo:0
# for VIP
NAMall together now
The « lo » indicates that it’s a « Loopback address », which is best described by Wikipedia :
Such an interface is assigned an address that can be accessed from management equipment over a network but is not assigned to any of the real interfaces on the device. This loopback address is also used for management datagrams, such as alarms, originating from the equipment. The property that makes this virtual interface special is that applications that use it will send or receive traffic using the address assigned to the virtual interface as opposed to the address on the physical interface through which the traffic passes.
In other words, it’s a fake IP that the machine can use to make packets anyways. Now, there is a known scenario in which a machine with a given loopback address will, in this particular situation, cause confusion on the network about which interface actually « owns » a given address. It has to do with ARP, and interested readers are encouraged to Google for « LVS ARP problem » for more technical details – for now, let’s just get right to the solution. On each of the real servers, we’ll need to edit « sysctl.conf » :
[root@B01 ~]# vim /etc/sysctl.conf
# this file already has stuff in it, so put this at the bottom
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2
At this point we’ve now explored each key item that is necessary to make this whole front-end infrastructure work, but it is perhaps not quite clear how it all works together. So, let’s take a step back for a moment and review :
There are four servers : two are load balancers, and two are web servers.
Of the two load balancers, only one is active at any given time ; the other is a backup.
Every DNS entry for the sites on the web servers points to one actual IP address.
This IP address is called the « Virtual IP ».
The VIP is claimed by the active load balancer, meaning that when a request is made for a given website, it goes to the active LB.
The LB then re-directs the request to an actual web server.
The re-direction can be random, or based on varying levels of logical decision making.
The web server will respond directly – the LB is not a proxy.
Great ! Now, what software runs where, and why ?
The load balancers use LVS in order to manage the relationship between VIPs and RIPs.
Pulse is used between the LBs in order to determine who is alive, and which one is active.
An optional (but useful) web interface to both LVS and Pulse comes in the form of Piranha, which runs on a dedicated instance of Apache HTTPd on port 3636.
And that, my friends, is that ! If you have any questions, feel free to comment below (remember to subscribe to the RSS feed for responses). Happy balancing !
oh, p.s., one last thing…
In case you’re wondering how to keep your LVS configuration file synchronised across both of the load balancers, one way to do it would be with a network-aware filesystem – POHMELFS, for example. 😉
Hi everybody – here’s a super-quick update for you concerning « ethtool », and how to use it to set options in Fedora properly. Ethtool is a great little tool that can be used to configure all manner of network interface related settings – notably the speed and duplex of a card – on the fly and in real time. One of the most common situations where ethtool would be used is at boot time, especially for cards which are finnicky, or have buggy drivers, or poor software support, or.. well, you get the idea.
Times were that if you needed to use ethtool to configure a NIC setting at boot time, you’d just stick the given command line into « rc.local », or perhaps another runlevel script, and forget about it. The problem with this approach is (at least) twofold :
Frankly, it’s easy to forget about something like this, which makes future support / debugging of network issues more of a pain.
Anything that automatically modifies the runlevel script (such as updates to the parent package) may destroy your local edits.
In order to deal with these issues, and to standardise the implementation of the ethtool-at-boot technique, the Red Hat (and, thus, Fedora) maintainers introduced an option for defining ethtool parameters on a per-interface basis via the standard « sysconfig » directory system. Now, this actually happened a number of years ago, but the implementation was poorly announced (and poorly documented at the time), and thus, even today a lot of users and administrators don’t seem to know about it.
Now, there’s a very good chance that you already know this, but just to refresh your memory : in the sysconfig directory, there is another directory called « network-scripts », which in turn contains a series of files named « ifcfg-eth? », where « ? » is a device number. Each network device has a configuration file associated with it ; for example, ifcfg-eth1 is the configuration file for the « eth1 » device.
In order to specify the ethtool options for a given network interface, simply edit the associated configuration file, and add a « ETHTOOL_OPTS » line. For example :
ETHTOOL_OPTS="autoneg off speed 100 duplex full"
Now, whenever the network service initialises that interface, ethtool will be run with the specified options. Simple, easy, and best of all, standardised. What could be better ?
UPDATE: This article was written back in 2009. According to a commenter below, Busybox has been replaced by Bash in RHEL 6; perhaps Fedora as well?
Bonjour my geeky friends ! 🙂 As you are likely aware, it is now summer-time here in the northern hemisphere, and thus, i’ve been spending as much time away from the computer as possible. That said, it’s been a long time, i shouldn’t have left you, without a strong beat to step to.
Now, if you’re not familiar with kickstarting, it’s basically just a way to automate the installation of an operating environment on a machine – think hands-free installation. Anaconda is the OS installation tool used in Fedora, RedHat, and some other Linux OS’s, and it can be used in a kickstart capacity. For those of you looking for an intro, i heavily suggest reading over the excellent documentation at the Fedora project website. The kickstart configuration process could very easily be a couple of blog entries on its own (which i’ll no doubt get around to in the future), but for now i want to touch on one particular aspect of it : complex partition schemes.
how it is
The current method for declaring partitions is relatively powerful, in that all manner of basic partitions, LVM components, and even RAID devices can be specified – but where it fails is in the creating of the actual partitions on the disk itself. The options that can be supplied to the partition keywords can make this clunky at best (and impossible at worst).
A basic example of a partitioning scheme that requires nothing outside of the available functions :
Great, no problem – we can easily define that in the kickstart :
part /boot --asprimary --size=128
part / --asprimary --size=20000
part /var/log --asprimary --size=20000
part /home --size=400000
part /opt --size=51680
part swap --size=8192
But what happens if we want to use this same kickstart on another machine (or, indeed, many other machines) that don’t have the same disk size ? One of the options that can be used with the « part » keyword is « –grow », which tells Anaconda to create as large a partition as possible. This can be used along with « –maxsize= », which does exactly what you think it does.
Continuing with the example, we can modify the « /home » partition to be of a variable size, which should do us nicely on disks which may be smaller or larger than our original 500GB unit.
part /home --size=1024 --grow
Here we’ve stated that we’d like the partition to be at least a gig, but that it should otherwise be as large as possible given the constraints of both the other partitions, as well as the total space available on the device. But what if you also want « /opt » to be variable in size ? One way would be to grow both of them :
part /home --size=1024 --grow
part /opt --size=1024 --grow
Now, what do you think that will do ? If you guessed « grow both of them to half the total available size each », you’d be correct. Maybe this is what you wanted – but then again, maybe it wasn’t. Of course, we could always specify a maximum ceiling on how far /opt will grow :
part /opt --size=1024 --maxsize=200000 --grow
That works, but only at the potential expense of /home. Consider what would happen if this was run against a 250GB disk ; the other (static) partitions would eat up some 48GB, /opt would grow to the maximum specified size of 200GB, and /home would be left with the remaining 2GB of available space.
If we were to add more partitions into the mix, the whole thing would become an imprecise mess rather quickly. Furthermore, we haven’t even begun to look at scenarios where there may (or may not) more than one disk, nor any fun tricks like automatically setting the swap size to be same as the actual amount of RAM (for example). For these sorts of things we need a different approach.
the magic of pre, the power of parted
The kickstart configuration contains a section called « %pre », which should be familiar to anybody who’s dealt with RPM packaging. Basically, the pre section contains text which will be parsed by the shell during the installation process – in other words, you can write a shell script here. Fairly be thee warned, however, as the shell spawned by Anaconda is « BusyBox », not « bash », and it lacks some of the functionality that you might expect. We can use the %pre section to our advantage in many ways – including partitioning. Instead of using the built-in functions to set up the partitions, we can do it ourselves (in a manner of speaking) using « parted ».
Parted is, as you might expect, a tool for editing partition data. Generally speaking it’s an interactive tool, but one of the nifty features is the « scripted mode », wherein partitioning commands can be passed to Parted on the command-line and executed immediately without further intervention. This is very handy in any sort of automated scenario, including during a kickstart.
We can use Parted to lay the groundwork for the basic example above, wherein /home is dynamically sized. Initially this will appear inefficient, since we won’t be doing anything that can’t be accomplished by using the existing Kickstart functionality, but it provides an excellent base from which to do more interesting things. What follows (until otherwise noted) are text blocks that can be inserted directly into the %pre section of the kickstart config :
# clear the MBR and partition table
dd if=/dev/zero of=/dev/sda bs=512 count=1
parted -s /dev/sda mklabel msdos
This ensures that the disk is clean, so that we don’t run into any existing partition data that might cause trouble. The « dd » command overwrites the first bit of the disk, so that any basic partition information is destroyed, then Parted is used to create a new disk label.
That little line gives us the total size of the disk, and assigns to a variable named « TOTAL ». There are other ways to obtain this value, but in keeping with the spirit of using Parted to solve our problems, this works. In this instance, « awk » and « cut » are used to extract the string we’re interested in. Continuing on…
# calculate start points
Here we determine the starting position for the swap and /opt partitions. Since we know the total size, we can subtract 8GB from it, and that gives us where the swap partition starts. Likewise, we can calculate the starting position of /opt based on the start point of swap (and so forth, were there other partitions to calculate).
« ext3 » : the type of filesystem (there are a number of possible options, but ext3 is pretty standard).
Notice that the « extended » and « swap » definitions do not contain a filesystem type – it is not necessary.
« start# end# » : the start and end points, expressed in MB.
Finally, we must still declare the partitions in the usual way. Take note that this does not occur in the %pre section – this goes in the normal portion of the configuration for defining partitions :
part /boot --onpart=/dev/sda1
part / --onpart=/dev/sda2
part /var/log --onpart=/dev/sda3
part /home --onpart=/dev/sda5
part /opt --onpart=/dev/sda6
part swap --onpart=/dev/sda7
As i mentioned when we began this section, yes, this is (so far) a remarkably inefficient way to set this particular basic configuration up. But, again to re-iterate, this exercise is about putting the groundwork in place for much more interesting applications of the technique.
mo’ drives, mo’ better
Perhaps some of your machines have more than one drive, and some don’t. These sorts of things can be determined, and then reacted upon dynamically using the described technique. Back to the %pre section :
# Determine number of drives (one or two in this case)
In this case, we’re using a built-in function called « list-harddrives » to help us determine which drive or drives are present, and then assign their device identifiers to variables. In other words, if you have an « sda » and an « sdb », those identifiers will be assigned to « $d1 » and « $d2 », and if you just have an sda, then $d2 will be empty.
This gives us some interesting new options ; for example, if we wanted to put /home on to the second drive, we could write up some simple logic to make that happen :
# if $d2 has a value, it's that of the second device.
if [ ! -z $d2 ]
part /home --size=1024 --ondisk=/dev/$HOMEDEVICE --grow
That, of course, assumes that the other partitions are defined, and that /home is the only entity which should be grown dynamically – but you get the idea. There’s nothing stopping us from writing a normal shell script that could determine the number of drives, their total size, and where the partition start points should be based on that information. In fact, let’s examine this idea a little further.
the size, she is dynamic !
Instead of trying to wrangle the partition sizes together with the default options, we can get as complex (or as simple) as we like with a few if statements, and some basic maths. Thinking about our layout then, we can express something like the following quite easily :
If there is one drive that is at least 500 GB in size, then /opt should be 200 GB, and /home should consume the rest.
If there is one drive is less than 500 GB, but more than 250 GB, then /opt and /home should each take half.
If there is one drive that is less than 250 GB, then /home should take two-thirds, and /opt gets the rest.
# $TOTAL from above...
if [ $TOTAL -ge 512000 ]
elif [ $TOTAL -lt 512000 ] && [ $TOTAL -ge 256000 ]
# get the dynamic space total, which is between where /var/log ends, and swap begins
elif [ $TOTAL -lt 256000 ]
Now, instead of having to create three different kickstart files, each describing a different scenario, we’ve covered it with one – nice !
At the end of the day, the possilibities are nearly endless, with the only restriction being that whatever you’d like to do has to be do-able in BusyBox – which, at this level, provides a lot great functionality.
Stay tuned for more entries related to kickstarting, PXE-based installations, and so forth, all to come here on dan’s linux blog. Cheers !
Hello again ! Today’s quick tip concerns a software package called Dia, which is an open source tool (available for both Windows and Linux, as it goes) used to make diagrams, flowcharts, network maps, and so forth. It has its own file format (.dia), which is (obviously?) useful for saving the projects you’re working on, but less useful if you need to give the diagram to anybody else, either in print or electronic form.
Dia can export to a variety of formats including SVG, PNG, and EPS, but one export format that it lacks native support for is the venerable PDF, which has become a de facto standard for transmitting documents between diverse environments. There are many advantages and interesting aspects of the PDF format, not the least of which being that what you see on your screen is what you get when it’s printed. It is unfortunate, then, that Dia won’t spit out a PDF (even if you ask very nicely).
Of course, being that it’s so easy to print directly to PDF (via CUPS, for example) these days, having native support for PDF may not, at first, seem all that useful. Well, as it turns out, printing directly to PDF might not give you quite what you were looking for. In practice, you do get a PDF, but what appeared to be a modestly-sized diagram in Dia will turn out to be a multi-page monster in (virtually) printed form. As a general rule, this is not what you want.
In order to get a usable PDF we need to use an intermediate step between Dia and the final file. The idea, quite simply, is to export the diagram as one of the supported formats, then convert that file into a PDF. There are a number of options here, but for our purposes we’ll save the diagram as an EPS file, then use a quick little command-line tool called « epstopdf » to perform the conversion.
There’s a good chance that you don’t have epstopdf on your machine. If you’re using Ubuntu, you used to be able to install it easily via the APT packager, but these days the little conversion tool comes as part of a larger suite of tools called « texlive-extra-utils ». This suite is dependant on a number of other packages, so go ahead and install them all :
$ sudo apt-get install texlive-extra-tools
EDIT : In Ubuntu 10.04, the package is named « texlive-font-utils ».
Among many, many little items of interest, our target application will be installed. To use it, simply feed it the name of the EPS file as an argument :
$ epstopdf somediagram.eps
It will automatically output a PDF file of the same name. There you go – a nice, shiny PDF of your Dia diagram.
Hello again fair readers. Today i’m going to re-visit POHMELFS, which i introduced in an earlier blog post. I received a comment on that post which basically asked for more information on some of the more interesting (read : advanced) features of POHMELFS, such as distributed storage, and the like. Well, today is the day ! If you need a refresher, be sure to skim over my previous post, as we’re going to dive in now right where i left off last time.
patch for the win
One of the reasons that there was a bit of a delay between my last POHMELFS post and this one was because i hit a bug. Given that we’re working with staging-level code here, that’s to be expected – luckily, thanks to some quick work by Evgeniy Polyakov on the POHMLEFS mailing list, there is still hope – hope in the form of a tasty little patch.
diff --git a/drivers/staging/pohmelfs/trans.c b/drivers/staging/pohmelfs/trans.c
index eab7868..bf7b09a 100644
@@ -467,7 +467,8 @@ int netfs_trans_finish_send(struct netfs_trans *t, struct pohmelfs_sb *psb)
- if (psb->active_state && (psb->active_state->state.ctl.prio >= st->ctl.prio))
+ if (psb->active_state && (psb->active_state->state.ctl.prio >= st->ctl.prio) &&
+ (t->flags & NETFS_TRANS_SINGLE_DST))
st = &psb->active_state->state;
err = netfs_trans_push(t, st);
Basically, this patch fixes a minor, but ultimately crippling bug related to writing to multiple servers. The details are not important – what’s important is that we apply the patch and keep the dream alive. First, you’ll need to copy and paste that block of code into a text file on one of the systems (in « ~/pohmel.diff, for example »). Then, in order to apply the patch, we’ll need to use a standard tool called (appropriately) « patch » :
Now, just as we did last time, we must play the kernel and module compilation and installation game (fun!). If you need a refresher on how to do this, just go back to my previous post. Note that this time around, the whole process will be much faster, since only the POHMELFS components need to be recompiled – everything else will stay the same. As a result, you can skip the part where you archive the entire kernel tree and copy it over – instead, just patch and recompile on each server and the client. It’s your call.
Once that’s out of the way we’ll reboot, and then it’s off to the races.
a new challenger appears !
It’s now time to add a third machine into the mix (« host_147 » in this case). Using this new box, we’ll create a simple sort of setup which is, in fact, quite representative of how things might work in the real world : two storage servers and a client. As you no doubt recall, one of the neat features of POHMELFS is that it can be employed in a parallel fashion, meaning that a file which appears to the client to be in one place, is actaully located in more than one storage medium. A general way of describing these ideas is by using the terms « logical » and « physical » ; the logical medium is the filesystem that the client sees, and the physical medium is the actual hard drive upon which the data is stored.
In this case, host_75 and host_166 will be the servers, each containing one copy of the data on their respective physical mediums (i.e. hard drives), and host_147 will be our client, which will access the data via the logical medium (i.e. the POHMELFS export). The new machine was set up in the same way as host_166 was, so we’ll skip over that, and get right to the good stuff.
A new directory should be created on each of the machines : « /opt/pohtest ». This will serve as the export directory on the servers, and the mount directory on the client – don’t put any data in it yet, though.
On the servers, we’ll initiate the server daemon. Unlike our first test, where we just let the defaults ride, this time around we’ll configure things a bit more intelligently :
In the above example, « -r » defines the directory to export, « -l » is where to output the logs to, and « -d » puts the process into the background, instead of on our console as before. This is normally how things would work, so it’s good to get used to it now. Now, we can follow the log files on each machine by using « tail » :
[root@host_75 (and host_166) ~]# tail -f /var/log/pohmelfs.log
Server is now listening at 0.0.0.0:1025.
With the servers up and ready to go, we can now turn our attention on the client. Don’t forget to load the pohmelfs module first !
[root@host_147 ~]# modprobe pohmelfs
[root@host_147 ~]# cfg -A add -a 192.168.0.75 -p 1025 -i 1
Now we mount. It’s important that we mount before we attempt to add the second server into the mix – trying to do it ahead of time will only result in terrible, crippling failure.
[root@host_147 ~]# mount -t pohmel -o idx=1 none /opt/pohtest/
No output means it worked (as usual), so let’s verify :
[root@host_147 opt]# cfg -A add -a 192.168.0.166 -p 1025 -i 1
Now we must wait at least 5 seconds for the synchronisation to occur. In reality it’s shorter than that, but 5 seconds is an easy number to remember, and it’s safe. So far this looks exactly the same as before, but there’s a bit of a conceptual twist – as you can see, both of those new add statements have the same index (as denoted by the -i). This means that they’re grouped together as part of the same logical medium. We can check on this by using the « show » action :
[root@host_147 ~]# cfg -A show -i 1
Config Index = 1
Family Server IP Port
AF_INET 192.168.0.75 1025
AF_INET 192.168.0.166 1025
Everything seems on the up and up so far, so we can go ahead and try our first mount. A series of options will be passed to the mount line, notably « idx=1 », which means index 1 (as seen above) – this is very important to specify, as without it, POHMELFS won’t be able to determine which logical group you’re talking about.
And if we take a look at the log output on the servers, we’ll see that the client connection has been accepted. Both of the logs should show the accepted line, but with different port numbers (the trailing digits at the end) :
Accepted client 192.168.0.147:48277.
There are other diagnostics we can run to take a look at what we’ve got running. At this stage they won’t tell us anything we don’t already know, but it will give us some practice with the tools and data, so that when the time comes to debug problems down the road, we’ll be ready.
For example, POHMELFS will write some handy information to « mountstats », which is exactly what it sounds like :
It’s not lined up very nicely, but the interesting column right now is « active », which lists « 1 » in both cases, meaning the connections are open. The « permissions » column lists « 3 » for both nodes which, in this case, means that they’re both available for reading and writing (as opposed to being read or write-only, which are also valid options).
but will it blend ?
Accepting the connection is one thing – successfully reading and writing files is entirely another. Let’s do some tests ; first we’ll use the client to create an empty file in mount :
[root@host_147 ~]# cd /opt/pohtest/
[root@host_147 pohtest]# touch FILE
[root@host_147 pohtest]# ls
Great, now let’s take a look at our servers :
[root@host_166 pohtest]# ls -l
-rw-r--r-- 1 root root 0 2009-07-06 16:58 FILE
And the other :
[root@host_75 ~]# ls -l /opt/pohtest/
-rw-r--r-- 1 root root 0 2009-07-06 16:46 FILE
Now, during my limited tests, i noticed a small lag time between my manipulations on the client, and when those actions were reflected on the servers. At this stage of the game i’m not sure whether that’s normal or not, or exactly what’s causing it – so don’t be alarmed if you see a small lag as well. I’ll be sure to post further updates on this point once i’ve got more information.
Update : As per Evgeniy on the mailing list :
This delay is not a bug, but feature - POHMELFS has local cache on
clients and data written on client is stored in that cache first and
then flushed to the server when client is under memory pressure or when
another one requests updated but not yet flushed data.
To force client to flush the data one can 'sync' on client or use
'flush' utility on the server. The latter will invalidate data on the
client (which implies it to be flushed to the server first), so server
update will become visible next time client reads that data.
how not to do it
Let’s do another little test. On one of the servers, we’ll perform a manipulation in the POHMELFS export directory :
Great, but if we take a look at the other server :
[root@host_166 ~]# ls -l /opt/pohtest/
-rw-r--r-- 1 root root 5 2009-07-06 16:59 FILE
And the client :
[root@host_147 ~]# ls -l /opt/pohtest/
-rw-r--r-- 1 root root 5 2009-07-06 20:47 FILE
We notice that it’s not there. Why ? Unfortunately, like so much bureaucracy, we didn’t go through the proper channels. Recall that our client has certain software running on it that allows it to speak to both servers, and that the mountpoint uses that software to ensure consistency between across the shared filesystem. In the example above, we wrote directly to the underlying filesystem of the server – completely avoiding said software – and thus POHMELFS had no way of knowing that a manipulation had occured.
In short – if you want to keep things consistent, you must interact via a client. But what if we want our servers to be able to interact with the data as well ? Well, there’s nothing stopping us from setting up client processes on our servers, too. This, however, will have to wait for the next instalment.