filesystem – dan's tech blog

workaround for slow shared folders in Virtualbox 3.x

Happy 2010 fair readers ! I hope that all is well with you and yours. Let’s get right to business : Virtualbox has a feature that allows you to access the host OS’s file system from the guest OS (shared folders), which is super useful, but not exactly perfectly implemented. In particular, there are known, documented performance issues in certain scenarios, such as when accessing a Linux host via a Windows guest (which, as you might imagine, is a pretty regular sort of activity).

One common (?) workaround is to install and configure Samba on the Linux host, then access it from the Windows guest like one would access any network server. The problem here is that it requires that Samba be installed and configured, which can be a pain in the, well, you know. Furthermore, the connection will be treated like any other, and the traffic will travel up and down the network stack, which is fundamentally unnecessary since the data is, physically speaking, stored locally.

Instead, here’s another workaround, one that keeps things simple, and solves the performance problem : just map the shared folder to a local drive in the host OS. It’s that easy. For those of us who aren’t too familiar with the Windows explorer interface (me included, heh), there are tonnes of step by step instructions available. For whatever reason (i suspect Netbios insanity), accessing the network share via a mapped drive manages to avoid whatever condition creates the lag problems, resulting in a rapid, efficient access to the underlying filesystem.

Hope that helps – enjoy !

(complex) partitioning in kickstart

UPDATE: This article was written back in 2009. According to a commenter below, Busybox has been replaced by Bash in RHEL 6; perhaps Fedora as well?

Bonjour my geeky friends ! 🙂 As you are likely aware, it is now summer-time here in the northern hemisphere, and thus, i’ve been spending as much time away from the computer as possible. That said, it’s been a long time, i shouldn’t have left you, without a strong beat to step to.

Now, if you’re not familiar with kickstarting, it’s basically just a way to automate the installation of an operating environment on a machine – think hands-free installation. Anaconda is the OS installation tool used in Fedora, RedHat, and some other Linux OS’s, and it can be used in a kickstart capacity. For those of you looking for an intro, i heavily suggest reading over the excellent documentation at the Fedora project website. The kickstart configuration process could very easily be a couple of blog entries on its own (which i’ll no doubt get around to in the future), but for now i want to touch on one particular aspect of it : complex partition schemes.

how it is

The current method for declaring partitions is relatively powerful, in that all manner of basic partitions, LVM components, and even RAID devices can be specified – but where it fails is in the creating of the actual partitions on the disk itself. The options that can be supplied to the partition keywords can make this clunky at best (and impossible at worst).

A basic example of a partitioning scheme that requires nothing outside of the available functions :

DEVICE                 MOUNTPOINT               SIZE
/dev/sda               (total)                  500,000 MB
/dev/sda1              /boot/                       128 MB
/dev/sda2              /                         20,000 MB
/dev/sda3              /var/log/                 20,000 MB
/dev/sda5              /home/                   400,000 MB
/dev/sda6              /opt/                     51,680 MB
/dev/sda7              swap                       8,192 MB

Great, no problem – we can easily define that in the kickstart :

part  /boot     --asprimary  --size=128
part  /         --asprimary  --size=20000
part  /var/log  --asprimary  --size=20000
part  /home                  --size=400000
part  /opt                   --size=51680
part  swap                   --size=8192

But what happens if we want to use this same kickstart on another machine (or, indeed, many other machines) that don’t have the same disk size ? One of the options that can be used with the « part » keyword is « –grow », which tells Anaconda to create as large a partition as possible. This can be used along with « –maxsize= », which does exactly what you think it does.

Continuing with the example, we can modify the « /home » partition to be of a variable size, which should do us nicely on disks which may be smaller or larger than our original 500GB unit.

part  /home  --size=1024  --grow

Here we’ve stated that we’d like the partition to be at least a gig, but that it should otherwise be as large as possible given the constraints of both the other partitions, as well as the total space available on the device. But what if you also want « /opt » to be variable in size ? One way would be to grow both of them :

part  /home  --size=1024  --grow
part  /opt   --size=1024  --grow

Now, what do you think that will do ? If you guessed « grow both of them to half the total available size each », you’d be correct. Maybe this is what you wanted – but then again, maybe it wasn’t. Of course, we could always specify a maximum ceiling on how far /opt will grow :

part  /opt  --size=1024  --maxsize=200000  --grow

That works, but only at the potential expense of /home. Consider what would happen if this was run against a 250GB disk ; the other (static) partitions would eat up some 48GB, /opt would grow to the maximum specified size of 200GB, and /home would be left with the remaining 2GB of available space.

If we were to add more partitions into the mix, the whole thing would become an imprecise mess rather quickly. Furthermore, we haven’t even begun to look at scenarios where there may (or may not) more than one disk, nor any fun tricks like automatically setting the swap size to be same as the actual amount of RAM (for example). For these sorts of things we need a different approach.

the magic of pre, the power of parted

The kickstart configuration contains a section called « %pre », which should be familiar to anybody who’s dealt with RPM packaging. Basically, the pre section contains text which will be parsed by the shell during the installation process – in other words, you can write a shell script here. Fairly be thee warned, however, as the shell spawned by Anaconda is « BusyBox », not « bash », and it lacks some of the functionality that you might expect. We can use the %pre section to our advantage in many ways – including partitioning. Instead of using the built-in functions to set up the partitions, we can do it ourselves (in a manner of speaking) using « parted ».

Parted is, as you might expect, a tool for editing partition data. Generally speaking it’s an interactive tool, but one of the nifty features is the « scripted mode », wherein partitioning commands can be passed to Parted on the command-line and executed immediately without further intervention. This is very handy in any sort of automated scenario, including during a kickstart.

We can use Parted to lay the groundwork for the basic example above, wherein /home is dynamically sized. Initially this will appear inefficient, since we won’t be doing anything that can’t be accomplished by using the existing Kickstart functionality, but it provides an excellent base from which to do more interesting things. What follows (until otherwise noted) are text blocks that can be inserted directly into the %pre section of the kickstart config :

# clear the MBR and partition table
dd if=/dev/zero of=/dev/sda bs=512 count=1
parted -s /dev/sda mklabel msdos

This ensures that the disk is clean, so that we don’t run into any existing partition data that might cause trouble. The « dd » command overwrites the first bit of the disk, so that any basic partition information is destroyed, then Parted is used to create a new disk label.

TOTAL=`parted -s /dev/sda unit mb print free | grep Free | awk '{print $3}' | cut -d "M" -f1`

That little line gives us the total size of the disk, and assigns to a variable named « TOTAL ». There are other ways to obtain this value, but in keeping with the spirit of using Parted to solve our problems, this works. In this instance, « awk » and « cut » are used to extract the string we’re interested in. Continuing on…

# calculate start points
let SWAP_START=$TOTAL-8192
let OPT_START=$SWAP_START-51680

Here we determine the starting position for the swap and /opt partitions. Since we know the total size, we can subtract 8GB from it, and that gives us where the swap partition starts. Likewise, we can calculate the starting position of /opt based on the start point of swap (and so forth, were there other partitions to calculate).

# partitions IN ORDER
parted -s /dev/sda mkpart primary ext3 0 128
parted -s /dev/sda mkpart primary ext3 128 20128
parted -s /dev/sda mkpart primary ext3 20128 40256
parted -s /dev/sda mkpart extended 40256 $TOTAL
parted -s /dev/sda mkpart logical ext3 40256 $OPT_START
parted -s /dev/sda mkpart logical ext3 $OPT_START $SWAP_START
parted -s /dev/sda mkpart logical $SWAP_START $TOTAL

The variables we populated above are used here in order to create the partitions on the disk. The syntax is very simple :

« parted -s » : run Parted in scripted (non-interactive) mode.
« /dev/sda » : the device (later, we’ll see how to determine this dynamically).
« mkpart » : the action to take (make partition).
« primary | extended | logical » : the type of partition.
« ext3 » : the type of filesystem (there are a number of possible options, but ext3 is pretty standard).
- Notice that the « extended » and « swap » definitions do not contain a filesystem type – it is not necessary.
« start# end# » : the start and end points, expressed in MB.

Finally, we must still declare the partitions in the usual way. Take note that this does not occur in the %pre section – this goes in the normal portion of the configuration for defining partitions :

part  /boot     --onpart=/dev/sda1
part  /         --onpart=/dev/sda2
part  /var/log  --onpart=/dev/sda3
part  /home     --onpart=/dev/sda5
part  /opt      --onpart=/dev/sda6
part  swap      --onpart=/dev/sda7

As i mentioned when we began this section, yes, this is (so far) a remarkably inefficient way to set this particular basic configuration up. But, again to re-iterate, this exercise is about putting the groundwork in place for much more interesting applications of the technique.

mo’ drives, mo’ better

Perhaps some of your machines have more than one drive, and some don’t. These sorts of things can be determined, and then reacted upon dynamically using the described technique. Back to the %pre section :

# Determine number of drives (one or two in this case)
set $(list-harddrives)
let numd=$#/2
d1=$1
d2=$3

In this case, we’re using a built-in function called « list-harddrives » to help us determine which drive or drives are present, and then assign their device identifiers to variables. In other words, if you have an « sda » and an « sdb », those identifiers will be assigned to « $d1 » and « $d2 », and if you just have an sda, then $d2 will be empty.

This gives us some interesting new options ; for example, if we wanted to put /home on to the second drive, we could write up some simple logic to make that happen :

# if $d2 has a value, it's that of the second device.
if [ ! -z $d2 ]
then
  HOMEDEVICE=$d2
else
  HOMEDEVICE=$d1
fi

# snip...
part  /home  --size=1024  --ondisk=/dev/$HOMEDEVICE  --grow

That, of course, assumes that the other partitions are defined, and that /home is the only entity which should be grown dynamically – but you get the idea. There’s nothing stopping us from writing a normal shell script that could determine the number of drives, their total size, and where the partition start points should be based on that information. In fact, let’s examine this idea a little further.

the size, she is dynamic !

Instead of trying to wrangle the partition sizes together with the default options, we can get as complex (or as simple) as we like with a few if statements, and some basic maths. Thinking about our layout then, we can express something like the following quite easily :

If there is one drive that is at least 500 GB in size, then /opt should be 200 GB, and /home should consume the rest.
If there is one drive is less than 500 GB, but more than 250 GB, then /opt and /home should each take half.
If there is one drive that is less than 250 GB, then /home should take two-thirds, and /opt gets the rest.

# $TOTAL from above...
if [ $TOTAL -ge 512000 ]
then
  let OPT_START=$SWAP_START-204800
elif [ $TOTAL -lt 512000 ] && [ $TOTAL -ge 256000 ]
then
  # get the dynamic space total, which is between where /var/log ends, and swap begins
  let DYN_TOTAL=$SWAP_START-40256
  let OPT_START=$DYN_TOTAL/2
elif [ $TOTAL -lt 256000 ]
then
  let DYN_TOTAL=$SWAP_START-40256
  let OPT_START=$DYN_TOTAL/3
  let OPT_START=$OPT_START+$OPT_START
fi

Now, instead of having to create three different kickstart files, each describing a different scenario, we’ve covered it with one – nice !

other possibilities

At the end of the day, the possilibities are nearly endless, with the only restriction being that whatever you’d like to do has to be do-able in BusyBox – which, at this level, provides a lot great functionality.

Stay tuned for more entries related to kickstarting, PXE-based installations, and so forth, all to come here on dan’s linux blog. Cheers !

pohmelfs pt. 2, return of pohmelfs !

Hello again fair readers. Today i’m going to re-visit POHMELFS, which i introduced in an earlier blog post. I received a comment on that post which basically asked for more information on some of the more interesting (read : advanced) features of POHMELFS, such as distributed storage, and the like. Well, today is the day ! If you need a refresher, be sure to skim over my previous post, as we’re going to dive in now right where i left off last time.

patch for the win

One of the reasons that there was a bit of a delay between my last POHMELFS post and this one was because i hit a bug. Given that we’re working with staging-level code here, that’s to be expected – luckily, thanks to some quick work by Evgeniy Polyakov on the POHMLEFS mailing list, there is still hope – hope in the form of a tasty little patch.

diff --git a/drivers/staging/pohmelfs/trans.c b/drivers/staging/pohmelfs/trans.c
index eab7868..bf7b09a 100644
--- a/drivers/staging/pohmelfs/trans.c
+++ b/drivers/staging/pohmelfs/trans.c
@@ -467,7 +467,8 @@ int netfs_trans_finish_send(struct netfs_trans *t, struct pohmelfs_sb *psb)
 				continue;
 		}

-		if (psb->active_state && (psb->active_state->state.ctl.prio >= st->ctl.prio))
+		if (psb->active_state && (psb->active_state->state.ctl.prio >= st->ctl.prio) &&
+				(t->flags & NETFS_TRANS_SINGLE_DST))
 			st = &psb->active_state->state;

 		err = netfs_trans_push(t, st);

Basically, this patch fixes a minor, but ultimately crippling bug related to writing to multiple servers. The details are not important – what’s important is that we apply the patch and keep the dream alive. First, you’ll need to copy and paste that block of code into a text file on one of the systems (in « ~/pohmel.diff, for example »). Then, in order to apply the patch, we’ll need to use a standard tool called (appropriately) « patch » :

[root@host_75 ~]# cd /usr/src/linux
[root@host_75 ~]# patch -p 1 < ~/pohmel.diff
patching file drivers/staging/pohmelfs/trans.c

Now, just as we did last time, we must play the kernel and module compilation and installation game (fun!). If you need a refresher on how to do this, just go back to my previous post. Note that this time around, the whole process will be much faster, since only the POHMELFS components need to be recompiled – everything else will stay the same. As a result, you can skip the part where you archive the entire kernel tree and copy it over – instead, just patch and recompile on each server and the client. It’s your call.

Once that’s out of the way we’ll reboot, and then it’s off to the races.

a new challenger appears !

It’s now time to add a third machine into the mix (« host_147 » in this case). Using this new box, we’ll create a simple sort of setup which is, in fact, quite representative of how things might work in the real world : two storage servers and a client. As you no doubt recall, one of the neat features of POHMELFS is that it can be employed in a parallel fashion, meaning that a file which appears to the client to be in one place, is actaully located in more than one storage medium. A general way of describing these ideas is by using the terms « logical » and « physical » ; the logical medium is the filesystem that the client sees, and the physical medium is the actual hard drive upon which the data is stored.

In this case, host_75 and host_166 will be the servers, each containing one copy of the data on their respective physical mediums (i.e. hard drives), and host_147 will be our client, which will access the data via the logical medium (i.e. the POHMELFS export). The new machine was set up in the same way as host_166 was, so we’ll skip over that, and get right to the good stuff.

A new directory should be created on each of the machines : « /opt/pohtest ». This will serve as the export directory on the servers, and the mount directory on the client – don’t put any data in it yet, though.

server config

On the servers, we’ll initiate the server daemon. Unlike our first test, where we just let the defaults ride, this time around we’ll configure things a bit more intelligently :

[root@host_75 (and host_166) ~]# fserver -r /opt/pohtest -l /var/log/pohmelfs.log -d

In the above example, « -r » defines the directory to export, « -l » is where to output the logs to, and « -d » puts the process into the background, instead of on our console as before. This is normally how things would work, so it’s good to get used to it now. Now, we can follow the log files on each machine by using « tail » :

[root@host_75 (and host_166) ~]# tail -f /var/log/pohmelfs.log
Server is now listening at 0.0.0.0:1025.

client config

With the servers up and ready to go, we can now turn our attention on the client. Don’t forget to load the pohmelfs module first !

[root@host_147 ~]# modprobe pohmelfs
[root@host_147 ~]# cfg -A add -a 192.168.0.75 -p 1025 -i 1

Now we mount. It’s important that we mount before we attempt to add the second server into the mix – trying to do it ahead of time will only result in terrible, crippling failure.

[root@host_147 ~]# mount -t pohmel -o idx=1 none /opt/pohtest/

No output means it worked (as usual), so let’s verify :

[root@host_147 ~]# df | grep poh
none                 154590376  10018492 144571884   7% /opt/pohtest

Great, now let’s add the other server :

[root@host_147 opt]# cfg -A add -a 192.168.0.166 -p 1025 -i 1

Now we must wait at least 5 seconds for the synchronisation to occur. In reality it’s shorter than that, but 5 seconds is an easy number to remember, and it’s safe. So far this looks exactly the same as before, but there’s a bit of a conceptual twist – as you can see, both of those new add statements have the same index (as denoted by the -i). This means that they’re grouped together as part of the same logical medium. We can check on this by using the « show » action :

[root@host_147 ~]# cfg -A show -i 1
Config Index = 1
Family    Server IP                                            Port     
AF_INET   192.168.0.75                                         1025
AF_INET   192.168.0.166                                        1025

Everything seems on the up and up so far, so we can go ahead and try our first mount. A series of options will be passed to the mount line, notably « idx=1 », which means index 1 (as seen above) – this is very important to specify, as without it, POHMELFS won’t be able to determine which logical group you’re talking about.

And if we take a look at the log output on the servers, we’ll see that the client connection has been accepted. Both of the logs should show the accepted line, but with different port numbers (the trailing digits at the end) :

Accepted client 192.168.0.147:48277.

There are other diagnostics we can run to take a look at what we’ve got running. At this stage they won’t tell us anything we don’t already know, but it will give us some practice with the tools and data, so that when the time comes to debug problems down the road, we’ll be ready.

For example, POHMELFS will write some handy information to « mountstats », which is exactly what it sounds like :

[root@host_147 ~]# cat /proc/1/mountstats
   ...
device none mounted on /opt/pohtest with fstype pohmel
idx addr(:port) socket_type protocol active priority permissions
1 192.168.0.75:1025 1 6 1 0 3
1 192.168.0.166:1025 1 6 1 0 3

It’s not lined up very nicely, but the interesting column right now is « active », which lists « 1 » in both cases, meaning the connections are open. The « permissions » column lists « 3 » for both nodes which, in this case, means that they’re both available for reading and writing (as opposed to being read or write-only, which are also valid options).

but will it blend ?

Accepting the connection is one thing – successfully reading and writing files is entirely another. Let’s do some tests ; first we’ll use the client to create an empty file in mount :

[root@host_147 ~]# cd /opt/pohtest/
[root@host_147 pohtest]# touch FILE
[root@host_147 pohtest]# ls
FILE

Great, now let’s take a look at our servers :

[root@host_166 pohtest]# ls -l
total 0
-rw-r--r-- 1 root root 0 2009-07-06 16:58 FILE
[root@host_166 ~]#

And the other :

[root@host_75 ~]# ls -l /opt/pohtest/
total 0
-rw-r--r-- 1 root root 0 2009-07-06 16:46 FILE
[root@host_75 ~]#

Now, during my limited tests, i noticed a small lag time between my manipulations on the client, and when those actions were reflected on the servers. At this stage of the game i’m not sure whether that’s normal or not, or exactly what’s causing it – so don’t be alarmed if you see a small lag as well. I’ll be sure to post further updates on this point once i’ve got more information.

Update : As per Evgeniy on the mailing list :

This delay is not a bug, but feature - POHMELFS has local cache on
clients and  data written on client is stored in that cache first and
then flushed to the server when client is under memory pressure or when
another one requests updated but not yet flushed data.

To force client to flush the data one can 'sync' on client or use
'flush' utility on the server. The latter will invalidate data on the
client (which implies it to be flushed to the server first), so server
update will become visible next time client reads that data.

how not to do it

Let’s do another little test. On one of the servers, we’ll perform a manipulation in the POHMELFS export directory :

[root@host_75 ~]# touch /opt/pohtest/host75file
[root@host_75 ~]# ls -l /opt/pohtest/
total 4
-rw-r--r-- 1 root root 5 2009-07-06 16:46 FILE
-rw-r--r-- 1 root root 0 2009-07-06 16:57 host75file
[root@host_75 ~]#

Great, but if we take a look at the other server :

[root@host_166 ~]# ls -l /opt/pohtest/
total 4
-rw-r--r-- 1 root root 5 2009-07-06 16:59 FILE

And the client :

[root@host_147 ~]# ls -l /opt/pohtest/
total 0
-rw-r--r-- 1 root root 5 2009-07-06 20:47 FILE

We notice that it’s not there. Why ? Unfortunately, like so much bureaucracy, we didn’t go through the proper channels. Recall that our client has certain software running on it that allows it to speak to both servers, and that the mountpoint uses that software to ensure consistency between across the shared filesystem. In the example above, we wrote directly to the underlying filesystem of the server – completely avoiding said software – and thus POHMELFS had no way of knowing that a manipulation had occured.

In short – if you want to keep things consistent, you must interact via a client. But what if we want our servers to be able to interact with the data as well ? Well, there’s nothing stopping us from setting up client processes on our servers, too. This, however, will have to wait for the next instalment.

See you on the intertubes !

pohmelfs update

Hello again ! You may be wondering when the next update in the POHMELFS series is coming – well, rest assured that i’m working on it even as you read this, and that it will be worth the wait.

Remember that we’re working with Staging-level code, and that sometimes things don’t always go as well as one might hope – in this case, there are some discrepancies between the code that the POHMELFS devs are using, and that which was released in the 2.6.30 code (according to the devs, at least).

I’ve recently received a nice new patch from one of the devs, and once i’ve got that squared away, we’ll continue with our exercise.

force disk geometry with sfdisk

Hello again ! This is a quick and dirty update which covers a handy little trick when dealing with writeable removable media – especially USB drives, compact flash cards, and the like.

I end up using a lot of USB keys in my environment for a variety of reasons, not the least of which is as handy portable Linux drives that can be stuck into any workstation and booted from directly. They’re like LiveCDs, except since i can write to them, any changes that are made during a session don’t disappear when the machine reboots (nice). As an aside, if that sounds interesting to you, i suggest checking out the Fedora LiveCD on USB Howto.

USB keys are so ubiquitous now that we buy in bulk, meaning we’ll get a bunch of identical units at one time. Once in a while (though more often than i’d like), one of the keys will end up having a detected geometry which is different from the others. This isn’t normally a big deal, but it can cause slight variations in the apparent available space to create a partition. This ends up being a problem if i’m looking to clone data from one key to another using a disk imaging tool such as « Partimage » (another tool that gets a lot of play around here).

The solution is fantastically simple, but perhaps not immediately obvious, as it requires the use of a tool that – for the most part – never gets touched by the average user (or admin !) : « sfdisk ». Sfdisk is a partition table manipulator that allows us to do a number of advanced (read: dangerous) operations to disks. Since the common day-to-day operations one might perform on a disk, such as creating or modifying partition assignments, are covered by the more common « fdisk » (or even « cfdisk »), sfdisk is rarely called upon outside of bizarre or extreme situations.

Altering geometry is one such situation.

change is good

The first thing we need to do is determine what the correct geometry is. This is obtained easily enough by running an fdisk report against a known-good key (sdc, in this case) :

[root@host_166 ~]# fdisk -l /dev/sdc

Disk /dev/sdc: 4001 MB, 4001366016 bytes
19 heads, 19 sectors/track, 21648 cylinders
Units = cylinders of 361 * 512 = 184832 bytes
Disk identifier: 0xf1bcd225

 Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *           1       21648     3907454+  83  Linux

Alternatively, we could ask sfdisk :

[root@host_166 ~]# sfdisk -g /dev/sdc
/dev/sdc: 21648 cylinders, 19 heads, 19 sectors/track

Now that we have the correct geometry, we can get sfdisk to alter that of the naughty key (sdb, in this case). As you can likely guess, -C is the cylinders, -H is the heads, and -S in the sectors (per track) :

[root@host_166 ~]# sfdisk -C 21648 -H 19 -S 19 /dev/sdb

Depending on your particular version of sfdisk and distro, this may trigger an interactive process which will ask you to create the desired partitions on the key. Assuming you just want one big Linux partition, you can hit « enter » and accept every default until it’s done.

And that’s that – one key brought rapidly in line with the others.

Cheers !

pohmelfs – the network filesystem of the future !

Hello again fair readers ! Today we’re going to take a look at POHMELFS, which is a network file system that was just recently integrated into the Linux kernel. This is an excellent exercise for three reasons : we’ll learn about some great new concepts related to network file systems, we get to compile and install our own kernel, and we get to play with a great new platform that could, eventually, become the new standard for network file systems in the *nix world. Sound good ? Let’s go !

A network file system is, according to Wikipedia, « any computer file system that supports sharing of files, printers and other resources as persistent storage over a computer network. » It is important to note that there is also a specific protocol called « Network File System », which is exactly what it sounds like, and is highly popular in the *nix world. For the purposes of this article, unless otherwise stated every time i use « network file system » or « nfs », i’m referring to the concept, not the protocol.

Building on the idea of an nfs, there are other types of network-aware file systems which fall into similar categories, such as « distributed file systems », « parallel file systems », « distributed parallel file systems », and so forth. These terms are largely non-standard, and conventional wisdom tends to put all of these sub-categories into the one big nfs family. POHMELFS, for example, is described by its creator as a « distributed parallel internet filesystem » – in fact, the name itself is an acronym for « Parallel Optimized Host Message Exchange Layered File System », which is, mercifully, pronounced simply as « poh-mel ».

The two interesting portions of the name are « parallel » and «optmized »…

parallel

One of the interesting aspects of POHMELFS is that a given chunk of data can (and in many cases, should) reside in more than one distinct storage unit (different physical servers, for example). The client can be made aware of this, which means that accesses to the data can happen in a « parallel » fashion, which has two major advantages :

Seamless real-time replication of the data during write operations
Faster overall data accesses during read operations

Imagine a scenario whereby you have two file servers : serverA and serverB, and any given number of clients (clientA, B, C, etc..). In a classic non-parallel (and non-replicating) file system scenario, data would not be identical between the two file servers. Clients would need to know which server had the data they wanted ahead of time, and if one of the servers went down, it would take all of its unique data with it. Furthermore, if all of the clients wanted to read data from serverB at once, everybody would get bogged down overall, since there are a finite amount of resources available for everybody to use.

The same scenario with a parallel file system is much better indeed ! Firstly, data would be identical on the two file servers, meaning that if one were to go down, there would be no direct loss of data availability (already a huge improvement). Secondly, and this is also huge, clients could spread their requests for data between the two servers, thus reducing the direct load on any one given server, and resulting in better overall resource usage (translation : better performance).

optimized

The author claims that POHMELFS is designed from the ground-up with performance in mind. This design principle has resulted in some amazing benchmarks which, frankly stated, make more or less every other network file system look slow in comparison ; some types of data access processes are merely rapid, whereas others are revolutionary. Of course, speed isn’t everything, and as you might expect, POHMELFS isn’t quite as feature rich (yet?) as many of the other nfs options on the market.

take it for a test drive

At the top of the article i mentioned that POHMELFS was recently introduced into the Linux kernel. What i really meant was that as of kernel version 2.6.30, POHMELFS is located in the « staging » tree of the overall source collection. This is already a little bit of a warning – code in the staging area is generally considered usable, but not ready to be merged into the kernel proper. If this doesn’t sound like your bag, it’s time to bail out now, but for those of you in the mood for adventure, the staging tree is a great place to find interesting new features and functionalities.

In the test scenario we’re about to run through, there are two servers : « host_75 » and « host _166 ». Both of these machines are very basic installations of Fedora 8, which from a server perspective is not very different from any other Fedora prior or since – therefore, unless otherwise noted, these operations should be identical on any other Fedora box.

kernel configuration

Long ago, in a faraway land where dragons battled wizards for supremacy over ancient battlements, the process of properly configuring, compiling, and installing a new Linux kernel was arcane knowledge – the stuff of legends. In our modern, enlightened age, installing a new kernel on your machine is (relatively) simple ! Heck, you even get a menu these days…

The first step in compiling a new kernel is to make sure that your system has the necessary tools with which to work. On a standard Fedora system, the fastest way is simply to install all of the development tools as a single mass operation. This is overkill, really, but it’s simple, and since we’re just testing things out anyways, fast and simple are our primary criteria :

[root@host_75 ~]# yum groupinstall 'Development Tools'

The kernel menu i mentioned a few moments ago is going to require an additional package which is not part of the development tools group : « ncurses-devel ».

[root@host_75 ~]# yum install ncurses-devel

Next up, we need to download and unpack the kernel source package. Your best bet is from kernel.org, which is reliable, rapid, and most importantly, secure :

[root@host_75 ~]# cd /usr/src
[root@host_75 src]# wget http://eu.kernel.org/pub/linux/kernel/v2.6/linux-2.6.30.tar.bz2
[root@host_75 src]# tar -xvjf linux-2.6.30.tar.bz2
[root@host_75 src]# ln -s linux-2.6.30 linux

Finally, we can take a look at the configuration menu. You’ll notice that we issue the command « make », which is a mechanism for compiling source code used across the *nix world. We’ll see it again a little later on, but for now, understand that all we’re doing here is «making » the configuration menu, not the kernel itself.

[root@host_75 src]# cd linux
[root@host_75 linux]# make menuconfig

Now there are a lot (a LOT) of possible options for configuring a kernel, and if you’ve got the time and the inclination, going through each one and reading the description can be very enlightening. That said, what we’re interested in is POHMELFS, and in order to enable it, we’re going to have to explicitly tell the configuration that we’re interested in staging-level code.

First, enable « Staging Drivers » :

Device Drivers ---> [*] Staging Drivers

Then disable « Exclude Staging drivers from being built ». It’s set up this way in order to prevent somebody from accidentally building anything from staging by accident :

Device Drivers ---> Staging Drivers ---> [ ] Exclude Staging drivers from being built

Next, enable POHMELFS as a module (if you’d like a refresher on modules, just check out any post on this blog with the « modules » tag) :

Device Drivers ---> Staging Drivers ---> <M> POHMELFS filesystem support

And, optionally, support for encryption :

Device Drivers ---> Staging Drivers ---> <M> POHMELFS filesystem support > [*] POHMELFS crypto support

Finally, you may wish to add a « local version string » – this is an identifier that you can customise to help you keep track of each kernel build.

General Setup ---> (-pohmelfs_test) Local version

Now we save and exit !

build & install the kernel

From here, all that’s left is to let it build – depending on your hardware, this can take a while. Patience, grasshopper.

[root@host_75 linux]# make && make modules && make modules_install && make install

There are four distinct commands which, assuming the previous one is successful, will execute consecutively – that’s what the double-ampersand (&&) does.

make : Builds the actual kernel (this is the part that takes forever)
make modules : Builds the modules (everything enabled as <M>, such as POHMELFS)
make modules_install : Copies the modules to their proper positions
make install : Creates the initrd (which i discussed in a previous post), sets up the bootloader (which we’ll take a look at), and so forth

Once the process is done, which is to say that all four items executed successfully, the last thing we need to check before we reboot is the bootloader – in this case, « GRUB ». The « make install » will add an entry for our new kernel into the GRUB configuration. This is fairly automatic, but i like to check it, just to be safe. The new entry should look something like this :

title Fedora (2.6.30-pohmelfs_test)
 root (hd0,0)
 kernel /vmlinuz-2.6.30-pohmelfs_test ro root=/dev/sda1
 initrd /initrd-2.6.30-pohmelfs_test.img

From here, we reboot, and when the GRUB menu appears, select our new POHMELFS item instead of the default entry.

userspace tools

The actual POHMELFS executables – the « userspace tools », so called since they are used by the user, not by the system, are not included with the kernel. This is normal. Even though our kernel now supports POHMELFS, our system doesn’t actually have any of the software which will interface with the kernel module yet. This has to be downloaded, configured, and compiled in the same fashion as the kernel ; however, whereas the kernel was easily downloaded via HTTP, the POHMELFS source is only available via « GIT ».

GIT, in a nutshell, is a platform for managing source code (like « CVS », « Subversion », or « VSS », just to name a few). For our purposes today, we’re only going to use one of its many features : copying the source code from the official POHMELFS site so that we can compile it ourselves. If you don’t already have GIT installed, now is the time ! Don’t delay, act today !

[root@host_75 ~]# yum install git

Depending on your existing installation, this may cause a fairly large number of new packages to be downloaded and installed, so don’t worry if the list looks huge. With that out of the way, we can download the source – this is known as « cloning » a « project » in GIT terminology :

[root@host_75 ~]# git clone http://www.ioremap.net/git/pohmelfs-server.git

Preparation of the source and so forth for POHMELFS was, not too long ago, a bit of a pain in the yoohoo. Now, the author has graciously included a tool which will take most of the pain out of the process – though there are still a few things to look out for :

[root@host_75 ~]# cd pohmelfs-server
[root@host_75 pohmelfs-server]# ./autogen.sh

This will take a little while, and will output a handful of lines as it goes along. The next step is to is a standard « ./configure », which, if you’ve never compiled anything before, is just about the most standard possible way to pre-configure source for compilation in the *nix world. Normally, ./configure accepts a variety of options (take a look at « ./configure –help » for a taste), but for now, we’re only interested in one :

[root@host_75 pohmelfs-server]# ./configure --with-kdir-path=/usr/src/linux/drivers/staging/pohmelfs

The supplied option tells ./configure where the POHMELFS kernel code is – in specific, where the « netfs.h » file is located. This is important later on, so take note. This process output tonnes of lines as it checks for the capabilities and desires of your environment, and customises the configuration as best it can for your machine. Once it’s done, we can go ahead and « make » :

[root@host_75 pohmelfs-server]# make

As of this writing, the above make may fail. As it turns out, certain elements in the POHMELFS source, as they stand now, expect OpenSSL to be installed on the system. This is true even if you were clever and provided « –disable-openssl » to ./configure above (good thinking, though !). We’ve got two options here : either modify the POHMELFS source in order to remove the references to things which do not exist, or simply install OpenSSL and be done with it. If you’ve already got OpenSSL on your system, then no worries, you probably didn’t even see this problem.

OpenSSL, briefly, is an open-source implementation of a series of encryption protocols and cryptographic algorithms which, among other things, allow for « secure » websites (i.e. via HTTPS), and other such things. As such, it’s a pretty standard thing to have on a machine (especially a server), and since it’s so easy to install in our test scenario, we’ll just go ahead and do that now :

[root@host_75 pohmelfs-server]# yum install openssl openssl-devel

Back to POHMELFS, it’s time to reconfigure. We’ll enable openssl now, since we’ve got it anyways…

[root@host_75 pohmelfs-server]# ./configure --with-kdir-path=/usr/src/linux/drivers/staging/pohmelfs --enable-openssl
[root@host_75 pohmelfs-server]# make

As of this writing, the above make will fail (again, possibly). In this instance, a required file can’t be located : netfs.h . Remember that one from above ? Of course you do. As it turns out, even though we explicitly specified the path to find this file, certain elements in the source (possibly auto-generated) expect it to be elsewhere. As with OpenSSL, we have two options : alter the code ourselves, or just satisfy the requirement as painlessly as possible. Well, you already know how we roll in these parts :

[root@host_75 pohmelfs-server]# mkdir /usr/src/linux/drivers/staging/pohmelfs/fs
[root@host_75 pohmelfs-server]# ln -s /usr/src/linux/drivers/staging/pohmelfs /usr/src/linux/drivers/staging/pohmelfs/fs/pohmelfs

What we’ve done here is create a link that points from where the source wants netfs.h to be, to where it actually is. Hey, it’s staging-level code, remember ? No worries – this was an easy one anyways. With that out of the way, we can make away !

[root@host_75 pohmelfs-server]# make
   ...
[root@host_75 pohmelfs-server]# make install

The make install will put the necessary binaries in the appropriate places on the system. In particular :

[root@host_75 ~]# which fserver cfg
/usr/local/bin/fserver
/usr/local/bin/cfg

and again !

That’s one server down, one to go. But, wait, that was a lot of work, and even more waiting, wasn’t it ? Doing it again would suck ; luckily, there are some shortcuts we can take.

If the hardware of both machines is more or less the same, there’s a better-than-average chance that the same kernel you’ve already compiled will work on the other server – you can just port it over. Now, there are very particular, very clean ways to go about this, and to those that like their test environments clean and tidy, i salute you ; we, however, know better. Let’s just pack up the source, copy it over, and deploy it all at once :

[root@host_75 ~] cd /usr/src/
[root@host_75 src] tar -cvzf src.tar.gz linux
   ...
[root@host_75 src] scp src.tar.gz root@host_166:/usr/src/
   ...

From here on in you’ll want to keep an eye on the hostname being used – we’re dealing with two machines now…

[root@host_166 ~] cd /usr/src/
[root@host_166 src] tar -xvzf src.tar.gz
[root@host_166 src] cd linux
[root@host_166 src] make modules_install && make install

Nice ! Reboot the second machine now, and don’t forget to choose the new kernel from the GRUB menu.

Likewise, we don’t need to install GIT on the second machine – we’ll just do like we did with the kernel :

[root@host_75 ~]# tar -cvzf poh.tar.gz --exclude=.git pohmelfs-server/
[root@host_75 ~]# scp poh.tar.gz root@host_166:~/

Notice the « –exclude » line in the tar command ? This is to prevent the GIT-specific stuff (which is substantial) from being archived, as it is not useful where we’re going.

[root@host_166 ~]# tar -xvzf poh.tar.gz
   ...
[root@host_166 ~]# cd pohmelfs-server/
[root@host_166 pohmelfs-server]# make install

testing time

The first thing we need to do is load the POHMELFS module, which was generated way back when we built the kernel :

[root@host_75 ~]# modprobe pohmelfs
[root@host_75 pohmelfs-server]# lsmod
Module                  Size  Used by
pohmelfs               59284  0

You will likely have a lot more items in this list – but one of them must be « pohmelfs ». Before we start the server daemon, we’ll have to decide which directory we want to « export », which is to say which one we’d like to share on the network. For now, let’s pick « /tmp », since it’s simple (and it’s the daemon default). Let’s put a file in there so that we can check to see if our share is properly working :

[root@host_75 ~]# touch /tmp/POHTEST.TXT

Next, we start the server daemon. For our first test, we’ll just launch the binary without any options – by default, the daemon will launch in a local console, export /tmp (as noted above), and bind the process to port 1025. Eventually you may wish to change some or all of these defaults, but for now, we’ll keep it simple :

[root@host_75 ~]# fserver
Server is now listening at 0.0.0.0:1025.

The most basic test possible at this point is to telnet from the second machine to the first, just to see if we can connect on the port :

[root@host_166 ~]# telnet host_75 1025
Trying 192.168.0.75...
Connected to 192.168.0.75.
Escape character is '^]'.
^]
telnet> quit
Connection closed.

This will create some output on the server console :

Accepted client 192.168.0.166:49744.
fserver_recv_data: size: 40, err: -104: Success [0].
Dropped thread 1 for client 192.168.0.166:49744.
Disconnected client 192.168.0.166:49744, operations served: 0.
Dropping worker: 3086465936.

Looks good ! Let’s try a proper connection with the POHMELFS userspace tool : « cfg ». The options we’ll pass to it are the most basic possible set :

« -A add » : Action is to add a new connection
« -a <address> » : Connect to this server
« -p <port> » : Connect on this port

[root@host_166 ~]# cfg -A add -a 192.168.0.75 -p 1025
Timed out polling for ack
main: err: -1.

Uh oh ! What happened ? The error message tells us that the client « timed out » (or waited too long) to receive an acknowledgement of the connection from the server. But why ? The answer, though simple, is not immediately obvious : we forgot to load the POHMELFS module on the second machine. No worries, go ahead and do that now, and we’ll try again :

[root@host_166 ~]# modprobe pohmelfs
[root@host_166 ~]# cfg -A add -a 192.168.0.75 -p 1025
[root@host_166 ~]#

In a stroke of user-friendliness to last the ages, a successful cfg execution will produce no output. No news is good news, i suppose…

Alright, now that we’ve got the server up, and we’ve prepared the client for a connection, we’ll need to pick a spot to « mount » the remote share, then initiate the mount itself :

[root@host_166 ~]# mkdir pohtest
[root@host_166 ~]# mount -t pohmel -o idx=1 none pohtest/
mount: unknown filesystem type 'pohmel'

Curses, foiled again ! Well, i was, at least – your mileage may vary on this one. If you get this error instead of a successful mount, the problem is very likely that the « pohmel » file system type isn’t declared in all the proper places. Check « proc » and the « filesystems » etc config :

[root@host_166 ~]# cat /proc/filesystems | grep poh
nodev    pohmel

[root@host_166 ~]# cat /etc/filesystems | grep poh
[root@host_166 ~]#

Ah ha ! Let’s rectify that little oversight and try again :

[root@host_166 ~]# echo "nodev pohmel" >> /etc/filesystems
[root@host_166 ~]# mount -t pohmel -o idx=1 none pohtest/

No output ? Great success ! Let’s verify :

[root@host_166 ~]# df | grep poh
none                 154590376  10007348 144583028   7% /root/pohtest

The server console also confirms the connection :

fserver_root_capabilities: avail: 148053020672, used: 10247524352, export: 0, inodes: 39911424, flags: 2.
Accepted client 192.168.0.166:37617.

And, finally, we can see our test file :

[root@host_166 ~]# cd pohtest
[root@host_166 pohtest]# ls -l
total 0
-rw-r--r-- 1 root root 0 2009-06-16 15:39 POHTEST.TXT

Closing the connection cleanly is as simple as umounting :

[root@host_166 ~]# umount pohtest

This is confirmed on the server console :

Dropped thread 1 for client 192.168.0.166:58955.
Disconnected client 192.168.0.166:58955, operations served: 1.
Dropping worker: 3086400400.

that’s a wrap, for now

Now i know what you’re thinking : where’s the parallel storage ? Where’s the real-time mirroring and all that fun stuff ? It’s coming. For now, we’re just getting our feet wet with technology. As time and testing permits, i’ll post more about POHMELFS – so stay tuned !

UPDATE : Check out the next instalment in the series ! http://www.dark.ca/2009/07/06/pohmelfs-pt-2-return-of-pohmelfs/

Last but not least, be sure to check out « fserver -h » for such useful features as « fork to background » and « logfile » – both of which, i guaruntee, you’ll want to look into if you intend to play around anymore with POHMELFS.

Finally, remember that while the code is very mature for staging, it’s still considered highly experimental. Good luck, and happy hacking !