Linux 3.1: Steady Storage
November 6, 2011 Leave a comment
The gestation period for the 3.1 kernel was fairly long for various reasons, but it’s out and there are some steady storage updates but nothing earth shattering. But don’t mistake this for complacency, the storage bits of the kernel are still progressing very nicely.
3.1 – Me Develop You Long Time
The development time of the 3.1 kernel was a bit longer than usual starting with the 3.0 release on from July 21 to the 3.1 release on Oct. 24 of this year. Three months is a little longer than usual but Linux was on vacation during the development and kernel.org had a few security issues. But the intrepid kernel developers persevered and we have a shiny new Linux kernel. Let’s take a look at the storage aspects of the 3.1 kernel starting with the fan favorite, file systems.
There were some nice new features added to the 3.1 kernel around file systems.
The ext3 file system, while some outdated, is still in very heavy use. In the 3.1 kernel, a very nice feature was added to the default options for ext3. The option that is now default in et3 is write barriers. You can read more about write barriers here but in general write barriers ensure that the data is written to the disk at key points in the data cycle. In particular, write barriers happen before the journal commit to make sure all of the transaction logs are on the disk. Write barriers will impact performance for many workloads, but the trade-off is that you get a much more reliable file system with less chance of losing data. Ext3 had write barriers for some time but they were never turned on by default even though most distributions turned them on. So the kernel developers decided in the 3.1 kernel that they would turn on write barriers by default. While I am a performance junky, I’m an even bigger junky for making sure I don’t lose data.
I’m truly surprised by how many people who know Linux and file systems well enough still do silly things such as making the command, “ls -l” part of a script or put millions of 4KB files in a single directory. Performing a simple “ls” command, or the more feared, “ls -l”, puts a tremendous strain on the file system forcing it to walk a big chunk of the file system and look up inodes to gather metadata. A big part of this process is the use of the readdir() function.
In the 3.1 kernel, Josef Bacik from Red Hat, created a patch that improves readdir() performance, particularly for “ls” commands that follow readdir() with an stat() command. The patch improved performance on a simple, but long running script that uses “ls” by 1,300 seconds (22 minutes). Here is a plot of the pattern of the script before the patch, and here is a plot of the same script after the patch. The differences between the two are remarkable.
The second patch to btrfs, while somewhat complicated because it deals with the idea of locking during metadata operations, results in significantly better read performance and in most cases improved write performance. However, there are cases, such as those for dbench, where the write performance suffers a bit. But overall, it is a good change for btrfs.
Everyone loves NFS (if you don’t, you should), because it’s the only file system standard. NFS is NFS is NFS. It is the same protocol for Linux as for OSX as for HPUX, as for Windows, etc. So it allows one to share data between different systems even if they have different operating systems. Plus it is a well known file system that is fairly easy to configure and operate and has well-known error paths.
In the 3.1 kernel, there was a patch to NFS, actually to the v4.1 standard of NFS (sometimes called pNFS), that added IPV6 support. This patch is fairly significant because as pNFS becomes a reality (yeah – it’s been a long road), some people are going to want to use IPV6, particularly in government areas. So this patch is a dandy one for NFS.
The XFS developers have been keeping a steady pace of great developments. In the 3.1 kernel they added some performance enhancements. The details are very involved, but if you want to read the kernel commits, you can see the first one here and the second one here.
The venerable reiserfs file system is still used in many places. In the 3.1 kernel, a simple patch was included to make the default mount option for reiserfs to barrier=flush. This will have an impact on existing reiserfs file systems that are upgraded to the 3.1 kernel.
HFSPlus (HFS+) is a file system developed by Apple. It is used on a variety of Apple platforms including iPods. In the 3.1 kernel, a 2TB limitation on HFS+ was changed to a dynamic limit based on the block size. While HFS+ may not be used as a primary file system for most Linux users, having the ability to mount and interact with Apple file systems can be very useful.
One of my favorite file systems in the Linux kernel is SquashFS (I think I’ve said this before). In past kernels, squashfs supported the XZ and LZO compression methods in the kernel, in addition to the default ZLIB compression method, giving it access to three different compression methods. It was decided to drop the ZLIB support in the 3.1 kernel since there are two other compression options. Plus dropping support for ZLIB reduces the kernels size which is important for people using Linux for embedded applications.
Block Layer Patches
File systems are cool and people love to talk about them but there are other aspects to Linux storage that are just as key. One important aspect is the block layer. In the 3.1 kernel there were several important patches to the block layer that you should be aware of.
One patch that may seem esoteric but is more important than you think, added the ability to force the completion of an IO operation to go to the core (cpu) that request the operation. The kernel has the ability to move processes around the cores in a system. For IO operations, the block layer usually has the concept of completing an IO operation to a core that is on the same socket as the requesting process (using the blk_cpu_to_group() function). Sometimes you want the requesting process to also get the completion notification. This patch allows you to set capability on a per block device basis such as,
echo 2 > /sys/block//queue/rq_affinity
There were a couple of performance tuning patches for the CFQ IO scheduler that were pretty detailed and beyond the scope of this quick article. But you have to love performance additions.
The third set of changes to the block layer focused on the device mapper (dm) portion of the block layer. The first patch that is significant is the addition of the capability of supporting the MD (multi-device) RAID-1 personality using dm-raid. This is not a insignificant patch since the MD RAID-1 has some unique features. With this patch you can now use the same features but using dm-raid.
The second dm patch adds the ability to parse and use metadata devices with dm-raid. This is significant because the metadata information allows dm-raid to correctly reassemble disks in the correct order when booting or if a disk fails or there is some other fault. This is an extremely important task that otherwise you have to do by hand (yuck).
If you have looked at the kernel patches for the last several kernels you will realize that there has been a large effort around improving the scalability of the VFS layer. The 3.1 kernel added some additional patches to improve scalability (and performance). The details are there if you want to read them but they are pretty deep.
iSCSI is becoming more popular as a replacement for Fibre Channel (FC) networks since you just use Ethernet networks. In the 3.1 kernel the current iSCSI implementation, called STGT, was declared obsolete after the inclusion of the linux-iscsi.org (LIO) SCSI target was included in the kernel. LIO is a full featured in-kernel implementation o the iSCSI target mode (RFC-3720) but was not the only in-kernel iSCSI target considered. A second set of patches, called SCST was also considered and I guess the discussion around LIO vs. SCST was pretty rough. For a more in-depth discussion you can read the lwn article here.
The 3.1 kernel was fairly quiet from a storage perspective but the thing I like most about it is that steady progress was made on the kernel. If you recall there was a period of time when storage development in the kernel was stagnate. Then in the later 2.6.x series it start picking up with a huge flurry of development, new file systems, scalability improvements, and performance improvements. So the fact that storage is still being developed in the kernel illustrates that it is important and work is still on-going.
There were some file system developments of note, particularly if you are upgrading an existing system to use the 3.1 kernel. There were also some developments in the block layer (if block layer patches go into the kernel you know they are good quality because no one wants to cause problems in the block layer) and the VFS layer had some further scalability improvements. And finally for people using iSCSI, there was a new iSCSI target implementation included in the kernel, so if you use iSCSI targets on Linux, be sure you test these out thoroughly before deploying things in production.