Linux 2.6.39: IO Improvements for the Last In the 2.6 Series


The 2.6.39 kernel has been out for a bit so reviewing the IO improvements is a good exercise to understand what’s happening in the Linux storage world. But just in case you didn’t know, this will be the last kernel in the 2.6.x series. The next kernel will be 3.0 and not 2.6.40.

2.6.39 is Out!

For those that live under a rock (and not that stupid Geico commercial), the 2.6.39 kernel was released on May 18th. For us Linux storage junkies there were some very nice additions to the kernel.

File Systems

While there are many aspects to storage in Linux, one of the most obvious features are file systems. In just about every new kernel there are new features associated with file systems and the 2.6.39 kernel was definitely not an exception. Some of the more notable patches touched 11 file systems in the kernel. The following sections highlight some of the more noticeable ones.

Ext4

There was an update to ext4 that went in the 2.6.37 kernel but was disabled because it wasn’t quite ready for production use (i.e. a corruption bug was found). The change allowed ext4 to use the Block IO layer (called “bio”) directly, instead of the intermediate “buffer” layer. This can improve performance and scalability, particular when your system has lots of cores (smp). You can read more about this update here. But remember that it was disabled in the 2.6.37 kernel in the source code.

In the 2.6.39 kernel, the ext4 code was updated and the corruption bugs fixed. So this scalability patch was re-enabled. You can read the code commit here. This is one of Ted Ts’o’ favorite patches that can really pump up ext4 performance on smp systems (BTW – get ready for large smp systems because we can easily see 64-core mainstream servers later this year).

Btrfs

In the 2.6.39 kernel, btrfs added the option of different compression and copy-on-write (COW) settings for each file or directory. Before this patch, these settings were on a per-file system basis, so these new changes allow much finer control (if you want). The first commit for this series is here.

Btrfs is under fairly heavy development so there were other patches to improve functionality and performance as well as trace points. Trace points can be very important because they allow debugging or monitoring to be more fine grained.

GFS2

GFS2 is a clustered file system that has been in the Linux kernel for quite some time (since Red Hat bought Sistina). In the 2.6.39 kernel a few performance changes were made. The first patch improves deallocation performance resulting in about a 25% improvement in file deletion performance of GFS2 (always good to see better metadata performance).

The second patch improves the cluster mmap scalability. Sounds like a mouthful but it’s a fairly important performance patch. Remember that GFS2 is a clustered file system where each node has the ability to perform IO perhaps to the same file. This patch improves the performance when the file access is done via mmap (an important file access method for some workloads) by several nodes.

The third patch comes from Dave Chinner, one of the key xfs developers. The patch reduces the impact of “log pushing” on sequential write throughput performance. This means that the patch improves performance by reducing the impact of log operations on the overall file system performance (yeah performance!).

The fourth patch uses an RCU for the glock hash table. In a nutshell, this patch adds an RCU (read-copy-update) for the glock (global lock) of GFS2. Using an RCU is commonly used for locks on files so that processes can actually read or write to the file excluding other processes (so you don’t get data corruption). RCUs are useful because they have very low overhead. Again, this patch is a nice performance upgrade for GFS2.

HPFS

HPFS (High Performance File System) is something of an ancient file system, originally designed for OS/2 (remember those days?). It is still in the kernel and during the hunt for parts of the kernel that were still using parts of the BKL (Big Kernel Lock), it was found that HPFS still used a number of those functions.

When SMP (Symmetric Multi-Processing) was first introduced into the Linux kernel, functions that performed locking (BKL) were introduced. These functions allowed the kernel to work with SMP systems particularly with small core counts. However, with common servers have up to 48 cores, the kernel doesn’t scale quite as well. So, the kernel developers went on the BKL hunt to eliminate the code in the kernel that used BKL functions and recode them to use much more scalable locking mechanisms.

One of the last remaining parts of the kernel that used BKL functions was HPFS. What made things difficult in removing those routines is that no one could be found that really maintained the code and no one stepped up and said that they really used it any more. So there was some discussion about just eliminating the code completely, but instead it was decided to go ahead and give the best effort possible to patch the HPFS code. Mikulas Patocka created three patches for HPFS that were pulled into the 2.6.39 kernel. The first allowed HPFS to compile with the options of PREEMPT and SMP (basically the BKL parts of HPFS were gone). The second patch implemented fsync in HPFS (shows how old the code is). And the third patch removed the CR/LF option (this was only really used in the 2.2 kernel).

So, in the 2.6.39 kernel the final bits of BKL in HPFS were eliminated. This, and other BKL pieces were eliminated in the kernel, allowed BKL victory to be declared (and all of the peasants rejoiced).

XFS

The strong development pace of XFS continued in the 2.6.39 kernel with two important patches. The first patch, made delayed logging the default (you had to explicitly turn it on in past kernels starting with 2.6.35). Delayed logging can greatly improve performance, particularly metadata performance. If you want to see an example of this, read Ric Wheeler’s Red Hat Summit paper on testing file systems with 1 billion files.

The second patch removed the use of the page cache to back the buffer cache in xfs. The reason for this is that the buffer cache has it’s own LRU now so you don’t need the page cache to provide persistent caching. This patch saves the overhead of using the page cache. The patch also means that xfs can handle 64k pages (if you want to go there) and has an impact on the 16TB file system limit for 32 bit machines.

Ceph

In the 2.6.39 kernel, two patches were added to Ceph. The first one added a mount option, ino32. This option allows ceph to report 32 bit ino values which is useful for 64-bit kernels with 32-bit userspace.

The second patch adds lingering request and watch/notify event framework to Ceph (i.e. more tracking information in Ceph).

Exofs

Exofs is an object oriented file system in the Linux kernel (which makes it pretty cool IMHO). In the 2.6.39 kernel there was one major patch for Exofs that the mount option of mounting the file system by osdname. This can be very useful if more devices were added later or the login order has changed.

Nilfs2

Nilfs2 is a very cool file system in the kernel that is a what is termed a log-structured file system. In the 2.6.39 kernel nilfs2 added some functions that exposed the standard attribute set to user-space via the chattr and lsattr functions. This can be very useful for several tools that read the attributes of files in user-space (mostly management and monitoring tools).

CIFS

While Linux is the one and only true operating system for the entire planet (as we all know), we do have to work with other file systems from time to time much to our dismay :). The most obvious interaction is with Windows systems via CIFS (Common Internet File System). In the 2.6.39 kernel a patch was added that allowed user names longer than 32 bytes. This patch allows better integration between Windows systems and Linux systems.

Squashfs

One of my favorite file systems in the Linux kernel is SquashFS. In the 2.6.39 kernel, a patch was added that allowed SquashFS to support the xz decompressor. Xz is compression technology that is lossless and that uses the LZMA2 compression algorithm.

Block Patches

Another key aspect to storage in Linux is the block layer in the kernel. The 2.6.39 kernel had a few patches for this layer helping to improve performance and add new capability.

The first patch adds the capability of syncing a single file system. Typically, the sync(2) function commits the buffer cache to disk but does so for all mounted file systems. However, you may not want to to do this for a system that has several mounted file systems. So this patch introduces a new system call, syncfs(2). This system call takes a file descriptor as an argument and then syncs on that file system.

The DM layer in the kernel also saw some patches the first of which is quite interesting. This first patch added a new target to the DM layer called a “flakey target” in the patch commit. This target is the same as the linear target except that it returns I/O errors periodically. This target is useful for simulating failing devices for testing purposes. This is not a target you want to use for real work of course, but if you are developing or testing things, it might be worthwhile (at the very least it has an interesting name).

The second patch introduces a new merge function for striped targets in the DM. This patch improves performance by about 12-35% when a reasonable chunk size is used (64K for example), in conjunction with a stripe count that is a power of 2. What the patch does is allow large block I/O requests to be merged together and handled properly by the DM layer. File systems such as XFS and ext4 take care to assemble large block I/O requests to improve performance and now the DM layer supports these requests rather than piece them out to the underlying hardware, eliminating the performance gains the file systems took pains to create.

Summary

The 2.6.39 kernel is somewhat quiet with no really huge storage oriented features but it does show that continual progress is being made with performance gains in a number of places. It touched a large number of file systems and also touched the block layer to some degree. No less than 11 file systems had noticeable patches (I didn’t discuss 9p in this article). To me that signals lots of work on Linux storage (love to see that).

The block layer, while sometimes a quiet piece of the kernel, had some patches that improved it for both users and developers. The block layer introduced a new system call, syncfs, that allows a single file system to be synced at a time which is a very useful feature for systems that have many mounted file systems.

The DM layer also improved performance for file systems that assemble larger block I/O requests by supporting them for the striped target. This can be a great help with more enterprise class oriented hardware.

Lots of nice improvements in this kernel and it’s good to see so much work focused on storage within Linux. This also puts the Linux world on a solid footing moving into the new 3.0 series of kernels which is what we’ll see next instead of a 2.6.40 kernel. But before moving onto the 3.0 series of kernels I wanted to thank all developers who worked on the 2.6 series of kernels (and the 2.5 development series). The 2.6 series started off as a kernel with more functionality and ambition that the prior series. It then developed into not only a great kernel for average users but also for the enterprise world. From the mid-2.6.2x kernels to 2.6.39, Linux storage has been developed at a wonderful rate and we are all better off because of this development. Thanks everyone and I look forward to the fun of the 3.0 series!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: