Linux 3.1: Steady Storage

The gestation period for the 3.1 kernel was fairly long for various reasons, but it’s out and there are some steady storage updates but nothing earth shattering. But don’t mistake this for complacency, the storage bits of the kernel are still progressing very nicely.

3.1 – Me Develop You Long Time

The development time of the 3.1 kernel was a bit longer than usual starting with the 3.0 release on from July 21 to the 3.1 release on Oct. 24 of this year. Three months is a little longer than usual but Linux was on vacation during the development and had a few security issues. But the intrepid kernel developers persevered and we have a shiny new Linux kernel. Let’s take a look at the storage aspects of the 3.1 kernel starting with the fan favorite, file systems.

File Systems

There were some nice new features added to the 3.1 kernel around file systems.

The ext3 file system, while some outdated, is still in very heavy use. In the 3.1 kernel, a very nice feature was added to the default options for ext3. The option that is now default in et3 is write barriers. You can read more about write barriers here but in general write barriers ensure that the data is written to the disk at key points in the data cycle. In particular, write barriers happen before the journal commit to make sure all of the transaction logs are on the disk. Write barriers will impact performance for many workloads, but the trade-off is that you get a much more reliable file system with less chance of losing data. Ext3 had write barriers for some time but they were never turned on by default even though most distributions turned them on. So the kernel developers decided in the 3.1 kernel that they would turn on write barriers by default. While I am a performance junky, I’m an even bigger junky for making sure I don’t lose data.

I’m truly surprised by how many people who know Linux and file systems well enough still do silly things such as making the command, “ls -l” part of a script or put millions of 4KB files in a single directory. Performing a simple “ls” command, or the more feared, “ls -l”, puts a tremendous strain on the file system forcing it to walk a big chunk of the file system and look up inodes to gather metadata. A big part of this process is the use of the readdir() function.

In the 3.1 kernel, Josef Bacik from Red Hat, created a patch that improves readdir() performance, particularly for “ls” commands that follow readdir() with an stat() command. The patch improved performance on a simple, but long running script that uses “ls” by 1,300 seconds (22 minutes). Here is a plot of the pattern of the script before the patch, and here is a plot of the same script after the patch. The differences between the two are remarkable.

The second patch to btrfs, while somewhat complicated because it deals with the idea of locking during metadata operations, results in significantly better read performance and in most cases improved write performance. However, there are cases, such as those for dbench, where the write performance suffers a bit. But overall, it is a good change for btrfs.

Everyone loves NFS (if you don’t, you should), because it’s the only file system standard. NFS is NFS is NFS. It is the same protocol for Linux as for OSX as for HPUX, as for Windows, etc. So it allows one to share data between different systems even if they have different operating systems. Plus it is a well known file system that is fairly easy to configure and operate and has well-known error paths.

In the 3.1 kernel, there was a patch to NFS, actually to the v4.1 standard of NFS (sometimes called pNFS), that added IPV6 support. This patch is fairly significant because as pNFS becomes a reality (yeah – it’s been a long road), some people are going to want to use IPV6, particularly in government areas. So this patch is a dandy one for NFS.

The XFS developers have been keeping a steady pace of great developments. In the 3.1 kernel they added some performance enhancements. The details are very involved, but if you want to read the kernel commits, you can see the first one here and the second one here.

The venerable reiserfs file system is still used in many places. In the 3.1 kernel, a simple patch was included to make the default mount option for reiserfs to barrier=flush. This will have an impact on existing reiserfs file systems that are upgraded to the 3.1 kernel.

HFSPlus (HFS+) is a file system developed by Apple. It is used on a variety of Apple platforms including iPods. In the 3.1 kernel, a 2TB limitation on HFS+ was changed to a dynamic limit based on the block size. While HFS+ may not be used as a primary file system for most Linux users, having the ability to mount and interact with Apple file systems can be very useful.

One of my favorite file systems in the Linux kernel is SquashFS (I think I’ve said this before). In past kernels, squashfs supported the XZ and LZO compression methods in the kernel, in addition to the default ZLIB compression method, giving it access to three different compression methods. It was decided to drop the ZLIB support in the 3.1 kernel since there are two other compression options. Plus dropping support for ZLIB reduces the kernels size which is important for people using Linux for embedded applications.

Block Layer Patches

File systems are cool and people love to talk about them but there are other aspects to Linux storage that are just as key. One important aspect is the block layer. In the 3.1 kernel there were several important patches to the block layer that you should be aware of.

One patch that may seem esoteric but is more important than you think, added the ability to force the completion of an IO operation to go to the core (cpu) that request the operation. The kernel has the ability to move processes around the cores in a system. For IO operations, the block layer usually has the concept of completing an IO operation to a core that is on the same socket as the requesting process (using the blk_cpu_to_group() function). Sometimes you want the requesting process to also get the completion notification. This patch allows you to set capability on a per block device basis such as,

echo 2 > /sys/block//queue/rq_affinity

There were a couple of performance tuning patches for the CFQ IO scheduler that were pretty detailed and beyond the scope of this quick article. But you have to love performance additions.

The third set of changes to the block layer focused on the device mapper (dm) portion of the block layer. The first patch that is significant is the addition of the capability of supporting the MD (multi-device) RAID-1 personality using dm-raid. This is not a insignificant patch since the MD RAID-1 has some unique features. With this patch you can now use the same features but using dm-raid.

The second dm patch adds the ability to parse and use metadata devices with dm-raid. This is significant because the metadata information allows dm-raid to correctly reassemble disks in the correct order when booting or if a disk fails or there is some other fault. This is an extremely important task that otherwise you have to do by hand (yuck).

VFS Layer

If you have looked at the kernel patches for the last several kernels you will realize that there has been a large effort around improving the scalability of the VFS layer. The 3.1 kernel added some additional patches to improve scalability (and performance). The details are there if you want to read them but they are pretty deep.


iSCSI is becoming more popular as a replacement for Fibre Channel (FC) networks since you just use Ethernet networks. In the 3.1 kernel the current iSCSI implementation, called STGT, was declared obsolete after the inclusion of the (LIO) SCSI target was included in the kernel. LIO is a full featured in-kernel implementation o the iSCSI target mode (RFC-3720) but was not the only in-kernel iSCSI target considered. A second set of patches, called SCST was also considered and I guess the discussion around LIO vs. SCST was pretty rough. For a more in-depth discussion you can read the lwn article here.


The 3.1 kernel was fairly quiet from a storage perspective but the thing I like most about it is that steady progress was made on the kernel. If you recall there was a period of time when storage development in the kernel was stagnate. Then in the later 2.6.x series it start picking up with a huge flurry of development, new file systems, scalability improvements, and performance improvements. So the fact that storage is still being developed in the kernel illustrates that it is important and work is still on-going.

There were some file system developments of note, particularly if you are upgrading an existing system to use the 3.1 kernel. There were also some developments in the block layer (if block layer patches go into the kernel you know they are good quality because no one wants to cause problems in the block layer) and the VFS layer had some further scalability improvements. And finally for people using iSCSI, there was a new iSCSI target implementation included in the kernel, so if you use iSCSI targets on Linux, be sure you test these out thoroughly before deploying things in production.

What’s an IOPS?


There have been many articles exploring the performance aspects of file systems, storage systems, and storage devices. Coupled with Throughput (Bytes per second), IOPS (Input/Output Operations per Second) is one of the two measures of performance that are typically examined when discussing storage media. Vendors will publish performance results with data such as “Peak Sequential Throughput is X MB/s” or “Peak IOPS is X” indicating the performance of the storage device. But what does an IOPS really mean and how is it defined?

Typically an IOP is an IO operation where data is sent from an application to the storage device. An IOPS is the measure of how many of these you can perform per second. But notice that the phrase “typically” is used in this explanation. That means there is no hard and fast definition of an IOPS that is standard for everyone. Consequently, as you can imagine, it’s possible to “game” the results and publish whatever results you like. (see a related article, Lies, Damn Lies and File System Benchmarks). That is the sad part of IOPS and Bandwidth – the results can be manipulated to be almost whatever the tester wants.

However, IOPS is a very important performance measure for applications because, believe it or not, many applications perform IO using very small transfer sizes (for example, see this article). How quickly or efficiently a storage system an perform IOPS can drive overall performance of the application. Moreover, today’s systems have lots of cores and run several applications at one time, further pushing the storage performance requirements. Therefore, knowing the IOPS measure of your storage devices is important but you just need to critical of the numbers that are published.

Measuring IOPS

There are several tools that are commonly used for measuring IOPS on systems. The first one is called Iometer, that you commonly see used on Windows systems. The second most common tool is IOzone, which have been used in the articles published on Linux Magazine because it is open-source, easy to build on almost any system, has a great deal of tests and options, and is widely used for storage testing. It is fairly evident at this point that having two tools could lead to some differences in IOPS measurements. Ideally there should be a precise definition of an IOPS with an accepted way to measure it. Then the various tools for examining IOPS would have to prove that they satisfy the definition (“certified” is another way of saying this). But just picking the software tool is perhaps the easiest part of measuring IOPS.

One commonly overlooked aspect of measuring IOPS is the size of the I/O operation (sometimes called the “payload size” using the terminology of the networking world). Is the I/O operation involve just a single byte? Or does it involve 1 MB? Just stating that a device can achieve 1,000 IOPS really tells you nothing. Is that 1,000 1-byte operations per second or 1,000 1MB operations per second?

The most common IO operation size for Linux is 4KB (or just 4K). It corresponds to the page size on almost all Linux systems so usually produces the best IOPS (but not always). Personally, I want to see IOPS measures for a range of IO operation sizes. I like to see 1KB (in case there is some exceptional performance at really small payload sizes), 4KB, 32KB, 64KB, maybe 128KB or 256KB, and 1MB. The reason I like to see a range of payload sizes is that it tells me how quickly the performance drops with payload size which I can then compare to the typical payload size of my application(s) (actually the “spectrum” of payload sizes). But if push comes to shove, I want to at least see the 4KB payload size but most importantly I want the publisher to tell me the payload size they used.

A second commonly overlooked aspect of measuring IOPS is whether the IO operation is a read or write or possibly a mix of them (you knew it wasn’t going to be good when I start numbering discussion points). Hard drives, which have spinning media, usually don’t have much difference between read and write operations and how fast they can execute them. However, SSDs are a bit different and have asymmetric performance. Consequently, you need to define how the IO operations were performed. For example, it could be stated, “This hardware is capable of Y 4K Write IOPS” where Y is the number, which means that the test was just write operations. If you compare some recent results for the two SSDs that were tested (see this article) you can see that SSDs can have very different Read IOPS and Write IOPS performance – sometimes even an order of magnitude different.

Many vendors choose to publish either Read IOPS or Write IOPS but rarely both. Other vendors like to publish IOPS for a mixed operation environment stating that the test was 75% Read and 25% Write. While they should be applauded for stating the mix of IO operations, they should also publish their Read IOPS performance (all read IO operations), and their Write IOPS performance (all write IO operations) so that the IOPS performance can be bounded. At this point in the article, vendors should be publishing the IOP performance measures something like the following:

  • 4K Read IOPS =
  • 4K Write IOPS =
  • (optional) 4K (X% Read/Y% Write) IOPS =

Note that the third bullet is optional and the ratios of read and write IOPS is totally up to the vendor.

A third commonly overlooked aspect of measuring IOPS is whether the IO operations are sequential or random. With sequential IOPS, the IO operations happen sequentially on the storage media. For example block 233 is used for the first IO operation, followed by block 234, followed by block 235, etc. With random IOPS the first IO operation is on block 233 and the second is on block 568192 or something like that. With the right options on the test system, such as a large queue depth, the IO operations can be optimized to improve performance. Plus the storage device itself may do some optimization. With true random IOPS there is much less chance that the server or storage device can optimize the access very much.

Most vendors report the sequential IOPS since typically it has a much larger value than random IOPS. However, in my opinion, random IOPS is much more meaningful, particularly in the case of a server. With a server you may have several applications running at once, accessing different files and different parts of the disk so that to the storage device, the access looks random.

So, at this point in the discussion, the IOPS performance should be listed something like the following:

  • 4K Random Read IOPS =
  • 4K Random Write IOPS =
  • 4K Sequential Read IOPS =
  • 4K Sequential Write IOPS =
  • (optional) 4K Random (X% Read/Y% Write) IOPS =

The IOPS can be either random or sequential (I like to see both), but at the very least they should publish if the IOPS are sequential or random.

A fourth commonly overlooked aspect of measuring IOPS is the queue depth. With Windows storage benchmarks, you see the queue depth adjusted quite a bit in the results. Linux does a pretty good job setting good queue depths so there is much less need to change the defaults. However, the queue depths can be adjusted which can possibly change the performance. Changing the queue depth on Linux is fairly easy.

The Linux IO Scheduler has the functionality to sort the incoming IO request into something called the request-queue where they are optimized for the best possible device access which usually means sequential access. The size of this queue is controllable. For example, you can look at the queue depth for the “sda” disk in a system and change it as shown below:

# cat /sys/block/sda/queue/nr_requests
# echo 100000 > /sys/block/sda/queue/nr_requests

Configuring the queue depth can only be done by root.

At this point the IOPS performance should be published something like the following:

  • 4K Random Read IOPS = X (queue depth = Z)
  • 4K Random Write IOPS = Y (queue depth = Z)
  • 4K Sequential Read IOPS = X (queue depth = Z)
  • 4K Sequential Write IOPS = Y (queue depth = Z)
  • (optional) 4K Random (X% Read/Y% Write) IOPS = W (queue depth = Z)

Or if you like they need to tell you the queue depth once if it applies to all of the tests.

In the Linux world, not too many “typical” benchmarks try different queue depths since typically the queue depth is 128 already which provides for good performance. However, depending upon the workload or the benchmark, you can adjust the queue depth to produce better performance. However, just be warned that if you change the queue depth for some benchmark, real application performance could suffer.

Notice that it is starting to take a fair amount of work to list the IOPS performance. There are at least four IOPS numbers that need to be reported for a specified queue depth. However, I personally would like to see the IOPS performance for several payload sizes and several queue depths. Very quickly, the number of tests that need to be run is growing quite rapidly. To take the side of the vendors, producing this amount of benchmarking data takes time, effort, and money. It may not be worthwhile for them to perform all of this work if the great masses don’t understand nor appreciate the data. On the other hand, taking the side of the user, this type of data is very useful and important since it can help set expectations when we buy a new storage device. And remember, the customer is always right so we need to continue to ask the vendors for this type of data.

There are several other “tricks” you can do to improve performance including more OS tuning, turning off all cron jobs during testing, locking process to specific cores using numactl, and so on. Covering all of them is beyond this article but you can assume that most vendors like to tune their systems to improve performance (ah – the wonders of benchmarks). One way to improve this situation is to report all details of the test environment (I try to do this) so that one could investigate which options might have been changed. However, for rotating media (hard drives), one can estimate the IOPS performance of single devices (i.e. individual drives).

Estimating IOPS for Rotating Media

For pretty much all rotational storage devices, the dominant factors in determining IOPS performance are seek time, access latency, and rotational speed (but typically we think of rotational speed as affecting seek time and latency). Basically the dominant factors affect the time to access a particular block of data on the storage media and report it back. For rotating media the latency is basically the same for read or write operations, making our life a bit easier.

The seek time is usually reported by disk vendors and is the time it takes for the drive head to move into position to read (or write) the correct track. The latency refers to the amount of time it takes for the specific spot on the drive to be in place underneath a drive head. The sum of these two times is the basic amount of time to read (or write) a specific spot on the drive. Since we’re focusing on rotating media, these times are mechanical so we can safely assume they are much larger than the amount of time to actually read the data or get it back to the drive controller and the OS (remember we’re talking IOPS so the amount of data is usually very small).

To estimate the IOPS performance of a hard drive, we simple use the average of these two times to compute the number of IO operations we can do per second.

Estimated IOPS = 1 / (average latency + average seek time)

For both numbers, the values should be in milliseconds (or at least in the same units – I’ll leave the math up to you). For example, if a disk has an average latency of 3 ms and an average seek time of 4.45 ms, then the estimated IOPS performance is,

Estimated IOPS = 1 / (average latency + average seek time)
Estimated IOPS = 1 / (3 + 4.45 ms)
Estimated IOPS = 1 / (0.00745)
Estimated IOPS = 134

This handy-dandy formula works for rotating media for single drives (SSDs IOPS performance is more difficult to estimate and not as accurate). Estimating performance for storage arrays that have RAID controllers and several drives is much more difficult and is usually not easy to do. However, there are some articles floating around the web that attempt to estimate the performance.


IOPS is one of the important measures of performance of storage devices. Personally I think it is the first performance measure one should examine since IOPS are important to the overall performance of a system. However, there is no standard definition of an IOPS so just like most benchmarks, it is almost impossible to compare values from one storage device to another or one vendor to another.

In the article I tried to explain a bit about IOPS and how they can be influenced by various factors. Hopefully this helps you realize that published IOPS benchmarks perhaps have been “gamed” by vendors and that you should ask for more details on how the values were found. Even better, you can run the benchmarks yourself or even ask posted benchmarks how they tested for IOPS performance.

Linux 2.6.39: IO Improvements for the Last In the 2.6 Series

The 2.6.39 kernel has been out for a bit so reviewing the IO improvements is a good exercise to understand what’s happening in the Linux storage world. But just in case you didn’t know, this will be the last kernel in the 2.6.x series. The next kernel will be 3.0 and not 2.6.40.

2.6.39 is Out!

For those that live under a rock (and not that stupid Geico commercial), the 2.6.39 kernel was released on May 18th. For us Linux storage junkies there were some very nice additions to the kernel.

File Systems

While there are many aspects to storage in Linux, one of the most obvious features are file systems. In just about every new kernel there are new features associated with file systems and the 2.6.39 kernel was definitely not an exception. Some of the more notable patches touched 11 file systems in the kernel. The following sections highlight some of the more noticeable ones.


There was an update to ext4 that went in the 2.6.37 kernel but was disabled because it wasn’t quite ready for production use (i.e. a corruption bug was found). The change allowed ext4 to use the Block IO layer (called “bio”) directly, instead of the intermediate “buffer” layer. This can improve performance and scalability, particular when your system has lots of cores (smp). You can read more about this update here. But remember that it was disabled in the 2.6.37 kernel in the source code.

In the 2.6.39 kernel, the ext4 code was updated and the corruption bugs fixed. So this scalability patch was re-enabled. You can read the code commit here. This is one of Ted Ts’o’ favorite patches that can really pump up ext4 performance on smp systems (BTW – get ready for large smp systems because we can easily see 64-core mainstream servers later this year).


In the 2.6.39 kernel, btrfs added the option of different compression and copy-on-write (COW) settings for each file or directory. Before this patch, these settings were on a per-file system basis, so these new changes allow much finer control (if you want). The first commit for this series is here.

Btrfs is under fairly heavy development so there were other patches to improve functionality and performance as well as trace points. Trace points can be very important because they allow debugging or monitoring to be more fine grained.


GFS2 is a clustered file system that has been in the Linux kernel for quite some time (since Red Hat bought Sistina). In the 2.6.39 kernel a few performance changes were made. The first patch improves deallocation performance resulting in about a 25% improvement in file deletion performance of GFS2 (always good to see better metadata performance).

The second patch improves the cluster mmap scalability. Sounds like a mouthful but it’s a fairly important performance patch. Remember that GFS2 is a clustered file system where each node has the ability to perform IO perhaps to the same file. This patch improves the performance when the file access is done via mmap (an important file access method for some workloads) by several nodes.

The third patch comes from Dave Chinner, one of the key xfs developers. The patch reduces the impact of “log pushing” on sequential write throughput performance. This means that the patch improves performance by reducing the impact of log operations on the overall file system performance (yeah performance!).

The fourth patch uses an RCU for the glock hash table. In a nutshell, this patch adds an RCU (read-copy-update) for the glock (global lock) of GFS2. Using an RCU is commonly used for locks on files so that processes can actually read or write to the file excluding other processes (so you don’t get data corruption). RCUs are useful because they have very low overhead. Again, this patch is a nice performance upgrade for GFS2.


HPFS (High Performance File System) is something of an ancient file system, originally designed for OS/2 (remember those days?). It is still in the kernel and during the hunt for parts of the kernel that were still using parts of the BKL (Big Kernel Lock), it was found that HPFS still used a number of those functions.

When SMP (Symmetric Multi-Processing) was first introduced into the Linux kernel, functions that performed locking (BKL) were introduced. These functions allowed the kernel to work with SMP systems particularly with small core counts. However, with common servers have up to 48 cores, the kernel doesn’t scale quite as well. So, the kernel developers went on the BKL hunt to eliminate the code in the kernel that used BKL functions and recode them to use much more scalable locking mechanisms.

One of the last remaining parts of the kernel that used BKL functions was HPFS. What made things difficult in removing those routines is that no one could be found that really maintained the code and no one stepped up and said that they really used it any more. So there was some discussion about just eliminating the code completely, but instead it was decided to go ahead and give the best effort possible to patch the HPFS code. Mikulas Patocka created three patches for HPFS that were pulled into the 2.6.39 kernel. The first allowed HPFS to compile with the options of PREEMPT and SMP (basically the BKL parts of HPFS were gone). The second patch implemented fsync in HPFS (shows how old the code is). And the third patch removed the CR/LF option (this was only really used in the 2.2 kernel).

So, in the 2.6.39 kernel the final bits of BKL in HPFS were eliminated. This, and other BKL pieces were eliminated in the kernel, allowed BKL victory to be declared (and all of the peasants rejoiced).


The strong development pace of XFS continued in the 2.6.39 kernel with two important patches. The first patch, made delayed logging the default (you had to explicitly turn it on in past kernels starting with 2.6.35). Delayed logging can greatly improve performance, particularly metadata performance. If you want to see an example of this, read Ric Wheeler’s Red Hat Summit paper on testing file systems with 1 billion files.

The second patch removed the use of the page cache to back the buffer cache in xfs. The reason for this is that the buffer cache has it’s own LRU now so you don’t need the page cache to provide persistent caching. This patch saves the overhead of using the page cache. The patch also means that xfs can handle 64k pages (if you want to go there) and has an impact on the 16TB file system limit for 32 bit machines.


In the 2.6.39 kernel, two patches were added to Ceph. The first one added a mount option, ino32. This option allows ceph to report 32 bit ino values which is useful for 64-bit kernels with 32-bit userspace.

The second patch adds lingering request and watch/notify event framework to Ceph (i.e. more tracking information in Ceph).


Exofs is an object oriented file system in the Linux kernel (which makes it pretty cool IMHO). In the 2.6.39 kernel there was one major patch for Exofs that the mount option of mounting the file system by osdname. This can be very useful if more devices were added later or the login order has changed.


Nilfs2 is a very cool file system in the kernel that is a what is termed a log-structured file system. In the 2.6.39 kernel nilfs2 added some functions that exposed the standard attribute set to user-space via the chattr and lsattr functions. This can be very useful for several tools that read the attributes of files in user-space (mostly management and monitoring tools).


While Linux is the one and only true operating system for the entire planet (as we all know), we do have to work with other file systems from time to time much to our dismay :). The most obvious interaction is with Windows systems via CIFS (Common Internet File System). In the 2.6.39 kernel a patch was added that allowed user names longer than 32 bytes. This patch allows better integration between Windows systems and Linux systems.


One of my favorite file systems in the Linux kernel is SquashFS. In the 2.6.39 kernel, a patch was added that allowed SquashFS to support the xz decompressor. Xz is compression technology that is lossless and that uses the LZMA2 compression algorithm.

Block Patches

Another key aspect to storage in Linux is the block layer in the kernel. The 2.6.39 kernel had a few patches for this layer helping to improve performance and add new capability.

The first patch adds the capability of syncing a single file system. Typically, the sync(2) function commits the buffer cache to disk but does so for all mounted file systems. However, you may not want to to do this for a system that has several mounted file systems. So this patch introduces a new system call, syncfs(2). This system call takes a file descriptor as an argument and then syncs on that file system.

The DM layer in the kernel also saw some patches the first of which is quite interesting. This first patch added a new target to the DM layer called a “flakey target” in the patch commit. This target is the same as the linear target except that it returns I/O errors periodically. This target is useful for simulating failing devices for testing purposes. This is not a target you want to use for real work of course, but if you are developing or testing things, it might be worthwhile (at the very least it has an interesting name).

The second patch introduces a new merge function for striped targets in the DM. This patch improves performance by about 12-35% when a reasonable chunk size is used (64K for example), in conjunction with a stripe count that is a power of 2. What the patch does is allow large block I/O requests to be merged together and handled properly by the DM layer. File systems such as XFS and ext4 take care to assemble large block I/O requests to improve performance and now the DM layer supports these requests rather than piece them out to the underlying hardware, eliminating the performance gains the file systems took pains to create.


The 2.6.39 kernel is somewhat quiet with no really huge storage oriented features but it does show that continual progress is being made with performance gains in a number of places. It touched a large number of file systems and also touched the block layer to some degree. No less than 11 file systems had noticeable patches (I didn’t discuss 9p in this article). To me that signals lots of work on Linux storage (love to see that).

The block layer, while sometimes a quiet piece of the kernel, had some patches that improved it for both users and developers. The block layer introduced a new system call, syncfs, that allows a single file system to be synced at a time which is a very useful feature for systems that have many mounted file systems.

The DM layer also improved performance for file systems that assemble larger block I/O requests by supporting them for the striped target. This can be a great help with more enterprise class oriented hardware.

Lots of nice improvements in this kernel and it’s good to see so much work focused on storage within Linux. This also puts the Linux world on a solid footing moving into the new 3.0 series of kernels which is what we’ll see next instead of a 2.6.40 kernel. But before moving onto the 3.0 series of kernels I wanted to thank all developers who worked on the 2.6 series of kernels (and the 2.5 development series). The 2.6 series started off as a kernel with more functionality and ambition that the prior series. It then developed into not only a great kernel for average users but also for the enterprise world. From the mid-2.6.2x kernels to 2.6.39, Linux storage has been developed at a wonderful rate and we are all better off because of this development. Thanks everyone and I look forward to the fun of the 3.0 series!

Linux 3.0 Some Cool Changes for Storage

The first shiny brand new kernel, 3.0, of the 3.x kernel series is out and there are some goodies for storage (as always). Let’s take a look at some of these.

2.6.x – So Long and Thanks for All the IO!

Before jumping into the 3.0 kernel I wanted to thank all of the developers and testers who put so much time into the 2.6 kernel series. I remember when the 2.5 “development” kernel came out and the time so many people put into it. I also remember when the 2.6 kernel came out and I was so excited to try it (and a bit nervous). Then I started to watch the development of the 2.6 kernel series, particularly the IO side of the kernel, and was very excited and pleased with the work. So to everyone who was involved in the development and testing of the 2.6 series, my profound and sincere thanks.

3.x – Welcome of the Excitement!

Let’s move on to the excitement of the new 3.x series. The 3.0 kernel is out and ready for use. It has some very nice changes/additions for the IO inclined (that’s us if you’re reading this article). We can break down the developments into a few categories including file systems, which is precisely where we’ll start.

The biggest changes to file systems in 3.0 happened to btrfs. There were three major classes of improvements:

  • Automatic defragmentation
  • Scrubbing (this is really cool)
  • Performance improvements

These improvements added greatly to the capability of btrfs, the fair-haired boy of Linux file systems at the moment.

Btrfs – Automatic defragmentation
The first improvement, automatic defragmentation, does exactly what it sounds like: it automatically defrags an online file system. Normally, Linux file systems such as ext4 and XFS do a pretty good job about keeping files as contiguous as possible by delaying allocations (to combine data requests), using extents (contiguous range of blocks), and using other techniques. However, don’t forget that btrfs is a COW (copy-on-write) file system. COWs are great for a number of things including file systems. For btrfs when a file is first written, it is usually laid out in sequential order as best it can (very similar to XFS or ext4). However, because of the COW nature of the file system, any changes to the file are written to free blocks and not over the data that was already there. Consequently, the file system can fragment fairly quickly.

Prior to the 3.0 kernel, the way to handle fragmentation was to either (1) defrag the file system periodically, or (2) mount the file system with COW disabled. The first option is fairly easy to do,

$ btrfs filesystem defragment

But you have to remember run the command or you have to put it in a cron job. But doing this also means that the performance will suffer a bit during the defragmentation process.

The second option involved mounting btrfs with an option to turn off COW. The mount option is,

-o nodatacow

This will limit the fragmentation of the file system but you lose the goodness that COW gives to btrfs. What is really needed is a way to defragment the file system when perhaps the file system isn’t busy or a technique to limit the fragmentation of the file system without giving up COW.

In the 3.0 kernel, btrfs gained some ability to defragment on the fly for certain types of data. In particular, it now has the mount option,

-o autodefrag

This option tells btrfs to look for small random writes into files and queues them for an automatic defrag process. According to the notes in the commit, the defrag capability isn’t well suited for database workloads yet but it does work for smaller files such as rpm (don’t forget that rpm based distros have an rpm database that is constantly being updated), sqlite or bdb databases. This new automatic defrag feature of btrfs is very useful in limiting this one source of fragmentation. If you think the file system has gotten too fragmented you can always defragment it by hand via the “btrfs” command.

Since btrfs is constantly being compared to ZFS, let’s compare the defrag capabilities of both. The authors of ZFS have tried to mitigate the impact of COW on the fragmentation by keeping the file changes in a buffer as long as possible before flushing the data to disk. Evidently, the authors feel that this limits most of the fragmentation in the file system. However, it’s impossible to completely eliminate fragmentation by just using a large buffer. On the other hand, btrfs has an a defrag utility if you want to defrag your file system. Also, this new feature focuses on the main source of fragmentation – small writes to existing files. While this features isn’t perfect it does provide the basis of a very good defrag capability without having to rely on large buffers. I would say the score on this feature is, btrfs – 1, ZFS – nil.

Btrfs – Scrubbing
The second new feature, which is my personal favorite, is scrubbing has been added to btrfs. Remember that btrfs computes checksums of data and metadata and stores them in a checksum tree. Ideally these checksums can be used to determine if data or metadata has gone bad (e.g. “bit rot”).

In the 3.0 kernel, btrfs gained the ability to do was is called a “scrub” of the data. This means that the stored checksums are compared to a freshly computed checksum to determine if the data has been corrupted. In the current patch set the scrub checks the extents in the file system and if a problem is found, a good copy of the data is searched for. If a good copy is found then it will be used to overwrite the bad copy. Note that this approach also captures checksums that have become corrupted in addition to data that may have become corrupted.

Given the size of drives and the size of RAID groups, the probably of hitting an error is increasing. Having the ability to scrub the data in a file system is a great feature.

In keeping with the comparison to ZFS, ZFS has had this feature for some time and btrfs only gained this feature in the 3.0 kernel. But let me point out one quick thing. I have seen many people say that ZFS has techniques to prevent data corruption. This statement is partially true for ZFS and now btrfs. The file systems have the ability to detect data corruption. But they can only correct the corruption if a copy of the bad data exists. If you have a RAID-1 configuration or a RAID-5 or 6, then the file system can find a good copy of the block (or construct one), and over write the bad data block(s). If you only have a single disk or RAID-0, then either file system will only detect data corruption but can’t correct it (Note: there is an option in ZFS that tells it to make two copies of the data but this halves your usable capacity as if you used the drive in a RAID-1 configuration with two equal-size partitions on the drive. You can also tell it to make 3 copies of the data on a single disk if you really want to do that). So I would score this one, btrfs – 1, ZFS – 1.

Btrfs – Performance Improvements
The automatic defragmentation and scrubbing features are really wonderful additions to btrfs but there is even more (and you don’t have to trade to get what is behind curtain #2). In the 3.0 kernel, there were some performance improvements added to btrfs.

The first feature is that the performance of file creations and deletes was improved. When btrfs does a file creation or file deletion it has to do a great number of b+ tree insertions such as inode names, directory name items, directory name index, etc. In the 3.0 kernel, the b+ tree file insertion or deletions are delayed which improves performance. The details of the implementation are fairly involved but you can read them here but basically it tries to do these functions in batches rather than addressing them one at a time. The result is that the for some microbenchmarks, the file creations have improved by about 15% and the file deletions have improved by about 20%.

A second performance improvement does not flush the checksum items of unchanged file data. While this doesn’t sound like a big deal, it helps fsync speeds. In the commit for the patch, a simple sysbench test doing a “random write + fsync” improved by almost a factor of 10 from about 112.75 requests/sec to 1216 requests/sec.

The third big performance improvement is the inclusion of a new patch that allocates chunks in a better sequence for multiple devices (RAID-0, RAID-1), especially when there is an odd-number of devices. Prior to this patch, when multiple devices were used, btrfs allocated chunks on the devices in the same order. This could cause problems when there were an odd number of devices in a RAID-1 or RAID-10 configuration. This patch sorts the devices before allocating and allocates stripes on the devices with the most available space as long as there is space available (capacity balancing).

Other File System changes in 3.0

There were some other changes to Linux file systems in the 3.0 kernel.

In the 2.6.38 kernel, XFS gained the ability for manual SSD discards from userspace using the FITRIM ioctl. The patch was not designed to be run during normal workloads since the freespace btree walks can cause large performance degradations. So while XFS had some “TRIM” capability it was not “online” when the file system was operating.

In the 3.0 kernel, XFS gained a patch that implemented “online” discard support (i.e. TRIM). The patch uses the function “blkdev_issue_discard” once a transaction commits (the space is unused).

NILFS2 gained the ability to resize while online in the 3.0 kernel. This patch added a resize ioctl (IO Control) that makes online resizing possible (uber cool).

In the 3.0 kernel, the clustered file system, OCFS2, gained a couple of new features. The first feature it gained was page discards (aka TRIM). This first patch added the FITRIM ioctl. The second patch added the ability for the OCFS2 function, ocfs2_trim_fs, to trim freed clusters in a volume.

A second patch set which, at first glance, seems to be inconsequential, actually has some downstream implications. There were two patches, this patch and this patch, which allow OCFS to move extents. Based on these patches, keep an eye on future OCFS2 capabilities.

If you watch the ext4 mailing list you will see that there is a great deal of development still going on in this file system. If you look deep enough a majority of this development is to add features and even more capability to ext4. Ext4 is shaping up to be one great file system (even better than it already is).

In the 3.0 kernel, there were a few patches added to ext4. The first patch set adds the capability of what is called “hole punching” in files. There are situations where an application doesn’t need the data in the middle of a file. Both XFS and OCFS2 have the capability of punching a hole in a file where a portion of the file is marked as unwanted and the associated storage is released (i.e. it can be reused). In the 3.0 kernel, this function was made “generic” so that any file system could use it (it was added to the fallocate system call). To learn more about this capability, take a look at this LWN article.

The first patch for ext4 added new routines which are used by fallocate when the hole punch flag is passed. The second patch modifies the truncate routines to support hole punching.

The second ext4 patch set is a very useful patch to help avoid data corruption. This commit added the the ability to prevent ext4 from being mounted multiple times. If a non-clustered or non-parallel file system is mounted more than once you have the potential for data corruption. For example, in the case of a high-availability (HA) NFS configuration where one NFS gatweay is exporting the data, the other standby NFS will also have to mount the file system. However, if the HA gets confused and both gateways suddenly start exporting the file system to multiple NFS clients, you are almost guaranteed to get data corruption (I’ve seen this happen several times). This patch prevents that from happening.

TRIM Summary
There is lots of interest surrounding TRIM for SSDs (and rightly so). In an upcoming article (it may not be posted) I write about how to check if TRIM is working with your SSD. In that article I review which file systems support TRIM (an important part of making sure TRIM works). With the 3.0 kernel, XFS and OCFS2 gained TRIM capability. Here is the summary of TRIM support for file systems as a function of the kernel.

  • 2.6.33: GFS2, nilfs2, btrfs, ext4, fat
  • 2.6.33: GFS2, nilfs2, btrfs, ext4 (including no journal mode), fat
  • 2.6.37: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat
  • 2.6.38: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs (“offline” mode), ext3
  • 3.0: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs (“offline” and “online” mode), ext3, ocfs2

The Guts

There are some dark, dank parts of the kernel where the casual user should never, ever look. These parts are where very serious things happen inside the kernel. One particular aspect is the VFS (Virtual File System). The 3.0 kernel had some important patches that touched this very important part of the kernel as well as other important bits.

VFS Patches
There has been a rash of work in the VFS recently, to make it more scalable. During some of this development it was discovered that some file systems, btrfs and ext4 in particular, had a bottleneck on larger systems. It seems that btrfs was doing a xattr lookup on every write (xattr = extended attributes). This caused an additional tree walk which then hit some per file system locks, stalling performance, and causing quite bad scalability. So in the 3.0 kernel the capability of caching the the xattr to avoid the lookup, was added, greatly improving scalability.

Block Layer Patches
There is another dark and scary layer where only the coders with strong kernel-fu dare travel. That is the block layer. In the 3.0 kernel, a patch was added that gives the block layer the ability to discard bio (block IO) in batches in the function blkdev_issue_discard(), making the discarding process much faster. In a test described in the commit, this patch made discards about 16x faster than before. Not bad – I love the smell of performance gains in new kernels!


There has been a great deal of file system development in the 3.0 kernel. Btrfs in particular had a number of features and capability bringing it closer to a full featured, stable file system. There were a number of other gains as well including TRIM support in XFS and OCFS2 as well as online resizing support in NILFS2, and the development of generic hole punching functions that were then used in ext4. OCFS2 also some patches that could form the basis of some pretty cool future functionality (keep your eyes peeled).

The 3.0 kernel also saw continued development by some of the best kernel developers, improving the VFS and the block layer. These are places only really careful and serious kernel developers tread, because it is so complicated and any change has a huge impact on the overall kernel (and your data).

I think the 3.0 kernel is a great kernel from an IO perspective. It was a really good way to transition from the 2.6 kernel series by giving us wonderful new features but not so much that everything broke. Don’t be afraid to give 3.0 a go, but be sure you do things carefully and one step at a time.