Linux 3.0 Some Cool Changes for Storage

The first shiny brand new kernel, 3.0, of the 3.x kernel series is out and there are some goodies for storage (as always). Let’s take a look at some of these.

2.6.x – So Long and Thanks for All the IO!

Before jumping into the 3.0 kernel I wanted to thank all of the developers and testers who put so much time into the 2.6 kernel series. I remember when the 2.5 “development” kernel came out and the time so many people put into it. I also remember when the 2.6 kernel came out and I was so excited to try it (and a bit nervous). Then I started to watch the development of the 2.6 kernel series, particularly the IO side of the kernel, and was very excited and pleased with the work. So to everyone who was involved in the development and testing of the 2.6 series, my profound and sincere thanks.

3.x – Welcome of the Excitement!

Let’s move on to the excitement of the new 3.x series. The 3.0 kernel is out and ready for use. It has some very nice changes/additions for the IO inclined (that’s us if you’re reading this article). We can break down the developments into a few categories including file systems, which is precisely where we’ll start.

The biggest changes to file systems in 3.0 happened to btrfs. There were three major classes of improvements:

  • Automatic defragmentation
  • Scrubbing (this is really cool)
  • Performance improvements

These improvements added greatly to the capability of btrfs, the fair-haired boy of Linux file systems at the moment.

Btrfs – Automatic defragmentation
The first improvement, automatic defragmentation, does exactly what it sounds like: it automatically defrags an online file system. Normally, Linux file systems such as ext4 and XFS do a pretty good job about keeping files as contiguous as possible by delaying allocations (to combine data requests), using extents (contiguous range of blocks), and using other techniques. However, don’t forget that btrfs is a COW (copy-on-write) file system. COWs are great for a number of things including file systems. For btrfs when a file is first written, it is usually laid out in sequential order as best it can (very similar to XFS or ext4). However, because of the COW nature of the file system, any changes to the file are written to free blocks and not over the data that was already there. Consequently, the file system can fragment fairly quickly.

Prior to the 3.0 kernel, the way to handle fragmentation was to either (1) defrag the file system periodically, or (2) mount the file system with COW disabled. The first option is fairly easy to do,

$ btrfs filesystem defragment

But you have to remember run the command or you have to put it in a cron job. But doing this also means that the performance will suffer a bit during the defragmentation process.

The second option involved mounting btrfs with an option to turn off COW. The mount option is,

-o nodatacow

This will limit the fragmentation of the file system but you lose the goodness that COW gives to btrfs. What is really needed is a way to defragment the file system when perhaps the file system isn’t busy or a technique to limit the fragmentation of the file system without giving up COW.

In the 3.0 kernel, btrfs gained some ability to defragment on the fly for certain types of data. In particular, it now has the mount option,

-o autodefrag

This option tells btrfs to look for small random writes into files and queues them for an automatic defrag process. According to the notes in the commit, the defrag capability isn’t well suited for database workloads yet but it does work for smaller files such as rpm (don’t forget that rpm based distros have an rpm database that is constantly being updated), sqlite or bdb databases. This new automatic defrag feature of btrfs is very useful in limiting this one source of fragmentation. If you think the file system has gotten too fragmented you can always defragment it by hand via the “btrfs” command.

Since btrfs is constantly being compared to ZFS, let’s compare the defrag capabilities of both. The authors of ZFS have tried to mitigate the impact of COW on the fragmentation by keeping the file changes in a buffer as long as possible before flushing the data to disk. Evidently, the authors feel that this limits most of the fragmentation in the file system. However, it’s impossible to completely eliminate fragmentation by just using a large buffer. On the other hand, btrfs has an a defrag utility if you want to defrag your file system. Also, this new feature focuses on the main source of fragmentation – small writes to existing files. While this features isn’t perfect it does provide the basis of a very good defrag capability without having to rely on large buffers. I would say the score on this feature is, btrfs – 1, ZFS – nil.

Btrfs – Scrubbing
The second new feature, which is my personal favorite, is scrubbing has been added to btrfs. Remember that btrfs computes checksums of data and metadata and stores them in a checksum tree. Ideally these checksums can be used to determine if data or metadata has gone bad (e.g. “bit rot”).

In the 3.0 kernel, btrfs gained the ability to do was is called a “scrub” of the data. This means that the stored checksums are compared to a freshly computed checksum to determine if the data has been corrupted. In the current patch set the scrub checks the extents in the file system and if a problem is found, a good copy of the data is searched for. If a good copy is found then it will be used to overwrite the bad copy. Note that this approach also captures checksums that have become corrupted in addition to data that may have become corrupted.

Given the size of drives and the size of RAID groups, the probably of hitting an error is increasing. Having the ability to scrub the data in a file system is a great feature.

In keeping with the comparison to ZFS, ZFS has had this feature for some time and btrfs only gained this feature in the 3.0 kernel. But let me point out one quick thing. I have seen many people say that ZFS has techniques to prevent data corruption. This statement is partially true for ZFS and now btrfs. The file systems have the ability to detect data corruption. But they can only correct the corruption if a copy of the bad data exists. If you have a RAID-1 configuration or a RAID-5 or 6, then the file system can find a good copy of the block (or construct one), and over write the bad data block(s). If you only have a single disk or RAID-0, then either file system will only detect data corruption but can’t correct it (Note: there is an option in ZFS that tells it to make two copies of the data but this halves your usable capacity as if you used the drive in a RAID-1 configuration with two equal-size partitions on the drive. You can also tell it to make 3 copies of the data on a single disk if you really want to do that). So I would score this one, btrfs – 1, ZFS – 1.

Btrfs – Performance Improvements
The automatic defragmentation and scrubbing features are really wonderful additions to btrfs but there is even more (and you don’t have to trade to get what is behind curtain #2). In the 3.0 kernel, there were some performance improvements added to btrfs.

The first feature is that the performance of file creations and deletes was improved. When btrfs does a file creation or file deletion it has to do a great number of b+ tree insertions such as inode names, directory name items, directory name index, etc. In the 3.0 kernel, the b+ tree file insertion or deletions are delayed which improves performance. The details of the implementation are fairly involved but you can read them here but basically it tries to do these functions in batches rather than addressing them one at a time. The result is that the for some microbenchmarks, the file creations have improved by about 15% and the file deletions have improved by about 20%.

A second performance improvement does not flush the checksum items of unchanged file data. While this doesn’t sound like a big deal, it helps fsync speeds. In the commit for the patch, a simple sysbench test doing a “random write + fsync” improved by almost a factor of 10 from about 112.75 requests/sec to 1216 requests/sec.

The third big performance improvement is the inclusion of a new patch that allocates chunks in a better sequence for multiple devices (RAID-0, RAID-1), especially when there is an odd-number of devices. Prior to this patch, when multiple devices were used, btrfs allocated chunks on the devices in the same order. This could cause problems when there were an odd number of devices in a RAID-1 or RAID-10 configuration. This patch sorts the devices before allocating and allocates stripes on the devices with the most available space as long as there is space available (capacity balancing).

Other File System changes in 3.0

There were some other changes to Linux file systems in the 3.0 kernel.

In the 2.6.38 kernel, XFS gained the ability for manual SSD discards from userspace using the FITRIM ioctl. The patch was not designed to be run during normal workloads since the freespace btree walks can cause large performance degradations. So while XFS had some “TRIM” capability it was not “online” when the file system was operating.

In the 3.0 kernel, XFS gained a patch that implemented “online” discard support (i.e. TRIM). The patch uses the function “blkdev_issue_discard” once a transaction commits (the space is unused).

NILFS2 gained the ability to resize while online in the 3.0 kernel. This patch added a resize ioctl (IO Control) that makes online resizing possible (uber cool).

In the 3.0 kernel, the clustered file system, OCFS2, gained a couple of new features. The first feature it gained was page discards (aka TRIM). This first patch added the FITRIM ioctl. The second patch added the ability for the OCFS2 function, ocfs2_trim_fs, to trim freed clusters in a volume.

A second patch set which, at first glance, seems to be inconsequential, actually has some downstream implications. There were two patches, this patch and this patch, which allow OCFS to move extents. Based on these patches, keep an eye on future OCFS2 capabilities.

If you watch the ext4 mailing list you will see that there is a great deal of development still going on in this file system. If you look deep enough a majority of this development is to add features and even more capability to ext4. Ext4 is shaping up to be one great file system (even better than it already is).

In the 3.0 kernel, there were a few patches added to ext4. The first patch set adds the capability of what is called “hole punching” in files. There are situations where an application doesn’t need the data in the middle of a file. Both XFS and OCFS2 have the capability of punching a hole in a file where a portion of the file is marked as unwanted and the associated storage is released (i.e. it can be reused). In the 3.0 kernel, this function was made “generic” so that any file system could use it (it was added to the fallocate system call). To learn more about this capability, take a look at this LWN article.

The first patch for ext4 added new routines which are used by fallocate when the hole punch flag is passed. The second patch modifies the truncate routines to support hole punching.

The second ext4 patch set is a very useful patch to help avoid data corruption. This commit added the the ability to prevent ext4 from being mounted multiple times. If a non-clustered or non-parallel file system is mounted more than once you have the potential for data corruption. For example, in the case of a high-availability (HA) NFS configuration where one NFS gatweay is exporting the data, the other standby NFS will also have to mount the file system. However, if the HA gets confused and both gateways suddenly start exporting the file system to multiple NFS clients, you are almost guaranteed to get data corruption (I’ve seen this happen several times). This patch prevents that from happening.

TRIM Summary
There is lots of interest surrounding TRIM for SSDs (and rightly so). In an upcoming article (it may not be posted) I write about how to check if TRIM is working with your SSD. In that article I review which file systems support TRIM (an important part of making sure TRIM works). With the 3.0 kernel, XFS and OCFS2 gained TRIM capability. Here is the summary of TRIM support for file systems as a function of the kernel.

  • 2.6.33: GFS2, nilfs2, btrfs, ext4, fat
  • 2.6.33: GFS2, nilfs2, btrfs, ext4 (including no journal mode), fat
  • 2.6.37: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat
  • 2.6.38: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs (“offline” mode), ext3
  • 3.0: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs (“offline” and “online” mode), ext3, ocfs2

The Guts

There are some dark, dank parts of the kernel where the casual user should never, ever look. These parts are where very serious things happen inside the kernel. One particular aspect is the VFS (Virtual File System). The 3.0 kernel had some important patches that touched this very important part of the kernel as well as other important bits.

VFS Patches
There has been a rash of work in the VFS recently, to make it more scalable. During some of this development it was discovered that some file systems, btrfs and ext4 in particular, had a bottleneck on larger systems. It seems that btrfs was doing a xattr lookup on every write (xattr = extended attributes). This caused an additional tree walk which then hit some per file system locks, stalling performance, and causing quite bad scalability. So in the 3.0 kernel the capability of caching the the xattr to avoid the lookup, was added, greatly improving scalability.

Block Layer Patches
There is another dark and scary layer where only the coders with strong kernel-fu dare travel. That is the block layer. In the 3.0 kernel, a patch was added that gives the block layer the ability to discard bio (block IO) in batches in the function blkdev_issue_discard(), making the discarding process much faster. In a test described in the commit, this patch made discards about 16x faster than before. Not bad – I love the smell of performance gains in new kernels!


There has been a great deal of file system development in the 3.0 kernel. Btrfs in particular had a number of features and capability bringing it closer to a full featured, stable file system. There were a number of other gains as well including TRIM support in XFS and OCFS2 as well as online resizing support in NILFS2, and the development of generic hole punching functions that were then used in ext4. OCFS2 also some patches that could form the basis of some pretty cool future functionality (keep your eyes peeled).

The 3.0 kernel also saw continued development by some of the best kernel developers, improving the VFS and the block layer. These are places only really careful and serious kernel developers tread, because it is so complicated and any change has a huge impact on the overall kernel (and your data).

I think the 3.0 kernel is a great kernel from an IO perspective. It was a really good way to transition from the 2.6 kernel series by giving us wonderful new features but not so much that everything broke. Don’t be afraid to give 3.0 a go, but be sure you do things carefully and one step at a time.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: