Checksumming Files to Find Bit-Rot

Do You Suffer From Bit-Rot?

Storage admins live in fear of corrupted data. This is why we make backups, copies (replicas), and use other methods to make sure we have copies of the data in case the original is corrupted. One of the most feared sources of data corruption is the proverbial bit-rot.

Bit rot can be caused by a number of sources but the result is always the same – one or more bits in the file have changed, causing silent data corruption. The “silent” part of the data corruption means that you don’t know it happened – all you know is that the data has changed (in essence it is now corrupt).

One source of data corruption is called URE (Unrecoverable Read Error) or UBER (Unrecoverable Bit Error Rate). These measures are a function of the storage media design and tell us the probability of encountering a bit on a drive that cannot be read. Sometimes specific bits on drives just cannot be read due to various factors. Usually the drive reports this error and it is put into the system logs. Also, many times the OS will give an error because it cannot read a specific portion of data. Or, in some cases, the drive will read the bit even though it may contain bad data (maybe the bit flipped due to cosmic radiation – which does happen, particularly at higher altitudes) which means that the bit can still be read but the bit is now incorrect.

The actual URE or UBER for a particular storage device is usually published by the manufacturer, although many times it can be hard to find. Typical ranges for hard drives are around 1 x 1014 bits which means that 1 out of ever 1014 bits cannot be read. Some SSD manufacturers will list their UBER as 1 x 1015 and some hard drive manufacturers will use this same number for enterprise class drives. Let’s convert that number to something a little easier to understand – how many Terabytes (TBs) must be read before encountering a URE.

One TB (1 TB) drives have about 2 billion (2 x 109) sectors and let’s assume a URE rate of 1 x 1014. The URE converts to about 24 x 109 sectors assuming that we have 512 byte sectors or 512 * 8 = 4,096 bits per sector. If you then divide the URE by the number of bytes per disk, you get the following:

24 x 10^9 / 2 x 10^9 = 12


This means if you have 12TBs or 12x 1TB drives your probability of encountering a URE is one (i.e. it’s going to happen). If you have 2TB drives, then all you need is 6x 2TB drives and you will encounter a URE.

If you have a RAID-5 group that has seven 2TB drives and one drive fails, the RAID rebuild has to read all of the remaining disks (all six of them). At that point you are almost guaranteed that during the RAID-5 rebuild, you will hit a URE and the RAID rebuild will fail. This means you have lost all of your data.

This is just an example of a form of bit-rot – the dreaded URE. It can and does happen and people either don’t see it or go screaming into the night that they lost all of their carefully saved KC and the Sunshine Band flash videos and mp3’s.

So what can be do about bit-rot? There are really two parts to that question. The first part, is detecting corrupted files and the second part is correcting corrupted files. In this article, I will talk about some simple techniques using extended file attributes that can help you detect corrupted data (recovering is the subject for a much longer article).

Checksums to the Rescue!

One way to check for corrupted data is through the use of a checksum. A checksum is a simple representation or finger print of a block of data (in our case, a file). There are a whole bunch of checksums including md5, sha-1, sha-2 (including 256, 384, and 512 bit checksums), and sha-3. These algorithms can be used to compute the checksum of a chunk of data, such as a file, with longer or more involved checksums typically requiring more computational work. Note that checksums are also used in cryptography, but we are using them as a way to finger print a file.

So for a given a file we could compute a checksum using a variety of techniques or compute different checksums using different algorithms for the same file. Then before a file is read the checksum of the file could be computed and compared against a stored checksum for that same file. If they do not match, then you know the file has changed. If the time stamps on the file haven’t changed since the checksums of the file were computed, then you know the file is corrupt (since no one changed the file, the data obviously fell victim to bit-rot or some other form of data corruption).

If you have read everything carefully to this point you can spot at least one flaw in the logic. The flaw I’m thinking of is that this process assumes that the checksum itself has not fallen victim to data corruption. So we have to find some way of ensuring the checksum itself does not fall victim to bit-rot or that we have a copy of the checksums stored somewhere that we assume does not fall victim to data corruption.

However, you can spot other flaw in the scheme (nothing’s perfect). One noticeable flaw is that until the checksum of the file is created and stored in some manner, the file can fall victim to bit-rot. We could go through some gyrations so that as the file is being created, the checksum is being computed and stored in real-time. However, that will dramatically increase the computational requirements and also slow down the I/O.

But for now, let’s assume that we are interested in ensuring data integrity for files that have been around a while and maybe haven’t been used for some period of time. The reason this is interesting is that since no one is using the file it is difficult to tell if the data is corrupt. Using checksums can allow us to detect if the file has been corrupted even if no one is actively using the file.

Checksums and Extended File Attributes

The whole point of this discussion is to help protect against bit-rot of files by using checksums. Since the focus is on files it makes sense to store the checksums with the file itself. This is easily accomplished using our knowledge of extended file attributes that we learned in a previous article.

The basic concept is to compute the checksums of the file and store the result in an extended attribute associated with the file. That way the checksum is stored with the file itself which is what we’re really after. To help improve things even more, let’s compute several checksums of the file since this will allow us to have several ways to detect file corruption. All of the checksums will be stored in extended attributes as well as in a file or database. However, as mentioned before there is the possibility of the checksums in the extended attributes might be corrupted – so what do we do?

A very simple solution is to store the checksums in a file or simple database and be sure that several of copies of the file or database are made. Then before you check the checksum of a file, you first look up the checksum in the file or database and then compare it to the checksums in the extended attributes. If they are identical, then the file is not corrupted.

There are lots of aspects to this process than you can develop to improve the probability of the checksums being valid. For example, you could make three copies of the checksum data and compute the checksum of these files. Then you compare the checksums of these three files before you read any data. If two of the three values are the same then you can assume that those two files are correct and that the third file is incorrect resulting in it being replaced from one of the other copies. But now we are getting into implementation details which is not the focus of this article.

Let’s take a quick look at some simple Python code to illustrate how we might compute the checksums for a file and store them in extended file attributes.

Sample Python Code

I’m not an expert coder and I don’t play one on television. Also, I’m not a “Pythonic” Python coder, so I’m sure there could be lots of debate about the code. However, the point of this sample code is to illustrate what is possible and how to go about implementing it.

For computing the checksums of a file, I will be using commands that typically come with most Linux distributions. In particular, I will be using md5sum, sha1sum, sha256sum, sha384sum, and sha512sum. To run these commands and grab the output to standard out (stdout), I will use a Python module called commands (note: I’m not using Python 3.x but Python 2.5.2 but I’ve also tested this code against Python 2.7.1 as well). This module has Python functions that allow us to run “shell commands” and capture the output in a tuple (this is a data type in Python).

However, the output from a shell command can have several parts to it so we may need to break the string into tokens so we can find what we want. A simple way to do that is to use the functions in the shlex module (Simple Lexical Analysis) for tokenizing a string based on spaces.

So let’s get coding! Here is the first part of my Python code to illustrate where I’m headed and how I import modules.

#!/usr/bin/python

#
# Test script for setting checksums on file
#

import sys

try:
   import commands                 # Needed for psopen
except ImportError:
   print "Cannot import commands module - this is needed for this application.";
   print "Exiting..."
   sys.exit();

try:
   import shlex              # Needed for splitting input lines
except ImportError:
   print "Cannot import shlex module - this is needed for this application.";
   print "Exiting..."
   sys.exit();



if __name__ == '__main__':
    
    # List of checksum functions:
    checksum_function_list = ["md5sum", "sha1sum", "sha256sum", "sha384sum", "sha512sum"];
    file_name = "./slides_fenics05.pdf";
    
# end if


At the top part of the code, I import the modules but raise exceptions and exit if the modules can’t be found since they are a key part of the code. Then in the main part of the code I define the list of the checksum functions I will be using. These are the exact command names to be used to compute the checksums. Note that I have chosen to compute the checksum of a file using 5 different algorithms. Since we will have multiple checksums for each file it will help improve the odds of finding a file with data corruption because I can check all five checksums. Plus it might also help find a corrupt checksum since we could compare the checksum of the file against all five measures and if one of the checksums in the extended file attributes is wrong but the other four are correct they we have found a corrupted extended attribute.

For the purposes of this article I’m just going to examine one file, slides_fenics05.pdf (a file I happen to have on my laptop).

The next step in the code is to add the code that loops over all five checksum functions.

#!/usr/bin/python

#
# Test script for setting checksums on file
#

import sys

try:
   import commands                 # Needed for psopen
except ImportError:
   print "Cannot import commands module - this is needed for this application.";
   print "Exiting..."
   sys.exit();

try:
   import shlex              # Needed for splitting input lines
except ImportError:
   print "Cannot import shlex module - this is needed for this application.";
   print "Exiting..."
   sys.exit();



if __name__ == '__main__':
    
    # List of checksum functions:
    checksum_function_list = ["md5sum", "sha1sum", "sha256sum", "sha384sum", "sha512sum"];
    file_name = "./slides_fenics05.pdf";
    
    for func in checksum_function_list:
        # Create command string to set extended attribute
        command_str = func + " " + file_name;
        checksum_output = commands.getstatusoutput(command_str);
        print "checksum_output: ",checksum_output
    # end for
 
# end if


Notice that I create the exact command line I want run as a string called “command_str”. This is the command executed by the function “commands.getstatusoutput”. Notice that this function returns a 2-tuple (status, output). You can see this in the output from the sample code below.

laytonjb@laytonjb-laptop:~/$ ./test1.py
checksum_output:  (0, '4052e5dd3d79de6b0a03d5dbc8821c60  ./slides_fenics05.pdf')
checksum_output:  (0, 'cdfcadf4752429f01c8105ff15c3e24fa9041b46  ./slides_fenics05.pdf')
checksum_output:  (0, '3c2ad544ba4245dc9e300afe79b81a3a25b2ff6e71e127724acd51124c47a381  ./slides_fenics05.pdf')
checksum_output:  (0, '0761eac4323d35a62c52f3c49dd2098e8b633724ed8dec2ee2de2ddda0874874a916b99287703a9eb1886af62d4ac0b3  ./slides_fenics05.pdf')
checksum_output:  (0, '42674cebe76d0c0567cf1bed21008b005912f0df76990456b669ef3d3942e607d69079e879ceecbb198e846a042f49ee28c145f9b1dc0b4bb4c9ddadd25777c5  ./slides_fenics05.pdf')
laytonjb@laytonjb-laptop:~/Documents/FEATURES/STORAGE094$


You can see that each time the commands.getstatusoutput function is called there are two parts in the output tuple – (1) the status of the command (was it successful?) and (2) the result of the command (the actual output). Ideally we should check the status of the command to determine if it was successful but I will leave that as an exercise for the reader 🙂

At this point we want to grab the output from the command (the second item in the 2-tuple) and extract the first part of the string which is the checksum. To do this we will use the shlex.split function in the shlex module. The code at this points looks like the following:

#!/usr/bin/python

#
# Test script for setting checksums on file
#

import sys

try:
   import commands                 # Needed for psopen
except ImportError:
   print "Cannot import commands module - this is needed for this application.";
   print "Exiting..."
   sys.exit();

try:
   import shlex              # Needed for splitting input lines
except ImportError:
   print "Cannot import shlex module - this is needed for this application.";
   print "Exiting..."
   sys.exit();



if __name__ == '__main__':
    
    # List of checksum functions:
    checksum_function_list = ["md5sum", "sha1sum", "sha256sum", "sha384sum", "sha512sum"];
    file_name = "./slides_fenics05.pdf";
    
    for func in checksum_function_list:
        # Create command string to set extended attribute
        command_str = func + " " + file_name;
        checksum_output = commands.getstatusoutput(command_str);
        print "checksum_output: ",checksum_output
        tokens = shlex.split(checksum_output[1]);
        checksum = tokens[0];
        print "   checksum = ",checksum," \n";
    # end for
 
# end if


In the code, the output from the checksum command is split (tokenized) based on spaces. Since the first token is the checksum that is what we’re interested in capturing and storing in the extended file attribute, we use the first token in the list and store it to a variable.

The output from the code at this stage is shown below:

laytonjb@laytonjb-laptop:~$ ./test1.py
checksum_output:  (0, '4052e5dd3d79de6b0a03d5dbc8821c60  ./slides_fenics05.pdf')
   checksum =  4052e5dd3d79de6b0a03d5dbc8821c60  

checksum_output:  (0, 'cdfcadf4752429f01c8105ff15c3e24fa9041b46  ./slides_fenics05.pdf')
   checksum =  cdfcadf4752429f01c8105ff15c3e24fa9041b46  

checksum_output:  (0, '3c2ad544ba4245dc9e300afe79b81a3a25b2ff6e71e127724acd51124c47a381  ./slides_fenics05.pdf')
   checksum =  3c2ad544ba4245dc9e300afe79b81a3a25b2ff6e71e127724acd51124c47a381  

checksum_output:  (0, '0761eac4323d35a62c52f3c49dd2098e8b633724ed8dec2ee2de2ddda0874874a916b99287703a9eb1886af62d4ac0b3  ./slides_fenics05.pdf')
   checksum =  0761eac4323d35a62c52f3c49dd2098e8b633724ed8dec2ee2de2ddda0874874a916b99287703a9eb1886af62d4ac0b3  

checksum_output:  (0, '42674cebe76d0c0567cf1bed21008b005912f0df76990456b669ef3d3942e607d69079e879ceecbb198e846a042f49ee28c145f9b1dc0b4bb4c9ddadd25777c5  ./slides_fenics05.pdf')
   checksum =  42674cebe76d0c0567cf1bed21008b005912f0df76990456b669ef3d3942e607d69079e879ceecbb198e846a042f49ee28c145f9b1dc0b4bb4c9ddadd25777c5  

The final step in the code is to create the command to set the extended attribute for the file. I will create “user” attributes that look like “user.checksum.[function]” where [function] is the name of the checksum command. To do this we need to run a command that looks like the following:

setfattr -n user.checksum.md5sum -v [checksum] [file]


where [checksum] is the checksum that we stored and [file] is the name of the file. I’m using the “user” class of extended file attributes for illustration only. If I were doing this in production, I would run the script as root and store the checksums using the “system” class of extended file attributes since a normal user would not be able to change the result.

At this point, the code looks like the following with all of the “print” functions removed.

#!/usr/bin/python

#
# Test script for setting checksums on file
#

import sys

try:
   import commands                 # Needed for psopen
except ImportError:
   print "Cannot import commands module - this is needed for this application.";
   print "Exiting..."
   sys.exit();

try:
   import shlex              # Needed for splitting input lines
except ImportError:
   print "Cannot import shlex module - this is needed for this application.";
   print "Exiting..."
   sys.exit();



if __name__ == '__main__':
    
    # List of checksum functions:
    checksum_function_list = ["md5sum", "sha1sum", "sha256sum", "sha384sum", "sha512sum"];
    file_name = "./slides_fenics05.pdf";
    
    for func in checksum_function_list:
        # Create command string to set extended attribute
        command_str = func + " " + file_name;
        checksum_output = commands.getstatusoutput(command_str);
        tokens = shlex.split(checksum_output[1]);
        checksum = tokens[0];
        
        xattr = "user.checksum." + func;
        command_str = "setfattr -n " + xattr + " -v " + str(checksum) + " " + file_name;
        xattr_output = commands.getstatusoutput(command_str);
    # end for
 
# end if


The way we check if the code is working is to look at the extended attributes of the file (recall this article on the details of the command).

laytonjb@laytonjb-laptop:~$ getfattr slides_fenics05.pdf 
# file: slides_fenics05.pdf
user.checksum.md5sum
user.checksum.sha1sum
user.checksum.sha256sum
user.checksum.sha384sum
user.checksum.sha512sum


This lists the extended attributes for the file. We can look at each attribute individually. For example, here is the md5sum attribute.

laytonjb@laytonjb-laptop:~$ getfattr -n user.checksum.md5sum slides_fenics05.pdf 
# file: slides_fenics05.pdf
user.checksum.md5sum="4052e5dd3d79de6b0a03d5dbc8821c60"


If you look at the md5sum from earlier output listings you can see that they match the md5 checksum in the extended file attribute associated with the file, indicating that the file hasn’t been corrupted.

Ideally we should be checking the status of each command to make sure that it returned successfully. But as I mentioned earlier that exercise is left up to the user.

One other aspect we need to consider is that users may have changed the data. We should store the date and time when the checksums were computed and store that value in the extended file attributes as well. So before computing the checksum on the file to see if it is corrupted we need to check if the time stamps associated with the file are more recent than the date and time stamp when the checksum was originally computed.

Summary

Data corruption is the most feared aspects of a storage admin’s life. This is why we do backups, replication, etc. – to recover data if the original data gets corrupted. One source of corrupted data is what is called bit-rot. Basically this is when a bit on the storage device goes bad and the data using that bit cannot be read or it returns the incorrect value indicating the file is now corrupt. But as we accumulate more and more data and this data gets colder (i.e. it hasn’t been used in a while), performing backups may not be easy (or even practical) so how do we determine if our data is corrupt?

The technique discussed in this article is to compute the checksum of the file and store it in an extended file attribute. In particular, I compute five different checksums to give us even more data to determine if the file has been corrupted. By storing all of the checksums in an additional location and ensuring that the stored values aren’t corrupt, we can compare the “correct” checksum to the checksum of the file. If they are the same, then the file is not corrupt. But if it’s different and yet the file has not been changed by the user, then the file is likely to be corrupt.

To help illustrate these ideas, I wrote some simple Python code to show you how it might be done. Hopefully this simple code will inspire you to think about how you might implement something a bit more robust around checksums of files and checking for data corruption.

Advertisements

Extended File Attributes Rock!

Introduction

I think it’s a given that the amount of data is increasing at a fairly fast rate. We now have lots of multimedia on our desktops, and lots of files on our servers at work, and we’re starting to put lots of data into the cloud (e.g. Facebook). One question that affects storage design and performance is if these files are large or small and how many of them are there?

At this year’s FAST (USENIX Conference on File System and Storage Technologies) the best paper went to “A Study of Practical Deduplication” by William Bolosky from Microsoft Research, and Dutch Meyer from the University of British Columbia. While the paper didn’t really cover Linux (it covered Windows) and it was more focused on desktops, and it was focused on deduplication, it did present some very enlightening insights on file systems from 2000 to 2010. Some of the highlights from the paper are:


  1. The median file size isn’t changing
  2. The average file size is larger
  3. The average file system capacity has tripled from 2000 to 2010

To fully understand the difference between the first point and the third point you need to remember some basic statistics. The average file size is computed by summing the size of every file and dividing by the number of files. But the median file size is found by ordering the list from the smallest to largest of the file size of every file. The median file size is the one in the middle of the ordered list. So, with these working definitions, the three observations previously mentioned indicate that perhaps desktops have a few really large files that drive up the average file size but at the same time there are a number of small files that makes the median file size about the same despite the increase in the number of files and the increase in large files.

The combination of the observations previously mentioned mean that we have many more files on our desktops and we are adding some really large files and about the same number of small files.

Yes, it’s Windows. Yes, it’s desktops. But these observations are another good data point that tell us something about our data. That is, the number of files is getting larger while we are adding some very large files and a large number of small files. What does this mean for us? One thing that it means to me is that we need to pay much more attention to managing our data.

Data Management – Who’s on First?

One of the keys to data management is being able to monitor the state of your data which usually means monitoring the metadata. Fortunately, POSIX gives us some standard metadata for our files such as the following:


  • File ownership (User ID and Group ID)
  • File permissions (world, group, user)
  • File times (atime, ctime, mtime)
  • File size
  • File name
  • Is it a true file or a directory?

There are several others (e.g. links) which I didn’t mention here.

With this information we can monitor the basic state of our data. We can compute how quickly our data is changing (how many files have been modified, created, deleted in a certain period of time). We can also determine how our data is “aging” – that is how old is the average file, the median file, and we can do this for the entire file system tree or certain parts of it. In essence we can get a good statistical overview of the “state of our data”.

All of this capability is just great and goes far beyond anything that is available today. However, with the file system capacity increasing so rapidly and the median file size staying about the same, we have a lot more files to monitor. Plus we keep data around for longer than we ever have. Perhaps over time it is easy to forget what a file name means or what is contained in a cryptic file name. Since POSIX is good enough to give some basic metadata wouldn’t it be nice to have the ability to add our own metadata? Something that we control that would allow is to add information about the data?

Extended File Attributes

What many people don’t realize is that there actually is a mechanism for adding your own metadata to files that is supported by most Linux file systems. This is called Extended File Attributes. In Linux, many file systems support it such as the following: ext2, ext3, ext4, jfs, xfs, reiserfs, btrfs, ocfs2 (2.1 and greater), and squashfs (kernel 2.6.35 and greater or a backport to an older kernel). Some of the file systems have restrictions on extended file attributes, such as the amount of data that can be added, but they do allow for the addition of user controlled metadata.

Any regular file that uses one of the previously mentioned extended file attributes may have a list of extended file attributes. The attributes have a name and some associated data (the actual attribute). The name starts with what is called a namespace identifier (more on that later), followed by a dot “.”, and then followed by a null-terminated string. You can add as many names separated by dots as you like to create “classes” of attributes.

Currently on Linux there are four namespaces for extended file attributes:


  1. user
  2. trusted
  3. security
  4. system

This article will focus on the “user” namespace since it has no restrictions with regard to naming or contents. However, the “system” namespace could be used for adding metadata controlled by root.

The system namespace is used primarily by the kernel for access control lists (ACLs) and can only be set by root. For example, it will use names such as “system.posix_acl_access” and “system.posix_acl_default” for extended file attributes. The general wisdom is that unless you are using ACLs to store additional metadata, which you can do, you should not use the system namespace. However, I believe that the system namespace is a place for metadata controlled by root or metadata that is immutable with respect to the users.

The security namespace is used by SELinux. An example of a name in this namespace would be something such as “security.selinux”.

The user attributes are meant to be used by the user and any application run by the user. The user namespace attributes are protected by the normal Unix user permission settings on the file. If you have write permission on the file then you can set an extended attribute. To give you an idea of what you can do for “names” for the extended file attributes for this namespace, here are some examples:


  • user.checksum.md5
  • user.checksum.sha1
  • user.checksum.sha256
  • user.original_author
  • user.application
  • user.project
  • user.comment

The first three example names are used for storing checksums about the file using three different checksum methods. The fourth example lists the originating author which can be useful in case multiple people have write access to the file or the original author leaves and the file is assigned to another user. The fifth name example can list the application that was used to generate the data such as output from an application. The sixth example lists the project that the data with which the data is associated. And the seventh example is the all-purpose general comment. From these few examples, you see that you can create some very useful metadata.

Tools for Extended File Attributes

There are several very useful tools for manipulating (setting, getting) extended attributes. These are usually included in the attr package that comes with most distributions. So be sure that this package is installed on the system.

The second thing you should check is that the kernel has attribute support. This should be turned on for almost every distribution that you might use, although there may be some very specialized ones that might not have it turned on. But if you build your own kernels (as yours truly does), be sure it is turned on. You can just grep the kernel’s “.config” file for any “ATTR” attributes.

The third thing is to make sure that the libattr package is installed. If you installed the attr package then this package should have been installed as well. But I like to be thorough and check that it was installed.

Then finally, you need to make sure the file system you are going to use with extended attributes is mounted with the user_xattr option.

Assuming that you have satisfied all of these criteria (they aren’t too hard), you can now use extended attributes! Let’s do some testing to show the tools and what we can do with them. Let’s begin by creating a simple file that has some dummy data in it.

$ echo "The quick brown fox" > ./test.txt
$ more test.txt
The quick brown fox

Now let’s add some extended attributes to this file.

$ setfattr -n user.comment -v "this is a comment" test.txt


This command sets the extended file attribute to the name “user.comment”. The option “-v” is the value of the attribute followed by that value. The final option for the command is the name of the file.

You can determine the extended attributes on a file with a simple command, getfattr as in the following example,

$ getfattr test.txt
# file: test.txt
user.comment


Notice that this only lists what extended attributes are defined for a particular file not the values of the attributes. Also notice that it only listed the “user” attributes since the command was done as a regular user. If you ran the command as root and there were system or security attributes assigned you would see those listed.

To see the values of the attributes you have to use the following command:

$ getfattr -n user.comment test.txt
# file: test.txt
user.comment="this is a comment"


With the “-n” option it will list the value of the extended attribute name that you specify.

If you want to remove an extended attribute you use the setfattr command but use the “-x” option such as the following:

$ setfattr -x user.comment test.txt
$ getfattr -n user.comment test.txt
test.txt: user.comment: No such attribute


You can tell that the extended attribute no longer exists because of the return from the setfattr command.

Summary

Without belaboring the point, the amount of data is growing at a very rapid rate even on our desktops. A recent study also pointed out that the number of files is also growing rapidly and that we are adding some very large files but also a large number of small files so that the average file size is growing while the median file size is pretty much staying the same. All of this data will result in a huge data management nightmare that we need to be ready to address.

One way to help address the deluge of data is to enable a rich set of metadata that we can use in our data management plan (whatever that is). An easy way to do this is to use extended file attributes. Most of the popular Linux file systems allow you to add to metadata to files, and in the case of xfs, you can pretty much add as much metadata as you want to the file.

There are four “namespaces” of extended file attributes that we can access. The one we are interested as users is the user namespace because if you have normal write permissions on the file, you can add attributes. If you have read permission on the file you can also read the attributes. But we could use the system namespace as administrators (just be careful) for attributes that we want to assign as root (i.e. users can’t change or query the attributes).

The tools to set and get extended file attributes come with virtually every Linux distribution. You just need to be sure they are installed with your distribution. Then you can set, retrieve, or erase as many extended file attributes as you wish.

Extended file attributes can be used to great effect to add metadata to files. It is really up to the user to do this since they understand the data and have the ability to add/change attributes. Extended attributes give a huge amount of flexibility to the user and creating simple scripts to query or search the metadata is fairly easy (an exercise left to the user). We can even create extended attributes as root so that the user can’t change or see them. This allows administrators to add really meaningful attributes for monitoring the state of the data on the file system. Extended file attributes rock!

How do you Know TRIM is Working With Your SSD in Your System?

Now that you have your shiny new SSD you want to take full advantage of it which can include TRIM support to improve performance. In this article I want to talk about how to tell if TRIM is working on your system (I’m assuming Linux of course).

Overview

Answering the question, “does TRIM work with my system?” is not as easy as it seems. The are several levels to this answer beginning with, “does my SSD support TRIM?”. Then we have to make sure the kernel supports TRIM. After that we need to make sure that the file system supports TRIM (or what is referred to as “discard”). If we’re using LVM (the device-mapper or dm layer) we have to make sure dm supports discards. And finally if we’re using software RAID (md: multi-device), we have to make sure that md supports discards. So answering the simple question, “does TRIM work with my system?” has a simple answer of “it depends upon your configuration” but it has a longer answer if you want details.

In the following sections I’ll talk about these various levels starting with the question of whether your SSD supports TRIM.

Does my SSD support TRIM?

While this is a seemingly easy question – it either works or it doesn’t – in actuality it can be something of a complicated question. The first thing to do with your distribution is to determine if your SSD supports TRIM, is to upgrade your hdparm package. The reason for this is that hdparm has fairly recently added some capability for enumerating if the hardware can support TRIM. As of the writing of this article, version 9.37 is the current version of hdparm. It’s pretty easy to build the code for yourself (in your user directory) or as root. If you look at the makefile you will see that by default hdparm is installed in /sbin. To install it locally just modify the makefile so that “binprefix” variable at the top, points to the base directory where you want to install it. For example, if I installed it in my home directory I would change binprefix to /home/laytonjb/binary (I install all applications in a subdirectory called “binary”). Then you simply do, ‘make’, ‘make install’ and you’re ready.

Once the updated hdparm is installed, you can test it on your SSD using the following command:

# /sbin/hdparm -I /dev/sdd

where /dev/sdd is the device corresponding to the SSD. If your SSD supports TRIM then you should see something like the following in the output somewhere:

*   Data Set Management TRIM supported

I tested this on a SandForce 1222 SSD that I’ve recently been testing. Rather than stare at all of the output, I like to pipe the output through grep.

[root@test64 laytonjb]# /sbin/hdparm -I /dev/sdd | grep -i TRIM
           *    Data Set Management TRIM supported (limit 1 block)
           *    Deterministic read data after TRIM

Below in Figure 1 is a screenshot of the output (the top portion has been chopped off).


Figure_1.png


Figure 1: Screenshot of hdparm Output (Top section has been Chopped Off)

Does the kernel support TRIM?

This is one of the easier questions to answer in this article to some degree. However, I’ll walk through the details so you can determine if your kernel works with TRIM or not.

TRIM support is also called “discard” in the kernel. To understand what kernels support TRIM a small kernel history is in order:

  • The initial support for discard was in the 2.6.28kernel
  • In the 2.6.29kernel, swap was modified to use discard support in case people put their swap space on an SSD
  • In kernel 2.6.30, the ability for the GFS2 file system to generate discard (TRIM) requests was added
  • In the 2.6.32kernel, the btrfs file system obtained the ability to generate TRIM (discard) requests.
  • In 2.6.33, believe it or not, the FAT file system got a “discard” mount option. But more importantly, in the 2.6.33 kernel, libata (i.e. the SATA driver library) added support for the TRIM command. So really at this point, kernels 2.6.33 and greater can support TRIM commands. Also, ext4 addedthe ability is used discard as a mount option for supporting the ability to use the TRIM command.
  • In 2.6.36 ext4 added discard support when there is no journal using the “discard” mount option. Also, in 2.6.36 discard support for the delay, linear, mpath, and dm stripe targets were added to the dm layer. Support was also added for secure discard. NILFS2 got a “nodiscard”option since NILFS2 had discard capability when it was added to the kernel in 2.6.30 but no way to turn it off prior to this kernel version.
  • In 2.6.37 ext4 gained the ability for batched discards. This was addedas an ioctl called FITRIM to the core code.
  • In the 2.6.38 version, RAID-1 support of discard was added to the dm. Also, xfs added manual support for FITRIM, and ext3 added support for batched discards (FITRIM).
  • In 2.6.39 xfs will add the ability to do batched discards.

So in general any kernel 2.6.33 or later should have TRIM capability in the kernel up to the point of the file system. This means the block device layer and the SATA library (libata) support TRIM.

However, you have to be very careful because I’m talking about the kernel.org kernels. Distributions will take patches from more recent kernels and back-port them to older kernels. So please check with your distribution if they are using a 2.6.32 or older kernel (before 2.6.33) but have TRIM support added to the SATA library. You might be in luck.

Does the file system support TRIM?

At this point we know if the hardware supports TRIM and if the kernel supports TRIM, but before it becomes usable we need to know if the file system can use the TRIM (discard) command. Again, this depends upon the kernel so I will summarize the kernel version and the file systems that are supported.

  • 2.6.33: GFS2, nilfs2, btrfs, ext4, fat
  • 2.6.33: GFS2, nilfs2, btrfs, ext4 (including no journal mode), fat
  • 2.6.37: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat
  • 2.6.38: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs, ext3

So if you have hardware that supports TRIM, a kernel that supports TRIM, then you can look up if your file system is capable of issuing TRIM commands (even batched TRIM).

Does the dm layer support TRIM?

The dm layer (device mapper) is a very important layer in Linux. You probably encounter this layer when you are using LVM with your distribution. So if you want to use LVM with your SSD(s), then you need to make sure that the TRIM command is honored starting at the top with the file system, then down to the dm layer, then down to the the block layer, then down to the driver layer (libata), and finally down to the drives (hardware). The good news is that in the 2.6.36 kernel, discard support was added to the dm layer! So any kernel 2.6.36 or later will support TRIM.

Does the MD layer support TRIM?

The last aspect we’ll examine is support of software RAID via MD (multi-device) in Linux. However, I have some bad news that currently TRIM is not really supported in the md layer (yet). So if you’re using md for software RAID within Linux, the TRIM command is not supported at this time.

Testing for TRIM Support

The last thing I want to mention is how you can test your system for TRIM support beyond just looking at the previous lists to see if TRIM support is there. I used the procedure in this article to test if TRIM was actually working.

The first logical step is to make sure the SSD isn’t being used during these tests (i.e. it’s quiet). Also be sure that the SSD supports TRIM using hdparm as previously mentioned (I’m using a SandForce 1222 based SSD that I’ve written about previously). Also be sure your file system uses TRIM. For this article I used ext4 and mounted it with the “discard” option as show in Figure 2 below.

Figure_2.png

Figure 2: Screenshot of /etc/fstab File Showing discard mount option

Then, as root, create a file on the SSD drive. The command is,

[root@test64 laytonjb]# cd /dev/sdd
[root@test64 laytonjb]# dd if=/dev/urandom of=tempfile count=100 bs=512k oflag=direct

 


Below in Figure 3 is the screenshot of this on my test system.

Figure_3.png

Figure 3: Screenshot of first command
The next step is to get the device sector of the tempfile just created and copy the first sector information after value 0. It sounds like a mouth full but it isn’t hard. The basic command is,

[root@test64 laytonjb]# /sbin/hdparm --fibmap tempfile

 


Below in Figure 4 is the screenshot of this on my test system.

Figure_4.png

Figure 4: Screenshot of Sector Information for Temporary File

So the beginning LBA for this file that we are interested in, is at sector 271360.

We can read that sector of the file using this location to show that there is data written there. The basic command is,

[root@test64 laytonjb]# /sbin/hdparm --read-sector 271360 /dev/sdb

 


Below in Figure 5 is the screenshot of this on my test system.

Figure_5.png

Figure 5: Screenshot of Sector Information for Temporary File

Since the data is not all zeros, that means there is data there.

The next step is to erase the file, tempfile, and sync the system to make sure that the file is actually erased. Figure 6 below illustrates this.

Figure_6.png

Figure 6: Screenshot of Erasing Temporary File and Syncing the Drive

I ran the sync command three times just to be sure (I guess this shows my age since the urban legend was to always run sync several times to flush all buffers).

Finally, we can repeat reading the sector 271360 to see what happened to the data there.

[root@test64 laytonjb]# /sbin/hddparm --read-sector 271360 /dev/sdb

 


Below in Figure 7 is the screenshot of this on my test system.

Figure_7.png

Figure 7: Screenshot of Sector Information for Temporary File After File was Erased

Since the data in that sector is now a bunch of zeros, you can see that the drive has “trimmed” the data. This requires a little more explanation.

Before the TRIM command existed, the SSD wouldn’t necessarily erase the block right away if the data is erased. The next time it needed the block it would erase it prior to using it. This means that the file deletion performance is very good since the SSD just marked the block as empty and returned to the kernel. However, the next time that block was used, it first had to be erased, potentially slowing down the write performance.

Alternatively, the block could be erased when the data was deleted but this means that the file delete performance isn’t very good. However, this can help write performance because the blocks are always erased and ready to be used.

TRIM offers the opportunity for the operating system (file system) to tell the SSD that it is done with the block even if the SSD isn’t sure the block isn’t being used. So when the SSD receives a trim command, it marks the block as empty and when it gets a chance, when the SSD perhaps isn’t as busy, it will erase the empty blocks (i.e. “trim” them or “discard” their contents). This means that the “erase” cycle isn’t affecting write performance or file delete performance, improving the apparent overall performance. This sounds great and this usually works really well for desktops where we can take a break periodically so the SSD has time to erase trimmed blocks. However, if the SSD is being used quite a bit, then the SSD may not have an opportunity to erase the blocks so we’re back to the behavior without TRIM. This isn’t necessarily a bad thing, but just a limitation of what TRIM can do for SSD performance.

Summary

I hope you have found this article useful. It is a little convoluted in answering the question “does my Linux box support TRIM”? Unfortunately, the answer isn’t simple and you have to step through all of the layers to fully understand if TRIM is supported. A good way to start is to look at the last list about which file systems support TRIM, select the one you like, and see which kernel version you need. If you need dm support for LVM then you have to use at least 2.6.36. If you use MD with your SSDs then I’m afraid you are out of luck with TRIM support.

What’s an IOPS?

Introduction

There have been many articles exploring the performance aspects of file systems, storage systems, and storage devices. Coupled with Throughput (Bytes per second), IOPS (Input/Output Operations per Second) is one of the two measures of performance that are typically examined when discussing storage media. Vendors will publish performance results with data such as “Peak Sequential Throughput is X MB/s” or “Peak IOPS is X” indicating the performance of the storage device. But what does an IOPS really mean and how is it defined?

Typically an IOP is an IO operation where data is sent from an application to the storage device. An IOPS is the measure of how many of these you can perform per second. But notice that the phrase “typically” is used in this explanation. That means there is no hard and fast definition of an IOPS that is standard for everyone. Consequently, as you can imagine, it’s possible to “game” the results and publish whatever results you like. (see a related article, Lies, Damn Lies and File System Benchmarks). That is the sad part of IOPS and Bandwidth – the results can be manipulated to be almost whatever the tester wants.

However, IOPS is a very important performance measure for applications because, believe it or not, many applications perform IO using very small transfer sizes (for example, see this article). How quickly or efficiently a storage system an perform IOPS can drive overall performance of the application. Moreover, today’s systems have lots of cores and run several applications at one time, further pushing the storage performance requirements. Therefore, knowing the IOPS measure of your storage devices is important but you just need to critical of the numbers that are published.

Measuring IOPS

There are several tools that are commonly used for measuring IOPS on systems. The first one is called Iometer, that you commonly see used on Windows systems. The second most common tool is IOzone, which have been used in the articles published on Linux Magazine because it is open-source, easy to build on almost any system, has a great deal of tests and options, and is widely used for storage testing. It is fairly evident at this point that having two tools could lead to some differences in IOPS measurements. Ideally there should be a precise definition of an IOPS with an accepted way to measure it. Then the various tools for examining IOPS would have to prove that they satisfy the definition (“certified” is another way of saying this). But just picking the software tool is perhaps the easiest part of measuring IOPS.

One commonly overlooked aspect of measuring IOPS is the size of the I/O operation (sometimes called the “payload size” using the terminology of the networking world). Is the I/O operation involve just a single byte? Or does it involve 1 MB? Just stating that a device can achieve 1,000 IOPS really tells you nothing. Is that 1,000 1-byte operations per second or 1,000 1MB operations per second?

The most common IO operation size for Linux is 4KB (or just 4K). It corresponds to the page size on almost all Linux systems so usually produces the best IOPS (but not always). Personally, I want to see IOPS measures for a range of IO operation sizes. I like to see 1KB (in case there is some exceptional performance at really small payload sizes), 4KB, 32KB, 64KB, maybe 128KB or 256KB, and 1MB. The reason I like to see a range of payload sizes is that it tells me how quickly the performance drops with payload size which I can then compare to the typical payload size of my application(s) (actually the “spectrum” of payload sizes). But if push comes to shove, I want to at least see the 4KB payload size but most importantly I want the publisher to tell me the payload size they used.

A second commonly overlooked aspect of measuring IOPS is whether the IO operation is a read or write or possibly a mix of them (you knew it wasn’t going to be good when I start numbering discussion points). Hard drives, which have spinning media, usually don’t have much difference between read and write operations and how fast they can execute them. However, SSDs are a bit different and have asymmetric performance. Consequently, you need to define how the IO operations were performed. For example, it could be stated, “This hardware is capable of Y 4K Write IOPS” where Y is the number, which means that the test was just write operations. If you compare some recent results for the two SSDs that were tested (see this article) you can see that SSDs can have very different Read IOPS and Write IOPS performance – sometimes even an order of magnitude different.

Many vendors choose to publish either Read IOPS or Write IOPS but rarely both. Other vendors like to publish IOPS for a mixed operation environment stating that the test was 75% Read and 25% Write. While they should be applauded for stating the mix of IO operations, they should also publish their Read IOPS performance (all read IO operations), and their Write IOPS performance (all write IO operations) so that the IOPS performance can be bounded. At this point in the article, vendors should be publishing the IOP performance measures something like the following:


  • 4K Read IOPS =
  • 4K Write IOPS =
  • (optional) 4K (X% Read/Y% Write) IOPS =

Note that the third bullet is optional and the ratios of read and write IOPS is totally up to the vendor.

A third commonly overlooked aspect of measuring IOPS is whether the IO operations are sequential or random. With sequential IOPS, the IO operations happen sequentially on the storage media. For example block 233 is used for the first IO operation, followed by block 234, followed by block 235, etc. With random IOPS the first IO operation is on block 233 and the second is on block 568192 or something like that. With the right options on the test system, such as a large queue depth, the IO operations can be optimized to improve performance. Plus the storage device itself may do some optimization. With true random IOPS there is much less chance that the server or storage device can optimize the access very much.

Most vendors report the sequential IOPS since typically it has a much larger value than random IOPS. However, in my opinion, random IOPS is much more meaningful, particularly in the case of a server. With a server you may have several applications running at once, accessing different files and different parts of the disk so that to the storage device, the access looks random.

So, at this point in the discussion, the IOPS performance should be listed something like the following:


  • 4K Random Read IOPS =
  • 4K Random Write IOPS =
  • 4K Sequential Read IOPS =
  • 4K Sequential Write IOPS =
  • (optional) 4K Random (X% Read/Y% Write) IOPS =

The IOPS can be either random or sequential (I like to see both), but at the very least they should publish if the IOPS are sequential or random.

A fourth commonly overlooked aspect of measuring IOPS is the queue depth. With Windows storage benchmarks, you see the queue depth adjusted quite a bit in the results. Linux does a pretty good job setting good queue depths so there is much less need to change the defaults. However, the queue depths can be adjusted which can possibly change the performance. Changing the queue depth on Linux is fairly easy.

The Linux IO Scheduler has the functionality to sort the incoming IO request into something called the request-queue where they are optimized for the best possible device access which usually means sequential access. The size of this queue is controllable. For example, you can look at the queue depth for the “sda” disk in a system and change it as shown below:

# cat /sys/block/sda/queue/nr_requests
128
# echo 100000 > /sys/block/sda/queue/nr_requests


Configuring the queue depth can only be done by root.

At this point the IOPS performance should be published something like the following:


  • 4K Random Read IOPS = X (queue depth = Z)
  • 4K Random Write IOPS = Y (queue depth = Z)
  • 4K Sequential Read IOPS = X (queue depth = Z)
  • 4K Sequential Write IOPS = Y (queue depth = Z)
  • (optional) 4K Random (X% Read/Y% Write) IOPS = W (queue depth = Z)

Or if you like they need to tell you the queue depth once if it applies to all of the tests.

In the Linux world, not too many “typical” benchmarks try different queue depths since typically the queue depth is 128 already which provides for good performance. However, depending upon the workload or the benchmark, you can adjust the queue depth to produce better performance. However, just be warned that if you change the queue depth for some benchmark, real application performance could suffer.

Notice that it is starting to take a fair amount of work to list the IOPS performance. There are at least four IOPS numbers that need to be reported for a specified queue depth. However, I personally would like to see the IOPS performance for several payload sizes and several queue depths. Very quickly, the number of tests that need to be run is growing quite rapidly. To take the side of the vendors, producing this amount of benchmarking data takes time, effort, and money. It may not be worthwhile for them to perform all of this work if the great masses don’t understand nor appreciate the data. On the other hand, taking the side of the user, this type of data is very useful and important since it can help set expectations when we buy a new storage device. And remember, the customer is always right so we need to continue to ask the vendors for this type of data.

There are several other “tricks” you can do to improve performance including more OS tuning, turning off all cron jobs during testing, locking process to specific cores using numactl, and so on. Covering all of them is beyond this article but you can assume that most vendors like to tune their systems to improve performance (ah – the wonders of benchmarks). One way to improve this situation is to report all details of the test environment (I try to do this) so that one could investigate which options might have been changed. However, for rotating media (hard drives), one can estimate the IOPS performance of single devices (i.e. individual drives).

Estimating IOPS for Rotating Media

For pretty much all rotational storage devices, the dominant factors in determining IOPS performance are seek time, access latency, and rotational speed (but typically we think of rotational speed as affecting seek time and latency). Basically the dominant factors affect the time to access a particular block of data on the storage media and report it back. For rotating media the latency is basically the same for read or write operations, making our life a bit easier.

The seek time is usually reported by disk vendors and is the time it takes for the drive head to move into position to read (or write) the correct track. The latency refers to the amount of time it takes for the specific spot on the drive to be in place underneath a drive head. The sum of these two times is the basic amount of time to read (or write) a specific spot on the drive. Since we’re focusing on rotating media, these times are mechanical so we can safely assume they are much larger than the amount of time to actually read the data or get it back to the drive controller and the OS (remember we’re talking IOPS so the amount of data is usually very small).

To estimate the IOPS performance of a hard drive, we simple use the average of these two times to compute the number of IO operations we can do per second.

Estimated IOPS = 1 / (average latency + average seek time)


For both numbers, the values should be in milliseconds (or at least in the same units – I’ll leave the math up to you). For example, if a disk has an average latency of 3 ms and an average seek time of 4.45 ms, then the estimated IOPS performance is,

Estimated IOPS = 1 / (average latency + average seek time)
Estimated IOPS = 1 / (3 + 4.45 ms)
Estimated IOPS = 1 / (0.00745)
Estimated IOPS = 134


This handy-dandy formula works for rotating media for single drives (SSDs IOPS performance is more difficult to estimate and not as accurate). Estimating performance for storage arrays that have RAID controllers and several drives is much more difficult and is usually not easy to do. However, there are some articles floating around the web that attempt to estimate the performance.

Summary

IOPS is one of the important measures of performance of storage devices. Personally I think it is the first performance measure one should examine since IOPS are important to the overall performance of a system. However, there is no standard definition of an IOPS so just like most benchmarks, it is almost impossible to compare values from one storage device to another or one vendor to another.

In the article I tried to explain a bit about IOPS and how they can be influenced by various factors. Hopefully this helps you realize that published IOPS benchmarks perhaps have been “gamed” by vendors and that you should ask for more details on how the values were found. Even better, you can run the benchmarks yourself or even ask posted benchmarks how they tested for IOPS performance.

Linux 2.6.39: IO Improvements for the Last In the 2.6 Series


The 2.6.39 kernel has been out for a bit so reviewing the IO improvements is a good exercise to understand what’s happening in the Linux storage world. But just in case you didn’t know, this will be the last kernel in the 2.6.x series. The next kernel will be 3.0 and not 2.6.40.

2.6.39 is Out!

For those that live under a rock (and not that stupid Geico commercial), the 2.6.39 kernel was released on May 18th. For us Linux storage junkies there were some very nice additions to the kernel.

File Systems

While there are many aspects to storage in Linux, one of the most obvious features are file systems. In just about every new kernel there are new features associated with file systems and the 2.6.39 kernel was definitely not an exception. Some of the more notable patches touched 11 file systems in the kernel. The following sections highlight some of the more noticeable ones.

Ext4

There was an update to ext4 that went in the 2.6.37 kernel but was disabled because it wasn’t quite ready for production use (i.e. a corruption bug was found). The change allowed ext4 to use the Block IO layer (called “bio”) directly, instead of the intermediate “buffer” layer. This can improve performance and scalability, particular when your system has lots of cores (smp). You can read more about this update here. But remember that it was disabled in the 2.6.37 kernel in the source code.

In the 2.6.39 kernel, the ext4 code was updated and the corruption bugs fixed. So this scalability patch was re-enabled. You can read the code commit here. This is one of Ted Ts’o’ favorite patches that can really pump up ext4 performance on smp systems (BTW – get ready for large smp systems because we can easily see 64-core mainstream servers later this year).

Btrfs

In the 2.6.39 kernel, btrfs added the option of different compression and copy-on-write (COW) settings for each file or directory. Before this patch, these settings were on a per-file system basis, so these new changes allow much finer control (if you want). The first commit for this series is here.

Btrfs is under fairly heavy development so there were other patches to improve functionality and performance as well as trace points. Trace points can be very important because they allow debugging or monitoring to be more fine grained.

GFS2

GFS2 is a clustered file system that has been in the Linux kernel for quite some time (since Red Hat bought Sistina). In the 2.6.39 kernel a few performance changes were made. The first patch improves deallocation performance resulting in about a 25% improvement in file deletion performance of GFS2 (always good to see better metadata performance).

The second patch improves the cluster mmap scalability. Sounds like a mouthful but it’s a fairly important performance patch. Remember that GFS2 is a clustered file system where each node has the ability to perform IO perhaps to the same file. This patch improves the performance when the file access is done via mmap (an important file access method for some workloads) by several nodes.

The third patch comes from Dave Chinner, one of the key xfs developers. The patch reduces the impact of “log pushing” on sequential write throughput performance. This means that the patch improves performance by reducing the impact of log operations on the overall file system performance (yeah performance!).

The fourth patch uses an RCU for the glock hash table. In a nutshell, this patch adds an RCU (read-copy-update) for the glock (global lock) of GFS2. Using an RCU is commonly used for locks on files so that processes can actually read or write to the file excluding other processes (so you don’t get data corruption). RCUs are useful because they have very low overhead. Again, this patch is a nice performance upgrade for GFS2.

HPFS

HPFS (High Performance File System) is something of an ancient file system, originally designed for OS/2 (remember those days?). It is still in the kernel and during the hunt for parts of the kernel that were still using parts of the BKL (Big Kernel Lock), it was found that HPFS still used a number of those functions.

When SMP (Symmetric Multi-Processing) was first introduced into the Linux kernel, functions that performed locking (BKL) were introduced. These functions allowed the kernel to work with SMP systems particularly with small core counts. However, with common servers have up to 48 cores, the kernel doesn’t scale quite as well. So, the kernel developers went on the BKL hunt to eliminate the code in the kernel that used BKL functions and recode them to use much more scalable locking mechanisms.

One of the last remaining parts of the kernel that used BKL functions was HPFS. What made things difficult in removing those routines is that no one could be found that really maintained the code and no one stepped up and said that they really used it any more. So there was some discussion about just eliminating the code completely, but instead it was decided to go ahead and give the best effort possible to patch the HPFS code. Mikulas Patocka created three patches for HPFS that were pulled into the 2.6.39 kernel. The first allowed HPFS to compile with the options of PREEMPT and SMP (basically the BKL parts of HPFS were gone). The second patch implemented fsync in HPFS (shows how old the code is). And the third patch removed the CR/LF option (this was only really used in the 2.2 kernel).

So, in the 2.6.39 kernel the final bits of BKL in HPFS were eliminated. This, and other BKL pieces were eliminated in the kernel, allowed BKL victory to be declared (and all of the peasants rejoiced).

XFS

The strong development pace of XFS continued in the 2.6.39 kernel with two important patches. The first patch, made delayed logging the default (you had to explicitly turn it on in past kernels starting with 2.6.35). Delayed logging can greatly improve performance, particularly metadata performance. If you want to see an example of this, read Ric Wheeler’s Red Hat Summit paper on testing file systems with 1 billion files.

The second patch removed the use of the page cache to back the buffer cache in xfs. The reason for this is that the buffer cache has it’s own LRU now so you don’t need the page cache to provide persistent caching. This patch saves the overhead of using the page cache. The patch also means that xfs can handle 64k pages (if you want to go there) and has an impact on the 16TB file system limit for 32 bit machines.

Ceph

In the 2.6.39 kernel, two patches were added to Ceph. The first one added a mount option, ino32. This option allows ceph to report 32 bit ino values which is useful for 64-bit kernels with 32-bit userspace.

The second patch adds lingering request and watch/notify event framework to Ceph (i.e. more tracking information in Ceph).

Exofs

Exofs is an object oriented file system in the Linux kernel (which makes it pretty cool IMHO). In the 2.6.39 kernel there was one major patch for Exofs that the mount option of mounting the file system by osdname. This can be very useful if more devices were added later or the login order has changed.

Nilfs2

Nilfs2 is a very cool file system in the kernel that is a what is termed a log-structured file system. In the 2.6.39 kernel nilfs2 added some functions that exposed the standard attribute set to user-space via the chattr and lsattr functions. This can be very useful for several tools that read the attributes of files in user-space (mostly management and monitoring tools).

CIFS

While Linux is the one and only true operating system for the entire planet (as we all know), we do have to work with other file systems from time to time much to our dismay :). The most obvious interaction is with Windows systems via CIFS (Common Internet File System). In the 2.6.39 kernel a patch was added that allowed user names longer than 32 bytes. This patch allows better integration between Windows systems and Linux systems.

Squashfs

One of my favorite file systems in the Linux kernel is SquashFS. In the 2.6.39 kernel, a patch was added that allowed SquashFS to support the xz decompressor. Xz is compression technology that is lossless and that uses the LZMA2 compression algorithm.

Block Patches

Another key aspect to storage in Linux is the block layer in the kernel. The 2.6.39 kernel had a few patches for this layer helping to improve performance and add new capability.

The first patch adds the capability of syncing a single file system. Typically, the sync(2) function commits the buffer cache to disk but does so for all mounted file systems. However, you may not want to to do this for a system that has several mounted file systems. So this patch introduces a new system call, syncfs(2). This system call takes a file descriptor as an argument and then syncs on that file system.

The DM layer in the kernel also saw some patches the first of which is quite interesting. This first patch added a new target to the DM layer called a “flakey target” in the patch commit. This target is the same as the linear target except that it returns I/O errors periodically. This target is useful for simulating failing devices for testing purposes. This is not a target you want to use for real work of course, but if you are developing or testing things, it might be worthwhile (at the very least it has an interesting name).

The second patch introduces a new merge function for striped targets in the DM. This patch improves performance by about 12-35% when a reasonable chunk size is used (64K for example), in conjunction with a stripe count that is a power of 2. What the patch does is allow large block I/O requests to be merged together and handled properly by the DM layer. File systems such as XFS and ext4 take care to assemble large block I/O requests to improve performance and now the DM layer supports these requests rather than piece them out to the underlying hardware, eliminating the performance gains the file systems took pains to create.

Summary

The 2.6.39 kernel is somewhat quiet with no really huge storage oriented features but it does show that continual progress is being made with performance gains in a number of places. It touched a large number of file systems and also touched the block layer to some degree. No less than 11 file systems had noticeable patches (I didn’t discuss 9p in this article). To me that signals lots of work on Linux storage (love to see that).

The block layer, while sometimes a quiet piece of the kernel, had some patches that improved it for both users and developers. The block layer introduced a new system call, syncfs, that allows a single file system to be synced at a time which is a very useful feature for systems that have many mounted file systems.

The DM layer also improved performance for file systems that assemble larger block I/O requests by supporting them for the striped target. This can be a great help with more enterprise class oriented hardware.

Lots of nice improvements in this kernel and it’s good to see so much work focused on storage within Linux. This also puts the Linux world on a solid footing moving into the new 3.0 series of kernels which is what we’ll see next instead of a 2.6.40 kernel. But before moving onto the 3.0 series of kernels I wanted to thank all developers who worked on the 2.6 series of kernels (and the 2.5 development series). The 2.6 series started off as a kernel with more functionality and ambition that the prior series. It then developed into not only a great kernel for average users but also for the enterprise world. From the mid-2.6.2x kernels to 2.6.39, Linux storage has been developed at a wonderful rate and we are all better off because of this development. Thanks everyone and I look forward to the fun of the 3.0 series!

Switching to Scientific Linux 6.1

Introduction

When I started using Linux very seriously in about 1993 or so. I converted my home system to Linux in 1993 using Yggdrasil and a bunch of floppies (Actually I remember Linus’ posting to comp.os.minix because I was looking for a *nix that I could run on my own system because I used it so much in graduate school and I read that mailing list in hopes I could figure out how to install minix on my home system). So I started using Yggdrasil and really liked it.

However, Yggrasil’s run pretty much ended in 1995 but I used it for a while longer because it was fun (ans easy). Around 1996 or 1997 I switched over to Red Hat Linux and found that I really liked it. Plus the world as settling on rpm as the application distribution format, so I headed down the Red Hat path.

I used Red Hat for quite a while on all kinds of systems. My personal desktop at home, at desktops at work (that was an interesting adventure because of SCO and the whole lawsuit threat, and HPC systems. I was very happy with it and I tried to purchase support when I could for production systems. Then Red Hat announced their change to Red Hat Enterprise Linux (RHEL) so I decided to make a switch then.

At that time I switched over to CentOS for various reasons. For production systems I switched to RHEL but for my home systems I used CentOS. I liked CentOS very much – it was just like RHEL that I used at work but without the support costs which I couldn’t afford.

CentOS

I used CentOS all over my home when ever I needed Linux. At one point I had 13 servers and desktops all using CentOS and I was happy as a clam. The security updates came out fairly quickly, the community was fairly good and they even tolerated non-CentOS questions such as general admin questions. I used CentOS in many HPC systems and wrote lots of articles using CentOS as the OS. Sorry Red Hat – I just couldn’t afford the support costs at the time and CentOS gave me everything I needed.

During this time I also tried SuSE because we used it as Linux Networx used it since it was cheaper for HPC than Red Hat. I even had SuSE on my laptop for a few years but didn’t use it too much.

I also tried CAOS Linux because my friend Greg Kurtzer, who developed Warewulf and Perceus, was developing it. I used it on a few small clusters and wrote a few article about it. It was close enough to Red Hat that I was comfortable with it but I still used CentOS on my desktop (old habots die hard).

I also tried Ubuntu during this time and it was nice and easy to use and worked well on laptops so I switched over my laptops to use it. However it did have a slight learning curve so I didn’t switch over any HPC systems to it nor did I switch over any production systems to it.

I still used CentOS until the great “whine” debacle of 2010-2011.

CentOS Community disintegrates

I don’t know the exact data but around 2010 or 2011, the whole CentOS project started to unravel. It didn’t track RHEL updates very quickly, particularly for RHEL 6.x and RHEL 5.6 (CentOS 5.5 was slow enough).

The mailing lists soon filled with users asking about the newer versions. Then the volume on the mailing lists turned up and the “developers” became belligerent, secretive, and amazingly rude. I know they were doing CentOS work on the side, but the fast disintegration of CentOS soon became apparent.

So I was stuck. I couldn’t afford to buy RHEL from Red Hat (maybe the Workstation version but I was looking for the server version) and CentOS was rapidly becoming a steaming pile. I wasn’t ready to jump with both feet into Ubuntu or SusE (worry guys) because I knew RHEL well enough that I could focus on what I wanted to do with it, rather than how to install it and admin it.

Scientific Linux

I had know about Scientific Linux for a while since I work in that field. I had heard some good things about it and I was impressed by the speed that they put out distributions and security updates. So I thought I would give it a try.

I grabbed the SL6.1 DVD install iso and put it on my main test system (I use it for all of my storage testing). The installation went very smoothly – exactly like I’m used to. I have a tendency to go a little heavy on the initial package selection so I restrained myself with this installation. However, I did chose the alternative yum repos so I could get some extra stuff. For example, I installed my all time favorites: gkrellm, nedit, and vlc. Easy as pie. But one of the cool things that comes with SL is all of the XFS extra goodies (got to love xfs!).

Summary

I suppose CentOS 6.1 would have been just as easy to install but installing SL6.1 was just as easy, it gave me a few more goodies than CentOS and I won’t be subjected to any of the drama surrounding CentOS. So I get the exact same behavior that I want (RHEL or CentOS), the easy installation (RHEL or CentOS), and I don’t get any of the drama of CentOS and I can afford the price on SL for now. Seems like a good deal and I will definitely be switching over to SL on all my production boxes at home but I’ll still use RHEL on production systems outside home (BTW – Red Hat is doing some great things around the HPC community and file systems and storage so they deserve our support in my opinion.

Linux 3.0 Some Cool Changes for Storage


The first shiny brand new kernel, 3.0, of the 3.x kernel series is out and there are some goodies for storage (as always). Let’s take a look at some of these.

2.6.x – So Long and Thanks for All the IO!

Before jumping into the 3.0 kernel I wanted to thank all of the developers and testers who put so much time into the 2.6 kernel series. I remember when the 2.5 “development” kernel came out and the time so many people put into it. I also remember when the 2.6 kernel came out and I was so excited to try it (and a bit nervous). Then I started to watch the development of the 2.6 kernel series, particularly the IO side of the kernel, and was very excited and pleased with the work. So to everyone who was involved in the development and testing of the 2.6 series, my profound and sincere thanks.

3.x – Welcome of the Excitement!

Let’s move on to the excitement of the new 3.x series. The 3.0 kernel is out and ready for use. It has some very nice changes/additions for the IO inclined (that’s us if you’re reading this article). We can break down the developments into a few categories including file systems, which is precisely where we’ll start.

The biggest changes to file systems in 3.0 happened to btrfs. There were three major classes of improvements:

  • Automatic defragmentation
  • Scrubbing (this is really cool)
  • Performance improvements

These improvements added greatly to the capability of btrfs, the fair-haired boy of Linux file systems at the moment.

Btrfs – Automatic defragmentation
The first improvement, automatic defragmentation, does exactly what it sounds like: it automatically defrags an online file system. Normally, Linux file systems such as ext4 and XFS do a pretty good job about keeping files as contiguous as possible by delaying allocations (to combine data requests), using extents (contiguous range of blocks), and using other techniques. However, don’t forget that btrfs is a COW (copy-on-write) file system. COWs are great for a number of things including file systems. For btrfs when a file is first written, it is usually laid out in sequential order as best it can (very similar to XFS or ext4). However, because of the COW nature of the file system, any changes to the file are written to free blocks and not over the data that was already there. Consequently, the file system can fragment fairly quickly.

Prior to the 3.0 kernel, the way to handle fragmentation was to either (1) defrag the file system periodically, or (2) mount the file system with COW disabled. The first option is fairly easy to do,

$ btrfs filesystem defragment

But you have to remember run the command or you have to put it in a cron job. But doing this also means that the performance will suffer a bit during the defragmentation process.

The second option involved mounting btrfs with an option to turn off COW. The mount option is,

-o nodatacow

This will limit the fragmentation of the file system but you lose the goodness that COW gives to btrfs. What is really needed is a way to defragment the file system when perhaps the file system isn’t busy or a technique to limit the fragmentation of the file system without giving up COW.

In the 3.0 kernel, btrfs gained some ability to defragment on the fly for certain types of data. In particular, it now has the mount option,

-o autodefrag

This option tells btrfs to look for small random writes into files and queues them for an automatic defrag process. According to the notes in the commit, the defrag capability isn’t well suited for database workloads yet but it does work for smaller files such as rpm (don’t forget that rpm based distros have an rpm database that is constantly being updated), sqlite or bdb databases. This new automatic defrag feature of btrfs is very useful in limiting this one source of fragmentation. If you think the file system has gotten too fragmented you can always defragment it by hand via the “btrfs” command.

Since btrfs is constantly being compared to ZFS, let’s compare the defrag capabilities of both. The authors of ZFS have tried to mitigate the impact of COW on the fragmentation by keeping the file changes in a buffer as long as possible before flushing the data to disk. Evidently, the authors feel that this limits most of the fragmentation in the file system. However, it’s impossible to completely eliminate fragmentation by just using a large buffer. On the other hand, btrfs has an a defrag utility if you want to defrag your file system. Also, this new feature focuses on the main source of fragmentation – small writes to existing files. While this features isn’t perfect it does provide the basis of a very good defrag capability without having to rely on large buffers. I would say the score on this feature is, btrfs – 1, ZFS – nil.

Btrfs – Scrubbing
The second new feature, which is my personal favorite, is scrubbing has been added to btrfs. Remember that btrfs computes checksums of data and metadata and stores them in a checksum tree. Ideally these checksums can be used to determine if data or metadata has gone bad (e.g. “bit rot”).

In the 3.0 kernel, btrfs gained the ability to do was is called a “scrub” of the data. This means that the stored checksums are compared to a freshly computed checksum to determine if the data has been corrupted. In the current patch set the scrub checks the extents in the file system and if a problem is found, a good copy of the data is searched for. If a good copy is found then it will be used to overwrite the bad copy. Note that this approach also captures checksums that have become corrupted in addition to data that may have become corrupted.

Given the size of drives and the size of RAID groups, the probably of hitting an error is increasing. Having the ability to scrub the data in a file system is a great feature.

In keeping with the comparison to ZFS, ZFS has had this feature for some time and btrfs only gained this feature in the 3.0 kernel. But let me point out one quick thing. I have seen many people say that ZFS has techniques to prevent data corruption. This statement is partially true for ZFS and now btrfs. The file systems have the ability to detect data corruption. But they can only correct the corruption if a copy of the bad data exists. If you have a RAID-1 configuration or a RAID-5 or 6, then the file system can find a good copy of the block (or construct one), and over write the bad data block(s). If you only have a single disk or RAID-0, then either file system will only detect data corruption but can’t correct it (Note: there is an option in ZFS that tells it to make two copies of the data but this halves your usable capacity as if you used the drive in a RAID-1 configuration with two equal-size partitions on the drive. You can also tell it to make 3 copies of the data on a single disk if you really want to do that). So I would score this one, btrfs – 1, ZFS – 1.

Btrfs – Performance Improvements
The automatic defragmentation and scrubbing features are really wonderful additions to btrfs but there is even more (and you don’t have to trade to get what is behind curtain #2). In the 3.0 kernel, there were some performance improvements added to btrfs.

The first feature is that the performance of file creations and deletes was improved. When btrfs does a file creation or file deletion it has to do a great number of b+ tree insertions such as inode names, directory name items, directory name index, etc. In the 3.0 kernel, the b+ tree file insertion or deletions are delayed which improves performance. The details of the implementation are fairly involved but you can read them here but basically it tries to do these functions in batches rather than addressing them one at a time. The result is that the for some microbenchmarks, the file creations have improved by about 15% and the file deletions have improved by about 20%.

A second performance improvement does not flush the checksum items of unchanged file data. While this doesn’t sound like a big deal, it helps fsync speeds. In the commit for the patch, a simple sysbench test doing a “random write + fsync” improved by almost a factor of 10 from about 112.75 requests/sec to 1216 requests/sec.

The third big performance improvement is the inclusion of a new patch that allocates chunks in a better sequence for multiple devices (RAID-0, RAID-1), especially when there is an odd-number of devices. Prior to this patch, when multiple devices were used, btrfs allocated chunks on the devices in the same order. This could cause problems when there were an odd number of devices in a RAID-1 or RAID-10 configuration. This patch sorts the devices before allocating and allocates stripes on the devices with the most available space as long as there is space available (capacity balancing).

Other File System changes in 3.0


There were some other changes to Linux file systems in the 3.0 kernel.


XFS
In the 2.6.38 kernel, XFS gained the ability for manual SSD discards from userspace using the FITRIM ioctl. The patch was not designed to be run during normal workloads since the freespace btree walks can cause large performance degradations. So while XFS had some “TRIM” capability it was not “online” when the file system was operating.

In the 3.0 kernel, XFS gained a patch that implemented “online” discard support (i.e. TRIM). The patch uses the function “blkdev_issue_discard” once a transaction commits (the space is unused).

NILFS2
NILFS2 gained the ability to resize while online in the 3.0 kernel. This patch added a resize ioctl (IO Control) that makes online resizing possible (uber cool).

OCFS2
In the 3.0 kernel, the clustered file system, OCFS2, gained a couple of new features. The first feature it gained was page discards (aka TRIM). This first patch added the FITRIM ioctl. The second patch added the ability for the OCFS2 function, ocfs2_trim_fs, to trim freed clusters in a volume.

A second patch set which, at first glance, seems to be inconsequential, actually has some downstream implications. There were two patches, this patch and this patch, which allow OCFS to move extents. Based on these patches, keep an eye on future OCFS2 capabilities.

EXT4
If you watch the ext4 mailing list you will see that there is a great deal of development still going on in this file system. If you look deep enough a majority of this development is to add features and even more capability to ext4. Ext4 is shaping up to be one great file system (even better than it already is).

In the 3.0 kernel, there were a few patches added to ext4. The first patch set adds the capability of what is called “hole punching” in files. There are situations where an application doesn’t need the data in the middle of a file. Both XFS and OCFS2 have the capability of punching a hole in a file where a portion of the file is marked as unwanted and the associated storage is released (i.e. it can be reused). In the 3.0 kernel, this function was made “generic” so that any file system could use it (it was added to the fallocate system call). To learn more about this capability, take a look at this LWN article.

The first patch for ext4 added new routines which are used by fallocate when the hole punch flag is passed. The second patch modifies the truncate routines to support hole punching.

The second ext4 patch set is a very useful patch to help avoid data corruption. This commit added the the ability to prevent ext4 from being mounted multiple times. If a non-clustered or non-parallel file system is mounted more than once you have the potential for data corruption. For example, in the case of a high-availability (HA) NFS configuration where one NFS gatweay is exporting the data, the other standby NFS will also have to mount the file system. However, if the HA gets confused and both gateways suddenly start exporting the file system to multiple NFS clients, you are almost guaranteed to get data corruption (I’ve seen this happen several times). This patch prevents that from happening.

TRIM Summary
There is lots of interest surrounding TRIM for SSDs (and rightly so). In an upcoming article (it may not be posted) I write about how to check if TRIM is working with your SSD. In that article I review which file systems support TRIM (an important part of making sure TRIM works). With the 3.0 kernel, XFS and OCFS2 gained TRIM capability. Here is the summary of TRIM support for file systems as a function of the kernel.

  • 2.6.33: GFS2, nilfs2, btrfs, ext4, fat
  • 2.6.33: GFS2, nilfs2, btrfs, ext4 (including no journal mode), fat
  • 2.6.37: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat
  • 2.6.38: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs (“offline” mode), ext3
  • 3.0: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs (“offline” and “online” mode), ext3, ocfs2

The Guts


There are some dark, dank parts of the kernel where the casual user should never, ever look. These parts are where very serious things happen inside the kernel. One particular aspect is the VFS (Virtual File System). The 3.0 kernel had some important patches that touched this very important part of the kernel as well as other important bits.


VFS Patches
There has been a rash of work in the VFS recently, to make it more scalable. During some of this development it was discovered that some file systems, btrfs and ext4 in particular, had a bottleneck on larger systems. It seems that btrfs was doing a xattr lookup on every write (xattr = extended attributes). This caused an additional tree walk which then hit some per file system locks, stalling performance, and causing quite bad scalability. So in the 3.0 kernel the capability of caching the the xattr to avoid the lookup, was added, greatly improving scalability.


Block Layer Patches
There is another dark and scary layer where only the coders with strong kernel-fu dare travel. That is the block layer. In the 3.0 kernel, a patch was added that gives the block layer the ability to discard bio (block IO) in batches in the function blkdev_issue_discard(), making the discarding process much faster. In a test described in the commit, this patch made discards about 16x faster than before. Not bad – I love the smell of performance gains in new kernels!

Summary


There has been a great deal of file system development in the 3.0 kernel. Btrfs in particular had a number of features and capability bringing it closer to a full featured, stable file system. There were a number of other gains as well including TRIM support in XFS and OCFS2 as well as online resizing support in NILFS2, and the development of generic hole punching functions that were then used in ext4. OCFS2 also some patches that could form the basis of some pretty cool future functionality (keep your eyes peeled).

The 3.0 kernel also saw continued development by some of the best kernel developers, improving the VFS and the block layer. These are places only really careful and serious kernel developers tread, because it is so complicated and any change has a huge impact on the overall kernel (and your data).

I think the 3.0 kernel is a great kernel from an IO perspective. It was a really good way to transition from the 2.6 kernel series by giving us wonderful new features but not so much that everything broke. Don’t be afraid to give 3.0 a go, but be sure you do things carefully and one step at a time.