IOSTAT Plotter V2 – Update

I’ve been updating nfsiostat_plotter and in the process I found a few bugs in iostat_plotter V2.

The first bug was in the way it kept time. If the time stamp ended in “PM” the code adds 12 hours so everything is based on a 24 hour clock. The problem is that if the iostat output is run just after noon, the time reads something like “12:49 PM” which then becomes 24:29. This bug has been fixed.

The second bug was in the plotting routines when doing “combined” plots where all devices are plotted on the same graph. It’s hard to explain but when there two or three plots, one above the other, sometimes the grid didn’t get plotted. And/or one of the graphs was longer than the others. This has been corrected (I hope).

The new version can be found here. The code has the same name as before (sorry about that). Just download the code and run it like you did before.

IOSTAT Plotter V2

I have had several requests for changes/modifications/updates to iostat_plotter so I recently found a little time to update the code. Rather than just replace I created, iostat_plotter_v2.

The two biggest changes made in the code are: (1) moving the legends outside the plots to the top right hand corner, and (2) allowing all of the devices to be included on the same set of plots. The first change makes look nicer (I think). I tried to maximize the width of the plot without getting too crazy. I also shrunk the text size on the legend so you could get more devices in the plots. I think you can get about 12 devices without the legend bleeding over to the plot below it.

The second change allows you to put all of the devices on the same plot if you use the “-c” option with the script. In the previous version you got a set of 11 plots for each device which allows you to clearly examine each device. If you had five devices you got a total of 55 plots (5 devices * 11 plots per device). Using the “-c” option you will now get 11 plots with all of the devices on each of the appropriate plots.

I hope these couple of features are useful. Please le me know if they are or if you want other changes. Thanks!

Extended File Attributes Rock!

Introduction

I think it’s a given that the amount of data is increasing at a fairly fast rate. We now have lots of multimedia on our desktops, and lots of files on our servers at work, and we’re starting to put lots of data into the cloud (e.g. Facebook). One question that affects storage design and performance is if these files are large or small and how many of them are there?

At this year’s FAST (USENIX Conference on File System and Storage Technologies) the best paper went to “A Study of Practical Deduplication” by William Bolosky from Microsoft Research, and Dutch Meyer from the University of British Columbia. While the paper didn’t really cover Linux (it covered Windows) and it was more focused on desktops, and it was focused on deduplication, it did present some very enlightening insights on file systems from 2000 to 2010. Some of the highlights from the paper are:


  1. The median file size isn’t changing
  2. The average file size is larger
  3. The average file system capacity has tripled from 2000 to 2010

To fully understand the difference between the first point and the third point you need to remember some basic statistics. The average file size is computed by summing the size of every file and dividing by the number of files. But the median file size is found by ordering the list from the smallest to largest of the file size of every file. The median file size is the one in the middle of the ordered list. So, with these working definitions, the three observations previously mentioned indicate that perhaps desktops have a few really large files that drive up the average file size but at the same time there are a number of small files that makes the median file size about the same despite the increase in the number of files and the increase in large files.

The combination of the observations previously mentioned mean that we have many more files on our desktops and we are adding some really large files and about the same number of small files.

Yes, it’s Windows. Yes, it’s desktops. But these observations are another good data point that tell us something about our data. That is, the number of files is getting larger while we are adding some very large files and a large number of small files. What does this mean for us? One thing that it means to me is that we need to pay much more attention to managing our data.

Data Management – Who’s on First?

One of the keys to data management is being able to monitor the state of your data which usually means monitoring the metadata. Fortunately, POSIX gives us some standard metadata for our files such as the following:


  • File ownership (User ID and Group ID)
  • File permissions (world, group, user)
  • File times (atime, ctime, mtime)
  • File size
  • File name
  • Is it a true file or a directory?

There are several others (e.g. links) which I didn’t mention here.

With this information we can monitor the basic state of our data. We can compute how quickly our data is changing (how many files have been modified, created, deleted in a certain period of time). We can also determine how our data is “aging” – that is how old is the average file, the median file, and we can do this for the entire file system tree or certain parts of it. In essence we can get a good statistical overview of the “state of our data”.

All of this capability is just great and goes far beyond anything that is available today. However, with the file system capacity increasing so rapidly and the median file size staying about the same, we have a lot more files to monitor. Plus we keep data around for longer than we ever have. Perhaps over time it is easy to forget what a file name means or what is contained in a cryptic file name. Since POSIX is good enough to give some basic metadata wouldn’t it be nice to have the ability to add our own metadata? Something that we control that would allow is to add information about the data?

Extended File Attributes

What many people don’t realize is that there actually is a mechanism for adding your own metadata to files that is supported by most Linux file systems. This is called Extended File Attributes. In Linux, many file systems support it such as the following: ext2, ext3, ext4, jfs, xfs, reiserfs, btrfs, ocfs2 (2.1 and greater), and squashfs (kernel 2.6.35 and greater or a backport to an older kernel). Some of the file systems have restrictions on extended file attributes, such as the amount of data that can be added, but they do allow for the addition of user controlled metadata.

Any regular file that uses one of the previously mentioned extended file attributes may have a list of extended file attributes. The attributes have a name and some associated data (the actual attribute). The name starts with what is called a namespace identifier (more on that later), followed by a dot “.”, and then followed by a null-terminated string. You can add as many names separated by dots as you like to create “classes” of attributes.

Currently on Linux there are four namespaces for extended file attributes:


  1. user
  2. trusted
  3. security
  4. system

This article will focus on the “user” namespace since it has no restrictions with regard to naming or contents. However, the “system” namespace could be used for adding metadata controlled by root.

The system namespace is used primarily by the kernel for access control lists (ACLs) and can only be set by root. For example, it will use names such as “system.posix_acl_access” and “system.posix_acl_default” for extended file attributes. The general wisdom is that unless you are using ACLs to store additional metadata, which you can do, you should not use the system namespace. However, I believe that the system namespace is a place for metadata controlled by root or metadata that is immutable with respect to the users.

The security namespace is used by SELinux. An example of a name in this namespace would be something such as “security.selinux”.

The user attributes are meant to be used by the user and any application run by the user. The user namespace attributes are protected by the normal Unix user permission settings on the file. If you have write permission on the file then you can set an extended attribute. To give you an idea of what you can do for “names” for the extended file attributes for this namespace, here are some examples:


  • user.checksum.md5
  • user.checksum.sha1
  • user.checksum.sha256
  • user.original_author
  • user.application
  • user.project
  • user.comment

The first three example names are used for storing checksums about the file using three different checksum methods. The fourth example lists the originating author which can be useful in case multiple people have write access to the file or the original author leaves and the file is assigned to another user. The fifth name example can list the application that was used to generate the data such as output from an application. The sixth example lists the project that the data with which the data is associated. And the seventh example is the all-purpose general comment. From these few examples, you see that you can create some very useful metadata.

Tools for Extended File Attributes

There are several very useful tools for manipulating (setting, getting) extended attributes. These are usually included in the attr package that comes with most distributions. So be sure that this package is installed on the system.

The second thing you should check is that the kernel has attribute support. This should be turned on for almost every distribution that you might use, although there may be some very specialized ones that might not have it turned on. But if you build your own kernels (as yours truly does), be sure it is turned on. You can just grep the kernel’s “.config” file for any “ATTR” attributes.

The third thing is to make sure that the libattr package is installed. If you installed the attr package then this package should have been installed as well. But I like to be thorough and check that it was installed.

Then finally, you need to make sure the file system you are going to use with extended attributes is mounted with the user_xattr option.

Assuming that you have satisfied all of these criteria (they aren’t too hard), you can now use extended attributes! Let’s do some testing to show the tools and what we can do with them. Let’s begin by creating a simple file that has some dummy data in it.

$ echo "The quick brown fox" > ./test.txt
$ more test.txt
The quick brown fox

Now let’s add some extended attributes to this file.

$ setfattr -n user.comment -v "this is a comment" test.txt


This command sets the extended file attribute to the name “user.comment”. The option “-v” is the value of the attribute followed by that value. The final option for the command is the name of the file.

You can determine the extended attributes on a file with a simple command, getfattr as in the following example,

$ getfattr test.txt
# file: test.txt
user.comment


Notice that this only lists what extended attributes are defined for a particular file not the values of the attributes. Also notice that it only listed the “user” attributes since the command was done as a regular user. If you ran the command as root and there were system or security attributes assigned you would see those listed.

To see the values of the attributes you have to use the following command:

$ getfattr -n user.comment test.txt
# file: test.txt
user.comment="this is a comment"


With the “-n” option it will list the value of the extended attribute name that you specify.

If you want to remove an extended attribute you use the setfattr command but use the “-x” option such as the following:

$ setfattr -x user.comment test.txt
$ getfattr -n user.comment test.txt
test.txt: user.comment: No such attribute


You can tell that the extended attribute no longer exists because of the return from the setfattr command.

Summary

Without belaboring the point, the amount of data is growing at a very rapid rate even on our desktops. A recent study also pointed out that the number of files is also growing rapidly and that we are adding some very large files but also a large number of small files so that the average file size is growing while the median file size is pretty much staying the same. All of this data will result in a huge data management nightmare that we need to be ready to address.

One way to help address the deluge of data is to enable a rich set of metadata that we can use in our data management plan (whatever that is). An easy way to do this is to use extended file attributes. Most of the popular Linux file systems allow you to add to metadata to files, and in the case of xfs, you can pretty much add as much metadata as you want to the file.

There are four “namespaces” of extended file attributes that we can access. The one we are interested as users is the user namespace because if you have normal write permissions on the file, you can add attributes. If you have read permission on the file you can also read the attributes. But we could use the system namespace as administrators (just be careful) for attributes that we want to assign as root (i.e. users can’t change or query the attributes).

The tools to set and get extended file attributes come with virtually every Linux distribution. You just need to be sure they are installed with your distribution. Then you can set, retrieve, or erase as many extended file attributes as you wish.

Extended file attributes can be used to great effect to add metadata to files. It is really up to the user to do this since they understand the data and have the ability to add/change attributes. Extended attributes give a huge amount of flexibility to the user and creating simple scripts to query or search the metadata is fairly easy (an exercise left to the user). We can even create extended attributes as root so that the user can’t change or see them. This allows administrators to add really meaningful attributes for monitoring the state of the data on the file system. Extended file attributes rock!

How do you Know TRIM is Working With Your SSD in Your System?

Now that you have your shiny new SSD you want to take full advantage of it which can include TRIM support to improve performance. In this article I want to talk about how to tell if TRIM is working on your system (I’m assuming Linux of course).

Overview

Answering the question, “does TRIM work with my system?” is not as easy as it seems. The are several levels to this answer beginning with, “does my SSD support TRIM?”. Then we have to make sure the kernel supports TRIM. After that we need to make sure that the file system supports TRIM (or what is referred to as “discard”). If we’re using LVM (the device-mapper or dm layer) we have to make sure dm supports discards. And finally if we’re using software RAID (md: multi-device), we have to make sure that md supports discards. So answering the simple question, “does TRIM work with my system?” has a simple answer of “it depends upon your configuration” but it has a longer answer if you want details.

In the following sections I’ll talk about these various levels starting with the question of whether your SSD supports TRIM.

Does my SSD support TRIM?

While this is a seemingly easy question – it either works or it doesn’t – in actuality it can be something of a complicated question. The first thing to do with your distribution is to determine if your SSD supports TRIM, is to upgrade your hdparm package. The reason for this is that hdparm has fairly recently added some capability for enumerating if the hardware can support TRIM. As of the writing of this article, version 9.37 is the current version of hdparm. It’s pretty easy to build the code for yourself (in your user directory) or as root. If you look at the makefile you will see that by default hdparm is installed in /sbin. To install it locally just modify the makefile so that “binprefix” variable at the top, points to the base directory where you want to install it. For example, if I installed it in my home directory I would change binprefix to /home/laytonjb/binary (I install all applications in a subdirectory called “binary”). Then you simply do, ‘make’, ‘make install’ and you’re ready.

Once the updated hdparm is installed, you can test it on your SSD using the following command:

# /sbin/hdparm -I /dev/sdd

where /dev/sdd is the device corresponding to the SSD. If your SSD supports TRIM then you should see something like the following in the output somewhere:

*   Data Set Management TRIM supported

I tested this on a SandForce 1222 SSD that I’ve recently been testing. Rather than stare at all of the output, I like to pipe the output through grep.

[root@test64 laytonjb]# /sbin/hdparm -I /dev/sdd | grep -i TRIM
           *    Data Set Management TRIM supported (limit 1 block)
           *    Deterministic read data after TRIM

Below in Figure 1 is a screenshot of the output (the top portion has been chopped off).


Figure_1.png


Figure 1: Screenshot of hdparm Output (Top section has been Chopped Off)

Does the kernel support TRIM?

This is one of the easier questions to answer in this article to some degree. However, I’ll walk through the details so you can determine if your kernel works with TRIM or not.

TRIM support is also called “discard” in the kernel. To understand what kernels support TRIM a small kernel history is in order:

  • The initial support for discard was in the 2.6.28kernel
  • In the 2.6.29kernel, swap was modified to use discard support in case people put their swap space on an SSD
  • In kernel 2.6.30, the ability for the GFS2 file system to generate discard (TRIM) requests was added
  • In the 2.6.32kernel, the btrfs file system obtained the ability to generate TRIM (discard) requests.
  • In 2.6.33, believe it or not, the FAT file system got a “discard” mount option. But more importantly, in the 2.6.33 kernel, libata (i.e. the SATA driver library) added support for the TRIM command. So really at this point, kernels 2.6.33 and greater can support TRIM commands. Also, ext4 addedthe ability is used discard as a mount option for supporting the ability to use the TRIM command.
  • In 2.6.36 ext4 added discard support when there is no journal using the “discard” mount option. Also, in 2.6.36 discard support for the delay, linear, mpath, and dm stripe targets were added to the dm layer. Support was also added for secure discard. NILFS2 got a “nodiscard”option since NILFS2 had discard capability when it was added to the kernel in 2.6.30 but no way to turn it off prior to this kernel version.
  • In 2.6.37 ext4 gained the ability for batched discards. This was addedas an ioctl called FITRIM to the core code.
  • In the 2.6.38 version, RAID-1 support of discard was added to the dm. Also, xfs added manual support for FITRIM, and ext3 added support for batched discards (FITRIM).
  • In 2.6.39 xfs will add the ability to do batched discards.

So in general any kernel 2.6.33 or later should have TRIM capability in the kernel up to the point of the file system. This means the block device layer and the SATA library (libata) support TRIM.

However, you have to be very careful because I’m talking about the kernel.org kernels. Distributions will take patches from more recent kernels and back-port them to older kernels. So please check with your distribution if they are using a 2.6.32 or older kernel (before 2.6.33) but have TRIM support added to the SATA library. You might be in luck.

Does the file system support TRIM?

At this point we know if the hardware supports TRIM and if the kernel supports TRIM, but before it becomes usable we need to know if the file system can use the TRIM (discard) command. Again, this depends upon the kernel so I will summarize the kernel version and the file systems that are supported.

  • 2.6.33: GFS2, nilfs2, btrfs, ext4, fat
  • 2.6.33: GFS2, nilfs2, btrfs, ext4 (including no journal mode), fat
  • 2.6.37: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat
  • 2.6.38: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs, ext3

So if you have hardware that supports TRIM, a kernel that supports TRIM, then you can look up if your file system is capable of issuing TRIM commands (even batched TRIM).

Does the dm layer support TRIM?

The dm layer (device mapper) is a very important layer in Linux. You probably encounter this layer when you are using LVM with your distribution. So if you want to use LVM with your SSD(s), then you need to make sure that the TRIM command is honored starting at the top with the file system, then down to the dm layer, then down to the the block layer, then down to the driver layer (libata), and finally down to the drives (hardware). The good news is that in the 2.6.36 kernel, discard support was added to the dm layer! So any kernel 2.6.36 or later will support TRIM.

Does the MD layer support TRIM?

The last aspect we’ll examine is support of software RAID via MD (multi-device) in Linux. However, I have some bad news that currently TRIM is not really supported in the md layer (yet). So if you’re using md for software RAID within Linux, the TRIM command is not supported at this time.

Testing for TRIM Support

The last thing I want to mention is how you can test your system for TRIM support beyond just looking at the previous lists to see if TRIM support is there. I used the procedure in this article to test if TRIM was actually working.

The first logical step is to make sure the SSD isn’t being used during these tests (i.e. it’s quiet). Also be sure that the SSD supports TRIM using hdparm as previously mentioned (I’m using a SandForce 1222 based SSD that I’ve written about previously). Also be sure your file system uses TRIM. For this article I used ext4 and mounted it with the “discard” option as show in Figure 2 below.

Figure_2.png

Figure 2: Screenshot of /etc/fstab File Showing discard mount option

Then, as root, create a file on the SSD drive. The command is,

[root@test64 laytonjb]# cd /dev/sdd
[root@test64 laytonjb]# dd if=/dev/urandom of=tempfile count=100 bs=512k oflag=direct

 


Below in Figure 3 is the screenshot of this on my test system.

Figure_3.png

Figure 3: Screenshot of first command
The next step is to get the device sector of the tempfile just created and copy the first sector information after value 0. It sounds like a mouth full but it isn’t hard. The basic command is,

[root@test64 laytonjb]# /sbin/hdparm --fibmap tempfile

 


Below in Figure 4 is the screenshot of this on my test system.

Figure_4.png

Figure 4: Screenshot of Sector Information for Temporary File

So the beginning LBA for this file that we are interested in, is at sector 271360.

We can read that sector of the file using this location to show that there is data written there. The basic command is,

[root@test64 laytonjb]# /sbin/hdparm --read-sector 271360 /dev/sdb

 


Below in Figure 5 is the screenshot of this on my test system.

Figure_5.png

Figure 5: Screenshot of Sector Information for Temporary File

Since the data is not all zeros, that means there is data there.

The next step is to erase the file, tempfile, and sync the system to make sure that the file is actually erased. Figure 6 below illustrates this.

Figure_6.png

Figure 6: Screenshot of Erasing Temporary File and Syncing the Drive

I ran the sync command three times just to be sure (I guess this shows my age since the urban legend was to always run sync several times to flush all buffers).

Finally, we can repeat reading the sector 271360 to see what happened to the data there.

[root@test64 laytonjb]# /sbin/hddparm --read-sector 271360 /dev/sdb

 


Below in Figure 7 is the screenshot of this on my test system.

Figure_7.png

Figure 7: Screenshot of Sector Information for Temporary File After File was Erased

Since the data in that sector is now a bunch of zeros, you can see that the drive has “trimmed” the data. This requires a little more explanation.

Before the TRIM command existed, the SSD wouldn’t necessarily erase the block right away if the data is erased. The next time it needed the block it would erase it prior to using it. This means that the file deletion performance is very good since the SSD just marked the block as empty and returned to the kernel. However, the next time that block was used, it first had to be erased, potentially slowing down the write performance.

Alternatively, the block could be erased when the data was deleted but this means that the file delete performance isn’t very good. However, this can help write performance because the blocks are always erased and ready to be used.

TRIM offers the opportunity for the operating system (file system) to tell the SSD that it is done with the block even if the SSD isn’t sure the block isn’t being used. So when the SSD receives a trim command, it marks the block as empty and when it gets a chance, when the SSD perhaps isn’t as busy, it will erase the empty blocks (i.e. “trim” them or “discard” their contents). This means that the “erase” cycle isn’t affecting write performance or file delete performance, improving the apparent overall performance. This sounds great and this usually works really well for desktops where we can take a break periodically so the SSD has time to erase trimmed blocks. However, if the SSD is being used quite a bit, then the SSD may not have an opportunity to erase the blocks so we’re back to the behavior without TRIM. This isn’t necessarily a bad thing, but just a limitation of what TRIM can do for SSD performance.

Summary

I hope you have found this article useful. It is a little convoluted in answering the question “does my Linux box support TRIM”? Unfortunately, the answer isn’t simple and you have to step through all of the layers to fully understand if TRIM is supported. A good way to start is to look at the last list about which file systems support TRIM, select the one you like, and see which kernel version you need. If you need dm support for LVM then you have to use at least 2.6.36. If you use MD with your SSDs then I’m afraid you are out of luck with TRIM support.

What’s an IOPS?

Introduction

There have been many articles exploring the performance aspects of file systems, storage systems, and storage devices. Coupled with Throughput (Bytes per second), IOPS (Input/Output Operations per Second) is one of the two measures of performance that are typically examined when discussing storage media. Vendors will publish performance results with data such as “Peak Sequential Throughput is X MB/s” or “Peak IOPS is X” indicating the performance of the storage device. But what does an IOPS really mean and how is it defined?

Typically an IOP is an IO operation where data is sent from an application to the storage device. An IOPS is the measure of how many of these you can perform per second. But notice that the phrase “typically” is used in this explanation. That means there is no hard and fast definition of an IOPS that is standard for everyone. Consequently, as you can imagine, it’s possible to “game” the results and publish whatever results you like. (see a related article, Lies, Damn Lies and File System Benchmarks). That is the sad part of IOPS and Bandwidth – the results can be manipulated to be almost whatever the tester wants.

However, IOPS is a very important performance measure for applications because, believe it or not, many applications perform IO using very small transfer sizes (for example, see this article). How quickly or efficiently a storage system an perform IOPS can drive overall performance of the application. Moreover, today’s systems have lots of cores and run several applications at one time, further pushing the storage performance requirements. Therefore, knowing the IOPS measure of your storage devices is important but you just need to critical of the numbers that are published.

Measuring IOPS

There are several tools that are commonly used for measuring IOPS on systems. The first one is called Iometer, that you commonly see used on Windows systems. The second most common tool is IOzone, which have been used in the articles published on Linux Magazine because it is open-source, easy to build on almost any system, has a great deal of tests and options, and is widely used for storage testing. It is fairly evident at this point that having two tools could lead to some differences in IOPS measurements. Ideally there should be a precise definition of an IOPS with an accepted way to measure it. Then the various tools for examining IOPS would have to prove that they satisfy the definition (“certified” is another way of saying this). But just picking the software tool is perhaps the easiest part of measuring IOPS.

One commonly overlooked aspect of measuring IOPS is the size of the I/O operation (sometimes called the “payload size” using the terminology of the networking world). Is the I/O operation involve just a single byte? Or does it involve 1 MB? Just stating that a device can achieve 1,000 IOPS really tells you nothing. Is that 1,000 1-byte operations per second or 1,000 1MB operations per second?

The most common IO operation size for Linux is 4KB (or just 4K). It corresponds to the page size on almost all Linux systems so usually produces the best IOPS (but not always). Personally, I want to see IOPS measures for a range of IO operation sizes. I like to see 1KB (in case there is some exceptional performance at really small payload sizes), 4KB, 32KB, 64KB, maybe 128KB or 256KB, and 1MB. The reason I like to see a range of payload sizes is that it tells me how quickly the performance drops with payload size which I can then compare to the typical payload size of my application(s) (actually the “spectrum” of payload sizes). But if push comes to shove, I want to at least see the 4KB payload size but most importantly I want the publisher to tell me the payload size they used.

A second commonly overlooked aspect of measuring IOPS is whether the IO operation is a read or write or possibly a mix of them (you knew it wasn’t going to be good when I start numbering discussion points). Hard drives, which have spinning media, usually don’t have much difference between read and write operations and how fast they can execute them. However, SSDs are a bit different and have asymmetric performance. Consequently, you need to define how the IO operations were performed. For example, it could be stated, “This hardware is capable of Y 4K Write IOPS” where Y is the number, which means that the test was just write operations. If you compare some recent results for the two SSDs that were tested (see this article) you can see that SSDs can have very different Read IOPS and Write IOPS performance – sometimes even an order of magnitude different.

Many vendors choose to publish either Read IOPS or Write IOPS but rarely both. Other vendors like to publish IOPS for a mixed operation environment stating that the test was 75% Read and 25% Write. While they should be applauded for stating the mix of IO operations, they should also publish their Read IOPS performance (all read IO operations), and their Write IOPS performance (all write IO operations) so that the IOPS performance can be bounded. At this point in the article, vendors should be publishing the IOP performance measures something like the following:


  • 4K Read IOPS =
  • 4K Write IOPS =
  • (optional) 4K (X% Read/Y% Write) IOPS =

Note that the third bullet is optional and the ratios of read and write IOPS is totally up to the vendor.

A third commonly overlooked aspect of measuring IOPS is whether the IO operations are sequential or random. With sequential IOPS, the IO operations happen sequentially on the storage media. For example block 233 is used for the first IO operation, followed by block 234, followed by block 235, etc. With random IOPS the first IO operation is on block 233 and the second is on block 568192 or something like that. With the right options on the test system, such as a large queue depth, the IO operations can be optimized to improve performance. Plus the storage device itself may do some optimization. With true random IOPS there is much less chance that the server or storage device can optimize the access very much.

Most vendors report the sequential IOPS since typically it has a much larger value than random IOPS. However, in my opinion, random IOPS is much more meaningful, particularly in the case of a server. With a server you may have several applications running at once, accessing different files and different parts of the disk so that to the storage device, the access looks random.

So, at this point in the discussion, the IOPS performance should be listed something like the following:


  • 4K Random Read IOPS =
  • 4K Random Write IOPS =
  • 4K Sequential Read IOPS =
  • 4K Sequential Write IOPS =
  • (optional) 4K Random (X% Read/Y% Write) IOPS =

The IOPS can be either random or sequential (I like to see both), but at the very least they should publish if the IOPS are sequential or random.

A fourth commonly overlooked aspect of measuring IOPS is the queue depth. With Windows storage benchmarks, you see the queue depth adjusted quite a bit in the results. Linux does a pretty good job setting good queue depths so there is much less need to change the defaults. However, the queue depths can be adjusted which can possibly change the performance. Changing the queue depth on Linux is fairly easy.

The Linux IO Scheduler has the functionality to sort the incoming IO request into something called the request-queue where they are optimized for the best possible device access which usually means sequential access. The size of this queue is controllable. For example, you can look at the queue depth for the “sda” disk in a system and change it as shown below:

# cat /sys/block/sda/queue/nr_requests
128
# echo 100000 > /sys/block/sda/queue/nr_requests


Configuring the queue depth can only be done by root.

At this point the IOPS performance should be published something like the following:


  • 4K Random Read IOPS = X (queue depth = Z)
  • 4K Random Write IOPS = Y (queue depth = Z)
  • 4K Sequential Read IOPS = X (queue depth = Z)
  • 4K Sequential Write IOPS = Y (queue depth = Z)
  • (optional) 4K Random (X% Read/Y% Write) IOPS = W (queue depth = Z)

Or if you like they need to tell you the queue depth once if it applies to all of the tests.

In the Linux world, not too many “typical” benchmarks try different queue depths since typically the queue depth is 128 already which provides for good performance. However, depending upon the workload or the benchmark, you can adjust the queue depth to produce better performance. However, just be warned that if you change the queue depth for some benchmark, real application performance could suffer.

Notice that it is starting to take a fair amount of work to list the IOPS performance. There are at least four IOPS numbers that need to be reported for a specified queue depth. However, I personally would like to see the IOPS performance for several payload sizes and several queue depths. Very quickly, the number of tests that need to be run is growing quite rapidly. To take the side of the vendors, producing this amount of benchmarking data takes time, effort, and money. It may not be worthwhile for them to perform all of this work if the great masses don’t understand nor appreciate the data. On the other hand, taking the side of the user, this type of data is very useful and important since it can help set expectations when we buy a new storage device. And remember, the customer is always right so we need to continue to ask the vendors for this type of data.

There are several other “tricks” you can do to improve performance including more OS tuning, turning off all cron jobs during testing, locking process to specific cores using numactl, and so on. Covering all of them is beyond this article but you can assume that most vendors like to tune their systems to improve performance (ah – the wonders of benchmarks). One way to improve this situation is to report all details of the test environment (I try to do this) so that one could investigate which options might have been changed. However, for rotating media (hard drives), one can estimate the IOPS performance of single devices (i.e. individual drives).

Estimating IOPS for Rotating Media

For pretty much all rotational storage devices, the dominant factors in determining IOPS performance are seek time, access latency, and rotational speed (but typically we think of rotational speed as affecting seek time and latency). Basically the dominant factors affect the time to access a particular block of data on the storage media and report it back. For rotating media the latency is basically the same for read or write operations, making our life a bit easier.

The seek time is usually reported by disk vendors and is the time it takes for the drive head to move into position to read (or write) the correct track. The latency refers to the amount of time it takes for the specific spot on the drive to be in place underneath a drive head. The sum of these two times is the basic amount of time to read (or write) a specific spot on the drive. Since we’re focusing on rotating media, these times are mechanical so we can safely assume they are much larger than the amount of time to actually read the data or get it back to the drive controller and the OS (remember we’re talking IOPS so the amount of data is usually very small).

To estimate the IOPS performance of a hard drive, we simple use the average of these two times to compute the number of IO operations we can do per second.

Estimated IOPS = 1 / (average latency + average seek time)


For both numbers, the values should be in milliseconds (or at least in the same units – I’ll leave the math up to you). For example, if a disk has an average latency of 3 ms and an average seek time of 4.45 ms, then the estimated IOPS performance is,

Estimated IOPS = 1 / (average latency + average seek time)
Estimated IOPS = 1 / (3 + 4.45 ms)
Estimated IOPS = 1 / (0.00745)
Estimated IOPS = 134


This handy-dandy formula works for rotating media for single drives (SSDs IOPS performance is more difficult to estimate and not as accurate). Estimating performance for storage arrays that have RAID controllers and several drives is much more difficult and is usually not easy to do. However, there are some articles floating around the web that attempt to estimate the performance.

Summary

IOPS is one of the important measures of performance of storage devices. Personally I think it is the first performance measure one should examine since IOPS are important to the overall performance of a system. However, there is no standard definition of an IOPS so just like most benchmarks, it is almost impossible to compare values from one storage device to another or one vendor to another.

In the article I tried to explain a bit about IOPS and how they can be influenced by various factors. Hopefully this helps you realize that published IOPS benchmarks perhaps have been “gamed” by vendors and that you should ask for more details on how the values were found. Even better, you can run the benchmarks yourself or even ask posted benchmarks how they tested for IOPS performance.