clusterbuffer

vmstat_plotter

September 21, 2014 Leave a comment

Scratching that `vmstat` plotting itch

I’ve been working on an article around vmstat-like tools and I started looking for ways to plot the data. There are some reasonable articles on using something like awk and gnuplot to create plots but I wanted something a little more self-contained that would produce a quick report with all of the data plotted. I did see one tool that was on-line only (you had to upload your data) but that was too much of a pain to use, so in true Clusterbuffer fashion I wrote my own.

It’s not a sophisticated tool by any stretch – it’s just some Python code to parse the data and plot it using matplotlib. Go to this page to read more about it and scroll down to the bottom to download the code. You can also see a simple example here.

Please read through the write-up since there are some limitations to the code that require vmstat to be run with certain flags. There are also vmstat options that it can’t plot because the output is not time dependent.

Enjoy!

Filed under Introduction, Linux Tagged with matplotlib, plot, Python, vmstat

NFSiostat_plotter V4

April 11, 2014 Leave a comment

I made an update to nfsiostat_plotter_v3 so that the iostat information, which is used for gathering the CPU usage, is optional. This means you don’t have to run iostat while you run nfsiostat. But it also means you won’t get to see the CPU usage plots.

Please go to this page for more details

Filed under Uncategorized Tagged with Linux, matplotlib, nfsiostat, Python

IOSTAT Plotter and NFSiostat plotter updates (V3!)

February 23, 2014 5 Comments

I’ve been updating both nfsiostat_plotter and nfsiostat_plotter. Both versions are now “v3”. There are a few new features in both:

Normally nfsiostat doesn’t capture the CPU utilization but this can be an important part of analyzing NFS client performance. In this version of the code you need to also run iostat while you run nfsiostat so that the CPU utilization is captured. With this version of code, you have to run iostat and run iostat_plotter_v3.py to produce the iostat output (which is read by nfsiostat_plotter).
For both iostat_plotter and nfsiostat_plotter, the legend has been moved outside the plot to the top right
For both iosta_plotter and nfsiostat_plotter, the size of the legend has been shrunk so you can plot more NFS file systems (up to about four before the legend lables start to overlap). Iostat_plotter can handle about 8 devices.
For nfsiostat_plotter, the command option “-c” was added so you could combine the NFS file systems onto a single plot (total of 4 plots)
For iostat_plotter, the command option “-c” was added so you could combine the devices onto a single plot (total of 4 plots)
For iostat_plotter, the code can read either sysstat v9.x or sysstat v10.x format for iostat. This is very important since CentOS and RHEL ship with a very old sysstat v9.x package.
For both iostat_plotter and nfsiostat_plotter, the legend labels are auto-sized based on the string length (it’s a heuristic algorithm).
For both iostat_plotter and nfsiostat_plotter, a simple “-h” option as added that produces a small help output.

Go to this page for iostat_plotter_v3 and this for nfsiostat_plotter_v3. You will see links to the new code. But just in case, here are the links.

Filed under Uncategorized

IOSTAT Plotter V2 – Update

February 17, 2014 Leave a comment

I’ve been updating nfsiostat_plotter and in the process I found a few bugs in iostat_plotter V2.

The first bug was in the way it kept time. If the time stamp ended in “PM” the code adds 12 hours so everything is based on a 24 hour clock. The problem is that if the iostat output is run just after noon, the time reads something like “12:49 PM” which then becomes 24:29. This bug has been fixed.

The second bug was in the plotting routines when doing “combined” plots where all devices are plotted on the same graph. It’s hard to explain but when there two or three plots, one above the other, sometimes the grid didn’t get plotted. And/or one of the graphs was longer than the others. This has been corrected (I hope).

The new version can be found here. The code has the same name as before (sorry about that). Just download the code and run it like you did before.

Filed under Linux, Linux Storage Tagged with iostat, Linux, matplotlib, storage

IOSTAT Plotter V2

December 21, 2013 Leave a comment

I have had several requests for changes/modifications/updates to iostat_plotter so I recently found a little time to update the code. Rather than just replace I created, iostat_plotter_v2.

The two biggest changes made in the code are: (1) moving the legends outside the plots to the top right hand corner, and (2) allowing all of the devices to be included on the same set of plots. The first change makes look nicer (I think). I tried to maximize the width of the plot without getting too crazy. I also shrunk the text size on the legend so you could get more devices in the plots. I think you can get about 12 devices without the legend bleeding over to the plot below it.

The second change allows you to put all of the devices on the same plot if you use the “-c” option with the script. In the previous version you got a set of 11 plots for each device which allows you to clearly examine each device. If you had five devices you got a total of 55 plots (5 devices * 11 plots per device). Using the “-c” option you will now get 11 plots with all of the devices on each of the appropriate plots.

I hope these couple of features are useful. Please le me know if they are or if you want other changes. Thanks!

Filed under Linux, Linux Storage, Uncategorized Tagged with iostat, Linux, report, storage

Linux 3.1: Steady Storage

November 6, 2011 Leave a comment

The gestation period for the 3.1 kernel was fairly long for various reasons, but it’s out and there are some steady storage updates but nothing earth shattering. But don’t mistake this for complacency, the storage bits of the kernel are still progressing very nicely.

3.1 – Me Develop You Long Time

The development time of the 3.1 kernel was a bit longer than usual starting with the 3.0 release on from July 21 to the 3.1 release on Oct. 24 of this year. Three months is a little longer than usual but Linux was on vacation during the development and kernel.org had a few security issues. But the intrepid kernel developers persevered and we have a shiny new Linux kernel. Let’s take a look at the storage aspects of the 3.1 kernel starting with the fan favorite, file systems.

File Systems

There were some nice new features added to the 3.1 kernel around file systems.

Ext3
The ext3 file system, while some outdated, is still in very heavy use. In the 3.1 kernel, a very nice feature was added to the default options for ext3. The option that is now default in et3 is write barriers. You can read more about write barriers here but in general write barriers ensure that the data is written to the disk at key points in the data cycle. In particular, write barriers happen before the journal commit to make sure all of the transaction logs are on the disk. Write barriers will impact performance for many workloads, but the trade-off is that you get a much more reliable file system with less chance of losing data. Ext3 had write barriers for some time but they were never turned on by default even though most distributions turned them on. So the kernel developers decided in the 3.1 kernel that they would turn on write barriers by default. While I am a performance junky, I’m an even bigger junky for making sure I don’t lose data.

Btrfs
I’m truly surprised by how many people who know Linux and file systems well enough still do silly things such as making the command, “ls -l” part of a script or put millions of 4KB files in a single directory. Performing a simple “ls” command, or the more feared, “ls -l”, puts a tremendous strain on the file system forcing it to walk a big chunk of the file system and look up inodes to gather metadata. A big part of this process is the use of the readdir() function.

In the 3.1 kernel, Josef Bacik from Red Hat, created a patch that improves readdir() performance, particularly for “ls” commands that follow readdir() with an stat() command. The patch improved performance on a simple, but long running script that uses “ls” by 1,300 seconds (22 minutes). Here is a plot of the pattern of the script before the patch, and here is a plot of the same script after the patch. The differences between the two are remarkable.

The second patch to btrfs, while somewhat complicated because it deals with the idea of locking during metadata operations, results in significantly better read performance and in most cases improved write performance. However, there are cases, such as those for dbench, where the write performance suffers a bit. But overall, it is a good change for btrfs.

NFS
Everyone loves NFS (if you don’t, you should), because it’s the only file system standard. NFS is NFS is NFS. It is the same protocol for Linux as for OSX as for HPUX, as for Windows, etc. So it allows one to share data between different systems even if they have different operating systems. Plus it is a well known file system that is fairly easy to configure and operate and has well-known error paths.

In the 3.1 kernel, there was a patch to NFS, actually to the v4.1 standard of NFS (sometimes called pNFS), that added IPV6 support. This patch is fairly significant because as pNFS becomes a reality (yeah – it’s been a long road), some people are going to want to use IPV6, particularly in government areas. So this patch is a dandy one for NFS.

XFS
The XFS developers have been keeping a steady pace of great developments. In the 3.1 kernel they added some performance enhancements. The details are very involved, but if you want to read the kernel commits, you can see the first one here and the second one here.

Reiserfs
The venerable reiserfs file system is still used in many places. In the 3.1 kernel, a simple patch was included to make the default mount option for reiserfs to barrier=flush. This will have an impact on existing reiserfs file systems that are upgraded to the 3.1 kernel.

HFSPlus
HFSPlus (HFS+) is a file system developed by Apple. It is used on a variety of Apple platforms including iPods. In the 3.1 kernel, a 2TB limitation on HFS+ was changed to a dynamic limit based on the block size. While HFS+ may not be used as a primary file system for most Linux users, having the ability to mount and interact with Apple file systems can be very useful.

Squashfs
One of my favorite file systems in the Linux kernel is SquashFS (I think I’ve said this before). In past kernels, squashfs supported the XZ and LZO compression methods in the kernel, in addition to the default ZLIB compression method, giving it access to three different compression methods. It was decided to drop the ZLIB support in the 3.1 kernel since there are two other compression options. Plus dropping support for ZLIB reduces the kernels size which is important for people using Linux for embedded applications.

Block Layer Patches

File systems are cool and people love to talk about them but there are other aspects to Linux storage that are just as key. One important aspect is the block layer. In the 3.1 kernel there were several important patches to the block layer that you should be aware of.

One patch that may seem esoteric but is more important than you think, added the ability to force the completion of an IO operation to go to the core (cpu) that request the operation. The kernel has the ability to move processes around the cores in a system. For IO operations, the block layer usually has the concept of completing an IO operation to a core that is on the same socket as the requesting process (using the blk_cpu_to_group() function). Sometimes you want the requesting process to also get the completion notification. This patch allows you to set capability on a per block device basis such as,


echo 2 > /sys/block//queue/rq_affinity

There were a couple of performance tuning patches for the CFQ IO scheduler that were pretty detailed and beyond the scope of this quick article. But you have to love performance additions.

The third set of changes to the block layer focused on the device mapper (dm) portion of the block layer. The first patch that is significant is the addition of the capability of supporting the MD (multi-device) RAID-1 personality using dm-raid. This is not a insignificant patch since the MD RAID-1 has some unique features. With this patch you can now use the same features but using dm-raid.

The second dm patch adds the ability to parse and use metadata devices with dm-raid. This is significant because the metadata information allows dm-raid to correctly reassemble disks in the correct order when booting or if a disk fails or there is some other fault. This is an extremely important task that otherwise you have to do by hand (yuck).

VFS Layer

If you have looked at the kernel patches for the last several kernels you will realize that there has been a large effort around improving the scalability of the VFS layer. The 3.1 kernel added some additional patches to improve scalability (and performance). The details are there if you want to read them but they are pretty deep.

New iSCSI

iSCSI is becoming more popular as a replacement for Fibre Channel (FC) networks since you just use Ethernet networks. In the 3.1 kernel the current iSCSI implementation, called STGT, was declared obsolete after the inclusion of the linux-iscsi.org (LIO) SCSI target was included in the kernel. LIO is a full featured in-kernel implementation o the iSCSI target mode (RFC-3720) but was not the only in-kernel iSCSI target considered. A second set of patches, called SCST was also considered and I guess the discussion around LIO vs. SCST was pretty rough. For a more in-depth discussion you can read the lwn article here.

Summary

The 3.1 kernel was fairly quiet from a storage perspective but the thing I like most about it is that steady progress was made on the kernel. If you recall there was a period of time when storage development in the kernel was stagnate. Then in the later 2.6.x series it start picking up with a huge flurry of development, new file systems, scalability improvements, and performance improvements. So the fact that storage is still being developed in the kernel illustrates that it is important and work is still on-going.

There were some file system developments of note, particularly if you are upgrading an existing system to use the 3.1 kernel. There were also some developments in the block layer (if block layer patches go into the kernel you know they are good quality because no one wants to cause problems in the block layer) and the VFS layer had some further scalability improvements. And finally for people using iSCSI, there was a new iSCSI target implementation included in the kernel, so if you use iSCSI targets on Linux, be sure you test these out thoroughly before deploying things in production.

Filed under Linux Kernel Storage

RIP Nedit

November 5, 2011 4 Comments

I have been using Nedit for many years for just about every editing need. However, a recent upgrade to Kubuntu 11.04 illustrated that Nedit was on it’s last leg and it was time to switch. RIP Nedit.

Nedit

I having been using Nedit for many years for just about every editing need I have. I use for writing code (it’s really great in this aspect), and I write these articles using Nedit (I have been for a very long time). it’s really easy to use, I love the split screen capability (but you need lots of vertical space for it to be really effective), I like the ability to edit multiple files and switch between them with a tab (yes – this is an extremely common capability but many years ago this didn’t exist), and I liked the auto-indentation feature (great when writing Python code). Overall I just found it to be a great editor so I started using it full time many years ago on IRIX and continued using it on Linux.

I started to use more features in Nedit and it just became comfortable for me to use. Most people will tell you that once they find a tool that scratches their proverbial itch they are very reluctant to change. I fall into this category. My wife even makes fun of me saying that the reason I don’t switch to anything else is because I don’t like change. Sorry sweety – in this case that isn’t the correct observation. The reason I don’t switch is that it would take me a long time to learn a new tool and become as proficient on it as with Nedit. Is it worth it to go through this process expending time and effort to learn a new editor just because it’s new? It’s highly unlikely that this new tool will make me more productive than with Nedit, so why switch? Switching for the sake of switching is good in some circumstances, but I don’t think this is one of them.

So I soldiered on for many years using Nedit and it worked just great. Whomever built the packages or binaries for the various distro I used (mostly CentOS, Scientific Linux, and Kubuntu) did a great job, presumably using the Lesstif library to build Nedit since it’s built on the Motif toolkit. I didn’t look at how they built the binaries, nor did I really want to know – I was just happy to have Nedit around. And, by the way, if you reading this and you built these binaries and packages, thanks very much. Count me as a happy and very appreciative customer.

Then I upgrade my laptop to Kubuntu 11.04 and installed Nedit. Then I noticed a small problem – when I tried copying and pasting a section of text or code, it would not paste. Instead I got the following message in the terminal window:


NEdit warning:
XmClipboardInquireLength() failed: clipboard locked.

A little googling found recent post that indicates that Nedit isn’t really working on Ubuntu anymore. I’m not sure if it’s a Nedit problem or a Motif library (presumably Lesstif or OpenMotif) but the result is that Nedit no longer works for me on Kubuntu 11.04 (Note: it works fine on CentOS 5.5 and older Ubuntu versions such as 8.04, but not the more recent versions). The only solution is to save any work, exit from Nedit, and start again.

If you couple this problem with comments from my friend Joe Landman, and I had to start looking for a new editor.

gedit

I tried all kinds of editors including Kate, Kedit, geany, JuffEd, and gedit. I’m a KDE kind of guy because I really loath Gnome (and the recent interfaces haven’t made me want to use it) but then again, the whole KDE 4.0 thing was a mess and not until KDE 4.6 or so, was it useable again.

Kate was fine and I used for some work but I found the interface to be a bit clunky. The same for Kedit, although Kedit moved up my list fairly rapidly. I ended up settling on gedit.

Gedit doesn’t have all the features I want especially being able to split the window so I can see two different parts of the code at the same time, but given the choice between an editor that doesn’t really work but has all the features, and an editor that works but doesn’t have all the features, I think I’ll chose the later. This is what gedit does for me.

If you have any ideas or suggestions for visual editors please don’t hesitate to suggest something.

Filed under Linux Tagged with editors, gedit, Linux, nedit

Checksumming Files to Find Bit-Rot

October 9, 2011 1 Comment

Do You Suffer From Bit-Rot?

Storage admins live in fear of corrupted data. This is why we make backups, copies (replicas), and use other methods to make sure we have copies of the data in case the original is corrupted. One of the most feared sources of data corruption is the proverbial bit-rot.

Bit rot can be caused by a number of sources but the result is always the same – one or more bits in the file have changed, causing silent data corruption. The “silent” part of the data corruption means that you don’t know it happened – all you know is that the data has changed (in essence it is now corrupt).

One source of data corruption is called URE (Unrecoverable Read Error) or UBER (Unrecoverable Bit Error Rate). These measures are a function of the storage media design and tell us the probability of encountering a bit on a drive that cannot be read. Sometimes specific bits on drives just cannot be read due to various factors. Usually the drive reports this error and it is put into the system logs. Also, many times the OS will give an error because it cannot read a specific portion of data. Or, in some cases, the drive will read the bit even though it may contain bad data (maybe the bit flipped due to cosmic radiation – which does happen, particularly at higher altitudes) which means that the bit can still be read but the bit is now incorrect.

The actual URE or UBER for a particular storage device is usually published by the manufacturer, although many times it can be hard to find. Typical ranges for hard drives are around 1 x 10¹⁴ bits which means that 1 out of ever 10¹⁴ bits cannot be read. Some SSD manufacturers will list their UBER as 1 x 10¹⁵ and some hard drive manufacturers will use this same number for enterprise class drives. Let’s convert that number to something a little easier to understand – how many Terabytes (TBs) must be read before encountering a URE.

One TB (1 TB) drives have about 2 billion (2 x 10⁹) sectors and let’s assume a URE rate of 1 x 10¹⁴. The URE converts to about 24 x 10⁹ sectors assuming that we have 512 byte sectors or 512 * 8 = 4,096 bits per sector. If you then divide the URE by the number of bytes per disk, you get the following:


24 x 10^9 / 2 x 10^9 = 12

This means if you have 12TBs or 12x 1TB drives your probability of encountering a URE is one (i.e. it’s going to happen). If you have 2TB drives, then all you need is 6x 2TB drives and you will encounter a URE.

If you have a RAID-5 group that has seven 2TB drives and one drive fails, the RAID rebuild has to read all of the remaining disks (all six of them). At that point you are almost guaranteed that during the RAID-5 rebuild, you will hit a URE and the RAID rebuild will fail. This means you have lost all of your data.

This is just an example of a form of bit-rot – the dreaded URE. It can and does happen and people either don’t see it or go screaming into the night that they lost all of their carefully saved KC and the Sunshine Band flash videos and mp3’s.

So what can be do about bit-rot? There are really two parts to that question. The first part, is detecting corrupted files and the second part is correcting corrupted files. In this article, I will talk about some simple techniques using extended file attributes that can help you detect corrupted data (recovering is the subject for a much longer article).

Checksums to the Rescue!

One way to check for corrupted data is through the use of a checksum. A checksum is a simple representation or finger print of a block of data (in our case, a file). There are a whole bunch of checksums including md5, sha-1, sha-2 (including 256, 384, and 512 bit checksums), and sha-3. These algorithms can be used to compute the checksum of a chunk of data, such as a file, with longer or more involved checksums typically requiring more computational work. Note that checksums are also used in cryptography, but we are using them as a way to finger print a file.

So for a given a file we could compute a checksum using a variety of techniques or compute different checksums using different algorithms for the same file. Then before a file is read the checksum of the file could be computed and compared against a stored checksum for that same file. If they do not match, then you know the file has changed. If the time stamps on the file haven’t changed since the checksums of the file were computed, then you know the file is corrupt (since no one changed the file, the data obviously fell victim to bit-rot or some other form of data corruption).

If you have read everything carefully to this point you can spot at least one flaw in the logic. The flaw I’m thinking of is that this process assumes that the checksum itself has not fallen victim to data corruption. So we have to find some way of ensuring the checksum itself does not fall victim to bit-rot or that we have a copy of the checksums stored somewhere that we assume does not fall victim to data corruption.

However, you can spot other flaw in the scheme (nothing’s perfect). One noticeable flaw is that until the checksum of the file is created and stored in some manner, the file can fall victim to bit-rot. We could go through some gyrations so that as the file is being created, the checksum is being computed and stored in real-time. However, that will dramatically increase the computational requirements and also slow down the I/O.

But for now, let’s assume that we are interested in ensuring data integrity for files that have been around a while and maybe haven’t been used for some period of time. The reason this is interesting is that since no one is using the file it is difficult to tell if the data is corrupt. Using checksums can allow us to detect if the file has been corrupted even if no one is actively using the file.

Checksums and Extended File Attributes

The whole point of this discussion is to help protect against bit-rot of files by using checksums. Since the focus is on files it makes sense to store the checksums with the file itself. This is easily accomplished using our knowledge of extended file attributes that we learned in a previous article.

The basic concept is to compute the checksums of the file and store the result in an extended attribute associated with the file. That way the checksum is stored with the file itself which is what we’re really after. To help improve things even more, let’s compute several checksums of the file since this will allow us to have several ways to detect file corruption. All of the checksums will be stored in extended attributes as well as in a file or database. However, as mentioned before there is the possibility of the checksums in the extended attributes might be corrupted – so what do we do?

A very simple solution is to store the checksums in a file or simple database and be sure that several of copies of the file or database are made. Then before you check the checksum of a file, you first look up the checksum in the file or database and then compare it to the checksums in the extended attributes. If they are identical, then the file is not corrupted.

There are lots of aspects to this process than you can develop to improve the probability of the checksums being valid. For example, you could make three copies of the checksum data and compute the checksum of these files. Then you compare the checksums of these three files before you read any data. If two of the three values are the same then you can assume that those two files are correct and that the third file is incorrect resulting in it being replaced from one of the other copies. But now we are getting into implementation details which is not the focus of this article.

Let’s take a quick look at some simple Python code to illustrate how we might compute the checksums for a file and store them in extended file attributes.

Sample Python Code

I’m not an expert coder and I don’t play one on television. Also, I’m not a “Pythonic” Python coder, so I’m sure there could be lots of debate about the code. However, the point of this sample code is to illustrate what is possible and how to go about implementing it.

For computing the checksums of a file, I will be using commands that typically come with most Linux distributions. In particular, I will be using md5sum, sha1sum, sha256sum, sha384sum, and sha512sum. To run these commands and grab the output to standard out (stdout), I will use a Python module called commands (note: I’m not using Python 3.x but Python 2.5.2 but I’ve also tested this code against Python 2.7.1 as well). This module has Python functions that allow us to run “shell commands” and capture the output in a tuple (this is a data type in Python).

However, the output from a shell command can have several parts to it so we may need to break the string into tokens so we can find what we want. A simple way to do that is to use the functions in the shlex module (Simple Lexical Analysis) for tokenizing a string based on spaces.

So let’s get coding! Here is the first part of my Python code to illustrate where I’m headed and how I import modules.


#!/usr/bin/python

#
# Test script for setting checksums on file
#

import sys

try:
   import commands                 # Needed for psopen
except ImportError:
   print "Cannot import commands module - this is needed for this application.";
   print "Exiting..."
   sys.exit();

try:
   import shlex              # Needed for splitting input lines
except ImportError:
   print "Cannot import shlex module - this is needed for this application.";
   print "Exiting..."
   sys.exit();



if __name__ == '__main__':
    
    # List of checksum functions:
    checksum_function_list = ["md5sum", "sha1sum", "sha256sum", "sha384sum", "sha512sum"];
    file_name = "./slides_fenics05.pdf";
    
# end if

At the top part of the code, I import the modules but raise exceptions and exit if the modules can’t be found since they are a key part of the code. Then in the main part of the code I define the list of the checksum functions I will be using. These are the exact command names to be used to compute the checksums. Note that I have chosen to compute the checksum of a file using 5 different algorithms. Since we will have multiple checksums for each file it will help improve the odds of finding a file with data corruption because I can check all five checksums. Plus it might also help find a corrupt checksum since we could compare the checksum of the file against all five measures and if one of the checksums in the extended file attributes is wrong but the other four are correct they we have found a corrupted extended attribute.

For the purposes of this article I’m just going to examine one file, slides_fenics05.pdf (a file I happen to have on my laptop).

The next step in the code is to add the code that loops over all five checksum functions.


#!/usr/bin/python

#
# Test script for setting checksums on file
#

import sys

try:
   import commands                 # Needed for psopen
except ImportError:
   print "Cannot import commands module - this is needed for this application.";
   print "Exiting..."
   sys.exit();

try:
   import shlex              # Needed for splitting input lines
except ImportError:
   print "Cannot import shlex module - this is needed for this application.";
   print "Exiting..."
   sys.exit();



if __name__ == '__main__':
    
    # List of checksum functions:
    checksum_function_list = ["md5sum", "sha1sum", "sha256sum", "sha384sum", "sha512sum"];
    file_name = "./slides_fenics05.pdf";
    
    for func in checksum_function_list:
        # Create command string to set extended attribute
        command_str = func + " " + file_name;
        checksum_output = commands.getstatusoutput(command_str);
        print "checksum_output: ",checksum_output
    # end for
 
# end if

Notice that I create the exact command line I want run as a string called “command_str”. This is the command executed by the function “commands.getstatusoutput”. Notice that this function returns a 2-tuple (status, output). You can see this in the output from the sample code below.


laytonjb@laytonjb-laptop:~/$ ./test1.py
checksum_output:  (0, '4052e5dd3d79de6b0a03d5dbc8821c60  ./slides_fenics05.pdf')
checksum_output:  (0, 'cdfcadf4752429f01c8105ff15c3e24fa9041b46  ./slides_fenics05.pdf')
checksum_output:  (0, '3c2ad544ba4245dc9e300afe79b81a3a25b2ff6e71e127724acd51124c47a381  ./slides_fenics05.pdf')
checksum_output:  (0, '0761eac4323d35a62c52f3c49dd2098e8b633724ed8dec2ee2de2ddda0874874a916b99287703a9eb1886af62d4ac0b3  ./slides_fenics05.pdf')
checksum_output:  (0, '42674cebe76d0c0567cf1bed21008b005912f0df76990456b669ef3d3942e607d69079e879ceecbb198e846a042f49ee28c145f9b1dc0b4bb4c9ddadd25777c5  ./slides_fenics05.pdf')
laytonjb@laytonjb-laptop:~/Documents/FEATURES/STORAGE094$

You can see that each time the commands.getstatusoutput function is called there are two parts in the output tuple – (1) the status of the command (was it successful?) and (2) the result of the command (the actual output). Ideally we should check the status of the command to determine if it was successful but I will leave that as an exercise for the reader 🙂

At this point we want to grab the output from the command (the second item in the 2-tuple) and extract the first part of the string which is the checksum. To do this we will use the shlex.split function in the shlex module. The code at this points looks like the following:


#!/usr/bin/python

#
# Test script for setting checksums on file
#

import sys

try:
   import commands                 # Needed for psopen
except ImportError:
   print "Cannot import commands module - this is needed for this application.";
   print "Exiting..."
   sys.exit();

try:
   import shlex              # Needed for splitting input lines
except ImportError:
   print "Cannot import shlex module - this is needed for this application.";
   print "Exiting..."
   sys.exit();



if __name__ == '__main__':
    
    # List of checksum functions:
    checksum_function_list = ["md5sum", "sha1sum", "sha256sum", "sha384sum", "sha512sum"];
    file_name = "./slides_fenics05.pdf";
    
    for func in checksum_function_list:
        # Create command string to set extended attribute
        command_str = func + " " + file_name;
        checksum_output = commands.getstatusoutput(command_str);
        print "checksum_output: ",checksum_output
        tokens = shlex.split(checksum_output[1]);
        checksum = tokens[0];
        print "   checksum = ",checksum," \n";
    # end for
 
# end if

In the code, the output from the checksum command is split (tokenized) based on spaces. Since the first token is the checksum that is what we’re interested in capturing and storing in the extended file attribute, we use the first token in the list and store it to a variable.

The output from the code at this stage is shown below:


laytonjb@laytonjb-laptop:~$ ./test1.py
checksum_output:  (0, '4052e5dd3d79de6b0a03d5dbc8821c60  ./slides_fenics05.pdf')
   checksum =  4052e5dd3d79de6b0a03d5dbc8821c60  

checksum_output:  (0, 'cdfcadf4752429f01c8105ff15c3e24fa9041b46  ./slides_fenics05.pdf')
   checksum =  cdfcadf4752429f01c8105ff15c3e24fa9041b46  

checksum_output:  (0, '3c2ad544ba4245dc9e300afe79b81a3a25b2ff6e71e127724acd51124c47a381  ./slides_fenics05.pdf')
   checksum =  3c2ad544ba4245dc9e300afe79b81a3a25b2ff6e71e127724acd51124c47a381  

checksum_output:  (0, '0761eac4323d35a62c52f3c49dd2098e8b633724ed8dec2ee2de2ddda0874874a916b99287703a9eb1886af62d4ac0b3  ./slides_fenics05.pdf')
   checksum =  0761eac4323d35a62c52f3c49dd2098e8b633724ed8dec2ee2de2ddda0874874a916b99287703a9eb1886af62d4ac0b3  

checksum_output:  (0, '42674cebe76d0c0567cf1bed21008b005912f0df76990456b669ef3d3942e607d69079e879ceecbb198e846a042f49ee28c145f9b1dc0b4bb4c9ddadd25777c5  ./slides_fenics05.pdf')
   checksum =  42674cebe76d0c0567cf1bed21008b005912f0df76990456b669ef3d3942e607d69079e879ceecbb198e846a042f49ee28c145f9b1dc0b4bb4c9ddadd25777c5

The final step in the code is to create the command to set the extended attribute for the file. I will create “user” attributes that look like “user.checksum.[function]” where [function] is the name of the checksum command. To do this we need to run a command that looks like the following:


setfattr -n user.checksum.md5sum -v [checksum] [file]

where [checksum] is the checksum that we stored and [file] is the name of the file. I’m using the “user” class of extended file attributes for illustration only. If I were doing this in production, I would run the script as root and store the checksums using the “system” class of extended file attributes since a normal user would not be able to change the result.

At this point, the code looks like the following with all of the “print” functions removed.


#!/usr/bin/python

#
# Test script for setting checksums on file
#

import sys

try:
   import commands                 # Needed for psopen
except ImportError:
   print "Cannot import commands module - this is needed for this application.";
   print "Exiting..."
   sys.exit();

try:
   import shlex              # Needed for splitting input lines
except ImportError:
   print "Cannot import shlex module - this is needed for this application.";
   print "Exiting..."
   sys.exit();



if __name__ == '__main__':
    
    # List of checksum functions:
    checksum_function_list = ["md5sum", "sha1sum", "sha256sum", "sha384sum", "sha512sum"];
    file_name = "./slides_fenics05.pdf";
    
    for func in checksum_function_list:
        # Create command string to set extended attribute
        command_str = func + " " + file_name;
        checksum_output = commands.getstatusoutput(command_str);
        tokens = shlex.split(checksum_output[1]);
        checksum = tokens[0];
        
        xattr = "user.checksum." + func;
        command_str = "setfattr -n " + xattr + " -v " + str(checksum) + " " + file_name;
        xattr_output = commands.getstatusoutput(command_str);
    # end for
 
# end if

The way we check if the code is working is to look at the extended attributes of the file (recall this article on the details of the command).


laytonjb@laytonjb-laptop:~$ getfattr slides_fenics05.pdf 
# file: slides_fenics05.pdf
user.checksum.md5sum
user.checksum.sha1sum
user.checksum.sha256sum
user.checksum.sha384sum
user.checksum.sha512sum

This lists the extended attributes for the file. We can look at each attribute individually. For example, here is the md5sum attribute.


laytonjb@laytonjb-laptop:~$ getfattr -n user.checksum.md5sum slides_fenics05.pdf 
# file: slides_fenics05.pdf
user.checksum.md5sum="4052e5dd3d79de6b0a03d5dbc8821c60"

If you look at the md5sum from earlier output listings you can see that they match the md5 checksum in the extended file attribute associated with the file, indicating that the file hasn’t been corrupted.

Ideally we should be checking the status of each command to make sure that it returned successfully. But as I mentioned earlier that exercise is left up to the user.

One other aspect we need to consider is that users may have changed the data. We should store the date and time when the checksums were computed and store that value in the extended file attributes as well. So before computing the checksum on the file to see if it is corrupted we need to check if the time stamps associated with the file are more recent than the date and time stamp when the checksum was originally computed.

Summary

Data corruption is the most feared aspects of a storage admin’s life. This is why we do backups, replication, etc. – to recover data if the original data gets corrupted. One source of corrupted data is what is called bit-rot. Basically this is when a bit on the storage device goes bad and the data using that bit cannot be read or it returns the incorrect value indicating the file is now corrupt. But as we accumulate more and more data and this data gets colder (i.e. it hasn’t been used in a while), performing backups may not be easy (or even practical) so how do we determine if our data is corrupt?

The technique discussed in this article is to compute the checksum of the file and store it in an extended file attribute. In particular, I compute five different checksums to give us even more data to determine if the file has been corrupted. By storing all of the checksums in an additional location and ensuring that the stored values aren’t corrupt, we can compare the “correct” checksum to the checksum of the file. If they are the same, then the file is not corrupt. But if it’s different and yet the file has not been changed by the user, then the file is likely to be corrupt.

To help illustrate these ideas, I wrote some simple Python code to show you how it might be done. Hopefully this simple code will inspire you to think about how you might implement something a bit more robust around checksums of files and checking for data corruption.

Filed under Uncategorized

Extended File Attributes Rock!

October 5, 2011 Leave a comment

Introduction

I think it’s a given that the amount of data is increasing at a fairly fast rate. We now have lots of multimedia on our desktops, and lots of files on our servers at work, and we’re starting to put lots of data into the cloud (e.g. Facebook). One question that affects storage design and performance is if these files are large or small and how many of them are there?

At this year’s FAST (USENIX Conference on File System and Storage Technologies) the best paper went to “A Study of Practical Deduplication” by William Bolosky from Microsoft Research, and Dutch Meyer from the University of British Columbia. While the paper didn’t really cover Linux (it covered Windows) and it was more focused on desktops, and it was focused on deduplication, it did present some very enlightening insights on file systems from 2000 to 2010. Some of the highlights from the paper are:

The median file size isn’t changing
The average file size is larger
The average file system capacity has tripled from 2000 to 2010

To fully understand the difference between the first point and the third point you need to remember some basic statistics. The average file size is computed by summing the size of every file and dividing by the number of files. But the median file size is found by ordering the list from the smallest to largest of the file size of every file. The median file size is the one in the middle of the ordered list. So, with these working definitions, the three observations previously mentioned indicate that perhaps desktops have a few really large files that drive up the average file size but at the same time there are a number of small files that makes the median file size about the same despite the increase in the number of files and the increase in large files.

The combination of the observations previously mentioned mean that we have many more files on our desktops and we are adding some really large files and about the same number of small files.

Yes, it’s Windows. Yes, it’s desktops. But these observations are another good data point that tell us something about our data. That is, the number of files is getting larger while we are adding some very large files and a large number of small files. What does this mean for us? One thing that it means to me is that we need to pay much more attention to managing our data.

Data Management – Who’s on First?

One of the keys to data management is being able to monitor the state of your data which usually means monitoring the metadata. Fortunately, POSIX gives us some standard metadata for our files such as the following:

File ownership (User ID and Group ID)
File permissions (world, group, user)
File times (atime, ctime, mtime)
File size
File name
Is it a true file or a directory?

There are several others (e.g. links) which I didn’t mention here.

With this information we can monitor the basic state of our data. We can compute how quickly our data is changing (how many files have been modified, created, deleted in a certain period of time). We can also determine how our data is “aging” – that is how old is the average file, the median file, and we can do this for the entire file system tree or certain parts of it. In essence we can get a good statistical overview of the “state of our data”.

All of this capability is just great and goes far beyond anything that is available today. However, with the file system capacity increasing so rapidly and the median file size staying about the same, we have a lot more files to monitor. Plus we keep data around for longer than we ever have. Perhaps over time it is easy to forget what a file name means or what is contained in a cryptic file name. Since POSIX is good enough to give some basic metadata wouldn’t it be nice to have the ability to add our own metadata? Something that we control that would allow is to add information about the data?

Extended File Attributes

What many people don’t realize is that there actually is a mechanism for adding your own metadata to files that is supported by most Linux file systems. This is called Extended File Attributes. In Linux, many file systems support it such as the following: ext2, ext3, ext4, jfs, xfs, reiserfs, btrfs, ocfs2 (2.1 and greater), and squashfs (kernel 2.6.35 and greater or a backport to an older kernel). Some of the file systems have restrictions on extended file attributes, such as the amount of data that can be added, but they do allow for the addition of user controlled metadata.

Any regular file that uses one of the previously mentioned extended file attributes may have a list of extended file attributes. The attributes have a name and some associated data (the actual attribute). The name starts with what is called a namespace identifier (more on that later), followed by a dot “.”, and then followed by a null-terminated string. You can add as many names separated by dots as you like to create “classes” of attributes.

Currently on Linux there are four namespaces for extended file attributes:

user
trusted
security
system

This article will focus on the “user” namespace since it has no restrictions with regard to naming or contents. However, the “system” namespace could be used for adding metadata controlled by root.

The system namespace is used primarily by the kernel for access control lists (ACLs) and can only be set by root. For example, it will use names such as “system.posix_acl_access” and “system.posix_acl_default” for extended file attributes. The general wisdom is that unless you are using ACLs to store additional metadata, which you can do, you should not use the system namespace. However, I believe that the system namespace is a place for metadata controlled by root or metadata that is immutable with respect to the users.

The security namespace is used by SELinux. An example of a name in this namespace would be something such as “security.selinux”.

The user attributes are meant to be used by the user and any application run by the user. The user namespace attributes are protected by the normal Unix user permission settings on the file. If you have write permission on the file then you can set an extended attribute. To give you an idea of what you can do for “names” for the extended file attributes for this namespace, here are some examples:

user.checksum.md5
user.checksum.sha1
user.checksum.sha256
user.original_author
user.application
user.project
user.comment

The first three example names are used for storing checksums about the file using three different checksum methods. The fourth example lists the originating author which can be useful in case multiple people have write access to the file or the original author leaves and the file is assigned to another user. The fifth name example can list the application that was used to generate the data such as output from an application. The sixth example lists the project that the data with which the data is associated. And the seventh example is the all-purpose general comment. From these few examples, you see that you can create some very useful metadata.

Tools for Extended File Attributes

There are several very useful tools for manipulating (setting, getting) extended attributes. These are usually included in the attr package that comes with most distributions. So be sure that this package is installed on the system.

The second thing you should check is that the kernel has attribute support. This should be turned on for almost every distribution that you might use, although there may be some very specialized ones that might not have it turned on. But if you build your own kernels (as yours truly does), be sure it is turned on. You can just grep the kernel’s “.config” file for any “ATTR” attributes.

The third thing is to make sure that the libattr package is installed. If you installed the attr package then this package should have been installed as well. But I like to be thorough and check that it was installed.

Then finally, you need to make sure the file system you are going to use with extended attributes is mounted with the user_xattr option.

Assuming that you have satisfied all of these criteria (they aren’t too hard), you can now use extended attributes! Let’s do some testing to show the tools and what we can do with them. Let’s begin by creating a simple file that has some dummy data in it.


$ echo "The quick brown fox" > ./test.txt
$ more test.txt
The quick brown fox

Now let’s add some extended attributes to this file.


$ setfattr -n user.comment -v "this is a comment" test.txt

This command sets the extended file attribute to the name “user.comment”. The option “-v” is the value of the attribute followed by that value. The final option for the command is the name of the file.

You can determine the extended attributes on a file with a simple command, getfattr as in the following example,


$ getfattr test.txt
# file: test.txt
user.comment

Notice that this only lists what extended attributes are defined for a particular file not the values of the attributes. Also notice that it only listed the “user” attributes since the command was done as a regular user. If you ran the command as root and there were system or security attributes assigned you would see those listed.

To see the values of the attributes you have to use the following command:


$ getfattr -n user.comment test.txt
# file: test.txt
user.comment="this is a comment"

With the “-n” option it will list the value of the extended attribute name that you specify.

If you want to remove an extended attribute you use the setfattr command but use the “-x” option such as the following:


$ setfattr -x user.comment test.txt
$ getfattr -n user.comment test.txt
test.txt: user.comment: No such attribute

You can tell that the extended attribute no longer exists because of the return from the setfattr command.

Summary

Without belaboring the point, the amount of data is growing at a very rapid rate even on our desktops. A recent study also pointed out that the number of files is also growing rapidly and that we are adding some very large files but also a large number of small files so that the average file size is growing while the median file size is pretty much staying the same. All of this data will result in a huge data management nightmare that we need to be ready to address.

One way to help address the deluge of data is to enable a rich set of metadata that we can use in our data management plan (whatever that is). An easy way to do this is to use extended file attributes. Most of the popular Linux file systems allow you to add to metadata to files, and in the case of xfs, you can pretty much add as much metadata as you want to the file.

There are four “namespaces” of extended file attributes that we can access. The one we are interested as users is the user namespace because if you have normal write permissions on the file, you can add attributes. If you have read permission on the file you can also read the attributes. But we could use the system namespace as administrators (just be careful) for attributes that we want to assign as root (i.e. users can’t change or query the attributes).

The tools to set and get extended file attributes come with virtually every Linux distribution. You just need to be sure they are installed with your distribution. Then you can set, retrieve, or erase as many extended file attributes as you wish.

Extended file attributes can be used to great effect to add metadata to files. It is really up to the user to do this since they understand the data and have the ability to add/change attributes. Extended attributes give a huge amount of flexibility to the user and creating simple scripts to query or search the metadata is fairly easy (an exercise left to the user). We can even create extended attributes as root so that the user can’t change or see them. This allows administrators to add really meaningful attributes for monitoring the state of the data on the file system. Extended file attributes rock!

Filed under Linux Storage Tagged with Extended Attributes, File Attributes, Linux

How do you Know TRIM is Working With Your SSD in Your System?

October 2, 2011 1 Comment

Now that you have your shiny new SSD you want to take full advantage of it which can include TRIM support to improve performance. In this article I want to talk about how to tell if TRIM is working on your system (I’m assuming Linux of course).

Overview

Answering the question, “does TRIM work with my system?” is not as easy as it seems. The are several levels to this answer beginning with, “does my SSD support TRIM?”. Then we have to make sure the kernel supports TRIM. After that we need to make sure that the file system supports TRIM (or what is referred to as “discard”). If we’re using LVM (the device-mapper or dm layer) we have to make sure dm supports discards. And finally if we’re using software RAID (md: multi-device), we have to make sure that md supports discards. So answering the simple question, “does TRIM work with my system?” has a simple answer of “it depends upon your configuration” but it has a longer answer if you want details.

In the following sections I’ll talk about these various levels starting with the question of whether your SSD supports TRIM.

Does my SSD support TRIM?

While this is a seemingly easy question – it either works or it doesn’t – in actuality it can be something of a complicated question. The first thing to do with your distribution is to determine if your SSD supports TRIM, is to upgrade your hdparm package. The reason for this is that hdparm has fairly recently added some capability for enumerating if the hardware can support TRIM. As of the writing of this article, version 9.37 is the current version of hdparm. It’s pretty easy to build the code for yourself (in your user directory) or as root. If you look at the makefile you will see that by default hdparm is installed in /sbin. To install it locally just modify the makefile so that “binprefix” variable at the top, points to the base directory where you want to install it. For example, if I installed it in my home directory I would change binprefix to /home/laytonjb/binary (I install all applications in a subdirectory called “binary”). Then you simply do, ‘make’, ‘make install’ and you’re ready.

Once the updated hdparm is installed, you can test it on your SSD using the following command:

# /sbin/hdparm -I /dev/sdd

where /dev/sdd is the device corresponding to the SSD. If your SSD supports TRIM then you should see something like the following in the output somewhere:

*   Data Set Management TRIM supported

I tested this on a SandForce 1222 SSD that I’ve recently been testing. Rather than stare at all of the output, I like to pipe the output through grep.

[root@test64 laytonjb]# /sbin/hdparm -I /dev/sdd | grep -i TRIM
           *    Data Set Management TRIM supported (limit 1 block)
           *    Deterministic read data after TRIM

Below in Figure 1 is a screenshot of the output (the top portion has been chopped off).

Figure 1: Screenshot of hdparm Output (Top section has been Chopped Off)

Does the kernel support TRIM?

This is one of the easier questions to answer in this article to some degree. However, I’ll walk through the details so you can determine if your kernel works with TRIM or not.

TRIM support is also called “discard” in the kernel. To understand what kernels support TRIM a small kernel history is in order:

The initial support for discard was in the 2.6.28kernel
In the 2.6.29kernel, swap was modified to use discard support in case people put their swap space on an SSD
In kernel 2.6.30, the ability for the GFS2 file system to generate discard (TRIM) requests was added
In the 2.6.32kernel, the btrfs file system obtained the ability to generate TRIM (discard) requests.
In 2.6.33, believe it or not, the FAT file system got a “discard” mount option. But more importantly, in the 2.6.33 kernel, libata (i.e. the SATA driver library) added support for the TRIM command. So really at this point, kernels 2.6.33 and greater can support TRIM commands. Also, ext4 addedthe ability is used discard as a mount option for supporting the ability to use the TRIM command.
In 2.6.36 ext4 added discard support when there is no journal using the “discard” mount option. Also, in 2.6.36 discard support for the delay, linear, mpath, and dm stripe targets were added to the dm layer. Support was also added for secure discard. NILFS2 got a “nodiscard”option since NILFS2 had discard capability when it was added to the kernel in 2.6.30 but no way to turn it off prior to this kernel version.
In 2.6.37 ext4 gained the ability for batched discards. This was addedas an ioctl called FITRIM to the core code.
In the 2.6.38 version, RAID-1 support of discard was added to the dm. Also, xfs added manual support for FITRIM, and ext3 added support for batched discards (FITRIM).
In 2.6.39 xfs will add the ability to do batched discards.

So in general any kernel 2.6.33 or later should have TRIM capability in the kernel up to the point of the file system. This means the block device layer and the SATA library (libata) support TRIM.

However, you have to be very careful because I’m talking about the kernel.org kernels. Distributions will take patches from more recent kernels and back-port them to older kernels. So please check with your distribution if they are using a 2.6.32 or older kernel (before 2.6.33) but have TRIM support added to the SATA library. You might be in luck.

Does the file system support TRIM?

At this point we know if the hardware supports TRIM and if the kernel supports TRIM, but before it becomes usable we need to know if the file system can use the TRIM (discard) command. Again, this depends upon the kernel so I will summarize the kernel version and the file systems that are supported.

2.6.33: GFS2, nilfs2, btrfs, ext4, fat
2.6.33: GFS2, nilfs2, btrfs, ext4 (including no journal mode), fat
2.6.37: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat
2.6.38: GFS2, nilfs2, btrfs, ext4 (including no journal mode and batched discard), fat, xfs, ext3

So if you have hardware that supports TRIM, a kernel that supports TRIM, then you can look up if your file system is capable of issuing TRIM commands (even batched TRIM).

Does the dm layer support TRIM?

The dm layer (device mapper) is a very important layer in Linux. You probably encounter this layer when you are using LVM with your distribution. So if you want to use LVM with your SSD(s), then you need to make sure that the TRIM command is honored starting at the top with the file system, then down to the dm layer, then down to the the block layer, then down to the driver layer (libata), and finally down to the drives (hardware). The good news is that in the 2.6.36 kernel, discard support was added to the dm layer! So any kernel 2.6.36 or later will support TRIM.

Does the MD layer support TRIM?

The last aspect we’ll examine is support of software RAID via MD (multi-device) in Linux. However, I have some bad news that currently TRIM is not really supported in the md layer (yet). So if you’re using md for software RAID within Linux, the TRIM command is not supported at this time.

Testing for TRIM Support

The last thing I want to mention is how you can test your system for TRIM support beyond just looking at the previous lists to see if TRIM support is there. I used the procedure in this article to test if TRIM was actually working.

The first logical step is to make sure the SSD isn’t being used during these tests (i.e. it’s quiet). Also be sure that the SSD supports TRIM using hdparm as previously mentioned (I’m using a SandForce 1222 based SSD that I’ve written about previously). Also be sure your file system uses TRIM. For this article I used ext4 and mounted it with the “discard” option as show in Figure 2 below.

Figure 2: Screenshot of /etc/fstab File Showing discard mount option

Then, as root, create a file on the SSD drive. The command is,

[root@test64 laytonjb]# cd /dev/sdd
[root@test64 laytonjb]# dd if=/dev/urandom of=tempfile count=100 bs=512k oflag=direct

Below in Figure 3 is the screenshot of this on my test system.

Figure 3: Screenshot of first command
The next step is to get the device sector of the tempfile just created and copy the first sector information after value 0. It sounds like a mouth full but it isn’t hard. The basic command is,

[root@test64 laytonjb]# /sbin/hdparm --fibmap tempfile

Below in Figure 4 is the screenshot of this on my test system.

Figure 4: Screenshot of Sector Information for Temporary File

So the beginning LBA for this file that we are interested in, is at sector 271360.

We can read that sector of the file using this location to show that there is data written there. The basic command is,

[root@test64 laytonjb]# /sbin/hdparm --read-sector 271360 /dev/sdb

Below in Figure 5 is the screenshot of this on my test system.

Figure 5: Screenshot of Sector Information for Temporary File

Since the data is not all zeros, that means there is data there.

The next step is to erase the file, tempfile, and sync the system to make sure that the file is actually erased. Figure 6 below illustrates this.

Figure 6: Screenshot of Erasing Temporary File and Syncing the Drive

I ran the sync command three times just to be sure (I guess this shows my age since the urban legend was to always run sync several times to flush all buffers).

Finally, we can repeat reading the sector 271360 to see what happened to the data there.

[root@test64 laytonjb]# /sbin/hddparm --read-sector 271360 /dev/sdb

Below in Figure 7 is the screenshot of this on my test system.

Figure 7: Screenshot of Sector Information for Temporary File After File was Erased

Since the data in that sector is now a bunch of zeros, you can see that the drive has “trimmed” the data. This requires a little more explanation.

Before the TRIM command existed, the SSD wouldn’t necessarily erase the block right away if the data is erased. The next time it needed the block it would erase it prior to using it. This means that the file deletion performance is very good since the SSD just marked the block as empty and returned to the kernel. However, the next time that block was used, it first had to be erased, potentially slowing down the write performance.

Alternatively, the block could be erased when the data was deleted but this means that the file delete performance isn’t very good. However, this can help write performance because the blocks are always erased and ready to be used.

TRIM offers the opportunity for the operating system (file system) to tell the SSD that it is done with the block even if the SSD isn’t sure the block isn’t being used. So when the SSD receives a trim command, it marks the block as empty and when it gets a chance, when the SSD perhaps isn’t as busy, it will erase the empty blocks (i.e. “trim” them or “discard” their contents). This means that the “erase” cycle isn’t affecting write performance or file delete performance, improving the apparent overall performance. This sounds great and this usually works really well for desktops where we can take a break periodically so the SSD has time to erase trimmed blocks. However, if the SSD is being used quite a bit, then the SSD may not have an opportunity to erase the blocks so we’re back to the behavior without TRIM. This isn’t necessarily a bad thing, but just a limitation of what TRIM can do for SSD performance.

Summary

I hope you have found this article useful. It is a little convoluted in answering the question “does my Linux box support TRIM”? Unfortunately, the answer isn’t simple and you have to step through all of the layers to fully understand if TRIM is supported. A good way to start is to look at the last list about which file systems support TRIM, select the one you like, and see which kernel version you need. If you need dm support for LVM then you have to use at least 2.6.36. If you use MD with your SSDs then I’m afraid you are out of luck with TRIM support.

Filed under Linux Storage Tagged with hdparm, kernel, Linux, LVM, md, SSD, Trim

← Older posts

Scratching that vmstat plotting itch

3.1 – Me Develop You Long Time

File Systems

Block Layer Patches

VFS Layer

New iSCSI

Summary

Nedit

gedit

Overview

Does my SSD support TRIM?

Does the kernel support TRIM?

Does the file system support TRIM?

Does the dm layer support TRIM?

Does the MD layer support TRIM?

Testing for TRIM Support

Blog Stats

Scratching that `vmstat` plotting itch