Data Integrity via Checksum
Probably the fundamental theorem of storage is,
“Data will always expand to fill the space allotted”
Users are always asking for more space to store their data while at the same time problems are getting larger, requiring more data capacity, and our ability to generate data is increasing as well.
At the same time, a recent study by the University of California at Santa Cruz and Netapp found that data is getting colder. That is, once the data has been created, it is rarely reused. If you use the temperature analogy to how often data is accessed, then if it is rarely, if ever, accessed after it was created, the data is considered to be “cold”.
Studies have shown us that on average, even just two weeks after the data is created, it is rarely touched. This means that the data is getting very cold. But at the same time, users insist that their data be accessible and available which usually means that it has to be on-line and not in an archive.
So we’re creating more data than every before, we’re not using it after a short period of time, and we want the data to stay available. Exacerbating this problem are requirements forcing us to keep data on-line for many years. The National Science Foundation (NSF) requires a data management plan for it’s research award recipients that includes how the recipients will store the data and make it available for some time (up to 20 years in some cases). The Sarbanes-Oxley Act is requiring us to keep records for even longer as well. We’re getting pushed into keeping more data and keeping a lot longer than we did previously.
One consequence of this is data integrity. How do we ensure that the data we’re storing does not become corrupt? I like to break this problem into two pieces:
- How do we check if data has become corrupted?
- If data is corrupted, how do we restore it?
This page deals with some tools that focus on the first piece and eventually will also focus on the second piece.
How do we know when data has become corrupted?
There are several ways to detect when data is corrupted. The most obvious is when you try to use the data and you either can’t read it or if the application crashes or gives you nonsense. But this can take more work because you have to run the applications against the data and have some way to determine the data is corrupted.
An easier way to determine if the data is corrupted (i.e. it has changed) is to use a Checksum. A checksum is a single value that is computed for a block of data (pretty simple idea). To determine if the data is corrupted (i.e. it has changed) then a checksum computed on the data doesn’t match the original stored checksum (assuming the original stored checksum is correct). One important aspect in using checksums to detect corrupted data is that the checksums should be as unique as possible to avoid the case where the data can change without the checksum changing.
There are many ways to compute a checksum such as md5sum, sha1sum, sha2 algorithms (sha256, sha384, sha512) as well as others. These checksums algorithms produce a different length checksum with longer checksums requiring more computational work. Table 1 below lists some checksum algorithms and their corresponding checksum length.
So you can see that some algorithms produce longer checksums than other algorithms.
There are some aspects to checksums that you need to keep in mind, especially considering we are using them to detect if a file has become corrupt. The first aspect is that the longer the checksum the more computational work it takes. So if you want to use a long checksum such as sha-512 then it will take much longer to compute than perhaps md5sum.
The second aspect you need to remember is that you need to store these checksums somewhere that serves as a “golden image”. That is, the stored checksums are presumed to be correct so they too must also not fall victim to data corruption. This is coupled with the fact that longer checksums have a higher probability of data corruption because they are longer. The increase in probability may be slight but where you are discussing petabytes of data, even small changes can have an impact.
The third consideration you must remember is a subtle but important point. If you store the checksums in a central repository (like a database or a simple flat text file), you have to maintain a mapping between the checksum and the file. If you lose this mapping the stored checksums become meaningless. For example, if you store the checksums in a centrally located, single database, you need to store the full path to the file and the checksum. But if the file is moved (new full path), or if the database becomes corrupt, then the entire database is considered worthless. You can start over and recompute the checksums for each file but you run the risk that the files may already be corrupted and then you are jut computing the checksum for a corrupted file (which is essentially defeating the purpose).
Correcting or Restoring Corrupt Data
Let’s assume that we have found a file that has corrupt data, what do you do to correct it? There are several approaches to this and each one has it pluses and minuses.
One way that some file systems correct this is to store the parity for a particular block on the disk along side the data in a RAID format. PanFS, the file system for Panasas does this as part of tiered parity. If the data in a block is found to be corrupt, then the parity data and other data blocks can be consulted and the data restored. If you like, this is RAID at the block level.
This type of approach works well but you sacrifice space on the drive to store the parity of the blocks. It may not be a lot but you have to recognize that you need to store extra information to help with correcting for bad data. Moreover, you have to check the parity blocks periodically so that the parity data is not corrupt (like comparing checksums).
A second approach is elegant and simple – brute force. That is, just make a one or more copies of the data. If one copy is determined to be bad, and another copy is determined to be correct, then you just update the bad data with the good data.
This approach too takes extra space since you are storing copies. So for example, if you keep two copies you will use 3 times the storage space (original + 2 copies). Many storage devices and file systems have taken this approach.
A third approach may be fairly simple – using RAID within the storage devices to correct corrupt data. Let’s say that a file has been found to be corrupt and you know which disk(s) holds the corrupted data (and the other disk(s) don’t hold corrupted data). To correct the bad disk, you tell the RAID controller that particular disk is bad and start the RAID rebuild process.
While this approach sounds simple there are a couple of aspects that need to be considered. The first aspect is that you need to be able to find which disk has corrupted data and make sure that the number of bad disks does not exceed the failed capability of the RAID level (for example 1 disk in the case of RAID-5 and 2 disks in the case of RAID-6). In addition, marking a disk as down and forcing a RAID rebuild means that the “good” disks have to be completely read and a hot-spare disk has to be written (remember that classic RAID is block based so during a rebuild you have to rebuild the entire disk). Remember that disks have a fairly low URE (Unrecoverable Read Error). This means that for large disk sizes and disk counts coupled with certain RAID levels, you may encounter a URE rendering the RAID rebuild failed meaning that all of the data is now bad (time to restore from a backup or a copy).
Keeping in mind the discussion about detecting data corruption and “fixing” or correcting data corruption, let’s move on to the point of a strategy for Data Integrity via Checksum
A Strategy for Data Integrity via Checksum
Given the previous discussion about data integrity and checksums, I have developed a strategy for using checksums for detecting data corruption. Remember that this part of the overall strategy only detects data corruption – it does not correct.
It is pretty obvious that the strategy involves using checksums to detect data corruption. However, I have added some features to help improve the checksum process. Firstly, I don’t compute just one checksum, I compute 5 of them (md5sum, sha-1, sha-256, sha-385, and sha-512). The reason I compute 5 checksums is that data rot (the probability of a single bit flipping spontaneously), is fairly low to begin with, so a single bit flip is unlikely to affect all 5 checksums. Consequently, this makes it fairly easy to determine if a checksum has gone bad. If I compute the 5 checksums and compare them against one another and not all of them agree (it’s basically a pass/fail when you check each checksum), then this indicates that the data is perhaps intact but the checksums have gone bad. Therefore, I just recompute the checksums and store them rather than having to restore all of the data.
The second piece of the strategy is that the checksums should be stored with the data. The reason for this is that if the checksums are centrally stored and something happens to this database, then I’ve lost all mappings between the checksums and the data. To me this seems unacceptable when it’s so easy to store the file’s checksum with the data itself. An easy way to do this is using extended file attributes. So we can easily define the checksums as specific extended attributes for files.
However, we can’t always escape the specter of a central database storing all of the checksums. When checking a file against the stored checksums and all 5 stored checksums don’t match the newly computed values, then the next step is to compare the checksums stored with each file and the newly computed checksums and the centrally stored checksums. If it is found that the checksums stored with the data file match the centrally stored checksums, then it is obvious that the data has been corrupted and needs to be corrected. We can also examine the corner case where the newly computed checksums match the centrally stored checksums match the newly computed checksums. If this is the case, then checksums stored with the file are corrupt and must be replaced (this scenario is very unlikely but we need to account for it anyway). But wait – there is more in the form of data protection.
To make sure the centrally stored checksums (typically a database) do not also suffer from bit-rot we need to do two things. The fist is that we need to compute a checksum of the database (a checksum of the file of checksums if you will) and store it with the database as an extended attribute. Then we need to make three copies of the database and store them in different locations (it shouldn’t be too much data and it’s very compressible). Then when you ready to use the database, you compare the three copies relative to one another. If two checksums match, then you can use one of those databases and repair the bad copy. If none of the three copies match, then you are in trouble because you don’t know which copy of the database is correct (two or three copies could be corrupt). However, the probably is very low of this happening, but if it concerns you, you can store 5, 7, 9, or 11 copies of the database (make an odd number of copies so you can break a tie if the checksums are different from one another).
There are really two steps to the data integrity process. The first step is to take a file that does not have checksums computed (and you are assuming the data has not been corrupted). You compute all 5 checksums and store them with the files in extended attributes, and you also store them in the central database. Once you are done with adding files to the database, be sure to update the checksum of the database and make the desired number of copies.
The second step is to periodically check the data for corruption. To do this, you compute all 5 checksums of the file, and compare them to the extended attribute checksums. If the newly computed checksums match the extended attribute checksums, then the data is not corrupt and nothing needs to be updated. If one to four of the newly computed checksums do not match the checksums stored in the extended attributes, then most likely the stored checksums have become corrupted. I would recommend further checking the centrally stored checksums which should match the newly computed checksums. If they do match, then simply update the checksums stored in the extended attributes. If they do no match, then it’s reasonable to say that the checksums or data have become corrupted so the data needs to be corrected (more on that later).
Overall the process is very simple. However, realize that you will need a fair amount of computational power to compute the checksums. This is particular true if you have a large number of files.
The implementation of this strategy isn’t overly difficult. As an example, the following code is all written in Python.