|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.524.0413 | Copyright info |  |
 |
 |
Undetected disk errors in RAID arrays
|  |  |
by J. L. Hafner, V. Deenadhayalan, W. Belluomini, and K. Rao
|
|
|  |
 |  |  |
|
| |
|
Disk drives are remarkably reliable devices, given their electronic, mechanical, and physical complexity. Modern drives have advanced error detection and correction mechanisms. For example, they can recover from certain internal errors, they can be instructed to remap logical to physical sectors to avoid problem areas on the platters, and they can usually report when the data they are storing is no longer readable (e.g., in the case of an uncorrectable read error, or hard error). Disk drive manufacturers specify reliability parameters, such as MTTF (mean time to failure) or hard-error rate. Several studies [1–4] involve the measurement of failure rates. Unfortunately, disk drives can also fail in ways that are undetected by the drive. This silent data corruption [5, 6] is referred to as an undetected disk error (UDE) in this paper. These come in two classes, undetected write error (UWE) and undetected read error (URE). The former are more significant problems, typically because their effects persist. Interestingly, UWEs produce detectable symptoms only during subsequent read operations; in effect, there is no immediate way to determine whether a data corruption event was caused by a UWE or a URE. We note that while evidence of UDE occurrence exists [5, 7], theoretical models for predicting failure rates for UDEs are not available. Furthermore, disk drive and storage systems manufacturers are reluctant to point out that their products are susceptible to UDEs.
A number of technologies exist that can be applied to the problem of UDE detection and possible correction. For example, applications can examine the contents of the data for reasonableness. For instance, for a simple text-editor application such as Microsoft Notepad, one does not expect to encounter a lot of non-ASCII byte codes. File systems may use checksums on files or blocks of files for error detection. However, for such systems to detect reliably and to have the ability to correct frequently (if not always), they must detect aggressively—on every read access to the physical disks. Most RAID (Redundant Arrays of Independent Disks [8]) subsystems implement a typical host write operation as a combination of reads and writes (e.g., read–modify–write in the case of a RAID5 array), so whereas the application performs only a write operation, the disks perform both reads and writes. In addition, RAID subsystems may read data for the purpose of reconstruction when disks fail detectably (e.g., a fail-stop or a hard sector error). In such cases, the reads or writes to the physical disk are issued by the RAID subsystem on its own and not in response to a host- or application-initiated operation. Furthermore, the RAID software layer is in the best position to correct any UDEs it detects and locates, when such corrections are theoretically and practically possible. (As is well known in the field of computer science, the abstraction layer framework provides a means to separate higher-level functions from lower-level functions and provides well-defined application programming interfaces that enable interaction and communication between the various layers.) For these reasons, we believe that the optimal solution to the problem of UDEs lies in the RAID layer, which is the layer closest to the physical drive. This issue is discussed later in this paper.
The purpose of this paper is to first provide a brief description of the types of disk drive failures that lead to UDEs and the effect they have on the integrity of the data that the drive returns to the requester (host, application, or RAID). We provide a small set of solution technologies and give brief examples of how they are used in some systems. We then focus on RAID storage systems and discuss how UDEs, if not detected sufficiently early, can cause corruption with two side effects: 1) the data corruption occurs in locations physically separate from the misbehaving drive and 2) the detection and location of these errors by the RAID subsystem are masked, or confused. Finally, we present a family of methods deployable in the RAID layer that address the problem of early detection and possible correction.
| |
|
Broadly speaking, UDEs can be classified on the basis of their origin, that is, as being caused by software or firmware malfunctions or by hardware malfunctions. Defects in the drive firmware code can cause the wrong data to be returned by the disk drive in response to a read command or the wrong data to be written by the drive in response to a write command. Typically, UDEs in read operations tend to be transient, while those in write operations are usually persistent; the incorrect data remains in place until it is overwritten by correct data.
UDEs caused by hardware malfunction can be classified as memory or interface related and head or media related. Memory or interface malfunctions can cause one or more bits to be flipped during data transmission to and from the disk drive, and their detection depends on the error-detection capabilities that are employed by the subsystems in which they occur.
With respect to head or media malfunctions, six disk misbehaviors exist that can produce a UDE event, three that produce UWEs and three that produce UREs. Table 1 summarizes the cases and the effect each one has on subsequent data integrity.
|
| Table 1 Summary of UDE types, characteristics, and side effects. (UDE: undetected disk error; UWE: undetected write error; URE: undetected read error.) |
|
|
|
|
|
| UDE types | Characteristics | Data integrity effects |
|
| Write (UWE) | Dropped write | Stale data |
|
| Off-track | |
| | Near off-track | Possible stale data |
| |
| Far off-track (miss target track A, or on or near track B) | Read of track A causes stale data |
| |
| Read of track B on B causes corrupted data |
| |
| Read of track B near B causes possible corrupted data |
|
| Read (URE) | ECC miscorrect | Corrupted data |
|
| Off-track | |
| | Near off-track | Possible stale data |
| |
| Far off-track | Corrupted data |
|
A dropped write (sometimes referred to as a phantom write [6]) occurs when the head does not produce a sufficiently strong signal in the right location to override and reset the bits of the data already present. In other words, the write never occurs and is effectively dropped by the drive. Causes for such errors include short circuits in the pre-amp signal lines or in the actuator cable. These errors can be persistent or intermittent.
An off-track write (also called a misdirected write) occurs when the head is not aligned properly on the track during the write operation, which has two possible consequences. First, the new data may be written slightly off-track, in the gap between two tracks. When a subsequent read occurs, the head may align with the old track or in the gap and so produce either stale or new data, respectively. Second, the new data may be written sufficiently off-track so that it corrupts data in a neighboring or distant track. In this case, the read of the original target location will produce stale data and a read of the offset target location will produce corrupted data; in effect, this is a double error scenario. Such off-track writing can occur for a number of reasons, but most notably from vibrations either in the drive environment or in the drive spindle or arm. In addition, off-track writes can happen as a result of defects in the disk between tracks that confuse the track alignment algorithms.
On reads, the ECC (error correcting code) logic may not function effectively if there are more errors than the code is designed to correct, but the code aggressively miscorrects as if there were fewer errors. However, the probability of such a bit error occurrence (which causes the ECC logic to miscorrect) is several orders of magnitude lower than the probabilities of the types of errors that the logic can properly correct. Thus, an ECC miscorrection is extremely unlikely. The symptoms of these UREs are corrupted data. A read may also track incorrectly, near or far off-track, and in so doing produce either stale data or incorrect data, respectively.
Regardless of the origin of the UDE, the symptoms for applications are the same—incorrect data returned for a read operation. Hence, the relative frequencies of the different causes are not as important to consider as technologies that can validate the data returned on a read operation.
| |
|
In this section, we present some technologies that can be applied singly or in combinations to address some or all of the problems of data corruption caused by UDEs, as well as other causes of data corruption. The list is not exhaustive but is intended only as a general overview of the types of methods that are used. They are also not mutually exclusive but are combined in some systems to improve detection and correction capabilities. We subdivide the technologies into different classes depending on the software layer in which they are implemented and the type of problem they attempt to address.
The first class of technologies applies checking at the middleware or application software layers. For example, some file systems use checksums on data chunks (e.g., 4-KB chunks) and store these checksums separate from the data chunks themselves. These are read with the data chunks, and the checksum is recomputed and compared; thus, a mismatch indicates corruption of either the data or the checksum. Technologies applied at this level have the power to detect a large class of UDEs as well as errors that might be injected by other components in the system such as networks, internal buses, and memory devices (such as DRAM). ZFS (a file system originally created by Sun Microsystems for the Solaris Operating System [9]), the IRON (internal robustness) file system [6], and many enterprise-class database systems have such features. However, many layers of abstraction exist between these methods and the disks themselves, so these methods do not necessarily help in terms of recovery. Let us consider two issues. First, when an application detects an error, it generally has no means of recovery (without going to system backup). Second, as noted in the introduction, these detections are not enforced on every disk read, only those that return data to the application; thus, this may permit additional data corruption (as described in more detail in the section “UDEs and RAID: The problems”). Using ZFS, recovery is possible when software RAID-Z is enabled; this effectively places the detection and recovery within the RAID subsystem and so is more aptly included in our second class of technologies.
The second class of technologies is implemented in the storage system itself. These have the advantage of being aware of every disk read and write, whether generated by the application directly or by the storage system itself. Here, the detection is limited only to errors in the storage system and not to errors introduced along the datapath between the storage system and the application. This class has a number of subclasses with respect to different powers of detection and correction with associated performance impacts and performance/cost tradeoffs.
One subclass is based on data or parity “scrub.” In a data scrub, the disk blocks are read (or verified) via a background process to detect latent disk errors [10]. By itself, a data scrub cannot detect UDEs (see the following paragraph). A parity scrub (see Nexsan systems [11]) provides an additional step beyond a data scrub. In addition to reading all of the data blocks, it recomputes the parity from the data and compares this computed parity with the value stored on the parity disk. A miscompare indicates a data corruption error. Unfortunately, data scrubs and parity scrubs do not always provide for recovery, and a comparison does not always provide a guarantee that no data corruption has occurred (see the section “UDEs and RAID: The problems”).
A second subclass co-locates some form of metadata along with the data. When this metadata contains checksums, a read of the data and metadata can detect data corruption errors, especially those introduced by firmware, software, or memory bus components. When the metadata contains some form of logical addressing of the data, a read can detect off-track writes but only at the offset target location (where data was incorrectly overwritten). These methods are usually combined with a data scrub (see Sundaram [12]) to enhance the effectiveness of a scrub.
A third subclass uses the SCSI (Small Computer System Interface) write–verify disk command (see SCSI Block Command Standard-3 [13]) that requires the disk, after writing the data to the platter, to reread from the platter to determine that the data was written correctly. This works for addressing the dropped write errors but cannot necessarily detect off-track writes because the tracking algorithm, which has already decided that the correct track is being used, may not move the head from the current track (which may or may not be the correct track). This also has a response-time performance impact because the disk must spin a full revolution before the check and before the status is returned to the host. This method has a simple recovery algorithm that causes an attempt to rewrite the data to the same or a different location on disk.
A fourth subclass of the technologies implemented in the storage system is based on the write cache. This is similar to the write–verify approach, but the verification is handled external to the disk drive. New data is held in the cache even after being written to disk. When the data needs to be evicted from the cache, the disk is first read and compared with the cache copy. The eviction occurs only if the data compare is correct. This has the same simple recovery algorithm as the write–verify approach. Unfortunately, this method may reduce the amount of usable cache memory compared to a typical cache (in the worst case to half its value), and it involves a full independent read operation—doubling the I/O and bandwidth requirements. The cache may need to hold both the version of the data last written to disk (if that has not been verified) and any new write data addressed to the same logical address. The new data cannot be written to disk, for example, with a read–modify–write operation until the disk copy has been verified, so two copies (old and new) must be held in the cache. An example of this method is described in [14].
A fifth subclass of storage system detection technologies uses metadata, similar to that at the application layer. In these technologies, the metadata is stored and managed within the storage system itself. The metadata may be a checksum or version number (e.g., a sequence number or timestamp) that is associated with a single block or a set of blocks. The second copy of the metadata is stored in memory for fast comparison and access. This memory copy must be flushed to disk somewhere in the system in order to maintain detection capabilities after system crashes. Depending on the granularity and size of the metadata, the system may not have enough memory to maintain the full set of metadata; thus, the metadata in memory is only a cached subset, and periodic disk reads (and writes) will be required. Judd [15] describes one such system.
Finally, a sixth subclass also uses some form of metadata for each block or set of blocks, but in this subclass, the metadata is stored internal to the RAID array (or RAID stripe). (Data striping is the segmentation of logically sequential data so that segments can be assigned to multiple physical devices and written concurrently.) It then has a locality of reference that can be used to mitigate some of the disk I/O overheads of separate locations. The advantages of this approach are lower I/O and bandwidth penalties for most operations; the disadvantage is potential loss of detection under certain failure scenarios. Before we describe these types of methods in more detail, we provide a short summary of related work and a detailed description of the effects of UDEs within a RAID subsystem.
| |
|
A significant amount of work has been recently undertaken in the industry to address the issues caused by UDEs. For example, one such effort is the end-to-end data integrity checking (which makes use of a DIF [data integrity field]) from the T10 standards organization (see the SCSI Block Command Standard-3 [13]). Unfortunately, the DIF method cannot protect against all types of errors. Most notably, dropped writes may not be detected without additional effort in the application layer.
The work of Schwarz et al. [16] uses signatures (which have some of the characteristics of hashes, a type of checksum) together with data scrubbing to detect some forms of UDEs. These signatures are stored in a database, that is, they are not co-resident with the data blocks. The focus of the paper [16] is on data reliability as a function of the data scrub rate and on which blocks are scrubbed and when they are scrubbed. The authors do not directly address the issue of how the RAID code and data scrub interact with UDEs and do not directly address questions of data correctness.
The paper by Sivathanu et al. [17] presents a very high level survey of various methods that deal with data integrity issues and describes a rough taxonomy of solutions. Some overlap exists between our classes and their taxonomy, but we subdivide the classes in the RAID layer variations. The researchers in [17] do not employ a specific RAID logical layer, although this might fit into the device driver or file system layer. Their paper touches briefly on how RAID1 mirroring or RAID5 scrubbing might help, and it contains an extensive bibliography.
The paper [6] that describes the IRON methodology deserves additional mention. It describes complex disk failure models and root causes and addresses the question of how file systems can address disk faults. The paper includes another taxonomy for detection and recovery techniques, but these are mostly viewed from the perspective of a file system, for example, checking return codes from device drives. The paper also includes a very extensive bibliography.
Malhotra and Trivedi [18] use Markov models and analytic techniques to model data reliability and data corruption, which they refer to as catastrophic error, on the basis of parameterized probabilities for certain errors, most notably probabilities of combinations of random byte and burst errors upon read. They address how these are captured, missed, or mishandled by the ECC on the disk. When applied to UDEs, we have two possible concerns with their model. First, they consider the effect of improper alignment of the head only for read operations and not for write operations; hence, they determine that these are usually corrected by a retry. Second, a dropped write is viewed in their model as a sector-length undetected burst error and hence extremely unlikely in their parameterization. We believe that off-track and dropped writes should be modeled by a probability distribution that is orthogonal to the probability distribution for byte and burst errors, which is what the ECC itself is designed to address.
Sundaram [12] gives a description of how some of the technologies in the toolkit are used in several network appliance products. They use co-located checksums per 4-KB block and periodic data scrubbing to address problems of data corruption caused by firmware errors and aging disks. Checksums are used to detect local data corruption but not dropped writes or far-off-track writes. They solve these problems with a novel technique that involves storing logical identifiers (e.g., logical address) with the data and writing new versions of the data on new physical locations.
Darden [10] discusses enlarging a sector on disk to include eight bytes of metadata. Included in the metadata is a CRC (cyclic redundancy check) that can detect local data corruption. This paper also describes the application of periodic data scrubbing, but primarily for the purpose of early detection of failed sectors. Also included in the metadata are a timestamp and a write stamp (phase bit). The timestamp is updated on a full-stripe write (i.e., an update of all the data in the stripe) and the write stamp is toggled on each write. These are replicated in the parity sectors (and so have some overlap with the data parity appendix scheme described in the section “UDEs in RAID: A family of solutions”). However, the consistency of these stamps is only checked each time an array is opened (e.g., on controller reset). Note that this scheme will be unable to detect two successive dropped writes that would have only toggled the write stamp bit.
| |
|
Storage system designers who care about data integrity also care about data reliability. These systems use RAID algorithms (e.g., RAID1/4/5) to increase reliability. Many systems are implementing RAID6, which we refer to as any two-erasure-tolerant code such as those referred to as EVENODD [19], RDP (row-diagonal parity) [20], or Reed–Solomon [21], to further improve reliability. As far as we know, there has been little analysis of the effect UDEs have on such systems. Typically, such systems use some form of parity scrub for which stored data is checked against the stored parity for UDE detection. We show below that this method cannot be relied on as a primary means to detect data corruption caused by UDEs. In particular, the scrub may mistakenly find that the data or parity is correct when in fact both are wrong.
Furthermore, one may believe naively that any two (or more)-erasure-tolerant code (e.g., any RAID6 type of code) should have the ability to not only detect errors during a parity scrub but also correct one such error (e.g., one UDE) per stripe. However, as we show, this is also not necessarily true because of the side effects of normal and successful RAID operations on a stripe following a UDE. We show cases in which even a RAID6 code can be mistaken about both the integrity of the data and parity and about the location of that error when an error is detected. Even for RAID6, parity scrubs are problematic for detection and correction of errors created by UDEs.
To explain these cases, we examine a state diagram (with partial state transitions) for a typical RAID stripe in a RAID subsystem. The RAID stripe is the smallest collection of data blocks and parity blocks (more generally, redundancy blocks) from multiple disks that are computationally related. Read operations on the data portion of the stripe are typically served by just reading the relevant data blocks. Write operations, which require updates to both data blocks and related parity blocks, may require reading of some data and/or parity blocks (either old data and old parity in a read-modify-write implementation or dependent data in a read-other-parity-compute implementation).
| |
|
We begin with a brief discussion of RAID5. Figure 1 shows a typical RAID5 stripe and the effect only three operations can have on the data consistency and correctness of the stripe. In the top row, the stripe is in a good state in which all of the data is correct and parity is consistent (e.g., after an error-free full-stripe write). (In this figure, A, B, and C designate different data blocks, and P represents a parity block. The letters a, b, and c represent data values, and the designates the XOR operation.)
Figure 1
The second row shows the state of the stripe after a read-modify-write operation to the stripe that involved an undetected write error (UWE) (e.g., dropped write) to data block A. The parity is read, modified, and written correctly. The old data is read correctly, but the new data is not written (the write is dropped and the error undetected). In this case, the parity encodes the correct (new) data, but the data in block A is stale (old). A parity scrub at this point would show this inconsistency, but with only a one-erasure-tolerant code, it is impossible to determine the location of the error. In this case, the RAID subsystem has to mark all of the data in the stripe as bad, resulting in a data loss event for the whole stripe. Note that the same scrub outcome would result if the data block had been written correctly but the parity block write was dropped (although in this case the recomputed parity would restore the stripe to the good state). The third row in Figure 1 shows the stripe after another update of data block A, but this time all reads and writes were error free. In this case, the scrub would again detect that something is wrong. However, now the error has moved from block A to the parity block even though the UWE occurred only on the first drive. Finally, the fourth row shows the stripe after a failure of disk block B and a reconstruction. It is clear that this block B now contains incorrect data, but the stripe is now consistent. In other words, the error migrated from block A to block B but is now undetectable by parity scrub.
This simple example shows the effects of normal (and error-free) operations on a stripe after one UWE. The stripe can contain incorrect data yet still appear to be consistent with respect to a parity consistency check.
| |
|
In the previous section, we saw that a RAID5 stripe can enter into many different states as a result of a single UWE event followed by normal and error-free write operations. Given that RAID6 is being widely deployed in the industry and that it is more complex compared to RAID5, we analyze in more detail the RAID6 stripe states and transitions. A RAID6 erasure code is also a single-error correction code [22]. As such, it comes with a mathematical algorithm that detects and locates (and so corrects) one error provided it is applied in a timely manner, that is, immediately after the error occurs. In the following description, this algorithm is referred to as the locator. The locator may specify the location of an error or may indicate that more than one error has occurred and so no location can be identified (locator output is empty). The locator may also be mistaken and, thus, report a single error location when three or more errors have occurred.
Figure 2 shows the state diagram for a RAID6 stripe. The two major states and their substates are described in more detail in the following text. Note that both sets of substates are only implicit; they are not observable by the normal RAID-level mechanisms. The complete state-transition rules for this state diagram are complicated, as they depend on what is corrupted, on which operation is being applied, and on what sequence of operations is applied. For our purposes, we only indicate that each state is attainable by some sequence of simple operations even though only one UDE event may have occurred. Hence, it is possible for a stripe to appear in any of these states and inner states (i.e., substates). The conclusion is that the parity scrub and location/reconstruction algorithms are not reliable. They can be fooled (undetectable-bad state, in the figure) or they can get confused and possibly provide an incorrect result (locator failure, in the figure).
Figure 2
The parity-consistent state
The parity-consistent state is reached when the parity scrub detects no data or parity inconsistencies. In coding theory terms, the syndromes are all zero. This state has two substates. In the good state, the data and parity are as they should be—namely, as if no error had occurred. This is assumed to be the initial state (e.g., after zeroing all data and parity in a stripe). In this state, any read by the host will return the same data that the host wrote previously, regardless of any failures that may have happened in the meantime. Note that any (fail-free) operation that overwrites all bad data (or parity) in the stripe without first reading it will transition into the good state from any other state. Such operations include full-stripe writes and promoted-full-stripe writes when the data read is correct data. The three actions needed to perform a promoted-full-stripe write are read-other[good], compute parity, and overwrite bad.
The undetectable-bad state has the property that a parity scrub will detect no data or parity inconsistencies, but there are bad blocks of data and/or parity in the stripe (a host read to these blocks will return bad data to the host). The stripe can reach this state from the parity-inconsistent state by a number of different operations. For example, a promoted-full-stripe write in which the dependent data was previously corrupted by a UDE will leave the corrupted data unchanged, but generate incorrect but consistent parity based on this corrupted data. In addition, a rebuild operation in the presence of corrupted data (where the corrupted data is not on the failed drive) will also transition into this undetectabe-bad state. These are both examples of read-other[bad]-compute parity type of operations. In some sense, the undetectable-bad state is the worst state for the stripe, because there is something wrong, but it cannot be determined by normal parity scrub. The parity-inconsistent state
In the parity-inconsistent state, a parity scrub will determine that the parity and data are not consistent, indicating that some error has occurred. The error location algorithm of the erasure code can either 1) correctly identify the location of incorrect data or parity, 2) incorrectly identify a location (i.e., identify a location, but the data or parity in that location is correct data), or 3) fail to determine an error location at all (symptomatically, this is as if more than one error occurred, although in reality only one UWE may in fact have occurred). For our purposes, we combine the last two cases into the locator-failure substate.
As noted, the inner states of the parity-inconsistent state are implicit because there is no direct way to tell what inner state the stripe is in by normal RAID-level mechanisms. Naively, the undetectable-bad and the locator failure states should never occur with a RAID6 algorithm. This would be true if the scrub occurs immediately following a UWE event. After a UWE event, there has been only one error, so that the stripe is clearly in the parity-inconsistent state (in fact, in the locate-and-repair state) and the locator algorithm should correctly identify the UWE location. Because there has been no other operation between the UWE and the scrub, no error propagation has occurred (as we saw in Figure 1) so the RAID-reconstruction algorithm should restore the correct data. Unfortunately, it is not practical to run a scrub after every write operation on the stripe. Scrubbing requires reading all of the disks and would dramatically increase the number of I/Os involved in each write transaction.
To show that every state is achievable, we consider the EVENODD [19] code as typical of XOR-based codes. Consider an EVENODD code with prime number p = 3, with n = 3 data disks and two rows (as in Figure 3). The columns represent disk drives (or strips on each disk). The uppercase letters denote logical addresses; the lowercase letters represent the data or parity content. The P and Q columns contain the formulas used to compute the P and Q parity.
Figure 3
Initially, assume the stripe is in the good state with data values a0, a1, b0, and so on. Now suppose that a write to block A0 with new data value a0’ occurs but that the actual write to block A0 fails with a UWE. The new relevant values are now as follows (← is the assignment operation, and as mentioned previously, is the XOR operation):
A0 ← a0, (the error) B0 ← b0, P0 ← a0’ b0 c0, Q0 ← a0’ x0, Q1 ← b0 y0,
where x0 = c1 b1 c0 and y0 = a1 b1 c0 (the specific values here are not relevant). The stripe has transitioned from the good state into the locate-and-repair state by this operation. Table 2 shows examples of how the stripe can transition to other states by subsequent error-free operations. Correct values are shown in a normal font, while error values are shown in bold. The table demonstrates how the location of the error can migrate. The four operations following a UWE shown in the table all result in different combinations of errors. In all of the cases, the locator function either fails or reports an incorrect error location. These examples show how it is necessary to check for UWEs after each operation. Once a subsequent operation occurs, the ability to identify and correct the error is lost and the system behavior becomes difficult to diagnose because of error migration.
|
| Table 2 Example transitions from the locate-and-repair state after a single undetected write error (UWE) event at location A0. The post-UWE[A0] columns show the values in each block immediately after the UWE. As described in the text, only A0 is incorrect. Each subsequent column shows the values and state after the labeled operation. RMW[a0″] is a read-modify-write to block A0 with new data a0″. RO-PC[b0′] is a read-other-parity-compute to block B0 with new data b0′. PFSW[b0′, b1′] is a promoted full-stripe write to strip B (also read-other). Rebuild[B] is rebuild of strip B (e.g., after a disk failure). All of these operations read the incorrect contents of block A0. When parity and data are consistent, the locator output is { } (the empty set). |
|
|
|
|
|
| Post-UWE[A0] values | Operation After UWE[A0] |
| Location | Initial value | RMW[a0″] | RO-PC[b0′] | PFSW[b0′,b1′] | Rebuild[B] |
| A0 | a0 | a0″ | a0 | a0 | a0 |
| |
| B0 | b0 | b0 | b0′ | b0′ | b0 a0 a0′ |
| |
| P0 | a0′ b0 c0 | a0 a0′ a0″ b0 c0 | a0 b0′ c0 | a0 b0′ c0 | a0′ b0 c0 |
| |
| Q0 | a0′ x0 | a0 a0′ a0″ x0 | a0′ x0 | a0 x0′ | a0′ x0 |
| |
| Q1 | b0 y0 | b0 y0 | b0′ y0 | b0′ y0′ | b0 y0 |
|
|
| Stripe state | | Locate failure | Locate failure | Undetected bad | Locate failure |
| |
| Locator output | | {A0} | {Q0} | { } | {Q0, Q1} |
|
Host operation types: A summary
Once bad data exists on the stripe as a result of a UWE, the state of the stripe evolves differently depending on the host operations that occur. This discussion shows that these host operations can be classified into five classes depending on the state they cause the RAID array to enter. These classes correspond to (a) host reads, which may or may not return bad data but which leave the state of the stripe unchanged, (b) host writes that have both a read dependency on bad data or parity but also overwrite it (e.g., read[bad]-modify-write), (c) host writes that have a read dependency on bad data/parity but do not overwrite the bad data (e.g., promoted-full-stripe write), (d) host writes that have no read dependency on bad data or parity but do not overwrite it (e.g., read[good]-modify-write), and (e) host writes that have no read dependency on bad data/parity but overwrite it (e.g., a full-stripe write as well as a promoted-full-stripe write in which new data overwrites bad data—and a rebuild operation in which the bad data was on the failed disk).
In general, class (b) operations can leave the state of the stripe unchanged but can also transition from a locate-and-repair state to a locator failure state depending on specifics of where the bad data/parity lies and the dependence on that for the specific operation. Class (c) operations typically transition the stripe to the undetected-bad state, and class (d) generally leaves the state of the stripe unchanged. For class (e) operations, transitions are to the good state.
| |
|
Given the problems that UDEs can cause in a RAID subsystem, as described in the previous section, we feel that the best solutions must lie within the RAID subsystem itself (or in a layer even closer to the disk). In this section, we describe a family of UDE detection (and possible correction) technologies that are fully embedded within the RAID layer. The common element in the family is the co-location of metadata associated with each block (or chunk of blocks) of user data with one or more parity blocks computed from the data. As such, they are all variations on what we call the parity appendix method. With the exception of the data parity appendix method (see below), these methods are new.
The advantage of co-locating metadata with the parity data arises from the fact that whenever new data is written to disk, new parity is also written to disk so the metadata updates add no cost to disk read and bandwidth. In addition, for most operations, particularly the read-modify-write implementations, the metadata disk I/O has no additional overhead; it occurs when the old parity is read in the normal course of the implementation. The methods enable a host validated read operation in which data is cross-checked with the metadata. This incurs a penalty of one additional I/O for reading the metadata, but this is an optional feature. Validation can be used on every read operation to guarantee integrity of the data to the host (if the performance penalty is acceptable) or periodically as an integrity control check (similar to the way parity scrub is performed today, with significantly less overhead) or it can be used on demand when there are questions about the behavior of one or more disk drives in the system (e.g., if a read-modify-write detected a problem on a particular disk, host validated reads may be enabled for that disk alone, thereby reducing the overall system impact but providing the extra level of data protection where it is most likely needed).
We use the term chunk to define the granularity of data or parity from a single disk for which validation metadata is maintained. The size of a chunk may be a single sector or a set of contiguous sectors up to the RAID strip size, depending on a number of factors that trade off storage overhead, I/O costs, and other factors of a specific implementation or architecture. The main requirement is for a strip to contain an integral number of chunks. To each chunk of data or parity, we add an appendix; this is a small unit of storage (small compared to the chunk size) that is co-located adjacent to or internal to the chunk. An appendix to a data chunk will contain one copy of some metadata for the data chunk itself and/or other data chunks. An appendix to a parity chunk will contain another copy of the metadata for each of the data chunks that contribute to that parity value. In some cases, this rule can be relaxed; the key is that sufficient copies of the metadata for each data chunk are replicated in sufficiently many parity appendices to provide the same redundancy of metadata as there is for the data itself. One implementation of an appendix is a scheme in which the appendix is one sector of metadata associated with a data chunk. Another implementation is described below using oversized disk sectors.
The metadata in a data appendix may contain checksums of a data chunk (e.g., LRC [longitudinal redundancy check], CRC, hash, or other compressed version of the raw data) or it may contain (embedded in the checksum or separately) the physical or logical block address of the data chunk it represents. Also, it may contain some sort of version number of the chunk (e.g., sequence number or timestamp corresponding to when the data is updated). The size of the metadata (how many bytes for a checksum or for a version number) is dependent again on many factors but is typically controlled by the size of the parity appendix and the width of the RAID stripe, because the parity appendix needs to contain a copy of the metadata for all data chunks it represents, which is usually the width of the data substripe.
Figure 4 shows a schematic of four variations of the basic data/parity/appendix layouts of a RAID6 stripe. (The implementation for RAID5 can be easily derived from the RAID6 version.) The names of each variation suggest where copies of metadata are stored, which directly affects the validation costs and guarantees. All methods store metadata in parity appendices. The parity appendix method only stores metadata in a parity appendix. The data parity appendix method additionally stores some metadata for a data chunk in the appendix for that data chunk itself. The buddy parity appendix method stores metadata for a data chunk in an appendix of a buddy (e.g., in the appendix of the neighboring chunk on the next disk). The data–buddy parity appendix method puts metadata for a data chunk in the appendix of the data chunk, as well as in an appendix of a buddy. Other methods may store copies in additional buddy chunk appendices (extending either the buddy parity appendix or the data–buddy parity appendix method).
Figure 4
We now explain how the simplest implementation of these methods can limit the type of metadata that is stored. For either of the data appendix methods, where a copy of metadata of a data chunk is stored in its own appendix, version numbers can be used. For the other two families, checksums are sufficient. Note that if a version number is not stored co-located with the data, the number has no value. The advantage of version numbers is that they are very easy to compute when compared to checksums, which are usually computationally expensive if they cannot be generated by hardware. One might presume that version numbers can be used to locate a UWE (e.g., a lower version number probably experienced the error), but this logic applies only to dropped writes; for the case of a far-off-track write, version number comparisons can be misleading. Also, use of version numbers without checksum requires that the data chunk and the appendix be updated atomically; adding a checksum to the appendix, when the appendix is in a separate sector, serves as a mechanism to ensure atomicity because it enables cross-checking of the data with its appendix.
A simple implementation of the data parity appendix method is described by Clark et al. [23]. The sectors on disk are formatted for 520 or more bytes, 512 bytes of which are user data. The other header bytes comprise the appendix for each sector. The metadata consists of simple version numbers. The version number is one byte in size, and its byte offset within the header is determined from the logical strip number within the stripe. In this way, the parity header is computed directly from the XOR operation used to compute the parity; no special access is required to the parity header bytes upon update. Also, very little storage overhead is used for the appendix (e.g., 8/520 = 1.5%) as well.
The differences between these four methods involve tradeoffs of protection versus performance impact. This is summarized in Table 3, which has four major columns, one for each method. Each column has two subcolumns in the first two rows to differentiate between the optimal and the critical RAID array state. We define the RAID array to be in an optimal state if all of the disks in the RAID array are in an online (readable and writable) state. We define the RAID array to be in a critical state if the RAID array has reached its fault tolerance limit and will become offline (unwritable) if one (more) member becomes offline.
|
| Table 3 Summary of effectiveness and I/O cost overhead of each of the four appendix methods for detecting and/or locating undetected disk errors, enabled in the RAID layer. |
|
|
|
|
|
| | Parity appendix | Data parity appendix | Buddy parity appendix | Data–buddy parity appendix |
|
|
|
|
| Optimal | Critical | Optimal | Critical | Optimal | Critical | Optimal | Critical |
|
| Detect guarantee | Yes | Maybe | Yes | Maybe | Yes | Yes | Yes | Yes |
| |
| Locate guarantee | R5: maybe | Maybe | Yes | Maybe | Yes | Maybe | Yes | Yes |
| |
| R6: yes | | | | | | | |
| |
| Short write I/O overhead | +0 | | +0 | | +1 | | +2 | |
| |
| Validated read I/O overhead | +1 | | +1 | | +1 | | +1 | |
| |
| Full-stripe write I/O overhead | +0 | | +0 | | +0 | | +0 | |
| |
| Promoted-full-stripe write I/O overhead | +1 | | +1 | | +1 | | +1 | |
|
In the first two rows of Table 3, we indicate the ability of a method to detect or locate a UDE. Maybe means that there exists an algorithm that will correctly provide the indicated function under certain conditions. For example, in RAID5/optimal with the parity appendix scheme (method), if more than two data chunk checksums differ (detect guarantee is yes) from the values in the parity appendix, we can assume that the parity appendix (and so parity) is wrong (locate guarantee is yes, for the case of RAID5/optimal). However, if only one checksum is in error, it cannot be determined whether the parity appendix is wrong or the data chunk is wrong (locate is no for this case, and hence, locate guarantee is maybe for the method). For the case in which locate is yes, we can treat it as a chunk read failure and use RAID5 to reconstruct the chunk, given that it is in the optimal state. However, for the case in which locate is no, we cannot correct the problem.
In the last four rows of Table 3, we indicate the disk I/O overhead associated with each method while performing a short write (write to only one strip), validated read (described above), full-stripe write (write to an entire stripe), and promoted-full-stripe write (write to less than an entire stripe that is promoted to a full-stripe write by reading from the disks the unwritten portion of the stripe). A value +N under a method indicates that N additional disk I/Os are required to implement that validation scheme.
The parity appendix and the data parity appendix schemes have the lowest I/O costs on update, since they only access disks where writes are already occurring. On the other hand, the two buddy methods must also update additional disks and so increase the overall impact on I/O. Nevertheless, the first two methods suffer from a major drawback: If the parity strips are lost, then no validation or checking can occur for an indefinite period. That is, even after the strips have been recovered (rebuilt), they will be populated with metadata that was not validated and so could in fact be wrong. After the data in the chunk has been completely overwritten by the host, the metadata will be in a state in which a validation can legitimately be applied. The two buddy methods solve this problem by maintaining yet another copy of the metadata in a buddy, so that at least one copy survives after tolerating maximum allowed failures of the RAID array.
In short, this class of methods provides a tunable family of solutions that has simple cost, performance, and effectiveness tradeoffs. If the impact of a UDE in an application is small but performance is important, the parity and data parity appendix methods may be sufficient, since they have the lowest I/O cost. If an application cannot tolerate UDEs, such as applications for the financial sector or high-cost scientific experiments that require a multi-day run on a supercomputer, then perhaps the buddy methods are more applicable since they provide more abilities to detect and locate, even though they have relatively high I/O overhead.
| |
|
Storage systems typically let the underlying disk drives provide the required data integrity. Given the ever increasing data-storage requirements for high-density disks, it may now be appropriate for storage systems to address the challenges of UDEs. In this paper, we discussed the causes of both read and write UDEs. We showed how those errors, because they are normally undetected, can complicate and confuse a RAID subsystem—putting stripes in states where incorrect conclusions on the integrity of the data are possible and even likely. We described some technologies that can be used to address the issue at various layers in the storage I/O stack and showed that the RAID layer of the storage system is a good place to address UDEs. Finally, we presented a family of four techniques, three of which are new, that may be integrated into the RAID layer and that provide solutions to the problems of UDEs, with tradeoffs on efficacy and performance costs.
Work on determining the causes and failure rates of data corruption is ongoing. Recently, Bairavasundaram et al. [24] have completed a large-scale study on data corruption in disk drives and reported significant rates of data corruption, especially in circumstances involving nonenterprise disk drives. Their work reinforces the need for effective technologies to detect and correct UDEs.
| |
|
Received September 14, 2007; accepted for publication October 8, 2007; Published online June 24, 2008.
|
|