|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.523.0307 | Copyright info |  |
 |
 |
System RAS implications of DRAM soft errors
|  |  |
by T. J. Dell
|
|
|  |
 |  |  |
|
| |
|
The invention of the one-transistor DRAM by IBM researcher Robert Dennard [1] has undoubtedly been a major contribution to the success of the modern computer, but it has also raised a new class of reliability concerns. Early computer reliability engineers found that in addition to the standard defect-driven permanent (or hard) errors, the DRAM device seemed occasionally susceptible to some kind of random events that were originally attributed to noise. However, as the physics of the effects of radiation in semiconductor structures began to be understood, a new class of errors in DRAM devices was identified. These single-event upsets (SEUs), or soft-error rate events (usually shortened to just SER), have to be dealt with appropriately to obtain the desired level of system reliability, availability, and serviceability (RAS).
The concern for DRAM SER has waxed and waned over the last 30 years. As SEU phenomena become better understood, new mitigation techniques are being developed to alleviate the SER impact. Then, as future, more advanced and cost-effective DRAM design and manufacturing processes are employed, newer variations of SER sensitivity are developed, and the cycle of identifying problems and implementing mitigation techniques is continued. This cycle has been somewhat abated for the current time in the DRAM world, but soft-error effects in SRAM, latches, and logic circuits appear to be approaching a critical juncture in the design of large ASICs (application-specific integrated circuits) and modern microprocessors. Nonetheless, new developments in DRAM and system architectures may again bring the DRAM SER issue to the forefront, as discussed in the section “New concern: Alignment of DRAM soft and hard errors,” later in this paper.
| |
|
Perhaps the first published clarion call of alarm regarding DRAM SER was sounded in 1979 by May and Woods [2]. The authors demonstrated that the charge of a DRAM cell storage capacitor could be upset by alpha-particles that were generated by the radioactive decay of trace contaminants that are found naturally in many substances used in the manufacture of DRAM devices and packages. The authors also noted several mitigation strategies that are still in use today. The first is to reduce the alpha-particle flux in the base materials used. The second is to provide some kind of system- or subsystem-level fault tolerance, which usually takes the form of an error-correcting code (ECC) for DRAM-based subsystems. The third is to design the cell of the DRAM device in a way such that its sensitivity to SEUs is reduced.
The next major event in understanding DRAM SER was the seminal paper by Ziegler and Lanford [3] published later in 1979 and implicating cosmic rays as a major source of DRAM SER. While others had suggested cosmic ray effects in semiconductors in space [4] and May and Wood even postulated their effect at sea level [2], Ziegler and Lanford gave a complete and convincing treatment of what was to become known as the dominant DRAM SER cause for many years to come.
The fact that now two distinct sources of DRAM SER were identified led to much confusion in the approach used by DRAM suppliers to quantify their SERs. It was possible for an experiment to be done with an alpha-particle source brought into close proximity to a DRAM die such that an accelerated test of the susceptibility of that device to alpha-particles was thought to have been conducted. Then, with the acceleration factors taken into account, a very low level of DRAM SER was predicted. However, what could have actually been happening was that the alpha-particles produced by the source were not of a high enough energy to actually flip DRAM cells, and thus what was actually being measured was the background cosmic ray component. The relative contribution of alpha-particle SER and cosmic ray SER was described clearly in Reference [5], in which the cosmic ray component was shown to be more than 10× greater than the alpha-particle component.
Note that there are significant differences in the two classes of DRAM upsets. The alpha-particle-induced effect is generally related to a much lower energy event. Alpha-particles are stopped by thin chip coatings or even a piece of paper, and therefore, the particles of interest are entirely generated from within the device or its immediate packaging [6]. However, the cosmic ray-induced effect is generally related to energy events that are orders of magnitude greater than an alpha-particle event. It takes a full 50 m of concrete to stop them [7], and their source is quite literally out of this world [8]. By reviewing Ziegler's illustration of the potential results of a cosmic ray interaction in the atmosphere (Figure 1), we can clearly see that all of the complex results of a cosmic ray shower must be considered when putting system mitigation techniques in place. Thus, the RAS implications at the system level for each failure mechanism must be understood in terms of the system mitigation techniques employed.
Figure 1
One final point in the DRAM SER debate was whether the magnitude claimed by some people based on accelerated proton beam testing was indicative of the actual field experience of a DRAM device. This point was addressed conclusively in 1994 by O'Gorman [9], who showed that a sample of parts run 200 m underground, where no cosmic ray radiation should exist, had a baseline SER attributable to a small alpha-particle component. However, when the parts were run at 30 m above sea level, 1.6 km above sea level, and 3.1 km above sea level, an increasing cosmic component was predictably seen. Thus, while an a priori prediction of cell design SER performance remains challenging, accelerated proton beam testing of any given actual hardware is considered a fairly well understood science.
| |
|
DRAM SER can be mitigated by cell design, material purity, and system fault tolerance. These three techniques can be effective in reducing alpha-particle-induced SERs. One of the key parameters of DRAM cell design is its Qcrit, a parameter that measures the amount of charge required to flip the logical bit stored at that cell. Qcrit is related directly to cell capacitance and cell voltage. For alpha-particle hardness, a cell can be designed with a Qcrit greater than 2.5 × 106 electrons [2, 6], which would basically eliminate any alpha-particle effect. However, because DRAM devices continue to shrink in cell geometries and support lower and lower voltages, preserving such a large Qcrit has become extremely difficult.
There are other DRAM cell parameters that are more difficult to design, model, and measure, such as the total charge collection region of the cell and the rate at which charge is collected. From the DRAM perspective, it turns out that in spite of major reductions in cell capacitance and in voltage, both the art and the science of good design have come together such that—over a 16-year period from the 256-Kb device to the 256-Mb device—the per-bit SER has dropped more than six orders of magnitude [10]. In fact, the effective plateau of DRAM device SER over the last decade has led some to declare that at least the alpha-particle-induced DRAM SER problem has been eliminated [11]. However, there is still a lot of SER-tolerant circuit design and modeling work to be done. For a longer treatment of modeling from a decade ago as well as an extensive list of references, see Reference [12]. For a contemporary view, see References [13–15].
Another aspect of the cell-design mitigation strategy has to do with the actual design of the cell storage node. Efforts to maximize cell capacitance in a minimal area have led to three-dimensional DRAM structures. Although this very fundamental design parameter is not usually chosen for SER reasons, it can have an extremely significant effect, albeit probably totally unintended. For example, IBM was a pioneer of the trench-cell DRAM capacitor, which is still incorporated in the designs of several manufacturers. There are two ways to design a trench capacitor: One involves storing the charge on the outside of the trench, where it is easily diverted into the device substrate, and one involves storing the charge on the inside of the trench, which is much more resistant to charge loss. IBM chose the inside storage design, and it proved to be vastly superior in terms of SER performance by a factor of more than 1,500× [16].
The second area of mitigation is materials purity. This area is effective only in mitigating alpha-particle-induced SERs and is completely ineffective in providing any assistance to the cosmic ray SER problem. Broadly speaking, the two classes of materials under consideration are those used in the fabrication of the device and those used in the packaging of the device. Materials as diverse as the phosphoric acid used in wafer processing and the lead–tin solder balls used in packaging interconnects have been implicated as sources of alpha-particles [17]. More recently, the detection of thermal neutron-induced alpha-particles in certain boron isotopes has shown that diligence in materials purity must be maintained [18].
One of the main difficulties of mitigating the materials impurity cause of SER is that the trace contaminants that have a great impact on SER may not show up as affecting any other device parameter. All device parameters and inline test data may show excellent device performance during manufacturing, but, for example, the simple fluke of acid bottles being improperly cleaned can lead to extremely bad field results as a result of an increase in alpha-particle SER [19]. Thus, the maintenance of materials purity is of utmost importance if SER is to be controlled for each and every device lot that is built.
The third area of mitigation involves some kind of system-level fault tolerance. Although such schemes can include relatively expensive options, such as full memory or system mirroring, or even triple redundancy of entire systems or subsystems, the typical means of providing memory fault tolerance is the use of ECC. In general, the higher the level at which the ECC is implemented, the more efficient it is. Thus, while chip-level schemes have been proposed [20] and put into production [21], we shall concentrate on the more efficient, higher-level system ECC schemes. A note on system ECC schemes: While a simple parity check of DRAM data can indeed be used with the complement–recomplement [22] scheme to actually correct a soft error, its ability to handle anything more than a single-bit error precludes it from consideration from all but the lowest end systems.
| |
|
At first, a very simple single-error-correcting and double-error-detecting (SEC/DED) ECC would seem completely sufficient for solving the DRAM SER problem. Because most upsets affect only a single cell, there can be at most one bad bit in any ECC word, no matter how the logical ECC word is taken physically out of the DRAM devices. Thus, for a common 8-byte memory interface, the 64 data bits are used in conjunction with 8 check bits to form a 72-bit ECC word that provides SEC/DED ECC capability. If any given ECC word contains a single error (whether hard or soft), the code will handily correct it, and no system-level error will result. Given the nature of soft errors, once the upset cell has been rewritten by either a background scrubbing engine [23] or simply by another write to that address, the cell is then restored to a correct value. This approach clearly handles the simple single-cell SEU very well. Unfortunately, not all SEUs are single-cell events [24].
The next step up from SEC/DED codes is the class of symbol-correcting codes in which, instead of a single bit being correctable, a single set of bits, called a symbol, is correctable. Thus, the nomenclature of single-symbol-correcting and double-symbol-detecting (SSC/DSD) codes is used. If the symbol is aligned with the DRAM such that all of the bits from a DRAM contribute to only a single symbol, then the entire DRAM can be missing but the SSC/DSC ECC will correct the bad DRAM symbol, and again the system will not suffer an error [25]. Figure 2(a) shows how DRAM data is grouped into symbols at the system level such that an SSC/DSD code will correct the data for an entire DRAM.
Figure 2
An alternative to this approach is to route each data bit of a DRAM into a separate SEC/DED ECC word such that the same effect is realized [26]. Unfortunately, while this approach is feasible for ×4 DRAM devices, which means that all of the memory contained in the device is accessible to the outside world by means of four data bits, it is not often acceptable for ×8 or ×16 devices (8 total data bits and 16 total data bits, respectively) because of the large amount of total data per access necessitated by having so many ECC words read in parallel. (DRAM data bits are often referred to as DQs, from the standard latch designation of D for the input and Q for the true output; hence, a DRAM DQ is a bidirectional data bit.) The interleaved approach to providing IBM Chipkill* memory protection with SEC/DED coding is shown in Figure 2(b). An excellent summary of ECCs for memory subsystems in general is found in [27].
This approach of providing Chipkill memory protection based on the ability of ECC to correct bad data from an entire DRAM has become popular in higher-end server systems because a certain fraction of DRAM hard errors will affect multiple DQs over a large address space of the DRAM and would, therefore, overwhelm the standard SEC/DED ECC [28]. In order to maintain the highest standards of RAS characteristics, such Chipkill memory effects must be adequately handled at the system level. Therefore, while DRAM SER was certainly not the primary reason to implement Chipkill memory ECC at the system level, it clearly provides a valuable benefit in correcting any multibit SER that may occur [29].
Since alpha-particle-induced DRAM SER is well understood and well controlled, and since cosmic ray-induced DRAM SER is well understood and effectively mitigated by either a simple or advanced memory ECC, there should be no concern for system RAS effects of DRAM SER. However, DRAM soft errors can occur in tandem with DRAM hard errors, and the combination of the two is of concern.
| |
|
If DRAM SERs stay at roughly their current levels, and if system RAS requirements remain about the same, then one could confidently assert that DRAM SER would indeed not be a very pressing system issue. However, while the former premise is unlikely but at least debatable, the latter premise is, almost by definition, certainly not true. System RAS requirements are being driven to higher levels with an intense sense of urgency. Whereas in the past customers were content to have no unscheduled system outages, customers today do not want their systems to be taken away even for short scheduled-maintenance outages and they want absolutely no downtime whatsoever in their mission-critical, 24-hours-a-day, seven-days-a-week business environments. Thus, the question of DRAM SER must be revisited in light of potential DRAM SER increases and demands for better system RAS. In this light, while technically not a new failure mechanism, soft–hard alignment deserves serious consideration.
The reason DRAM SER is still a serious system RAS concern is that higher system RAS requirements now bring to light failure modes that were considered secondary in previous generations. A prime example of this is the alignment of a hard and soft DRAM error. The scenario is the following. A DRAM component fails with a hard error, but the system continues to operate flawlessly as a result of memory subsystem fault tolerance, primarily through the use of one of the aforementioned ECC schemes. At some point in the future, the failing component will likely be replaced, but that does not happen immediately. Now, if the hard error affects only a single DRAM bit, or perhaps a small cluster of bits along a single DRAM bitline, the chances of alignment with another error of any kind are exceedingly small, given the hundreds of millions of possible DRAM addresses at which the small cluster of fails occurs. However, a small fraction of those hard fails will affect the entire DRAM chip, or at least a significant portion of its address space such that now the probability of alignment with any other kind of error is much greater than zero, even if the second error affects only a single DRAM cell. Because the typical DRAM SER can easily be two orders of magnitude greater than the typical DRAM hard-error rate, it is clear that the hard–soft alignment scenario must be considered.
In order to understand the magnitude of this phenomenon, an illustration using industry-published failure rates is instructive. If a memory subsystem consists of two industry-standard, ECC-oriented DIMMs (dual inline memory modules) of ×4 DRAMs, then it has 36 devices that contribute to each ECC word. Assuming a Chipkill memory-correcting ECC [30], this system would be able to correct up to any complete DRAM that failed. However, once a hard Chipkill memory error occurs, the system is susceptible to a subsequent soft error anywhere within the 36 devices (with the exception that if the soft error occurred in the same device that had experienced the hard error, it would be completely irrelevant). Thus, for a typical DRAM device operating at a hard Chipkill memory failure rate of 13.7 FITs [31] and a typical SER of 1,000 FITs [32] (see also [10, 11, 16]), a per-memory-rank failure rate can be projected as a function of time before repair. To convert that to the system level, the total number of ranks must be included. For this example, a 32-GB memory subsystem built of 1-Gb DRAMs is used (Figure 3). The results of analyzing this system for the above assumptions are summarized in Table 1.
Figure 3
|
| Table 1 Failure rate adder for a 32-GB memory subsystem as a result of hard–soft DRAM error alignments as a function of time to repair. |
|
|
|
|
|
Time to repair (months) | Memory failure rate adder (FITs) |
|
| 1 | 102 |
| 6 | 608 |
| 12 | 1,207 |
|
To put the results of Table 1 into perspective, note that if only hard errors are taken into account, a Chipkill memory ECC would completely eliminate the DRAM component of the memory failure rate. This would leave only 5–10 FITs per DIMM because of the on-DIMM discrete components, and the resultant system performance would be 160–320 FITs. By contrast, it can be seen that for even the one-month time-to-repair scenario, a 30% to 60% increase is caused by consideration of the alignment of a hard and soft error. Clearly this is an adder that must be taken into account if proper system-level RAS calculations are to be performed.
In addition, if multibit soft errors start to represent an increased percentage of SEU occurrences, then they would be more likely to align with less catastrophic hard failure mechanisms, such as wordline and bitline failures. These failure alignments are so small today that they can be ignored, but they may become a problem if current trends continue.
A final note on the effectiveness of system-level mitigation techniques is that some commonly used schemes that are generally effective for dealing with soft errors may not be effective at all in dealing with the soft–hard alignment. Each mitigation technique must be evaluated in light of each possible failure scenario in order to determine its effectiveness. For example, soft-error background scrubbing is the technique of having the hardware or software read through memory when other memory operations are not pending in order to determine whether the ECC can find any correctable errors. If it does, then the corrected bit is written back into the memory such that any soft errors are repaired. However, for the soft–hard alignment situation, the scrub may not get to the soft-error-containing address before an actual read does; thus, scrubbing may not provide an effective means of protecting against this kind of failure scenario even though it may be useful for other scenarios.
On the other hand, hardware chip sparing (when a new, unused DRAM is used to replace a failing DRAM when a Chipkill memory-class defect is discovered) will indeed be effective for soft–hard alignments [23, 33, 34]. Thus, a complete RAS model is most helpful in evaluating all of the various failure mechanisms against all of the system RAS mitigation strategies, including the ability to model the combined effects of multiple schemes, such as Chipkill memory ECC working in tandem with hardware chip sparing.
| |
|
It has been shown that even though much of the concern for SER is properly focused on SRAMs and logic structures, there is still the danger that DRAM soft errors will join forces with DRAM hard errors to provide an unacceptable increase to system RAS characteristics. In particular, Chipkill memory hard errors have been identified as being able to align with random, single-cell soft errors to cause a 30% to 60% increase in memory subsystem failure rates. Good system RAS design requires that these failure mechanisms be evaluated in light of both the system architecture and the latest industry DRAM hard and soft failure rate data. There is a growing sense that even with DRAM cell capacitances remaining stubbornly entrenched above a critical value for alpha-particle sensitivity, eventually device and voltage scaling will prevail, and a new round of SER mitigation will be required because of the lower Qcrit of an alpha-particle-sensitive DRAM cell.1
Therefore, for the current phenomenon of soft–hard DRAM fail alignment, and for the potential occurrence of a revitalized DRAM sensitivity to alpha-particles, the system RAS model must be expanded to include these emerging effects in order for the system design point to be robust enough to prevent customer-impact events resulting from DRAM soft errors. Schemes such as Chipkill memory ECC coupled with hardware DRAM sparing can then be employed to ensure that DRAM SER does not impact the customer's RAS experience.
*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both.
| |
| |
1Private correspondence with Dr. Robert H. Dennard, July 2007.
Received July 17, 2007; accepted for publication August 29, 2007; Published online March 14, 2008.
|
|