|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.523.0255 | Copyright info |  |
 |
 |
Circuit design and modeling for soft errors
|  |  |
by A. KleinOsowski, E. H. Cannon, P. Oldiges, and L. Wissel
|
|
|  |
 |  |  |
|
| |
|
Over the past 30 years, transient errors due to ionizing radiation, or soft errors, have become an increasing concern in semiconductor products. As the semiconductor industry progresses into the sub-100-nm lithography generation, soft errors are becoming a primary design concern. However, soft errors have not prevented continued progress and advancements in the semiconductor industry. Although there are many design solutions, the challenge to research scientists is to make innovative low-cost design modifications and oversee the adoption of these techniques in semiconductor products.
The overarching new phenomenon that occurred at about the 90-nm lithography generation was the advent of soft errors in flip-flops. These clocked logic elements store configuration information and hold the results of cycle-by-cycle computations. Unlike memory arrays, flip-flops are often used individually or in small groups. This usage scenario does not easily lend itself to error checking and correction codes (ECCs). Instead, flip-flops are typically hardened using more capacitance, additional current drive, or transistor-level redundancy. In extreme cases, flip-flops are triplicated and the state determined by voting among these flip-flops.
Other circuit structures may also be vulnerable to soft errors, depending on the chip design and operating environment. Clock nets often have sufficient capacitance and enough restoring drive to suppress radiation-induced single-event transients (SETs). The leaf cells of the clock tree, though, may have weak transistors and be lightly loaded. It is possible for a radiation event to create a spurious pulse on a leaf-cell clock net. This false clock pulse turns on the flip-flops driven by that clock net. If the data input of the flip-flops is logically opposite the stored value of the flip-flop, the stored flip-flop data becomes corrupt.
For high-frequency designs, the duration of a transient pulse on a signal net may be nearly as long as a full clock period. For these designs, errors on combinational logic paths may be captured in a flip-flop, causing a single-bit upset.
Most circuit-level mitigation techniques have power, area, and timing costs that are fairly high. With this in mind, accurate and early modeling as well as fail rate predictions are essential. Designers must know as early as possible whether, and to what degree, they need to modify their designs to be robust against soft errors. This paper describes the modeling techniques that are used to predict the soft-error rate (SER) of circuits, as well as chip design options that can be used to address soft errors.
| |
|
The IBM proprietary semiconductor device simulation tool FIELDAY [1, 2] is used extensively to understand charge collection physics and to obtain a quantitative value of the critical charge (Qcrit) required to logically upset SRAM (static RAM) bit cells and flip-flops. This physical effects modeling is critical early in the technology bring-up, prior to circuit hardware availability. Pre-silicon modeling provides early insight into circuit sensitivities.
Prototype transistors in a given technology are modeled using process simulation tools that have been calibrated to previous-generation hardware. The process models are calibrated against secondary ion mass spectroscopy (SIMS) data as well as electrostatic transistor data such as Vt roll off, off-current, and subthreshold slope.
It is important to capture the most detail possible about the transistor structure and doping profiles so that subsequent transistor simulations will be based on realistic scenarios. Figure 1 shows the detail that we capture for an n-FET in IBM 65-nm silicon-on-insulator (SOI) technology. The structure of silicided regions must also be captured so that current flow is modeled accurately. The metal layers must also be defined so that outer fringe capacitance is simulated correctly. Nitride films are also used to create tensile or compressive strain in the channel of the MOS (metal-oxide semiconductor) transistors, which affect drive current and SRAM cell or flip-flop stability.
Figure 1
Physical models used in FIELDAY are calibrated to available transistor data and theory. Band-to-band tunneling models as well as generation and recombination models are calibrated against hardware data obtained from forward and reverse I–V characteristics of body-contacted SOI transistors. The body contact allows monitoring of the I–V characteristics of the floating body. These measurements are then adjusted to factor out the parasitic effect of the body contact (since most logic transistors do not use a body contact). Parameters used in the model for inversion layer quantization are calibrated against gate C–V data. Low and high field mobility parameters are calibrated from transistor transport data such as linear current, transconductance, and on-current. The alpha-particle photogeneration model has been calibrated using theoretical understanding from simulations using the IBM DAMOCLES* program [3].
A typical use of FIELDAY for soft-error radiation event modeling is to perform a two-carrier transient analysis of an event in a single transistor. Energy, strike location, and incidence tilt and rotation angles can be defined. The time evolution of the generated electron–hole pairs and their effect on the internal potential are calculated. Internal potentials and terminal currents are then analyzed. A more powerful capability of FIELDAY is to use its mixed-mode feature. Small circuits of numerical transistors are linked, and FIELDAY then solves the circuit equations and the transistor equations. This mixed-mode capability allows us to determine the critical charge of SRAM cells or flip-flop circuits of proposed transistors and technologies before hardware or compact models are available.
FIELDAY first calculates the dc operating point for the cell. In a typical SOI simulation, an alpha-particle with a given incidence energy strikes one of the off-state transistors at normal incidence directly through the center of the gate. A parameter that varies the amount of electron–hole pairs generated is adjusted until the cell just switches state. The minimum amount of charge necessary to cause the cell to flip state is this critical charge, Qcrit. Figure 2(a) shows a six-transistor 65-nm SOI SRAM cell, while Figure 2(b) shows the characteristics of the response of that cell to an alpha-particle strike in the off-state pull-down n-FET (N1). Injection of 0.236 fC (femtocoulombs) of charge in the SOI body and extension regions causes the SRAM cell to flip state.
Figure 2
Through our modeling of soft errors in SOI and bulk transistors, we have found several important differences and similarities between the two technologies. In this paper, we discuss these differences for an event in an n-FET. The physics is the same for p-FET strikes, but references to electrons and holes should be reversed.
Both bulk and SOI transistors exhibit fast and slow charge collection mechanisms. In bulk technologies, funneling, or field-aided charge collection, occurs within the first few picoseconds or tens of picoseconds. The charge not collected by funneling is free to diffuse, and some of that charge is collected at the drain depletion region over the next tens or hundreds of picoseconds, causing a decrease in the drain node voltage. After a particle passes through the drain and creates electron–hole pairs in the p-well, electrons are collected in the drain and holes from the drain are transported into the p-well. The dominant mechanism for soft errors in bulk technologies is through direct charge collection by the drain. The collected charge may change the node voltage sufficiently to upset the stored data. In a secondary mechanism, holes in the p-well may charge up the well, causing a parasitic NPN bipolar (n+ drain/p-well/n+ source) to temporarily turn on. This secondary mechanism is sensitive to the resistance from the charge collection region to the p-well contact. The holes in the p‐well eventually are collected by a well contact. Furthermore, electrons not initially collected by funneling may diffuse into the drain region, showing up as additional drain current.
For SOI transistors, only charge generated in the thin active silicon layer contributes to soft errors. This means fewer electrons are collected. This small amount of collected charge has little effect in the transistor drain, but it can cause upsets if collected in the transistor body. As in bulk transistors, the generated electrons are collected quickly and holes are transported into the body. Most SOI transistors do not have a body contact however, so even a small number of holes charge up the body of the transistor enough to turn on the parasitic NPN transistor. The conducting transistor will cause the drain voltage to droop. The parasitic bipolar action stays turned on until the excess carriers in the body recombine and the body potential is lowered. Although the fact that a small number of holes can upset an SOI cell could lead to the conclusion that an SOI cell has a higher upset rate than a bulk cell, the opposite is true because of the relative charge collection areas. In SOI, the particle must strike the very small body region to cause significant drain current. In bulk, a particle strike anywhere in a larger region near the drain results in drain current. Because the parasitic bipolar transistor plays such a major role in the charge collection process, it is important to capture the physics of high charge injection in the simulation.
The dominance of the bipolar effect in SOI transistors leads to one advantage for upset modeling. The alpha-particle-induced electron–hole pairs that are generated in the transistor quickly separate, with the generated electrons appearing as drain current and the generated holes flowing into the body. Transistor widths for 90-nm technology and beyond are comparable to the width of the charge generation track, so it is not necessary to generate a three-dimensional (3D) structure when simulating soft errors in SOI circuits. The alpha-particle-induced holes spread out over the entire body shortly after being generated. Two-dimensional (2D) simulations and 3D simulations of transistors show little difference in the charge collection characteristics.
Bulk transistors in a common n-well or p-well with other transistors must be modeled in 3D since the alpha-particle-induced charge does not stay confined within the bulk device channel, as it does for SOI devices. The 3D simulations of charge collection in 130-nm bulk transistors show approximately 40% less charge collection than the same conditions in 2D simulations because generated charge is shared by adjacent transistors. As technology scales and the gate lengths and widths are reduced, more of the alpha-particle-induced charge is collected by adjacent devices.
Finally, because the parasitic bipolar gain in the SOI transistor plays such a major role in the upset susceptibility of a cell, the critical charge is a somewhat simpler concept. In order to determine an SER for a circuit, we must understand how much generated charge ends up in the body of the SOI transistor. Although bias conditions, event track structure, and location and angle of the strike affect the total amount of charge generated and transported into the body, it is possible to define an effective charge collection volume that can be used in SEMM-2 (see the section “The role of SEMM-2” and Reference [4]) for estimating circuit SER. The critical charge for circuits manufactured using bulk technology is a more fluid value and depends much more strongly on the location and angle of an event, the bias on the transistor, and the track structure.
| |
|
In order to accurately predict the SER of semiconductor products, the circuits of greatest interest, such as the most sensitive circuits and SER-mitigated circuits, are “taped out” (fabricated) on test structures. Radiation testing of these test-site experiments is used to measure Qcrit of SOI circuits and to determine circuit sensitivity to alpha-particles and neutrons or protons. These test results are then used to calibrate the IBM modeling tools for soft errors. Extensive modeling is conducted for additional sensitive circuits that were not able to be fabricated in test-site experiments.
SRAM arrays are the de facto standard for stressing the performance and reliability of a fabrication process. These large arrays also work well for collecting extensive radiation test data. Arrays of shift-register chains are also fabricated in test-site experiments. These experimental flip-flop chains provide a framework for characterizing product circuit libraries and testing experimental circuit designs.
Since test structures are primarily used to represent circuits in an upcoming product, care is taken to use the same metallization as that used in product chips. Test chips are sliced and the metal stack is measured and characterized. Modeling tools provide insight into the alpha-particle attenuation through the metallization.
SRAM arrays are typically fabricated with 1-Mb to 32-Mb cell counts. Flip-flop chains are typically fabricated with 8-Kb to 30-Kb cell counts. This large quantity of bits provides enough fails to collect high-confidence statistics in radiation test experiments. The SRAM arrays have sufficient bits to provide high-confidence statistics in life test experiments.
The circuit response step of soft-error modeling centers on determining Qcrit, which is determined by circuit modeling conducted with a compact transistor model [5] embedded in a macro model and a circuit simulator such as HSPICE** or Spectre**.
Figure 3 illustrates the injected current pulse and its connection to an SOI n-FET. Each injected current pulse wave shape follows a sharp linear rise with an exponential decay, as introduced in [6, 7]:
 | (1) |
Equation (1) is used to numerically generate the wave shape in Figure 3. Pulse rise and fall times for each technology generation are determined from FIELDAY modeling. The double exponential current pulse in Equation (1) works well for modeling up to a few fC of injected charge. For large charge injection, such as in an aerospace environment, this method may cause the connected nodes to go above or below the voltage rail. Alternate wave shapes are required to properly capture the physics of large charge injection events [8].
Figure 3
The Qcrit analysis is conducted by performing a binary search over a range of values of injected charge. Each simulation uses transient analysis. The first simulation injects the lower-bound charge amount and verifies that the circuit is not upset. The second simulation injects the upper-bound charge amount and verifies that the circuit is upset. The third and subsequent simulations bisect the range until the range falls within a predefined tolerance. Current pulse rise and fall times are held constant while the charge amount is modulated by modulating the peak current. The minimum amount of charge that upsets the circuit is reported as the circuit Qcrit.
The results are verified against simulations conducted using the drift-diffusion transistor simulator FIELDAY (see the section “The role of FIELDAY”), as well as against hardware measurements. Figure 4 compares simulated and measured Qcrit values for the slave stage of a 65-nm SOI flip-flop, similar to the flip-flop shown in Figure 5. Qcrit from circuit simulations has excellent agreement with the accelerated hardware measurements.
Figure 4
Figure 5
When modeling Qcrit in SOI technologies, charge must be injected directly into the n-FET body to correctly trigger the bipolar effect. Since digital logic n-FETs in SOI technologies do not have a body contact, a special n-FET compact model with an ideal body contact must be used.
In bulk technologies, digital logic n-FET bodies sit in a common well that is typically biased to GND (ground), or 0 volts. Therefore, bulk technology Qcrit simulations can be simplified by connecting the current pulse between the n-FET drain and GND.
Historically, soft errors have been a primary concern for SRAM arrays. Recent generations of microelectronic circuits have highlighted the need to monitor and characterize soft errors in flip-flops [10]. Transient current pulses in cones1 of combinational logic have become a problem in aerospace environments [11] and are on the horizon in terrestrial electronics.
Unlike SRAM cells, flip-flops are not symmetric. In addition, flip-flops have stacked transistors with internal nodes (node0 and node1 in Figure 5) and transmission gates (P22 and N22 in Figure 5) that can turn on due to an ionizing particle strike and cause the flip-flop data to be overwritten. These complications mean that each transistor in a flip-flop must be analyzed for its soft-error sensitivity. The cross-coupled transistors (P23 and N23, P26 and N26, in Figure 5) can propagate a radiation-induced transient to the true or complement nodes (data and data_b in Figure 5) that is latched and causes a logical upset. Local clock inverters (P21 and N21 in Figure 5) can develop transients that erroneously turn on the flip-flop. In a bulk technology, the charge generated by the ionizing radiation particle that is collected at the drain of an on-state transistor (P25 or N25 in Figure 5) can upset the flip-flop. These on-state transistors are electrically connected to one or more off-state transistors and act as collection area for generated charge. In contrast, in SOI technologies, very little charge generated by the ionizing radiation particle is collected by the drain of a transistor. The primary mechanism for upset in SOI technologies is when the body of an off-state transistor charges up, causing the transistor to conduct current from its drain to source (assuming the drain is at the high-voltage rail potential and the source is at 0 volts). In an on-state SOI transistor, there is no bipolar current since the drain and source are at the same potential. In a bulk technology, radiation events on the drain of transmission gate transistors (P22 or N22 in Figure 5) have an impact on the true node (data in Figure 5) and can flip the latch. In SOI technology, radiation events on transmission gate transistors can turn on a parasitic bipolar current in the struck device. However, the true node is affected only if the input signal (data_in in Figure 5) has the opposite logical value of the true node.
Since flip-flops have master and slave stages, they are, in essence, two memory cells. A 50% duty cycle complementary clocking scheme makes the master stage sensitive to soft errors half of the time and the slave stage sensitive to soft errors the other half of the time. Other clocking schemes may stress the master or slave stages asymmetrically. In most microelectronic designs, both the master and the slave stages of a flip-flop are sensitive at some time during the clock cycle and must be analyzed.
Transient current pulses in cones of combinational logic primarily affect small transistors. Pulses will propagate only if the resulting voltage glitch exceeds the threshold voltage of the downstream combinational gates. These transient current pulses are also derated by do not care conditions on combinational gates.
| |
|
The IBM proprietary SER simulation tool SEMM-2 [4, 7, 12, 13] generates a random sample of radiation events and simulates charge collection for each event. FIELDAY modeling provides the physical effects mechanisms of events in a transistor under test, circuit modeling provides the Qcrit of each transistor in the circuit under test, and then SEMM-2 combines this data with the environment-under-test radiation flux to provide raw soft-error fail rates. Raw fail rates are reported in failures in time (FITs) or in fails per bit day. One FIT is equal to one failure per 109 hours.
SEMM-2 can simulate radiation sources used in accelerated testing (alpha-particle sources, proton beam, neutron beam, or ion beam) as well as radiation sources in products (alpha-particles from packaging materials and cosmic ray neutrons). Neutron and proton collisions are precharacterized by the NUSPA model [14, 15] and nuclear optical models [16, 17].
SEMM-2 performs Monte Carlo sampling of radiation events. Consider a thorium foil, which has several isotopes that emit alpha-particles at different energies. SEMM-2 samples the alpha-particle emission angle, emission energy, and emission location in the foil. For a proton beam experiment, Monte Carlo techniques are used to sample the incident location of the proton on the chip, as well as the position at which the proton collision occurs along the trajectory. The daughter particles are also sampled from the precharacterized NUSPA and optical model calculations.
For each radiation event, SEMM-2 traces the trajectory of all ionizing radiation particles. Charge packets are generated along these trajectories, and a fast drift-diffusion model calculates the charge collected at the various nodes in the user-defined circuit cell layout. The final result of the charge collection calculations is the charge collection probability distribution—that is, the probability of collecting at least a given amount of charge for a radiation event from the simulated source (i.e., alpha-particles from a thorium foil). The SER for an accelerated test condition or in a product environment is calculated using a measured or simulated Qcrit value with the charge collection probability distribution.
Figure 6 presents simulated and measured fail rates for thorium foil and proton beam accelerated tests on 90-nm bulk SRAMs. The agreement is quite good. SER simulations are used to estimate constituent SER components for packaged chips (i.e., errors due to alpha-particles from wafer impurities, alpha-particles from solder balls, and alpha-particles from underfill), accounting for the metallization of the product and the alpha-particle flux and energy distribution of a packaged chip.
Figure 6
| |
|
Soft-error mitigation can be done at several levels, from the underlying process up to the integrated system. Mitigation at each level has its own challenges and benefits. In this section, we review several common soft-error mitigation options.
At the process level, careful selection and screening of materials can reduce the radiation flux at the silicon surface. Removal of 10B from the process eliminates the thermal neutron-10B soft-error component [18]. Low-alpha-particle packaging materials, such as low-alpha-particle solder, reduce the alpha-particle soft-error component. As new materials are incorporated into the process, care must be taken to avoid introducing alpha-particle sources on the chip.
Technology choices can have an impact on the SER of a product. SOI technologies have proven to be two to six times less sensitive to single-event upsets than bulk technologies [19, 20]. In bulk technologies, buried wells can reduce charge collection and lower the SER [20].
The single-event upset hardness of flip-flops can be increased by 1) increasing the transistor size, 2) adding passive capacitance, and 3) changing the transistor types with threshold voltage shifts. These three techniques are the simplest and easiest design change methods for increasing hardness. Transistor-level redundancy, such as the dual interlocked circuit element (DICE) [21], and triple modular redundancy (TMR) [22] are also effective mitigation techniques, but they incur much higher area, power, and timing overheads. These heavyweight mitigation techniques are often unnecessary for alpha-particle and terrestrial environments.
The most common and most effective form of microarchitectural mitigation is error checking and correction (ECC). ECC can be as simple as a parity check on an SRAM array or a bank of flip-flops. Large arrays often employ more elaborate SECDED (single-error correct, double-error detect) error correction codes. Heavyweight microarchitectural techniques such as duplicated or triplicated computation units, time-redundant computations, or watchdog processors are applied in mission-critical systems [23–26]. In non-mission-critical systems, derating factors (i.e., do not care conditions) alone may void the need for microarchitectural changes [27].
| |
|
We conclude this paper with a brief discussion of soft-error trends with technology scaling.
Since 180-nm technology, the SRAM SER has been determined by particle flux rather than by cell robustness. If the sensitive area in an SRAM cell is hit by an alpha-particle or a cosmic ray daughter particle, the cell will most likely flip. As SRAM cells are scaled, each cell has a smaller charge collection area. Thus, the SRAM per-bit SER has decreased over recent generations of integrated circuits, but increased levels of integration have caused an increase in the system-level SRAM SER. SRAMs have entered the realm in which a single alpha-particle or cosmic ray event may cause multiple adjacent SRAM cells to upset. Column multiplexing of SRAM arrays guarantees that physically adjacent cells do not represent multiple bits in an SRAM word. Hence, most multicell upsets still appear as single-bit upsets in the output. Single-bit SRAM upsets are largely solved by ECC.
The number of flip-flops on a single chip has increased with scaling and higher levels of integration, and this increase is even greater for high-frequency and multicore designs. This increased number of flip-flops, combined with the increased sensitivity of flip-flops, highlights the need for characterization and selected mitigation of soft errors in flip-flops. Soft-error mitigation strategies are too costly to implement across the board, so it is crucial to identify critical circuits in which soft errors pose the largest risk to system reliability. This requires elaborate fault injection modeling. Once the critical circuits are identified, extensive effort is needed to select the appropriate mitigation strategies while minimizing the performance, power, and area penalties.
Overall, soft errors in integrated circuits present a fascinating, cross-discipline field of study. IBM employs one of the largest groups in the industry to research, characterize, and mitigate soft errors in its mainframes, supercomputers, and ASIC products.
| |
The material in this paper is based on work supported in part by the Defense Advanced Research Projects Agency under its agreement number HR0011-07-9-0002. We thank colleagues in IBM for fruitful discussions regarding this work. A special thanks goes to Robert H. Dennard, Tak H. Ning, Kerry Bernstein, Richard Williams, John Aitken, Dave Heidel, Kenneth Rodbell, Mike Gordon, and Henry Tang. We also thank Scott McAllister and Charles Montrose for support during testing.
*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both.
**Trademark, service mark, or registered trademark of Synopsis, Inc., or Cadence Design Systems in the United States, other countries, or both.
| |
| |
1A logic cone is defined by all signals and all logic gates that influence one latch or one output of the combinational logic circuit.
Received July 16, 2007; accepted for publication September 29, 2007; Published online February 29, 2008.
|
|