IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Journal of Research and Development

Soft Errors in Circuits and Systems   Volume 52, Number 3, 2008
Table of contents: HTMLPDF This article: HTML PDFDOI: 10.1147/rd.523.0223Copyright info

Preface

The reliability of computing systems is very important since many applications demand uninterrupted operation. Permanent and transient failures must be reduced to extremely low levels in order to meet the requirements for current and future systems. The radiation-induced transient failures, generally referred to as soft errors or single-event upsets (SEUs), are the topic of this special issue.

In 1996, the September issue of the IBM Journal of Research and Development focused on SEUs caused by cosmic radiation and most of the papers described soft errors in memory circuits. The work described in this issue extends the previously published work and focuses primarily on soft errors caused by alpha-particles that are emitted from radioactive contamination on or in the chip. The primary concern has also shifted from memory circuits to logic and latch circuits. In addition, the papers in the second section of this issue of the Journal go beyond the discussion of circuit upsets to cover the impact of soft errors at the chip and system levels.

Over the past few decades, the issue of SEUs has grown and evolved to encompass a wide range of topics including the following: 1) numerous sources of radiation, 2) a wide range of new materials, 3) new chip technology and device scaling, 4) memory, latch, and logic circuits, 5) system architectures, 6) software applications, and 7) methods available for SEU mitigation. Each of these topics is briefly mentioned below.

The background cosmic radiation effect on soft-error rates (SERs) in memories is well known. More recently, the susceptibility of modern devices to soft errors due to low-energy neutrons has led to a more accurate determination of the cosmic ray energy spectrum from thermal energies to 1 GeV (billion electron volts). The flux of neutrons varies inversely with energy, resulting in a very large number of low-energy neutrons. In the late 1990s, thermal neutron capture reactions were shown to be a soft-error mechanism, and modern devices are known to be sensitive to both low-energy alpha-particles and low-energy protons. In fact, the modern device (65-nm node) sensitivity to alpha-particles makes this the most prominent soft-error failure mechanism, more significant even than the cosmic ray component. The consequences of alpha-particle sensitivity are highlighted in a number of the papers in this issue.

The sensitivity to thermal neutron capture resulted in the removal of 10B from most semiconductor materials and processes, especially boron-phosphorus silicate glass. In addition, the alpha-particle activity in Pb has been reduced with the application of extensive testing and monitoring of materials. New materials are also routinely screened for U and Th impurity levels since even parts per billion (ppb) can lead to unacceptable levels of alpha-particle emission (from, e.g., 210Po). In addition, some naturally occurring alpha-particle-emitting materials, such as Pt, must be limited in quantity to avoid unacceptable SERs. In order to address these material issues, details of the particle energy loss in the back-end-of-line (BEOL) interconnect structure must be accurately modeled, since the alpha-particle path length determines the charge that is deposited in a device. “Soft” paths (through low-density materials, such as insulators) absorb less of the alpha-particle energy than “dense” paths (through high-Z materials, such as metals). The most critical path is one in which the end-of-range of the alpha-particle lies in the device, at which point all of the alpha-particle charge is deposited in the device.

Over the last several generations of silicon devices, the critical charge (Qcrit) required to upset a device has decreased from much more than 100 fC (femtocoulombs) to less than 1 fC. Devices today can now be upset by alpha-particles with energies less than 10 MeV (million electron volts). For many years, it has been predicted that technology scaling to smaller dimensions would lead to SEUs due to direct ionization from protons. This has recently been shown to be true for both bulk SRAM and 65-nm silicon-on-insulator (SOI) devices.

Since it is much more costly to provide SEU mitigation for latch circuits, the soft-error failure rate of latches is increasing more rapidly with technology scaling. This failure rate increase is further enhanced because the latches are now sensitive to upsets by alpha-particles.

Given that soft errors can lead to upsets of both SRAM cells and latches, the challenge at the system level is to provide high levels of resiliency to these upsets. The trends at the system level toward higher numbers of cores per chip, higher numbers of latches and SRAMs, and tighter power constraints that limit the use of redundancy continue to increase the magnitude of this challenge. Mitigation of soft errors in SRAMs has been widely realized using parity checking, error checking and correction, redundancy, and various forms of voting. SER mitigation for latches is more difficult due to large space penalties, performance degradation, and increased power consumption. In addition, the alpha-particle flux from packaging materials can be greatly reduced by simply increasing the thickness of the BEOL structure. In servers, mitigation is achieved by reducing the flux of incident particles, decreasing the intrinsic sensitivity of circuit elements, building in high levels of error detection and recovery logic at the chip level, and designing robust system architectures.

In this issue of the Journal, the SERs in both circuits and systems are covered. The paper by Heidel et al. presents experimental data showing latch upsets using low-energy alpha-particles. The use of nuclear and SEU event modeling is described in the paper by Tang. This theme is continued in the paper by Tang et al., in which the role of the metal and insulator layers in absorbing alpha-particle radiation is modeled and experimentally measured. Circuit SER modeling is described in the paper by KleinOsowski et al., and SER measurement techniques are described by Gordon et al.

In the second part of this issue of the Journal, the papers focus on the impact of soft errors at the system level. Papers by Sanda et al. and Bender et al. describe the high resiliency of the IBM POWER6™ microprocessor chip, as well as the POWER6 microprocessor I/O subsystem, respectively. These papers describe how this resiliency is directly measured in high-energy proton beam irradiation experiments. The paper by Sanda et al. also describes a fault injection methodology that can be used to precisely quantify and thereby verify soft-error robustness. The paper by Bender et al. highlights the importance of application conditions such as bandwidth and utilization. Rivers et al. describe a methodology for quantifying the soft-error robustness of an evolving chip design. Finally, Dell describes the history of DRAM soft errors, subsequent mitigation strategies, and architectural considerations that exacerbate the effects of soft errors.

 
 
 David F. Heidel
Research Staff Member
IBM Research Division
 
 Jack M. Hergenrother
Manager, Technology and System z Testing
IBM Systems and Technology Group
 
 Kenneth P. Rodbell
Research Staff Member, Manager SER Research
IBM Research Division

Guest Editors
 


    About IBMPrivacyContact