|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.523.0285 | Copyright info |  |
 |
 |
Soft-error resilience of the IBM POWER6 processor input/output subsystem
|  |  |
by C. Bender, P. N. Sanda, P. Kudva, R. Mata, V. Pokala, R. Haraden, and M. Schallhorn
|
|
|  |
 |  |  |
|
| |
|
In evaluating system soft-error rates (SERs), it is important to consider all of the elements of the system working together. For a server, this includes but is not limited to the microprocessors, I/O (input/output) chips, and the memory subsystem. SER mitigation at the system level is carefully orchestrated in IBM systems to detect and correct most errors, prevent the consumption of uncorrectable errors through consistency checking, and maintain continuous operation, even in the rare event of an uncorrectable error.
It is known that signals connecting integrated circuits, including I/O signals, may encounter errors, and it is common practice to apply error-protection schemes such as parity and error correction codes (ECC) [1] to such signals. Data packets are commonly transported with their ECC bits. However, I/O systems can contain a complexity of subsystems [2] as well as individual integrated circuits, which are in themselves susceptible to transient errors. With the current technology trend of increased SER in logic circuits [3], both computations and data handling within these logic circuits must be evaluated for SER. In the system, the data packets passing through the I/O are checked at various stages, both within the I/O chip and at the receiving interfaces [4].
Here, we discuss the effects of soft errors in an IBM POWER6* processor I/O hub chip built in IBM 90-nm bulk technology [5]. Proton beam experiments measuring the bandwidth or utilization dependence have illustrated its high error resilience. The I/O hub is very different from the processor in structure, function, and technology used, making its unique sensitivity to soft errors worthwhile to study.
Our analysis and results presented in this paper show that the POWER6 processor I/O subsystem is resilient to soft errors. Signals are resilient to errors that occur due to electrical noise at the buses, which have built-in protocols for correcting bit errors with ECC and requesting as well as resending data and packets. Additionally, the IBM system design has built-in I/O bridge redundancy, which ensures the continuation of function when an error occurs. This allows continuation of the I/O function while degrading performance gracefully without bringing down the system. Also, I/O systems are built to maximize both I/O connectivity and peak bandwidth capacity. This results in excess capacity for typical workloads and configurations, therefore making I/O systems more resilient to soft errors.
| |
|
Figure 1 shows the POWER6 processor-based system used for these studies. The system board utilizes two dual-core microprocessors and one I/O hub chip.
Figure 1
Both processors were utilized for the measurements. The tests described here used various combinations of the PCI-X**, PCI Express** (PCIe), and on-chip high-speed host Ethernet adapter (HEA) interfaces.
The I/O hub chip includes a high-performance general I/O interface (GX) to the processors, high-speed Ethernet, PCI-X and PCIe interfaces, as well as an auxiliary GX interface (GX passthrough) (Figure 2). The four PCI-X and four PCIe interfaces can be used by systems to provide adapter slots for I/O adapters or integrated PCI (peripheral connect interface)-attached devices. The HEA provides two 10-Gb/s Ethernet ports, each of which can be configured as two 1-Gb/s Ethernet ports. The auxiliary GX interface is used by systems that require a means to attach I/O expansion drawers for additional PCI-X or PCIe slots.
Figure 2
The GX circuits are depicted in the green-shaded areas of the chip in Figure 2. They are responsible for the direct communication between the I/O hub chip and the central processing unit (CPU) and memory. These tasks include performing the GX bus protocol, the first-level decoding and tagging of read and write requests coming from the CPU to the I/O hub, and routing read and write requests coming from attached I/O devices targeting system memory.
The POWER6 processor-based system used in the beam tests used the I/O hub chip ports to build two PCI‐X adapter slots, three PCIe slots, two integrated PCI-X devices, dual 10-Gb/s Ethernet ports configurable to quad 1-Gb/s Ethernet, and one I/O expansion drawer slot.
| |
|
Proton beam measurements were performed at the Francis H. Burr Proton Therapy Center of the Massachusetts General Hospital [6]. The details are described by Sanda et al. [7]. The lead wall in the beamline has an aperture through which the proton beam passes. Figure 3 shows a proton radiograph with the I/O hub chip in the center of the beam.
Figure 3
| |
|
Static measurements were performed to determine the baseline of latch flips. The same procedure was performed as described in [7]. Tests were performed to measure L1 and L2 flips from 0 → 1 and 1 → 0. Repeated measurements were taken in each test. Figure 4 shows the x–y map of the accumulated latch flips of one of the tests mentioned above (top) and the layout (bottom).
Figure 4
| |
|
The POWER6 processor-based system error handling includes consistency checking of I/O hub signals. Errors are detected within the I/O hub chip and parts external to the chip. The external detection and recovery may occur in other parts of the system including the GX bus, the CPU, inter-CPU buses, or the memory subsystem. The point at which the error is detected depends on the type of error. For example, errors at the interface of the I/O hub chip that are not caught by the I/O hub chip would be flagged by other parts of the system. The GX bus is among the first to catch such errors. Errors that get past the GX bus may be flagged by either the CPU or the memory subsystem checks. A majority of the errors were caught by the I/O hub chip as expected.
ECC is employed by all system components, including the I/O hub. It detects double-bit errors and corrects all single-bit errors. The error detection mechanism uses a consumer model. Errors are detected at the earliest possible point, but recovery or failure may be left to the component that uses the data. This provides maximum fault isolation and maximum fault tolerance. Checking is done on any data that the I/O hub sends so that any errors that occur in circuit stages after the I/O hub chip has completed its error checks are detected. For instance, if there are uncorrectable data errors in one program, this model ensures that they affect only that program and no others. In systems that support virtualization, such as the POWER6 processor-based system, this is a critical feature that provides overall system failure resiliency.
Similarly, cyclic redundancy checks (CRCs), which can detect multiple-bit errors, are employed elsewhere in the I/O system. Data failures are detected, and recovery is accomplished by resending the data either by the I/O hardware itself or by software.
| |
|
Functional tests involved starting a system from power-on reset followed by an initial program load (IPL). Once the system IPL (or boot) was complete, an exerciser program for a given adapter was run on the system. The entire system—CPU, memory, and support chips—is employed in exercising the I/O. As the program executed, the I/O hub chip was exposed to the proton beam.
The software package used to exercise the I/O hub chip is one that is used by IBM hardware test teams to verify the POWER* processor-based system hardware components. The package is used as an application to the AIX* or the Linux** operating system and is widely used at various stages of hardware design, from system development to manufacturing. It invokes individual applications, each of which is designed to exercise a specific hardware function. These applications can be run individually or simultaneously with other applications. Multiple copies of a single application can also be run. In this way, interactions among various hardware functions can be tested (e.g., processor, L2 cache, system memory, and I/O adapters). In our testing we focused on I/O adapters in order to exercise the I/O hub. We ran one or more copies of the Ethernet adapter exerciser. Each copy used either an Ethernet adapter plugged into a PCI slot or the HEA. Each copy would send and receive data simultaneously between two Ethernet ports. The terms Ethernet adapter, Ethernet interface, and Ethernet cable are used interchangeably in this paper.
The functional test stopped on system checkstop or when a data miscompare was detected. A system checkstop is a system-fatal error that brings the system to an immediate halt, at which point the failing state of the system may be logged for failure analysis. The checkstops initiated by the I/O hub chip itself are generated in the GX circuits. Failures here that cannot be recovered and corrected by the I/O hub itself generally result in a checkstop for one of two reasons: They result in the loss of connection between the CPU and memory and all I/O devices or the checkstop prevents corrupted data from being used by the memory or CPU.
The highest probability that a soft error will affect chip function and result in a checkstop occurs when the GX circuits are 100% utilized. The I/O hub is generally immune to latch flips so long as those latches are not being used functionally, that is, as long as the I/O hub is not depending on the state of a latch; the fact that it temporarily contains the wrong state is of no consequence. Therefore, higher utilization produces higher I/O hub failure rates. Utilization in an I/O device is measured by the bandwidth of data flowing through the I/O device (i.e., the number of millions of bytes of data that move in and out per second—MB/s). The failure rate obtained with the exerciser must be scaled to the bandwidth (BW) of typical applications in order to obtain a projected failure rate.
In our testing, we used eight different configurations using the same Ethernet test exerciser in each case to drive similar but random datastreams to tabulate our dynamic SER rating. The PCI tests used 1-Gb/s Ethernet adapters. The tests are categorized as follows:
- Idle.
- Full-BW PCI test.
- Half-BW PCI test.
- PCI single adapter test.
- 10-Gb/s HEA test.
- 1-Gb/s HEA test.
- 10-Gb/s HEA + PCI test.
- 1-Gb/s HEA + PCI test.
Tests 2, 7, and 8 were designed to maximize the bandwidth achieved with the Ethernet interfaces attached to either the PCI slot or the HEA. The terms 1 Gb/s and 10 Gb/s refer to the data transfer rate or bandwidth of the Ethernet interface. Tests 2 through 6 were used to isolate each of the I/O hub chip interfaces (i.e., PCI-X, PCIe, and HEA) that were exercised. Test 1, an idle state test, was used as a baseline and was measured with the system booted to the operating system prompt but not running any I/O traffic. There is some possibility that even when completely idle, a latch flip in the I/O hub can either cause an error checker circuit to falsely detect an error or it can cause the I/O hub to believe there is work to do when in fact there is none. Thus 0% utilization of the I/O hub actually has a non-zero soft-error failure rate.
Each test case is distinguished by the number of Ethernet ports tested. For instance, in test 2, our maximum PCI configuration, we ran the exerciser on eight Ethernet ports, while in test 3 we ran four ports to represent half the bandwidth of the maximum PCI configuration. In test 4, we ran both ports on a single adapter card to represent the minimum bandwidth configuration.
In tests 2 and 4 we had adapter(s) loop back to themselves. That is, by using dual-port Ethernet adapters, we connected the Ethernet cable from one port to the other port on the same adapter card.
For test 3, we connected the Ethernet cable from one port of one adapter to the neighboring adapter port. Under the PCI-X bridge, we have two dual-port Ethernet adapters that are populated and the same likewise for the PCIe bridge. Thus, for each pair of Ethernet adapters under their residing bridge, we only used one of the Ethernet ports and connected them together. Under this configuration we would still exercise all available slots under the PCI-X and PCIe bridges while generating half the bandwidth of the maximum PCI configuration.
For the 10-Gb/s HEA test, we had a daughter card with dual fiber-optic ports that were tied together similar to test 2. For test 6, we had a quad-port daughter card and we paired the ports with an Ethernet cable. Tests 7 and 8 consisted mainly of the HEA interface running in addition to the test 3 configuration.
| |
|
The I/O tests generate read and write traffic across the PCI and GX interfaces to move data between the memory and the Ethernet ports. For example, the exerciser would initiate a message send using TCP/IP (Transmission Control Protocol/Internet Protocol), which would result in a command to the Ethernet (either in the I/O hub chip or in the PCI adapter) causing it to read the message out of memory and send it over the Ethernet cable. The other end of the cable would be connected to another Ethernet device in our system, which would then initiate a write to memory. The CPU would be notified of the message arrival, and the exerciser code would check the data. This process would be repeated by the exerciser, which would keep as many messages in flight as possible to maximize the throughput of each Ethernet device. Testing many Ethernet devices together raises the total I/O hub chip throughput.
Ethernet adapters were chosen to exercise the PCI interfaces for two reasons: They were readily available and easily exercised without requiring elaborate infrastructure, and the test code we have for them checks for data miscompares. For example, Fibre Channel adapters might have been chosen but would have required a large number of attached disks (and their associated chassis) to generate the amount of traffic we needed.
The Ethernet exerciser and TCP/IP code stack are CPU intensive, and the system we had available was an early development test system with only four CPUs running at suboptimal frequency. Testing all interfaces simultaneously would have degraded the bandwidth achieved by the test due to the lack of CPU availability. Separate experiments were performed to split the chip into pieces for the test, and then their contributions were added together. The interfaces were tested alone and together and then compared with the results of test 1, the idle state.
For test 2, the full-BW PCI test, four dual-port 1-Gb/s Ethernet adapters were configured. A maximum configuration added two PCI-X and two PCIe adapters for a total of eight interfaces. Each 1-Gb/s Ethernet port can generate a theoretical maximum of 120 MB/s per direction (240 MB/s per direction per adapter). The theoretical peak for eight ports is 1,920 MB/s. The CPU was fully exercised by this maximum configuration in our test system. Test 3, the half-BW PCI test, used the same adapters as test 2, but only one port on each adapter was used. Theoretically, this would produce 960 MB/s. The intended purpose was to create an additional failure-rate data point from which we could scale the full-BW PCI test to 100%. Test 4, the PCI single adapter test, was the same as the higher BW PCI test, but only one adapter was used. Its purpose was the same as that of test 3.
| |
|
The HEA is an Ethernet function that is built into the I/O hub chip. As described above in the I/O hub overview section, it may be configured as either a 1-Gb/s or a 10‐Gb/s port. The HEA exerciser operates the same as that for the PCI-attached Ethernet devices. The only difference is the lack of any PCI interface between the GX bus and the Ethernet device.
Tests 5 and 6, the 1-Gb/s and 10-Gb/s HEA tests, exercised the four 1-Gb/s Ethernet or two 10-Gb/s Ethernet ports on the I/O hub chip. After observing the poor bandwidth achieved with the PCI 1-Gb/s adapters, we decided to run the HEA 1-Gb/s tests at a 33% lower GX frequency to achieve higher utilization of the GX circuits. Tests 7 and 8, the 1-Gb/s and 10-Gb/s HEA + PCI tests, provided another failure-rate data point to assist in adding the contributions of the two pieces. The latter interfaces were run at a medium rate to avoid overrunning the CPU.
The only interface not tested was the GX-passthrough interface used to attach external I/O expansion drawers. Again, the need for additional machine infrastructure and additional test complexity argued against including it. Since the circuit area uniquely associated with this interface is small, it was skipped.
| |
|
Figure 5(a) shows the results of the measured error rates (blue bars) for each of the functional tests as compared with the idle test. The yellow bars show these same measurements after adjusting for 100% bandwidth and typical application usage. The green bars show the failure-rate reduction we were able to achieve for the tests as a result of the fact that some of the checkstops which occurred during the measurement would not occur in the field. During the testing, we were running down-level firmware that had several error-correction and recovery functions disabled. Accordingly, firmware and some hardware error correction and recovery mechanisms were missing from the beam measurement environment, and the measured data was reviewed to eliminate such recoverable errors from the data, resulting in the green bars. Finally, the line graph shows the utilization of the GX circuits for each of the tests, from which we can see that higher utilization generally produces higher error rates.
Figure 5
The half-BW PCI and PCI single-adapter derating [7] and idle derating were similar. The PCI multiadapter was more strenuous (i.e., higher bandwidth) than these. The 1‐Gb/s HEA and 10-Gb/s HEA adapter tests were successively more strenuous. The 1-Gb/s HEA + PCI test was increasingly more susceptible to upset. Finally, the 10‐Gb/s HEA + PCI test was the most strenuous, resulting in the smallest derating factor. Note that the 1‐Gb/s HEA test utilization was helped greatly by having reduced the GX bus frequency. In hindsight, we would have repeated the frequency reduction for the other tests as well.
The fact that the half-BW PCI and PCI single-adapter tests showed no statistically significant difference from idle surprised us and meant that we could not use these in assisting us to scale the other functional tests. We were able to conclude that these tests did not exercise the chip enough to escape the error-susceptibility noise of the idle test. As described previously, the I/O hub chip with 0% utilization has some non-zero failure rate, which we call the idle noise. When exercising the chip, some idle fails are replaced with functional fails until the number of functional fails exceeds those of idle.
Once we escape the idle noise, the functional fails scale with the utilization of the chip. We use measured I/O bandwidth as a proxy for utilization. Knowing the bandwidth of a test, we can scale that portion of its failure rate which exceeds the idle noise to 100%. For example, if the measured bandwidth of a test is known to be 25% of theoretical peak, then its scaled failure rate can be calculated as

where fst = the failure rate scaled to 100% bandwidth, ft = the measured failure rate of the test, and fi = the measured idle failure rate. Once scaled up for peak bandwidth, we can then scale it back to represent the effects of customer usage. In Figure 5, 20% usage is assumed.
From the measured error rates, we can calculate per-test deratings from the static test latch-flip rates. The derating is the ratio of the beam energy required to produce a functional error to the energy required to produce a latch flip. This gives us a measure of the number of soft-error latch flips that are required to actually produce a functional failure. Figure 5(b) shows the results of derating for each of the tests.
Increased utilization and, thus, increased error rates are observed in two ways: first, by increasing the number of circuits in the I/O hub that are active during the test, that is, the HEA + PCI tests show the largest error rates; second, by increasing the bandwidth, that is, the 10-Gb/s HEA test had a higher error rate than either the 1-Gb/s HEA or the full-BW PCI tests. The weighting (yellow bars) removes the bandwidth differences between the tests. Removing the false checkstops (green bars) shows just how much improvement can be gained by engaging effective error recovery.
Since the impact of soft errors is strongly dependent on whether the data in the latch being corrupted by the soft error is being used, the question of typical usage is very important in the derating equation. Peaks and valleys in I/O activity have to be averaged before applying them to calculate a derating. Few applications run the I/O at peak capacity. If we assume as a worst case that the applications do run their I/O at its peak capacity, then we can simply take the weighted average of idle and per-test 100%-scaled-bandwidth failure rates to produce a typical application usage derating. With the advent of the virtualization of I/O, we expect to approach this 100%-scaled-bandwidth utilization. Thus, while this scaling is pessimistic today, it properly covers future expectations.
| |
|
This study demonstrates the error resilience under even full load for the I/O hub of a high-performance server system. Even under maximum utilization, the error resiliency compared with 0% utilization was quite good. The error resiliency was shown to be further enhanced with the checking and recovery capabilities of the I/O hub chip as well as the rest of the POWER6 microprocessor error-handling system. The study focused on the most-susceptible portion of the I/O hub chip (colored green in Figure 2). Failures in all other portions of the I/O hub chip (colored blue) are completely recoverable by hardware and firmware means and thus further enhance the error resiliency of the system as a whole.
*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both.
**Trademark, service mark, or registered trademark of PCI-SIG Corporation, Linus Torvalds, or InfiniBand Trade Association in the United States, other countries, or both.
| |
|
Received July 17, 2007; accepted for publication October 9, 2007; Published online March 6, 2008.
|
|