0018-8646/2001/$5.00 (C) 2001 IBM Fast pseudorandom-number generators with modulus 2[sup]k[/sup] or 2[sup]k[/sup]-1 using fused multiply-add by R. C. Agarwal, R. F. Enenkl, F. G. Gustavson, A. Kothari, and M. Zubair Many numerically intensive computations done in a scientific computing environment require uniformly distributed pseudorandom numbers in the range (0, 1) and (-1, 1). For multiplicative congruential generators with modulus 2[sup]k[/sup], k [less or = to] 52, and period 2[sup]k-2[/sup], we show that the cost per random number for these two distributions is 3 and 3.125 multiply-adds on RS/6000(R) processors. Our code, on the IBM POWER2 Model 590, produces more than 40 million uniformly distributed pseudorandom numbers per second for both ranges (0, 1) and (-1, 1). Additionally, our code sustains the 40 million per second rate for data out of cache. The Numerical Aerodynamic Simulation (NAS) parallel benchmarks use a linear congruential generator with modulus 246. Our result is about 50 times faster than the generic implementation given in the benchmarks. The extra-accuracy fused multiply-add instruction of RS/6000 machines combined with a few algorithmic innovations gives rise to the 50-fold increase. If IEEE 64-bit arithmetic is used with our Fortran code on POWER and PowerPC(R) architectures, the results we obtain are bit-wise identical to the generic algorithms. The paper gives several illustrations of a general technique called the Algorithm and Architecture approach. We demonstrate herein that programmer-controlled unrolling of loops is equivalent to "customized vectorization of RISC-type code." Customized vectorization is more powerful than ordinary vectorization, and it is only possible on RISC-type machines. We illustrate its use to show that RS/6000 processors can compute the distribution (-1, 1) at the rate of 3.125 multiply-adds. We also specify a linear congruential generator that is related to the multiplicative congruential generator referred to above. It has a full period of 2[sup]k[/sup], where 2[sup]k[/sup] is the modulus. The cost per random number [in the range (0, 1)] for this generator is four multiply-adds on RS/6000 processors. Our code, on the IBM POWER2 Model 590, for this generator produces more than 30 million uniformly distributed pseudorandom numbers per second for the range (0, 1). We show that this generator is "embarrassingly parallel," or EP. Using the Algorithm and Architecture approach, we describe a new concept called "generalized unrolling." Finally, we present a multiplicative congruential generator for which the modulus is not a power of 2. Such a generator, as well as one with modulus 2[sup]k[/sup], is selectable as the generator used in the RANDOM_NUMBER intrinsic function of IBM XL Fortran and XL High Performance Fortran. All of the generators reported here are EP. Using an IBM SP2 machine with 250 wide nodes, it is possible to compute more than ten billion uniform random numbers in a second. 1. Introduction The extra-accurate fused multiply-add (FMA) operation of the RS/6000* and PowerPC* family of RISC microprocessors offers many opportunities to use mathematical innovation to produce fast algorithms for numerically intensive computation (NIC). In this paper we illustrate this assertion by giving several examples and by demonstrating a 50-fold increase in performance (over generic algorithms [1]) for pseudorandom-number generation. The results obtained are bit-wise identical to the results of the generic algorithms. For multiplicative congruential generators [2] of the type s[sub]i+1[/sub] = as[sub]i[/sub] mod 2[sup]k[/sup], x[sub]i+1[/sub] = 2[sup]-k[/sup]s[sub]i+1[/sub] or x[sub]i+1[/sub] = 2[sup]-k+1[/sup]s[sub]i+1[/sub] - 1, (1) with k [less or = to] 52, we show that the cost per random number for the two distribution intervals (0, 1) and (-1, 1) is respectively 3 and 3.125 multiply-adds on RS/6000 processors. We also specify a linear congruential generator of the form s[sub]i+1[/sub] = (as[sub]i[/sub] + c) mod 2[sup]k[/sup], x[sub]i+1[/sub] = 2[sup]-k[/sup]s[sub]i+1[/sub], related to the generator in [2], which has a full period of 2[sup]k[/sup]. The cost per random number [in the range (0, 1)] for this generator is four multiply-adds on RS/6000 processors. Finally, we present a multiplicative congruential generator for the interval (0, 1) of the form s[sub]i+1[/sub] = as[sub]i[/sub] mod (2[sup]k[/sup] - 1), s[sub]i+1[/sub] x[sub]i+1[/sub] = ----------------- , (2) 2[sup]k[/sup] - 1 (discussed in [3]), for which the modulus is not a power of 2. Such a generator, as well as a generator of type (1), is selectable as the generator used in the RANDOM_NUMBER intrinsic function of IBM XL Fortran (XLF) [4] and XL High Performance Fortran (XLHPF) [5]. [By "a generator of type (n)," or "generator (n)," we mean a generator whose description is given by equation(s) (n).] We first introduce some notation. We use q to represent the modulus, either 2[sup]k[/sup] or 2[sup]k[/sup] - 1, depending on the generator. Arithmetic operators in equations are exact. Operands are IEEE normalized numbers. We use the operators [xor], [bisected_circle], [o times], [o slash] for IEEE double-precision (64-bit) floating-point arithmetic, and fl(x) for the correctly rounded double-precision value corresponding to x. In most cases, x = ab + c, where a, b, and c are IEEE numbers. Unless otherwise stated, the rounding mode we use is round-to-zero (chop). This leads to the best performance. Let a, b, and c be arbitrary IEEE 64-bit floating-point numbers. The fused multiply-add operation on RS/6000 and PowerPC computes the correctly rounded d = fl(ab + c) for any of the four IEEE rounding modes. On RS/6000 machines, all 106 bits of the product a * b are added to c in order to guarantee that d will be correctly rounded for all possible values of a, b, and c. POWER and POWER2 RS/6000 machines can on a continuous basis compute one and two FMAs every machine cycle if the results of the FMAs do not cause any pipeline delays. The pipeline delay for these machines is two or three cycles. Loop unrolling can be used to avoid any pipeline delays for the pseudorandom-number calculation. We first consider a multiplicative congruential pseudorandom-number generator for which the modulus is a power of 2. See Knuth [2] for a thorough discussion of pseudorandom-number generators. Let 0 < s[sub]0[/sub] < 2[sup]k[/sup], s[sub]0[/sub] odd, 1 < a < 2[sup]k[/sup], k [less or = to] 52, and for i [greater or = to] 0, let s[sub]i+1[/sub] = as[sub]i[/sub] mod 2[sup]k[/sup], x[sub]i+1[/sub] = 2[sup]-k[/sup]s[sub]i+1[/sub]. (3) For k = 46 and a = 5[sup]13[/sup] we have a specific instance of such a generator. This generator has period 2[sup]k-2[/sup], and it is extensively used by the Numerical Aerodynamic Simulation (NAS) suite of paper and pencil benchmarks [1]. We mention here that the random-number generator (3) is embarrassingly parallel, or EP. In [1, p. 31, bullet 3], this point is made for the EP benchmark. Also, on page 29 of [1], Bailey describes the binary algorithm for exponentiation that allows one to compute a[sup]n[/sup] mod 2[sup]k[/sup] in log[sub]2[/sub](n) steps. The fact that a[sup]n[/sup] mod 2[sup]k[/sup] is computable in log[sub]2[/sub](n) steps is crucial to making the random-number generator (3) embarrassingly parallel. Bailey implicitly points this out in [1, p. 31, bullet 2]. On the IBM POWER2 Model 590, our bit-wise identical algorithms corresponding to Equation (3) compute more than 40 million random numbers per second. On the IBM SP2 machine, using the Model 590 for its nodes, and using p such nodes, we can compute more than 40p million random numbers per second because of the EP nature of these algorithms. Thus, using 25 nodes our algorithms can compute more than a billion random numbers in a second. In [1], Bailey gives a generic algorithm that simulates base 2[sup]23[/sup] multiple-precision arithmetic to compute the s[sub]i[/sub]'s in (3). This algorithm requires 18 floating-point operations and four convert-to-integer operations per random number. We implement the same generator using three multiply-adds. Any pseudorandom s[sub]i[/sub] from (3) can then be placed in the range 0 < x[sub]i[/sub] < 1 by the scaling x[sub]i[/sub] = 2[sup]-k[/sup]s[sub]i[/sub] or be put in the range -1 < x[sub]i[/sub] < 1 by computing x[sub]i[/sub] = 2[sup]-k+1[/sup]s[sub]i[/sub] - 1. These computations respectively require one and two additional floating-point operations. In this paper, we redefine (3) to compute each x[sub]i[/sub] directly, without first computing s[sub]i[/sub], and thus compute x[sub]i+1[/sub] = ax[sub]i[/sub] mod 1. This change avoids the actual scalings done above. We also show how these computations can be done by three and 3.125 multiply-adds per random number on RS/6000 machines. When doing modular arithmetic on integers, it is natural to use the greatest integer function. In floating-point arithmetic this is achieved by using the IEEE round-to-zero (chop) rounding mode. Throughout this paper, except for one place in Section 2, we use the chop rounding mode. Many NIC algorithms are rated by their megaflop rate, although this popular measure is often misleading. We think that pseudorandom-number generation is such a case. An NIC computation related to pseudorandom generation is the EP NAS benchmark [1, 6]. By using mathematical innovation and the fast random-number generation described here, we were able to demonstrate that the RS/6000 POWER2 Model 590 could currently outperform a Cray YMP for EP [6]. In this paper we compare the ratio of the computing time for the generic algorithm [1] to the time of our algorithm. In both cases we use the same model of RS/6000. The new algorithm is written entirely in Fortran and is compiled using the XL Fortran (XLF) compiler [4]. We also implemented another generator of the type (3) in which a = 44485709377909 and k = 48. This generator is the RANF() pseudorandom function used on the CDC Cyber 174 computer, and is also generator 2 in the RANDOM_NUMBER intrinsic function provided with the IBM XLF and XLHPF compilers. It is described in the book Stochastic Simulation by Brian Ripley [7, p. 216], who presents statistical test results for several generators; he finds this generator "quite acceptable." Gordon Slishman has implemented this generator[foot1] using three multiply-adds for single-processor execution of RANDOM_NUMBER. The parallel implementation in XLF and XLHPF is described in Section 4. Slishman also implemented the generator (2), for which the modulus is not a power of 2, for generator 1 of single-processor RANDOM_NUMBER. Parallel versions of both generators 1 and 2 in RANDOM_NUMBER were implemented by Robert Enenkel for SP (distributed-memory) machines, and by Enenkel and Xinmin Tian[foot2] for SMP (shared-memory) machines. Generator (2) is discussed in detail in Section 5. In Section 2 we describe some elementary mathematical ideas that can be used to compute (3) rapidly. These ideas are used to derive an algorithm to compute pseudorandom numbers in the range (0, 1) in three multiply-adds and pseudorandom numbers in the range (-1, 1) in 3.125 multiply-adds. Also, using the IEEE round-to-nearest mode, we show how the latter computation can be done in three multiply-adds. We now discuss POWER and PowerPC models. Before the advent of vector processors, integer arithmetic was generally faster than floating-point arithmetic. It was then customary to produce pseudorandom numbers using fixed-point arithmetic and then convert these integers to floating point with scaling to get floating-point pseudorandom numbers. When fast floating-point processors became available, it became more economical to produce the random numbers directly in floating point. For example, the first equation of generator (1) could be computed on an RS/6000 by setting 0 < s[sub]0[/sub] < 2[sup]k[/sup], s[sub]0[/sub] odd, and letting, for i [greater or = to] 0, u = fl(as[sub]i[/sub] + 2[sup]52+k[/sup]), v = u [bisected_circle] 2[sup]52+k[/sup], s[sub]i+1[/sub] = fl(as[sub]i[/sub] - v). The above three statements define pseudorandom integers in the range 0 < s[sub]i[/sub] < 2[sup]k[/sup] that are bit-wise identical to the generator (1). However, one usually wants random floating-point numbers in the range (0, 1) or (-1, 1). The formulas above would then require modification; i.e., x[sub]i[/sub] = 2[sup]-k[/sup]s[sub]i[/sub] or x[sub]i[/sub] = 2[sup]-k+1[/sup]s[sub]i[/sub] - 1. However, these computations require an extra multiply-add in addition to the four needed to compute s[sub]i[/sub]. One intention of this paper is to demonstrate that this additional multiply-add can be removed. This results in improvement factors of 4/3 and 4/3.125, which come to 33% and 28% improvements per iteration over computation based on the above four statements. In Section 3 we describe two full-period linear congruential generators, x[sub]i+1[/sub] = (ax[sub]i[/sub] + c) mod 1, (4) that compute pseudorandom numbers in the range (0, 1) in four multiply-adds for c = 2[sup]-k[/sup] and c = a2[sup]-k[/sup]. We show a proof that these generators are EP. The proof involves showing that (1 + a + ... + a[sup]n-1[/sup]) mod q can be computed in log[sub]2[/sub](n) steps. Our four-multiply-add implementation of these generators requires us to avoid conditional operations in the unrolled inner loop. It turns out that ordinary unrolling via vectorization fails. To overcome this failure, we introduce "generalized unrolling," which becomes possible because the generator is EP. In [2, Section 3.6, pp. 170-173], Knuth provides a summary on "how to generate random numbers." He recommends using a linear congruential generator, X <--(aX + c) mod q, (5) that satisfies seven properties. However, he states that this class of generator applies primarily to machine-level coding and hence is not portable. For IEEE arithmetic, it makes sense to choose q = 2[sup]k[/sup]. The fused multiply-add instruction is provided by many computer architectures, including the IBM RS/6000 POWER and PowerPC, IBM S/390* G5, Intel/HP IA-64 Itanium**, Apple Power Mac**, HP Precision Architecture RISC 2.0 (e.g., HP900/800), and SGI MIPS** R-8000.[foot3] Thus, within this more restrictive domain, we can propose our four-multiply-add generator as being "portable." In Section 5 we extend the ideas of the previous sections to the generator (2). The modulus of this generator is 2[sup]31[/sup] - 1, and not a power of 2 as for the previous generators. To achieve high performance, nontrivial changes to the algorithm are required. In Section 6 we describe the performance of these algorithms relative to the generic algorithm of [1]. 2. Three multiplicative congruential generators Let a and y be positive integers less than 2[sup]k[/sup], where k [less or = to] 52. Clearly a and y are IEEE numbers. Let x = 2[sup]-k[/sup]y so that 0 < x < 1. Also, x is an IEEE number. Let p = ax. The base-2 (bit) representation of p is p[sub]1[/sub]p[sub]2[/sub] ... p[sub]k[/sub].p[sub]k+1[/sub]p[sub]k+2[/sub] ... p[sub]2k[/sub], where each p[sub]i[/sub] is 0 or 1. Represent p = I + F, where I = p[sub]1[/sub] ... p[sub]k[/sub] and F = .p[sub]k+1[/sub] ... p[sub]2k[/sub]; i.e., I and F are the integer and fractional parts of p. Note that ax mod 1 = F, that I and F are IEEE numbers, and that 2[sup]52[/sup] [less or = to] p + 2[sup]52[/sup] [less or = to] 2[sup]53[/sup] - 1. Let u = fl(ax + 2[sup]52[/sup]). (6) All 2[sup]52[/sup] positive integers in the range 2[sup]52[/sup] to 2[sup]53[/sup] - 1 are all of the IEEE numbers in that range. Thus, u in (6) is an integer, and since we are in chop mode, u is the largest integer not exceeding p + 2[sup]52[/sup]; i.e., u = p + I. (See Lemma 1 for a detailed constructive proof.) Let v = u [bisected_circle] 2[sup]52[/sup]. (7) The computation in (7) is done exactly, since v is computed using IEEE arithmetic, in which u, 2[sup]52[/sup], and v = I are all IEEE numbers. Now let w = fl(ax - v). (8) On RS/6000, w is the fractional part F of ax because the fused multiply-add instruction always delivers the correct answer when its operands and result are IEEE numbers. Note that ax - v = F. Thus, w = ax mod 1. (9) We have just proved the following. Theorem 1 The computation in (6), (7), and (8) above produces the result (9). The value computed by (8) is bit-wise identical to the infinitely precise result (9). When unrolled, Equations (6), (7), and (8) constitute a three-multiply-add implementation of the random-number generator (3). Let a = 5[sup]13[/sup] and choose 1 < s[sub]0[/sub] < 2[sup]k[/sup] with s[sub]0[/sub] an odd integer and k = 46. Set x[sub]0[/sub] = s[sub]0[/sub]2[sup]-k[/sup]. Then 0 < x[sub]0[/sub] < 1. Note that the seven lower-order bits of x[sub]0[/sub] are zero. In fact, each x[sub]i[/sub], i [greater or = to] 0, given by x[sub]i+1[/sub] = ax[sub]i[/sub] mod 1 (10) has this property. We now discuss some aspects of the Algorithm and Architecture approach [8] and how it relates to unrolling. In Section 1.3 of [1], Bailey et al. describe sample codes for the NAS benchmarks. There were two codes distributed for random-number generation, namely RANDLC and VRANLC; the latter is of interest to this paper. Subroutine VRANLC generates n REAL*8 uniform pseudorandom numbers in the range (0, 1) by using Equation (3). The documentation states that VRANLC is the standard version designed for scalar or RISC systems. A comment in this code states that the DO loop below in (11) which generates the n uniform pseudorandom numbers is not vectorizable. T1 = R23 * A A1 = AINT (T1) A2 = A - T23 * A1 C C Generate N results. This loop is not C vectorizable. C DO 120 I = 1, N C C Break X into two parts such that C X = 2[sup]23[/sup] * X1 + X2, compute C Z = A1 * X2 + A2 * X1 (mod 2[sup]23[/sup]), and then C X = 2[sup]23[/sup] * Z + A2 * X2 (mod 2[sup]46[/sup]). C T1 = R23 * X X1 = AINT (T1) X2 = X - T23 * X1 T1 = A1 * X2 + A2 * X1 T2 = AINT (R23 * T1) Z = T1 - T23 * T2 T3 = T23 * Z + A2 * X2 T4 = AINT (R46 * T3) X = T3 - T46 * T4 Y(I)= R46 * X 120 CONTINUE (11) We believe the authors' statement about the code in (11). Today's compilers are not able to vectorize this loop. On the other hand, Equation (3) is vectorizable. The Algorithm and Architecture approach to handling Equation (3) would be to rewrite the code so that it performs well when run on some actual machine, e.g., an RS/6000 RISC-type machine. In the present case we would also need to unroll Equations (6), (7), and (8). Please note that the code in (11) can be replaced by the following code, in which TP52 = 2[sup]52[/sup]: DO i = 0, n - 1 u = a * x(i) + TP52 v = u - TP52 x(i+1) = a * x(i) - v ENDDO (12) However, the above code will not execute in three cycles because of pipeline delays. This code is also not vectorizable by a compiler. In producing the code in (12), we have used the extra accuracy feature of the fused multiply-add instruction that is present on RS/6000 and PowerPC machines. Also, the code in (12) is equivalent to using Equation (9). A compiler cannot recognize this fact. This is where we use the Architecture feature of our approach. We now show how to "unroll" the code in (12). In doing so, we vectorize the code in (12). Let a[sub]j[/sub] = a[sup]j[/sup] mod 2[sup]k[/sup] for 1 [less or = to] j [less or = to] m. Note that, from (3), s[sub]i+j[/sub] = a[sup]j[/sup]s[sub]i[/sub] mod q = a[sub]j[/sub]s[sub]i[/sub] mod q, so that a[sub]j[/sub]s[sub]i[/sub] mod q x[sub]i+j[/sub] = -------------------------------- q = a[sub]j[/sub]x[sub]i[/sub] mod 1, 1 [less or = to] j [less or = to] m. (13) Thus, given x[sub]i[/sub] and (13), we have defined the next m iterations of (10). This unrolling of (10) by m constitutes using a vector of length m to compute these next m iterations of (10). In effect, RS/6000 machines can be viewed as vector machines with a short vector length. Because RS/6000 machines possess functional parallelism (see [9]), RS/6000 machines have very low vector start-up costs. This fact implies that using a short vector length is not a performance drawback. However, to achieve this vectorization it must be programmed. Now let m = 4. Thus, the code in (12) becomes DO i = 0, n - 4, 4 u = a * x(i) + TP52 u2 = a2 * x(i) + TP52 u3 = a3 * x(i) + TP52 u4 = a4 * x(i) + TP52 v = u - TP52 v2 = u2 - TP52 v3 = u3 - TP52 v4 = u4 - TP52 x(i+1) = a * x(i) - v x(i+2) = a2 * x(i) - v2 x(i+3) = a3 * x(i) - v3 x(i+4) = a4 * x(i) - v4 ENDDO The above code eliminates most of the pipeline delays. Every target of an FMA in that code is separated by three independent FMAs. However, the last FMA of the loop is followed by the first FMA of the loop with i replaced by i + 4; thus, there is no separation between the target x(i + 4) and the target u. In order to eliminate all pipeline delays, we need a more complicated "unrolling" of the code in (12). Consider xi = x(4) xi3 = x(3) xi2 = x(2) xi1 = x(1) DO i = 0, n - 8, 8 u4 = a4 * xi + TP52 u3 = a3 * xi + TP52 u2 = a2 * xi + TP52 u = a * xi + TP52 x(i+3) = xi3 x(i+4) = xi v4 = u4 - TP52 v3 = u3 - TP52 v2 = u2 - TP52 v = u - TP52 xi0 = a4 * xi - v4 xi3 = a3 * xi - v3 x(i+1) = xi1 x(i+2) = xi2 xi2 = a2 * xi - v2 xi1 = a * xi - v u4 = a4 * xi0 + TP52 u3 = a3 * xi0 + TP52 u2 = a2 * xi0 + TP52 u = a * xi0 + TP52 x(i+7) = xi3 x(i+8) = xi0 v4 = u4 - TP52 v3 = u3 - TP52 v2 = u2 - TP52 v = u - TP52 xi = a4 * xi0 - v4 xi3 = a3 * xi0 - v3 x(i+5) = xi1 x(i+6) = xi2 xi2 = a2 * xi0 - v2 xi1 = a * xi0 - v ENDDO (14) The code in (14) eliminates all pipeline delays and hence should execute at the rate of one pseudorandom number every three multiply-adds. To start the pipeline, we need to precompute outside of the loop the first four pseudorandom numbers, x(i), i = 1, ... , 4. The code in (14) is unrolled by 8. Now we demonstrate another feature of the Algorithm and Architecture approach. Suppose we decide to use (13) to unroll by m = 8. We have seen that a compiler cannot unroll the present loop (12). We were forced to program the unrolling and to introduce the concept of a "vector instruction" into RS/6000 coding. We now claim that this necessity of having to program a "vector instruction" gives rise to a feature of superscalar machines that vector machines do not possess. This feature allows us to produce "customized vector instructions." We remark that this is a part of the Algorithm and Architecture approach, and we now illustrate this concept with the example of generating uniform pseudorandom numbers in the range (-1, 1). A standard implementation requires an extra operation to convert the range from (0, 1) to (-1, 1). Let c1 = 2[sup]53[/sup] and c2 = 2[sup]53[/sup] - 1. Let 0 < s < 2[sup]k[/sup] with s odd and k = 46. Let x = 2[sup]-k+1[/sup]s. Then 0 < x < 2. Let, for i [greater or = to] 1, u = fl(a[sub]j[/sub]x + c1), (15) v = u [bisected_circle] c2, (16) and x[sub]i+j[/sub] = fl(a[sub]j[/sub]x - v), (17) where 1 [less or = to] j < 8. For j = 8 we let u = fl(a[sub]7[/sub]x + c1), (18) v = u [bisected_circle] c1, (19) x = fl(a[sub]7[/sub]x - v), (20) and x[sub]i+8[/sub] = x [bisected_circle] 1. (21) Equations (15) to (21) define a loop, unrolled by 8, in which the next eight random iterates are computed and also the "seed" x is updated in Equation (20). The first seven iterations cost three multiply-adds, while the last iteration costs four multiply-adds. The functional parallelism of RS/6000 indicates that this loop will execute in 25 multiply-adds, or 25/8 = 3.125 multiply-adds per iteration. We close Section 2 by demonstrating a use of the IEEE "round-to-nearest" rounding mode. Let 0 < u[sub]i[/sub] < 1 be the x[sub]i[/sub] computed by Equation (10). For each u[sub]i[/sub], let x[sub]i[/sub] = 2u[sub]i[/sub] - 1. Then -1 < x[sub]i[/sub] < 1. Let c1 = 2[sup]53[/sup] + 2[sup]k[/sup]. Let SEED be the seed x[sub]0[/sub] for the u[sub]i[/sub] computation given by Equation (10): C use round-to-nearest mode x(0) = TWO * SEED - ONE DO i = 0, n - 1 uc = a * x(i) + c1 v = uc - c1 x(i+1) = a * x(i) - v ENDDO SEED = HALF * x(n) + HALF We now prove the following. Theorem 2 The above code produces results that are bit-wise identical to x[sub]i[/sub] = 2u[sub]i[/sub] - 1, where u[sub]i[/sub] is the x[sub]i[/sub] given by (10). Proof Let au[sub]i[/sub] = I + F, where 0 < F < 1, so u[sub]i+1[/sub] = F. We want to show x[sub]i+1[/sub] = 2F - 1. Let u = ax[sub]i[/sub] + c1. Since a is odd, a = 2a[sub]1[/sub] + 1 for some a[sub]1[/sub], and r = 2(I - a[sub]1[/sub]) + c1 + 2F - 1. Note that -2[sup]k[/sup] < ax[sub]i[/sub] < 2[sup]k[/sup], so that 2[sup]53[/sup] < u < 2[sup]53[/sup] + 2[sup]k+1[/sup]. For IEEE numbers x with exponent 53, namely those x that satisfy 2[sup]53[/sup] [less or = to] x [less or = to] 2[sup]54[/sup] - 2, x is an even integer. For round-to-nearest, the computed value of u, namely uc, is the IEEE number closest to u. We claim uc = c1 + 2(I - a[sub]1[/sub]). Since uc is an even number with exponent 53, it is an IEEE number. Also, u - uc = 2F - 1 or -1 < u - uc < 1 as 0 < F < 1. Thus, uc is the closest IEEE number to u. The fused multiply-add instruction computes v and x[sub]i+1[/sub] exactly. Thus, v = 2(I - a[sub]1[/sub]) and x[sub]i+1[/sub] = 2F - 1. 3. Two full-period generators We now specify a high-level description of the full-period generators. Recall that the full-cycle generator is a linear congruential generator of the form (5) with q = 2[sup]k[/sup]. We follow the recommendation of Knuth (see [2, p. 171]) and choose the value of c to be 1 or a. These two choices of c give rise to the two versions of our full-period generator. These generators have implementations that are nearly the same as the three-multiply-add generator given by Equations (6), (7), and (8). Let x be the current iterate, in which 0 < x < 1. Then the following four multiply-add statements in (22) to (25) below describe the full-period generator: u = fl(ax + 2[sup]52[/sup]), (22) v = u [bisected_circle] 2[sup]52[/sup], (23) w = fl(ax - v), (24) x = w [xor] [c_bar]. (25) In (25), the last operation, x = w [xor] [c_bar], computes the next random number. In (25), [c_bar] = 2[sup]-k[/sup] when c = 1, and [c_bar] = a2[sup]-k[/sup] when c = a. We first consider the case [c_bar] = 2[sup]-k[/sup]. We were not able to reduce the number of multiply-adds to three as we did in Equations (15) to (21). We would need to define c2 = 2[sup]52[/sup] + 2[sup]-46[/sup], and this value cannot be represented in a double-precision word. In Equation (25), note that w < 1. This fact follows from (9). Since [c_bar] is the smallest random number, we are guaranteed that the new x [less or = to] 1. In fact, x will equal 1 only once in 2[sup]k[/sup] iterations of the generator. For k = 46, this is a very rare event. Suppose x = 1. Then (22), (23), and (24) compute w = 0 because x is an integer. However, the new x = [c_bar], and we reach the conclusion that the generator is self-correcting! This feature is very important (see [3] for another example), because it is not necessary to "check" x during each iteration of the calculation. Suppose we chose 2[sup]-46[/sup] < [c_bar] < 1. Then, x in (25) could be greater than 1. In that case we would have to test and possibly correct x during every iteration: IF (x .gt. 1) x = x - 1 (26) A conditional statement of this sort, when present in a pipelined inner loop, can significantly degrade performance. Thus, the choice of [c_bar] = 2[sup]-k[/sup] is essential to making the performance of the full-cycle generator fast. To obtain a new random number every four multiply-adds, it is necessary to unroll (22), (23), (24), and (25) by at least a factor of 2. Our numerical experiments on POWER2 showed that unrolling by 8 was necessary. The code for a linear congruential generator without unrolling is DO i = 0, n - 1 u = a * x(i) + TP52 v = u - TP52 w = a * x(i) - v x(i+1) = w + [c_bar] ENDDO (27) Because of pipeline delays, this code will not execute at the rate of one random number produced every four multiply-adds. It turns out that unrolling via "program vectorization" described in Section 2 does not work well. To see this, note that x[sub]j+i[/sub] = a[sub]j[/sub]x[sub]i[/sub] + [c_bar][sub]j[/sub], (28) where a[sub]j[/sub] = a[sup]j[/sup] mod 2[sup]k[/sup] and [c_bar][sub]j[/sub] = [(1 + a + ... + a[sup]j-1[/sup]) mod 2[sup]k[/sup]][c_bar]. The [c_bar][sub]j[/sub]'s are now variable! We have previously seen that the choice of [c_bar] = 2[sup]-k[/sup] was essential in order to maintain good performance. Any other value of [c_bar] gave rise to the use of conditional statements in the code, such as the statement in (26). Thus, the type of unrolling that comes from ordinary vectorization fails to provide peak performance. However, another form of unrolling works. In the present context, we call this "generalized unrolling." Let N = 2n be an arbitrary even integer. Suppose that we can compute a[sub]n[/sub] and [c_bar][sub]n[/sub] of Equation (28) in log[sub]2[/sub](n) steps. Then we can unroll the code by 2 in (27) as follows: DO i = 0, n - 1 u = a * x(i) + TP52 u1 = a * x(i+n) + TP52 v = u - TP52 v1 = u1 - TP52 w = a * x(i) - v w1 = a * x(i+n) - v1 x(i+1) = w + [c_bar] x(n+i+1) = w1 + [c_bar] ENDDO (29) The code in (29) constitutes "generalized unrolling." We have divided the vector x(0:N - 1) into two vectors x(0:n - 1) and y(0:n - 1) = x(n:N - 1) and worked on them independently. This type of unrolling gives rise to a form of single-instruction, multiple-data (SIMD) parallelism on a single processor. In [9] we also exploit this form of SIMD parallelism on POWER2 machines. The crucial point of the code in (29) for the present application is that the two vectors x and y both use the same [c_bar]. We show how to compute [c_bar][sub]n[/sub] in log[sub]2[/sub](n) steps. First note that [c_bar][sub]j+i[/sub] mod 1 = (a[sup]j[/sup] mod 2[sup]k[/sup])([c_bar][sub]i[/sub] mod 1) + [c_bar][sub]j[/sub] mod 1 (30) and a[sup]j+i[/sup] mod 2[sup]k[/sup] = (a[sup]j[/sup] mod 2[sup]k[/sup])(a[sup]i[/sup] mod 2[sup]k[/sup]) (31) hold for all i and j. We construct a table = TBL(2, 0:46) whose ith entries TBL(1:2, i) are a[sup]2[sup]i[/sup][/sup] and [c_bar][sub]2[sup]i[/sup][/sub], for 0 [less or = to] i [less or = to] 46. Note that TBL(1, i + 1) = TBL(1, i) * TBL(1, i) mod 2[sup]k[/sup], (32) TBL(2, i + 1) = [TBL(1, i) + 1] * TBL(2, i) mod< 1, (33) along with TBL(1, 0) = a and TBL(2, 0) = 2[sup]-46[/sup], generates the table in 46 steps. We can now, given x[sub]0[/sub] and the precomputed table, use Equations (28), (30), and (31) to compute x[sub]n[/sub] in log[sub]2[/sub](n) steps as follows: t = x(0) n0 = n lb = 0 1 nn = n0/2 IF (n0 .gt. 2 * nn) THEN u = t * TBL (1, lb) + TP52 v = u - TP52 w = t * TBL (1, lb) - v t = t + TBL (2, lb) IF (t .gt. 1) t = t - 1 ENDIF lb = lb + 1 n0 = nn IF (n0 .gt. 0) goto 1 x(n) = t (34) We now consider the second generator; this generator has [c_bar] = a2[sup]-k[/sup]. Consider the following code, in which TPMK = 2[sup]-k[/sup]: DO i = 0, n - 1 u = x(i) + TPMK v = a * u + TP52 w = v - TP52 x(i+1) = a * u - w ENDDO (35) Here we have 0 [less or = to] x[sub]i[/sub] < 1. When x[sub]i[/sub] = 1 - 2[sup]-k[/sup], x[sub]i+1[/sub] = 0 as u = 1. Again, this generator is self-correcting. The unrolled-by-2 version of (35) becomes DO i = 0, n - 1 u = x(i) + TPMK u1 = x(i+n) + TPMK v = a * u + TP52 v1 = a * u1 + TP52 w = v - TP52 w1 = v1 - TP52 x(i+1) = a * u - w x(n+i+1) = a * u1 - w1 ENDDO (36) Equations (30), (31), (32), and (34) remain unchanged. Equation (33) is modified to TBL(2, i + 1) = [TBL(1, i) + a] * TBL(2, i) mod 1. (37) 4. Parallel implementation for HPF The IBM XL Fortran (XLF) [4] and XL High Performance Fortran (HPF) [5] languages include a RANDOM_NUMBER intrinsic function that implements two multiplicative congruential generators. These generators are also part of PESSL, the IBM Parallel Engineering and Scientific Subroutine Library for AIX [10]. Generator 1 is of type (2), with a = 7[sup]5[/sup], k = 31, and modulus q = 2[sup]k[/sup] - 1. Generator 2 is of type (3), with a = 44485709377909, k = 48, and modulus q = 2[sup]k[/sup]. The generator that is to be used is selected by setting the generator argument of RANDOM_NUMBER to 1 or 2, respectively. In this section, we discuss implementation issues arising from the parallel directives in HPF for the modulus 2[sup]48[/sup] generator. The modulus 2[sup]31[/sup] - 1 generator is discussed in Section 5. HPF parallel directives The HPF statement CALL RANDOM_NUMBER (A) fills the scalar, vector, or array A with random numbers. HPF provides directives (ALIGN, DISTRIBUTE, BLOCK, CYCLIC, etc.) that allow the programmer to specify what elements of A are to reside on what processor. To achieve high performance, the parallel algorithm works by computing on each processor only those random numbers resident on that processor, ultimately achieving linear speedups for large quantities of random numbers. The N random numbers required on a particular processor j [member] {0, ..., P - 1} are sub-sequences of x[sub]0[/sub], x[sub]1[/sub], x[sub]2[/sub], ..., x[sub]N[/sub] from (3) of the form x[sub]i[sub]j[/sub][/sub], x[sub]i[sub]j[/sub] + k[/sub], x[sub]i[sub]j[/sub][/sub][sub] + 2k[/sub], ..., x[sub]i[sub]j[/sub] + (N - 1)k[/sub], (38) where k is called the stride of the sequence. If A is BLOCK-distributed, k = 1 and i[sub]j[/sub] = jN, j = 0, ..., P - 1. If A is CYCLIC-distributed, k = P and i[sub]j[/sub] = j, j = 0, ..., P - 1. A contiguous sequence is one with stride k = 1, and one with k > 1 is called strided. (There is also a BLOCK-CYCLIC distribution and other variations which are not discussed here for brevity, but which are handled by applying techniques similar to those presented here.) The rest of this section shows how it is possible to compute (38) in O(N + log i[sub]j[/sub]) time, resulting in asymptotically linear speedup. The algorithm used depends on whether the required sequence is contiguous or strided. Contiguous random-number sequences A contiguous random-number sequence has the form x[sub]i[/sub], x[sub]i+1[/sub], x[sub]i+2[/sub], ..., x[sub]i+N-1[/sub]. Contiguous sequences are required in HPF on each processor when the random numbers are written to a BLOCK-distributed vector or array. Once the starting number x[sub]i[/sub] is known for a particular processor, the rest of the sequence can be directly computed by the code (14). The fast calculation of the starting numbers on each processor is dealt with next. Strided random-number sequences If the CYCLIC distribution directive is used in HPF with more than one processor, the sequence of random numbers resident on each processor consists of strided (k > 1) sub-sequences of the form (38). In addition, for a BLOCK distribution, the starting values x[sub]i[sub]j[/sub][/sub], j = 0, ... , P - 1 in (38) must be computed from some available previous x[sub]i[/sub], for example, x[sub]0[/sub]. Both of these situations require the efficient computation of multiple steps of the recurrence (13). That is, given x[sub]i[/sub] and n, we must compute x[sub]i+n[/sub]. The code in (14) does this for several fixed n to achieve the unrolling. However, since n is not known a priori in the current context, we cannot simply precompute the value of a[sub]n[/sub] = a[sup]n[/sup] mod q, (39) and must therefore devise an efficient means to compute it for general n. We now show how to compute (13) in O(log[sub]2[/sub]n) steps, which is crucial to the asymptotically linear speedup of the parallel algorithm. Binary algorithms for exponentiation are described in [2], and one is used in Bailey's random-number generator implementation [1]. We use a binary exponentiation algorithm here, but achieve high performance through the use of the FMA instruction. Let the binary representation of n be n = b[sub]k[/sub] ... b[sub]0[/sub], b[sub]k[/sub] [neq] 0, k = [left_floor]log[sub]2[/sub]n[right_floor], n [neq] 0. Then k n = [summation] 2[sup]j[/sup]b[sub]j[/sub] j=0 and a[sub]n[/sub]x[sub]i[/sub] mod 1 = [(a[sup] [summation][sup]k[/sup][sub]j=0[/sub] 2[sup]j[/sup]b[sub]j[/sub][/sup]) mod q] x[sub]i[/sub] mod 1 k = [([Product] a[sup]2[sup]j[/sup][/sup]) mod q] x[sub]i[/sub] mod 1 j=0 b[sub]j[/sub][neq]0 k = [[Product](a[sup]2[sup]j[/sup][/sup] mod q)] x[sub]i[/sub] mod 1 j=0 b[sub]j[/sub][neq]0 (40) The values T[sub]i[/sub] = a[sup]2[sup]i[/sup][/sup] mod q, i = 0, ... , log[sub]2[/sub]q - 1, are precomputed and stored in a table, allowing the computation of the product (40) in k [less or = to] log[sub]2[/sub]n steps. The product is accumulated in a loop, as in the following pseudocode: t = X(i) DO i = 0, k IF (b[sub]i[/sub] [neq] 0) THEN C Compute t = tT[sub]i[/sub] mod 1. u = t * T(i) + TP52 v = u - TP52 t = t * T(i) - v ENDIF ENDDO X(i + n) = t (41) 5. A modulus 2[sup]31[/sup] - 1 generator We now consider efficient implementation of the generator of type (2) used in RANDOM_NUMBER. This is an extension of the work in the previous section, but is significantly more complicated since the modulus is no longer a power of 2. For the modulus 2[sup]31[/sup] - 1 generator, high performance is achieved by scaling the integer recurrence by 2[sup]-31[/sup], finding a fast implementation of the scaled recurrence, and finally transforming the resulting numbers to the range (0, 1). Both the mathematical issues and the implementation issues arising from the parallel directives in HPF are discussed next. Generator recurrence The random-number sequence x[sub]i[/sub] [member] (0, 1), i = 0, 1, ... (42) produced by the generator is defined by the recurrence s[sub]i+1[/sub] = as[sub]i[/sub] mod q, (43) s[sub]i+1[/sub] x[sub]i+1[/sub] = --------------- (44) q where s[sub]0[/sub] is an integer, 0 < s[sub]0[/sub] < q, and a = 7[sup]5[/sup] = 16807, q = 2[sup]31[/sup] - 1. It has a period of q - 1. This value of q is well suited to an implementation on a machine with 32-bit floating-point arithmetic, such as the IBM S/360 [3]. Although the target architecture of XLF and XLHPF is the IBM RS/6000, which uses IEEE floating-point, this generator has been provided for compatibility with existing applications. The mathematical properties of this generator are discussed in [3]. As for the modulus 2[sup]k[/sup] generator, the algorithm used depends on whether the required sequence is contiguous or strided. Contiguous random-number sequences Before discussing the main contribution of this section, parallel algorithms for strided random-number sequences, it is useful to describe the sequential algorithms (due to G. Slishman[foot4]) on which they are based. Although the ideas here are based on earlier sections of this paper, the implementation for this generator is more involved, since q is not a power of 2. RANDOM_NUMBER uses a sequential implementation of (43) and (44) to produce contiguous random-number sequences of the form x[sub]i[/sub], x[sub]i+1[/sub], x[sub]i+2[/sub], ..., x[sub]i+N-1[/sub], when the starting number x[sub]i[/sub] is known. These sequences are also required by HPF when the random numbers are written to a BLOCK-distributed vector or array. It is convenient for computational purposes to rewrite (43) and (44) in terms of the value y[sub]i[/sub] = 2[sup]-31[/sup]s[sub]i[/sub], (45) so that ay[sub]i[/sub]2[sup]31[/sup] mod q y[sub]i+1[/sub] = ---------------------------------- 2[sup]31[/sup] ay[sub]i[/sub]2[sup]31[/sup] ay[sub]i[/sub]2[sup]31[/sup] - [left_floor]----------------------------[right_floor]q q = ------------------------------------------------------- 2[sup]31[/sup] = ay[sub]i[/sub] ay[sub]i[/sub]2[sup]31[/sup] - [left_floor]----------------------------[right_floor] q (1 - 2[sup]-31[/sup]) = ay[sub]i[/sub] -[left_floor]ay[sub]i[/sub]s[left_floor] (1 - 2[sup]-31[/sup]) (46) where 2[sup]31[/sup] s = -------------- q 2[sup]31[/sup] = ------------------ 2[sup]31[/sup] - 1 1 = ------------------- 1 - 2[sup]-31[/sup] = 1 + 2[sup]-31[/sup] + 2[sup]-62[/sup] + 2[sup]-93[/sup] + ... (47) We now use Lemmas 1, 2, and 6 in the Appendix to transform (46) into an expression (51) that can be efficiently computed with FMAs. Let [s_bar] = 1 + 2[sup]-31[/sup], which is exactly representable as a double-precision IEEE floating-point number. By Lemma 2, we can use [s_bar] in place of s in (48), giving (49). By Lemma 1, we compute [left_floor]ay[sub]i[/sub][s_bar][right_floor], giving (50). [Since ay[sub]i[/sub][s_bar] = 7[sup]5[/sup]2[sup]-31[/sup]s[sub]i[/sub](1 + 2[sup]-31[/sup]) where s[sub]i[/sub] < 2[sup]31[/sup] - 1, ay[sub]i[/sub][s_bar] < 2[sup]15[/sup], satisfying the condition of the lemma.] By Lemma 6, we rearrange the terms in the form of three FMAs (51): y[sub]i+1[/sub] = ay[sub]i[/sub] - [left_floor]ay[sub]i[/sub]s[right_floor] (1 - 2[sup]-31[/sup]) (48) = ay[sub]i[/sub] - [left_floor]ay[sub]i[/sub][s_bar][right_floor] (1 - 2[sup]-31[/sup]) (49) = ay[sub]i[/sub] - (ay[sub]i[/sub][s_bar] [xor] 2[sup]52[/sup] [bisected_circle] 2[sup]52[/sup])* (1 - 2[sup]-31[/sup]) (50) = fl[ay[sub]i[/sub] - (ay[sub]i[/sub][s_bar] [xor] 2[sup]52[/sup])* (1 - 2[sup]-31[/sup]) - 2[sup]52[/sup](1 - 2[sup]-31[/sup])]. (51) The random number x[sub]i+1[/sub] is recovered via (44) and (45) as 2[sup]31[/sup]y[sub]i+1[/sub] x[sub]i+1[/sub] = ----------------------------- 2[sup]31[/sup] - 1 = sy[sub]i+1[/sub]. The floating-point values [y_tilde][sub]i+1[/sub] and [x_tilde][sub]i+1[/sub], corresponding respectively to y[sub]i+1[/sub] and x[sub]i+1[/sub], are computed in RANDOM_NUMBER by the sequence of instructions [y_tilde][sub]0[/sub] = y[sub]0[/sub] and u = fl([y_tilde][sub]i[/sub][a_bar] + 2[sup]52[/sup]), (52) v = fl(uk[sub]1[/sub] - k[sub]2[/sub]), (53) [y_tilde][sub]i+1[/sub] = fl([y_tilde][sub]i[/sub]a - v), (54) [x_tilde][sub]i+1[/sub] = [y_tilde][sub]i+1[/sub] [o times] [s_bar], (55) where [a_bar] = a[s_bar], k[sub]1[/sub] = 1 - 2[sup]-31[/sup], and k[sub]2[/sub] = 2[sup]52[/sup] - 2[sup]21[/sup] are exactly representable in IEEE double precision. The program (52)-(55) computes u = [left_floor][y_tilde][sub]i[/sub][a_bar][right_floor] + 2[sup]52[/sup], v is computed exactly, and it follows that [y_tilde][sub]i+1[/sub] = y[sub]i+1[/sub] and [x_tilde][sub]i+1[/sub] = fl(x[sub]i+1[/sub]). (A detailed proof is given by Lemma 7 in the Appendix.) Although an RS/6000 POWER machine is capable of computing one FMA per clock cycle, the code in (52)-(55) will not execute in four cycles because of pipeline delays. The reason for this is that the result of each FMA is not available for 2-3 cycles, although another independent FMA can be started immediately. Pipeline delays can be eliminated by loop unrolling [as explained in Section 2 for the modulus 2[sup]k[/sup] generator (12)], if a means of efficiently computing y[sub]i+n[/sub], given y[sub]i[/sub], is available. Using (43), (44), and (45), and proceeding as in the derivation of (49), we have s[sub]i+n[/sub] = a[sup]n[/sup]s[sub]i[/sub] mod q = a[sub]n[/sub]s[sub]i[/sub] mod q, y[sub]i+n[/sub] = a[sub]n[/sub]y[sub]i[/sub] - [left_floor]a[sub]n[/sub]y[sub]i[/sub][s_bar][right_floor] (1 - 2[sup]-31[/sup]), (56) where a[sub]n[/sub] = a[sup]n[/sup] mod q. This is computed by the following sequence of instructions: [y_tilde][sub]0[/sub] = y[sub]0[/sub] and u = [y_tilde][sub]i[/sub] [o times] [a_bar][sub]n[/sub], (57) v = fl([y_tilde][sub]i[/sub]a[sub]n[/sub]+u), (58) w = v [xor] 2[sup]52[/sup], (59) x = w [bisected_circle] 2[sup]52[/sup], (60) y = fl([y_tilde][sub]i[/sub]a[sub]n[/sub] - x), (61) [y_tilde][sub]i+n[/sub] = fl(2[sup]-31[/sup]x + y), (62) [x_tilde][sub]i+n[/sub] = [y_tilde][sub]i+n[/sub] [o times] [s_bar], (63) where [a_bar][sub]n[/sub] = 2[sup]-31[/sup]a[sub]n[/sub]. (64) The program (57)-(63) computes x = [left_floor]a[sub]n[/sub][y_tilde][sub]i[/sub][s_bar][right_floor], y is computed exactly, and it follows that [y_tilde][sub]i+1[/sub] = y[sub]i+1[/sub] and [x_tilde][sub]i+1[/sub] = fl(x[sub]i+1[/sub]). (A detailed proof is given by Lemma 8 in the Appendix.) Unlike the modulus 2[sup]k[/sup] generator, the modulus 2[sup]31[/sup] - 1 generator requires more instructions to compute a strided random-number sequence than a contiguous one. An unrolled loop based on (57)-(63) takes seven (instead of four) cycles per number on an RS/6000 POWER machine. Therefore, the direct unrolling used for the 2[sup]k[/sup] generator in (14) does not achieve maximum performance for the 2[sup]31[/sup] - 1 generator. Instead, the sequential implementation of the 2[sup]31[/sup] - 1 generator in RANDOM_NUMBER produces contiguous random-number sequences by using two nested unrolled loops based on (57)-(63) and (52)-(55), analogous to the generalized unrolling in Section 3. For a fixed, preselected batch size m = 32, precomputed values of am = a[sub]m[/sub], a2m = a[sub]2m[/sub], a3m = a[sub]3m[/sub], and abar = [a_bar], ambar = [a_bar][sub]m[/sub], a2mbar = [a_bar][sub]2m[/sub], a3mbar = [a_bar][sub]3m[/sub] are kept. Given an initial seed yi = y[sub]i[/sub], the following code (65) then computes random numbers x(0), ..., x(N) in batches of size 4m. (We assume that N is a multiple of 4m for simplicity.) It takes 18 + 16m cycles per 4m random numbers, for an average of 4.14 cycles per number. This approach is also adopted in the following XLHPF 1.2 parallel algorithm for generating contiguous random-number sequences, which are required when the RANDOM_NUMBER argument vector or array is BLOCK-distributed: DO i = 0, N - 4 * m, 4 * m C (57)-(63) unrolled by 3 u1 = yi * ambar u2 = yi * a2mbar u3 = yi * a3mbar v1 = yi * am + u1 v2 = yi * a2m + u2 v3 = yi * a3m + u3 w1 = v1 + TP52 w2 = v2 + TP52 w3 = v3 + TP52 x1 = w1 - TP52 x2 = w2 - TP52 x3 = w3 - TP52 y1 = yi * am - x1 y2 = yi * a2m - x2 y3 = yi * a3m - x3 yipm = TPM31 * x1 + y1 yip2m = TPM31 * x2 + y2 yip3m = TPM31 * x3 + y3 DO j = 1, m C (52)-(55) unrolled by 4 u = yi * abar + TP52 u1 = yipm * abar + TP52 u2 = yip2m * abar + TP52 u3 = yip3m * abar + TP52 v = u * k1 - k2 v1 = u1 * k1 - k2 v2 = u2 * k1 - k2 v3 = u3 * k1 - k2 yj = yi * a - v yjpm = yipm * a - v1 yjp2m = yip2m * a - v2 yjp3m = yip3m * a - v3 x(i+j) = yj * sbar x(i+m+j) = yjpm * sbar x(i+2 * m+j) = yjp2m * sbar x(i+3 * m+j) = yjp3m * sbar END DO yi = yjp3m END DO (65) Strided random-number sequences If the CYCLIC distribution directive is used in HPF with more than one processor, the sequence of random numbers resident on each processor consists of strided (k > 1) sub-sequences of the form (38). In addition, for a BLOCK distribution, the starting values x[sub]i[sub]j[/sub][/sub], j = 0, ... , P - 1, in (38) must be computed from some available previous x[sub]i[/sub], for example x[sub]0[/sub]. Both of these situations require the efficient computation of multiple steps of the recurrence (43), (44). That is, given y[sub]i[/sub] and n in (56), we must compute y[sub]i+n[/sub]. This is performed by the code in (57)-(63). However, since n is not known a priori, we cannot simply precompute the value of a[sub]n[/sub] = a[sup]n[/sup] mod q, and must therefore devise an efficient means to compute it for general n. We now consider how to compute a[sub]n[/sub] in O(log[sub]2[/sub]n) steps, which is crucial to the asymptotically linear speedup of the parallel algorithm. The outline of the approach is similar to (41), but it is complicated by the need to compute the product modulo q instead of 1: a[sub]n[/sub] = 1 DO i = 0, k IF (b[sub]i[/sub] [neq] 0) THEN a[sub]n[/sub] = a[sub]n[/sub]T[sub]i[/sub] mod q ENDIF ENDDO (66) A means for efficiently computing the modulus operation in (66) using IEEE double-precision arithmetic and the RS/6000 FMA instruction is a main contribution of this section, and is derived next. Computing products modulo q We now consider how to efficiently compute modular products of the form (66). (This is a more detailed description of results that were briefly outlined in [11].) Let d = bc mod q, where b and c are integers representable in IEEE double precision (IEEE integers, for short), with 0 < b < q and 0 < c < q, where b and c are powers of a, modulo q. [In particular, the values b = a[sub]n[/sub] and c = T(i) from (66) satisfy these conditions.] Then bc d = bc - [left_floor]--[right_floor]q. q By (47), 1 - = 2[sup]-31[/sup]s, (67) q so that d = bc - [left_floor]bc2[sup]-31[/sup]s[right_floor](2[sup]31[/sup]-1). By Lemma 3 in the Appendix, it follows that the infinite series s can be replaced by its first two terms; that is, d = bc - [left_floor]bc2[sup]-31[/sup][s_bar][right_floor] (2[sup]31[/sup] - 1). (68) [In Lemma 3 we use m = bc < q[sup]2[/sup] < 2[sup]62[/sup] and the condition that q [product_x] m is satisfied, since q [product_x] b and q [product_x] c (b < q and c < q) and q is prime.] We now show how (68) can be computed. Theorem 3 Let b and c be integers representable in IEEE double precision (IEEE integers), with 0 < b < 2[sup]31[/sup] - 1 and 0 < c < 2[sup]31[/sup] - 1. Let d = bc mod (2[sup]31[/sup] - 1), and compute as follows using IEEE double-precision arithmetic with the round-to-zero rounding mode: u = (2[sup]-62[/sup]b) [o times] c, (69) v = fl[(2[sup]-31[/sup]b)c + u], (70) w = v [xor] 2[sup]52[/sup], (71) x = w [bisected_circle] 2[sup]52[/sup], (72) y = 2[sup]31[/sup] [o times] x, (73) z = fl(bc - y), (74) [d_tilde] = z [xor] x. (75) Then [d_tilde] = d. Proof (See Appendix.) In RANDOM_NUMBER, strided random-number sequences are computed by an unrolled loop based on (57)-(63), similar to (76) below. Given a stride k, ai = a[sub]ki[/sub], i = 1, ..., 4, are first computed by (66) and (69)-(75). Let aibar = [a_bar][sub]ki[/sub], i = 1, ..., 4. Starting with y1 = y[sub]i-3[/sub], y2 = y[sub]i-2[/sub], y3 = y[sub]i-1[/sub], yi = y[sub]i[/sub], the unrolled loop (76) computes strided random numbers X(i) = x[sub]ki[/sub], i = 1, ..., N. It takes seven cycles per random number. DO i = 0, N - 4, 4 u4 = yi * a4bar u3 = yi * a3bar u2 = yi * a2bar u1 = yi * a1bar X(i+1) = y1 * sbar v4 = yi * a4 + u4 v3 = yi * a3 + u3 v2 = yi * a2 + u2 v1 = yi * a1 + u1 X(i+2) = y2 * sbar w4 = v4 + TP52 w3 = v3 + TP52 w2 = v2 + TP52 w1 = v1 + TP52 X(i+3) = y3 * sbar x4 = w4 - TP52 x2 = w2 - TP52 x3 = w3 - TP52 x1 = w1 - TP52 X(i+4) = yi * sbar z4 = yi * a4 - x4 z3 = yi * a3 - x3 z2 = yi * a2 - x2 z1 = yi * a1 - x1 yi = TPM31 * x4 + z4 y3 = TPM31 * x3 + z3 y2 = TPM31 * x2 + z2 y1 = TPM31 * x1 + z1 END DO (76) 6. POWER2 timing results We now give timings for the generators described in Sections 2 and 3. We have confined our timing studies to the RS/6000 POWER2 Model 590. POWER2 is capable of producing two FMAs every cycle, with a pipeline delay of two or three cycles [8, 9]. The POWER2 Model 590 has a clock cycle of 15 ns. The code in Equation (14) gives a pipeline delay of two cycles. To be safe we have doubled the unrolling from 8 to 16 in the actual code used for the timing below. (There are enough floating-point registers on the POWER2 to allow one to double the unrolling factor without negative impact. Running at the lesser unrolling, however, did not affect performance in any measurable way.) Since each uniform random number in the range (0, 1) costs three FMAs, the peak possible rate of random-number generation is 44.44 million per second. We measured performance of the (0, 1) generator for batches of double-word random numbers of size n = 2[sup]i[/sup], where i = 12 to 21. The size of the 590 cache is 2[sup]15[/sup] doublewords, and the size of its TLB is 2[sup]18[/sup] doublewords. All timing measurements were done using the XLF RTC (real-time clock) utility, and represent actual elapsed "wall-clock" time, including system time, etc. For i between 14 and 21, the performance was essentially constant. It varied between 42.90 and 43.50 million random numbers per second. For i = 12 and 13, the values were 39.33 and 41.79. This dropoff is due to a fixed setup cost for the generator. The setup cost consists of saving and restoring the user rounding mode, setting the rounding mode to chop, initializing the unrolled loop so that stores to memory are optimal, and the completion of the loop modulo the unrolled count. Without the setup cost, we measured a rate of 42.33 million random numbers per second for i = 12. For large n this cost becomes negligible. The (0, 1) generator runs at 98% of the peak obtainable rate. The result was independent of a and k. We tested two generators where a = 5[sup]13[/sup] and k = 46, and a = 44485709377909 and k = 48. The generic generator of Bailey et al. [1] ran at the rate of 810000 pseudorandom numbers for the data in cache and 803000 for data not in cache. Thus, the new generator is 53 times faster than the generic generator for both data in cache and data not in cache. We tested two versions of the (-1, 1) generator. For one of these, the number of cycles for 16 random numbers was 25. This is the 3.125-FMA generator, in which rounding is toward zero. The second one produced 16 random numbers in 24 cycles. This is the three-FMA generator, where rounding is to nearest. The results for the round-to-nearest (-1, 1) generator were identical to the results for the (0, 1) generator. Returning to the 3.125-FMA (-1, 1) generator, for i = 14 to 21 the performance was essentially constant; it varied between 41.15 and 40.52 million random numbers per second. The peak obtainable rate is 24/25 of 44.44 = 42.67 million random numbers per second. The measured results were at 96.4% of the peak obtainable rate. For i = 12 and 13, the rates were 37.53 and 39.91 million random numbers per second. Here again we see the effect of the fixed setup cost. In summary, for large enough batches of numbers the multiplicative random-number generators deliver more than 40 million random numbers per second regardless of whether the batch size fits in cache. Now we discuss the four-FMA linear congruential generator. Here we chose batches of size 2[sup]i[/sup] for i = 12, 13, 14, and 15. For i = 12, 13, and 14, we generated 27.62, 29.35, and 30.14 million random numbers per second. The peak possible rate is 33.33 million per second. Thus, our measured rate is about 90% of the peak obtainable. The smaller value for i = 12 was due to the fixed setup cost for this generator. For i = 15, the result was 26.40 million random numbers per second. For larger values of i, this rate per second dropped off very sharply because of cache and TLB thrashing. To alleviate this problem, we set n = 2[sup]i[/sup] - 544 * 8. The number 544 is the sum of a page plus a line in doublewords. The factor of 8 was obtained by dividing n by 8 to set up eight different store queues. However, almost any other value of n would have worked equally well. We tried new batches of size n, where i = 15 to 21. The rates were 30.32, 29.76, 29.91, 29.76, 29.76, 29.75, and 29.76 million random numbers per second. In summary, the four-FMA generator delivers about 30 million pseudorandom numbers per second both for data in cache and data out of cache. 7. Conclusions In this paper, we have given several illustrations of a general technique called the Algorithm and Architecture approach [11]. We have used algorithmic innovation and the FMA instruction in the design of several uniformly distributed pseudorandom-number generators for the intervals (0, 1) and (-1, 1). We have implemented multiplicative congruential pseudorandom-number generators, for the ranges (0, 1) and (-1, 1), of the form s[sub]i+1[/sub] = as[sub]i[/sub] mod 2[sup]k[/sup], x[sub]i+1[/sub] = 2[sup]-k[/sup]s[sub]i+1[/sub] or 2[sup]-k+1[/sup]s[sub]i+1[/sub] - 1, (77) which have a period of 2[sup]k-2[/sup]. We have shown that the theoretical cost per random number for the two ranges is respectively 3 and 3.125 multiply-adds on RS/6000 processors. Our codes, on the IBM POWER2 Model 590, run at 98% and 96.4% of the peak obtainable rate, respectively, and produce more than 40 million uniformly distributed pseudorandom numbers per second for both ranges. Additionally, our code sustains the 40 million per second rate for data out of cache. Our code is about 50 times faster than the generic implementation with k = 46 given in the NAS parallel benchmarks [1], while producing bit-wise identical results. We also implemented a linear congruential generator of the form s[sub]i+1[/sub] = (as[sub]i[/sub] + c) mod 2[sup]k[/sup], x[sub]i+1[/sub] = 2[sup]-k[/sup]s[sub]i+1[/sub], (78) which has the full period of 2[sup]k[/sup]. We have shown that the theoretical cost per random number [in the range (0, 1)] for this generator is four multiply-adds on RS/6000 processors. Our code, on the IBM POWER2 Model 590, runs at about 90% of the peak obtainable rate and produces more than 30 million uniformly distributed pseudorandom numbers per second. Finally, we implemented a multiplicative-congruential generator for the interval (0, 1) of the form s[sub]i + 1[/sub] = as[sub]i[/sub] mod (2[sup]k[/sup] - 1), s[sub]i+1[/sub] x[sub]i+1[/sub] = ----------------- , (79) 2[sup]k[/sup] - 1 for which the modulus is not a power of 2. Generators of type (77) and (79), for the interval (0, 1), are available in the RANDOM_NUMBER intrinsic function of IBM XL Fortran [4] and XL High Performance Fortran [5]. All of the generators reported here are "embarrassingly parallel." Using an IBM SP2 machine with 250 POWER2 wide nodes, it is possible to compute more than ten billion uniform random numbers in a second. 8. Appendix This section contains detailed proofs for the lemmas used in the paper. We first prove a result that allows the floor operation to be computed quickly using IEEE arithmetic. Lemma 1 Let v be representable in IEEE double precision, with 0 [less or = to] v < 2[sup]52[/sup]. Then, with the round-to-zero rounding mode, v [xor] 2[sup]52[/sup] = [left_floor]v[right_floor] + 2[sup]52[/sup] and (v [xor] 2[sup]52[/sup]) [bisected_circle] 2[sup]52[/sup] = [left_floor]v[right_floor]. Proof The lemma is trivially true for v = 0. Therefore, assume that v > 0. Since v is IEEE, it can be written in binary as v = b[sub]0[/sub] . b[sub]1[/sub] ... b[sub]52[/sub] x 2[sup]e[/sup], b[sub]0[/sub] = 1, where e < 52 since v < 2[sup]52[/sup]. Then, [left_floor]v[right_floor] = { b[sub]0[/sub] ... b[sub]e[/sub] e [greater or = to] 0, { 0 e < 0. Clearly [left_floor]v[right_floor] is an IEEE number. Suppose first that e [greater or = to] 0. Then v [xor] 2[sup]52[/sup] performs the following addition, the result of which is truncated to 53 bits. The number of zeros inserted for the normalization of v is k = 52 - e, and k [greater or = to] 1 since e < 52: v = 0 . 0 ... 0 b[sub]0[/sub] ... b[sub]e[/sub] b[sub]e+1[/sub] ... b[sub]52[/sub] x 2[sup]52[/sup], 2[sup]52[/sup] = 1 . x 2[sup]52[/sup], v [xor] 2[sup]52[/sup] = 1 . 0 ... 0 b[sub]0[/sub] ... b[sub]e[/sub] x 2[sup]52[/sup]. If e < 0, then k > 52 and (v [xor] 2[sup]52[/sup]) = 2[sup]52[/sup]. Thus, v [xor] 2[sup]52[/sup] = [left_floor]v[right_floor] + 2[sup]52[/sup], and this establishes the first part of Lemma 1. Now (v [xor] 2[sup]52[/sup]) [bisected_circle] 2[sup]52[/sup] = [left_floor]v[right_floor], because the operands and result are IEEE numbers. [end] Corollary 1 Let x and y be IEEE double-precision numbers satisfying 0 [less or = to] xy < 2[sup]52[/sup]. Then fl(xy + 2[sup]52[/sup]) = [left_floor]xy[right_floor] + 2[sup]52[/sup] and fl(xy + 2[sup]52[/sup]) [bisected_circle] 2[sup]52[/sup] = [left_floor]xy[right_floor]. Proof Use Lemma 1, except that v [xor] 2[sup]52[/sup] in the lemma is replaced by fl(xy + 2[sup]52[/sup]), and also xy = 0 . 0 ... 0 b[sub]0[/sub] ... b[sub]e[/sub] b[sub]e+1[/sub] ... b[sub]105[/sub] x 2[sup]52[/sup] 2[sup]52[/sup] = 1 . x 2[sup]52[/sup] fl(xy + 2[sup]52[/sup]) = 1 . 0 ... 0 b[sub]1[/sub] ... b[sub]e[/sub] x 2[sup]52[/sup]. Therefore fl(xy + 2[sup]52[/sup]) = [left_floor]xy[right_floor] + 2[sup]52[/sup], and the result follows. [end] Lemma 2 Let a = 7[sup]5[/sup], y[sub]i[/sub] = 2[sup]-31[/sup]s[sub]i[/sub], where s[sub]i[/sub] is an integer with 0 < s[sub]i[/sub] < q = 2[sup]31[/sup] - 1, and [s_bar] = 1 + 2[sup]-31[/sup], s = 1 + 2[sup]-31[/sup] + 2[sup]-62[/sup] + ... . Then [left_floor]ay[sub]i[/sub]s[right_floor] = [left_floor]ay[sub]i[/sub][s_bar][right_floor]. Proof Let m = 2[sup]31[/sup]ay[sub]i[/sub]. Then Lemma 3 gives the required result, provided we can show that m satisfies the conditions of Lemma 3. First, we show that m is an integer, m = 2[sup]31[/sup]ay[sub]i[/sub] = 2[sup]31[/sup]7[sup]5[/sup]2[sup]-31[/sup]s[sub]i[/sub] = 7[sup]5[/sup]s[sub]i[/sub], (80) which is an integer because s[sub]i[/sub] is. Second, we show that 0 < m < 2[sup]62[/sup]. From (80), m > 0 since s[sub]i[/sub] is. Also from (80), m < 7[sup]5[/sup]2[sup]31[/sup] < 8[sup]5[/sup]2[sup]31[/sup] = 2[sup]46[/sup] < 2[sup]62[/sup]. Third, we show that q [product_x] m. Since q is prime, if q|m, then q|a or q|s[sub]i[/sub]. But q > a and q > s[sub]i[/sub]; thus, q [product_x] m. [end] Lemma 3 Let m [member] Z, 0 < m < 2[sup]62[/sup], (2[sup]31[/sup] - 1) [product_x] m. Then [left_floor]2[sup]-31[/sup]m + 2[sup]-62[/sup]m[right_floor] = [left_floor]2[sup]-31[/sup]m + 2[sup]-62[/sup]m + 2[sup]-93[/sup]m + 2[sup]-124[/sup]m + ... [right_floor]. Proof Let the binary representation of m be m = b[sub]1[/sub]b[sub]2[/sub] ... b[sub]62[/sub], where each b[sub]i[/sub] is either 0 or 1. Then, 2[sup]-31[/sup]m = b[sub]1[/sub] ... b[sub]31[/sub] . b[sub]32[/sub] ... b[sub]62[/sub], 2[sup]-62[/sup]m = . b[sub]1[/sub] ... b[sub]31[/sub] b[sub]32[/sub] ... b[sub]62[/sub], 2[sup]-93[/sup]m = . 0 ... 0 b[sub]1[/sub] ... b[sub]31[/sub] b[sub]32[/sub] ... b[sub]62[/sub]. Let g = 2[sup]-31[/sup]m + 2[sup]-62[/sup]m and h[sub]3[/sub] = 2[sup]-93[/sup]m, h[sub]4[/sub] = 2[sup]-93[/sup]m + 2[sup]-124[/sup]m, . . . i h[sub]i[/sub] = [summation] 2[sup]-31j[/sup]m, i = 3, 4, ..., j=3 [infinity] h = [summation] 2[sup]-31j[/sup]m, j=3 = lim h[sub]i[/sub] i --> [infinity] (81) We must show that [left_floor]g + h[right_floor] = [left_floor]g[right_floor]. Suppose, to get a contradiction, that [left_floor]g + h[right_floor] [neq] [left_floor]g[right_floor]. Then adding h to g must cause the integer part of g to change. Since the 2[sup]-1[/sup] to 2[sup]-31[/sup] bits of h are all zero, this means that either (A) A carry-out must occur from the 2[sup]-32[/sup] bit of g + h, and this carry must be propagated into the integer part; or (B) No carry is propagated into the integer part, but all of the (infinite number of) bits in the fractional part of g + h are 1, so that g + h = [left_floor]2[sup]-31[/sup]m[right_floor] + .11 ... = [left_floor]2[sup]-31[/sup]m[right_floor] + 1 [member] Z. (82) However, (B) is impossible, since by (47), m g + h = ------------------ , 2[sup]31[/sup] - 1 so (2[sup]31[/sup] - 1) [product_x] m implies g + h [not_member] Z, contradicting (82). Thus (A) holds. But also, [infinity] g + h = [summation] 2[sup]-31j[/sup]m, j=1 and is not an integer. Therefore, for J sufficiently large, [left_floor]g + h[sub]J[/sub][right_floor] = [left_floor]g + h[right_floor]. Consider the 2[sup]-31[/sup] bit of g + h[sub]J[/sub]. If b[sub]62[/sub] + b[sub]31[/sub] = 0 or 10, an incoming carry would change the bit to 1 and no further carry would be propagated. Therefore, by (A), b[sub]62[/sub] + b[sub]31[/sub] = 1. Repeating this argument with the higher-order bits shows that b[sub]i+31[/sub] + b[sub]i[/sub] = 1, i = 1, ... , 31. (83) Now consider the two lowest-order terms in h[sub]J[/sub], those corresponding to j = J - 1 and j = J in (81): 2[sup]-31(J-1)[/sup]m = . 0 ... 0 b[sub]1[/sub] ... b[sub]31[/sub] b[sub]32[/sub] ... b[sub]62[/sub], 2[sup]-31J[/sup]m = . 0 ... 0 0 ... 0 b[sub]1[/sub] ... b[sub]31[/sub] b[sub]32[/sub] ... b[sub]62[/sub] . There is no carry-out from b[sub]32[/sub] ... b[sub]62[/sub] in 2[sup]-31J[/sup]m, since the corresponding bits in 2[sup]-31(J-1)[/sup]m are zero. Therefore, there is no carry-in to the next higher-order 31 bits of the sum, which are therefore all 1 by (83), with no carry-out. Repeating this argument with each higher-order block of 31 bits shows that there is no carry-out from the 2[sup]-32[/sup] bit of g + h, contradicting (A). Therefore, [left_floor]g + h[right_floor] = [left_floor]g[right_floor]. [end] Lemma 4 Compute v as in (69) and (70). Then [left_floor]v[right_floor] = [left_floor]bc(2[sup]-31[/sup] + 2[sup]-62[/sup])[right_floor]. Proof Since b, c [member] Z with 0 < b < 2[sup]31[/sup] - 1 and 0 < c < 2[sup]31[/sup] - 1 implies bc [member] Z with 0 < bc < 2[sup]62[/sup], we can write bc in binary as bc = .b[sub]1[/sub]b[sub]2[/sub] ... b[sub]62[/sub] x 2[sup]e[/sup], where 1 [less or = to] e [less or = to] 62 and b[sub]1[/sub] [neq] 0. Then 2[sup]-31[/sup]bc + 2[sup]-62[/sup]bc is the result of the following addition: 2[sup]-31[/sup]bc = b[sub]1[/sub] ... b[sub]31[/sub] . b[sub]32[/sub] ... b[sub]62[/sub] x 2[sup]e-62[/sup], 2[sup]-62[/sup]bc = . b[sub]1[/sub] ... b[sub]31[/sub] b[sub]32[/sub] ... b[sub]62[/sub] x 2[sup]e-62[/sup], 2[sup]-31[/sup]bc + 2[sup]-62[/sup]bc = c[sub]0[/sub] c[sub]1[/sub] ... c[sub]31[/sub] . c[sub]32[/sub] ... c[sub]62[/sub] b[sub]32[/sub] ... b[sub]62[/sub] x 2[sup]e-62[/sup]. (84) Since e - 62 [less or = to] 0, [left_floor]2[sup]-31[/sup]bc + 2[sup]-62[/sup]bc[right_floor] = [left_floor]c[sub]0[/sub] ... c[sub]31[/sub] x 2[sup]e-62[/sup][right_floor]. (85) Now u = 2[sup]-62[/sup]b [o times] c = .b[sub]1[/sub] ... b[sub]31[/sub] b[sub]32[/sub] ... b[sub]53[/sub] x 2[sup]e-62[/sup], (86) since b[sub]1[/sub] [neq] 0 and b[sub]54[/sub] ... b[sub]62[/sub] are truncated. The FMA in (70) computes v = fl(2[sup]-31[/sup]bc + u). When we replace 2[sup]-62[/sup]bc in (84) with u from (86), the sum in (84) becomes v = c[sub]0[/sub] ... c[sub]31[/sub] . c[sub]32[/sub] ... c[sub]62[/sub] b[sub]32[/sub] ... b[sub]53[/sub] x 2[sup]e-62[/sup], so that [left_floor]v[right_floor] = [left_floor]c[sub]0[/sub] ... c[sub]31[/sub] x 2[sup]e-62[/sup][right_floor] = [left_floor]2[sup]-31[/sup]bc + 2[sup]-62[/sup]bc[right_floor] by (85) as required. [end] Lemma 5 Compute y as in (69)-(73). Then bc - y is an IEEE integer. Proof By (91), bc - y = bc - 2[sup]31[/sup][left_floor]bc(2[sup]-31[/sup] + 2[sup]-62[/sup]) [right_floor] [member] Z, and so will be IEEE if |bc - y| < 2[sup]53[/sup]. For all x [member] R, x - 1 < [left_floor]x[right_floor] [less or = to] x; thus, bc(2[sup]-31[/sup] + 2[sup]-62[/sup]) - 1 < [left_floor]bc(2[sup]-31[/sup] + 2[sup]-62[/sup])[right_floor] [less or = to] bc(2[sup]-31[/sup] + 2[sup]-62[/sup]), bc - 2[sup]31[/sup]bc(2[sup]-31[/sup] + 2[sup]-62[/sup]) [less or = to] bc - 2[sup]31[/sup][left_floor]bc(2[sup]-31[/sup] + 2[sup]-62[/sup])[right_floor] < bc - 2[sup]31[/sup][bc(2[sup]-31[/sup] + 2[sup]-62[/sup]) - 1], - 2[sup]-31[/sup]bc [less or = to] bc - y < - 2[sup]-31[/sup] bc + 2[sup]31[/sup]. However, 0 < bc < 2[sup]62[/sup], so - 2[sup]31[/sup] < bc - y < - 2[sup]-31[/sup] + 2[sup]31[/sup] < 2[sup]31[/sup]; hence, |bc - y| < 2[sup]31[/sup]. [end] Lemma 6 Let 0 [less or = to] u < 2[sup]22[/sup] be an IEEE number. Then, (u [xor] 2[sup]52[/sup] [bisected_circle] 2[sup]52[/sup])(1 - 2[sup]-31[/sup]) = fl[(u [xor] 2[sup]52[/sup])(1 - 2[sup]-31[/sup]) - 2[sup]52[/sup](1 - 2[sup]-31[/sup])]. (87) Proof By Lemma 1, u [xor] 2[sup]52[/sup] = [left_floor]u[right_floor] + 2[sup]52[/sup]. Thus, the right side of (87) equals fl[([left_floor]u[right_floor] + 2[sup]52[/sup])(1 - 2[sup]-31[/sup]) - 2[sup]52[/sup](1 - 2[sup]-31[/sup])] = fl[[left_floor] u[right_floor](1 - 2[sup]-31[/sup])] = fl[[left_floor]u[right_floor] - 2[sup]-31[/sup][left_floor]u[right_floor]]. Since 0 [less or = to] u < 2[sup]22[/sup], [left_floor]u[right_floor] is an integer with at most 22 bits. Therefore, [left_floor]u[right_floor] - 2[sup]-31[/sup][left_floor]u[right_floor] has at most 22 + 31 = 53 bits, making it exactly representable. The right side of (87) thus becomes [left_floor]u[right_floor](1 - 2[sup]-31[/sup]), which equals the left side by Lemma 1. [end] Lemma 7 (52)-(55) result in [y_tilde][sub]i+1[/sub] = y[sub]i+1[/sub] and [x_tilde][sub]i+1[/sub] = fl(x[sub]i+1[/sub]). Proof By definition, [y_tilde][sub]0[/sub] = y[sub]0[/sub]. Applying induction on i, assume [y_tilde][sub]i[/sub] = y[sub]i[/sub] to show [y_tilde][sub]i+1[/sub] = y[sub]i+1[/sub]. By Corollary 1, u = [left_floor]y[sub]i[/sub][a_bar][right_floor] + 2[sup]52[/sup]. We first show that (53) is exact. y[sub]i[/sub][a_bar] = 2[sup]-31[/sup]s[sub]i[/sub]a[s_bar] < 2[sup]-31[/sup](2[sup]31[/sup] - 1)2[sup]15[/sup](1 + 2[sup]-31[/sup]) < 2[sup]15[/sup], so that u = I + 2[sup]52[/sup], where I is an integer less than 2[sup]15[/sup]. Thus, v = (I + 2[sup]52[/sup])(1 - 2[sup]-31[/sup]) - 2[sup]52[/sup] + 2[sup]21[/sup] = I - 2[sup]-31[/sup]I. This is an IEEE number, since I is an integer with at most 15 bits, and so v has at most 15 + 31 = 46 < 53 bits. Therefore, (53) exactly computes v = u(1 - 2[sup]-31[/sup]) - (2[sup]52[/sup] - 2[sup]21[/sup]). Since v is exact, (54) computes [y_tilde][sub]i+1[/sub] = y[sub]i+1[/sub] exactly by (51), provided y[sub]i+1[/sub] is IEEE. But y[sub]i+1[/sub] = 2[sup]-31[/sup]s[sub]i+1[/sub] by (45), where s[sub]i+1[/sub] is an integer less than 2[sup]31[/sup] - 1, so y[sub]i+1[/sub] is an IEEE number. Finally, to show that (55) computes [x_tilde][sub]i+1[/sub] = fl(x[sub]i+1[/sub]), we must establish that fl(y[sub]i+1[/sub]s) = fl(y[sub]i+1[/sub][s_bar]). (88) Since s[sub]i+1[/sub] is an integer less than 2[sup]31[/sup] - 1, we can write s[sub]i+1[/sub] = b[sub]1[/sub] ... b[sub]31[/sub] in binary. Thus, y[sub]i+1[/sub] = . b[sub]1[/sub] ... b[sub]31[/sub], 2[sup]-31[/sup]y[sub]i+1[/sub] = . 0 ... 0 b[sub]1[/sub] ... b[sub]31[/sub], y[sub]i+1[/sub][s_bar] = . b[sub]1[/sub] ... b[sub]31[/sub] b[sub]1[/sub] ... b[sub]31[/sub], y[sub]i+1[/sub]s = . b[sub]1[/sub] ... b[sub]31[/sub] b[sub]1[/sub] ... b[sub]31[/sub] b[sub]1[/sub] ... b[sub]31[/sub] ... . Let i be the index of the first nonzero bit of y[sub]i+1[/sub]. The 53 bits of the mantissa of fl(y[sub]i+1[/sub]s) begin at b[sub]i[/sub] in the first group of 31 bits of y[sub]i+1[/sub]s. If i [less or = to] 10, they end with b[sub]21+i[/sub] in the second group of 31 bits, in which case (88) holds. If i > 10, they end with b[sub]i-10[/sub] in the third group of 31 bits. For (88) to hold in this case, we need the bits of the mantissa coming from the third group to be zero, that is, b[sub]1[/sub] = ... = b[sub]i-10[/sub] = 0. But since b[sub]i[/sub] is the first nonzero bit, b[sub]1[/sub] = ... = b[sub]i-1[/sub] = 0, and the result follows. [end] Lemma 8 (57)-(63) result in [y_tilde][sub]i+1[/sub] = y[sub]i+1[/sub] and [x_tilde][sub]i+1[/sub] = fl(x[sub]i+1[/sub]). Proof By definition, [y_tilde][sub]0[/sub] = y[sub]0[/sub]. Applying induction on i, assume [y_tilde][sub]i[/sub] = y[sub]i[/sub] to show [y_tilde][sub]i+1[/sub] = y[sub]i+1[/sub]. We first show that x = [left_floor]v[right_floor] = [left_floor]y[sub]i[/sub]a[sub]n[/sub][s_bar][right_floor]. Let I = s[sub]i[/sub]a[sub]n[/sub]. Then I is an integer satisfying I < (2[sup]31[/sup] - 1)(2[sup]31[/sup] - 1) < 2[sup]62[/sup], so we can write I in binary as I = b[sub]1[/sub] ... b[sub]62[/sub], and y[sub]i[/sub]a[sub]n[/sub] = 2[sup]-31[/sup]I. Now y[sub]i[/sub]a[sub]n[/sub][s_bar] = y[sub]i[/sub]a[sub]n[/sub] + y[sub]i[/sub][a_bar][sub]n[/sub], u = y[sub]i[/sub] [o_times] [a_bar][sub]n[/sub], v = fl(y[sub]i[/sub]a[sub]n[/sub] + u), and y[sub]i[/sub]a[sub]n[/sub] = b[sub]1[/sub] ... b[sub]31[/sub] . b[sub]32[/sub] ... b[sub]62[/sub], 2[sup]-31[/sup]y[sub]i[/sub] a[sub]n[/sub] = . b[sub]1[/sub] ... b[sub]31[/sub] b[sub]32[/sub] ... b[sub]62[/sub]. For [left_floor]v[right_floor] to be exact, it is only necessary for the first 31 bits of u to be correct, since the less significant bits align with zeros in y[sub]i[/sub]a[sub]n[/sub] and so cannot affect the integer part of the result. But at least the first 53 bits of u are correct; thus, [left_floor]v[right_floor] = [left_floor]y[sub]i[/sub]a[sub]n[/sub][s_bar][right_floor]. By Lemma 1, then, x = [left_floor]v[right_floor]. Next, we show that y is exact. This follows if y[sub]i[/sub]a[sub]n[/sub] - [left_floor]y[sub]i[/sub]a[sub]n[/sub][s_bar][right_floor] is an IEEE number. [left_floor]y[sub]i[/sub]a[sub]n[/sub][s_bar][right_floor] = [left_floor]y[sub]i[/sub]a[sub]n[/sub] + 2[sup]-31[/sup]y[sub]i[/sub]a[sub]n[/sub][right_floor] = [left_floor]y[sub]i[/sub]a[sub]n[/sub][right_floor] or [left_floor]y[sub]i[/sub]a[sub]n[/sub][right_floor] + 1, since the bits of 2[sup]-31[/sup]y[sub]i[/sub]a[sub]n[/sub] do not overlap with the integer part of y[sub]i[/sub]a[sub]n[/sub]. Thus y[sub]i[/sub]a[sub]n[/sub] - [left_floor]y[sub]i[/sub]a[sub]n[/sub][s_bar][right_floor] is either b[sub]32[/sub] ... b[sub]62[/sub] or one less than this value, both of which contain at most 31 bits, and thus are IEEE numbers. Finally, [y_tilde][sub]i+1[/sub] = fl(2[sup]-31[/sup]x + y) = fl[2[sup]-31[/sup] [left_floor]y[sub]i[/sub]a[sub]n[/sub][s_bar] [right_floor] + y[sub]i[/sub]a[sub]n[/sub] - [left_floor]y[sub]i[/sub]a[sub]n[/sub][s_bar] [right_floor]] = fl[y[sub]i[/sub]a[sub]n[/sub] - [left_floor]y[sub]i[/sub]a[sub]n[/sub][s_bar] [right_floor](1 - 2[sup]-31[/sup])] = fl[y[sub]i[/sub]a[sub]n[/sub] - [left_floor]y[sub]i[/sub]a[sub]n[/sub]s [right_floor](1 - 2[sup]-31[/sup])] by Lemma 2 = fl(y[sub]i+n[/sub]) by (56) = y[sub]i+n[/sub], since y[sub]i+n[/sub] = 2[sup]-31[/sup]s[sub]i+n[/sub] by (45), where s[sub]i+n[/sub] is an integer less than 2[sup]31[/sup] - 1, so y[sub]i+n[/sub] is an IEEE number. It follows that [x_tilde][sub]i+n[/sub] = fl(x[sub]i+n[/sub]), by an argument similar to that used in Lemma 7 to show that [x_tilde][sub]i+1[/sub] = fl(x[sub]i+1[/sub]). [end] Theorem 3 Let b and c be integers representable in IEEE double precision (IEEE integers), with 0 < b < 2[sup]31[/sup] - 1 and 0 < c < 2[sup]31[/sup] - 1. Let d = bc mod (2[sup]31[/sup]-1), and compute as in (69)-(75), using IEEE double-precision arithmetic with the round-to-zero rounding mode. Then [d_tilde] = d. Proof From (68), d = bc - [left_floor]bc2[sup]-31[/sup](1 + 2[sup]-31[/sup])[right_floor] (2[sup]31[/sup] - 1) = bc - 2[sup]31[/sup][left_floor]bc(2[sup]-31[/sup] + 2[sup]-62[/sup]) [right_floor] + [left_floor]bc(2[sup]-31[/sup] + 2[sup]-62[/sup])[right_floor]. (89) We now make use of Lemmas 4 and 5. By Lemma 4, [left_floor]v[right_floor] = [left_floor]bc(2[sup]-31[/sup] + 2[sup]-62[/sup])[right_floor]. Since bc < 2[sup]62[/sup], [left_floor]v[right_floor] [less or = to] [left_floor] 2[sup]62[/sup](2[sup]-31[/sup] + 2[sup]-62[/sup])[right_floor] = 2[sup]31[/sup] + 1, so v satisfies the condition of Lemma 1. By Lemma 1, x = [left_floor]v[right_floor]. (90) Then, since the multiplication in (73) by a power of 2 is exact, y = 2[sup]31[/sup]x = 2[sup]31[/sup][left_floor]bc(2[sup]-31[/sup] + 2[sup]-62[/sup])[right_floor]. (91) The FMA in (74) is exact when its result is an IEEE integer. By Lemma 5, bc - y is an IEEE integer; thus, z = bc - y. From (89), (90), and (91), d = bc - y + x = z + x. Since z + x = d = bc mod q is an IEEE integer, it follows that z + x = z [xor] x = [d_tilde]. [end] *Trademark or registered trademark of International Business Machines Corporation. **Trademark or registered trademark of Intel Corporation, Apple Computer, Inc., or MIPS Technologies, Inc. References 1. D. Bailey, J. Barton, T. Lasinski, and S. Simon, "The NAS Parallel Benchmarks," Technical Report RNR-91-002, Revision 2, NASA Ames Research Center, Moffett Field, CA, 1991. 2. D. Knuth, The Art of Computer Programming: Seminumerical Algorithms, 2nd edition, Volume 2, Addison-Wesley Publishing Co., Reading, MA, 1981. 3. F. G. Gustavson and W. A. Liniger, "A Fast Random Number Generator with Good Statistical Properties," Computing 6, 221-226 (1970). 4. IBM Corporation, XL Fortran for AIX, Language Reference, Version 7 Release 1, 1999, Order No. SC09-2867-00; available through IBM branch offices. 5. IBM Corporation, XL High Performance Fortran for AIX, Language Reference and User's Guide, Version 1 Release 3, 1997; Order No. SC09-2631-00; available through IBM branch offices. 6. R. C. Agarwal, F. G. Gustavson, and M. Zubair, "A Very High Performance Algorithm for NAS EP Benchmark," Proceedings of HPCN'94, Munich, Germany, April 18-20, 1994. 7. Brian D. Ripley, Stochastic Simulation, John Wiley & Sons, Inc., New York, 1987. 8. R. C. Agarwal and F. G. Gustavson, "Algorithm and Architecture Aspects of Producing ESSL BLAS on POWER2, in POWERPC and POWER2: Technical Aspects of the New IBM RISC System/6000," Technical Report SA23-27-37, IBM Corporation, 1994. 9. R. C. Agarwal, F. G. Gustavson, and M. Zubair, "Exploiting Functional Parallelism of POWER2 to Design High-Performance Numerical Algorithms," IBM J. Res. & Dev. 38, 563-576 (1994). 10. IBM Corporation, Parallel Engineering and Scientific Subroutine Library for AIX Guide and Reference, July 2000; Order No. SA22-7273-03; http://www.rs6000.ibm.com/doc_link/en_US/a_doc_lib/sp32/pessl/pessl.html. 11. R. F. Enenkel and F. G. Gustavson, "Fast Calculation of Integer Products Modulo 2**31-1 Using IEEE Floating Point and Fused Multiply-Add," IBM Tech. Disclosure Bull. 41 (412), Article 41287 (August 1998). Footnotes [foot1]Gordon Slishman, private communication, IBM Toronto Laboratory, January 1994. [foot2]Xinmin Tian, private communication, IBM Toronto Laboratory, 1997. [foot3]W. Kahan, private communication, Computer Science Division, University of California, Berkeley, August 2000. [foot4]Gordon Slishman, private communication, IBM Toronto Laboratory, January 1994. Received November 1, 2000; accepted for publication September 16, 2001 Biographical sketches of authors Ramesh C. Agarwal IBM Research Division, Almaden Research Center, 650 Harry Road, San Jose, California 95120 (ragarwal@us.ibm.com). Dr. Agarwal received a B.Tech. (Hons.) degree from the Indian Institute of Technology (IIT), Bombay. While there, he received The President of India Gold Medal. He received M.S. and Ph.D. degrees from Rice University and was awarded the Sigma Xi Award for best Ph.D. thesis in electrical engineering. From 1974 to 1977, Dr. Agarwal was at IBM Research in Yorktown Heights, New York. He spent the period 1978-1981 as an Associate Professor at IIT Delhi. In 1982, he returned to IBM. Dr. Agarwal has done research in many areas of engineering, science, and mathematics and has published more than sixty papers in various journals. In 1982, he helped the National Academy of Sciences in studying the acoustic tapes related to the assassination of President Kennedy. He showed that there is no acoustical evidence for the conspiracy theory. In 1994, he analyzed the floating-point divide flaw in the Intel Pentium chip and showed that for spreadsheet calculations using decimal numbers, the probability of divide error increases by several orders of magnitude. His main research focus has been in the area of algorithms and architecture for high-performance computing. His current research activities are in the area of data-mining algorithms and database performance. In 1974, Dr. Agarwal received the Senior Award for best paper from the IEEE ASSP group. He has received several Outstanding Achievement Awards and a Corporate Award from IBM. In 2001, he received a Distinguished Alumnus Award from IIT Bombay. Dr. Agarwal is a Fellow of the IEEE and an IBM Fellow. Robert F. Enenkel IBM Toronto Laboratory, Centre for Advanced Studies, M.S. C1/B4R, 8200 Warden Avenue, Markham, Ontario, Canada L6G 1C7 (enenkel@ca.ibm.com). Dr. Enenkel is a Research Associate at the IBM Centre for Advanced Studies (CAS). Prior to joining CAS, he worked at the IBM Toronto Laboratory on the development of a C compiler and its math library, and developed parallel methods for random-number generation for Fortran and High Performance Fortran compilers. Dr. Enenkel received his B.Sc., M.Sc. and Ph.D. degrees from the University of Toronto, with thesis work in the area of numerical methods for the parallel solution of initial value problems for ordinary differential equations. His current research is in numerical computing as it relates to compilers and operating systems, including floating-point arithmetic, mathematical function libraries, and the performance tuning of algorithms. He is also interested in parallel computing and the application of numerical methods to practical problems in various areas of science. Dr. Enenkel has received an IBM Invention Achievement Award and an IBM Author Recognition Award; he is a member of the Society for Industrial and Applied Mathematics. Fred G. Gustavson IBM Research Division, Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, New York 10598 (gustav@us.ibm.com). Dr. Gustavson manages the Algorithms and Architectures group in the Mathematical Sciences Department at the IBM Thomas J. Watson Research Center. He received his B.S. degree in physics, and his M.S. and Ph.D. degrees in applied mathematics, all from Rensselaer Polytechnic Institute. He joined IBM Research in 1963. One of his primary interests has been in developing theory and programming techniques for exploiting the sparseness inherent in large systems of linear equations. Dr. Gustavson has worked in the areas of nonlinear differential equations, linear algebra, symbolic computation, computer-aided design of networks, design and analysis of algorithms, and programming applications. He and his group are currently engaged in activities that are aimed at exploiting the novel features of the IBM family of RISC processors. These include hardware design for divide and square root, new algorithms for POWER2 for the Engineering and Scientific Subroutine Library (ESSL) and for other math kernels, and parallel algorithms for distributed and shared memory processors. Dr. Gustavson has received an IBM Outstanding Contribution Award, an IBM Outstanding Innovation Award, an IBM Outstanding Invention Award, two IBM Corporate Technical Recognition Awards, and a Research Division Technical Group Award. He is a Fellow of the IEEE. Alok Kothari Current address not available. Mohammad Zubair Old Dominion University, Computer Science Department, Hampton Boulevard, Norfolk, Virginia 23529 (zubair@cs.odu.edu). Dr. Zubair has more than thirteen years of research experience in the area of experimental computer science and engineering, both at Old Dominion University and in industry. In his tenure at the University, he has developed several software systems. Two of his research efforts have led to source-code license agreements with major companies. His major industrial assignment was for three years at the IBM Thomas J. Watson Research Center, where he made contributions to IBM ESSL product and in the enabling of parallel benchmarks on IBM SP2. Dr. Zubair's research has been supported by NASA, NSF, ARPA, Los Alamos, AFRL, NRL, JTASC, and the IBM Corporation.