|
1. Introduction
LAPACK algorithm DGEQRF requires more floating-point operations than LAPACK algorithm DGEQR2; see [1]. Yet DGEQRF outperforms DGEQR2 on an RS/6000* workstation by nearly a factor of 3 on large matrices. Dongarra, Kaufman, and Hammarling, in [2], later, Bischof and Van Loan, in [3], and still later, Schreiber and Van Loan, in [4], demonstrated why this is possible by aggregating the Householder transforms before applying them to a matrix C. The result of [3] and [4] was the k-way aggregating WY Householder transform and the k-way aggregating storage-efficient Householder transform. In the latter, the aggregated representation of Q = I YTYT. Here, lower trapezoidal Y is m by k, consisting of k Householder vectors, and upper triangular T is k by k.
Our recursive algorithm, RGEQR3, starts with a block size k = 1 and doubles k in each step. If we were to allow this to continue for a large number of columns, the performance would eventually degrade because the additional floating-point operations (FLOPs) grow cubically in k. The cost of RGEQR3 on an n by n matrix is (13/6)n3 +
· · ·
, whereas the DGEQR2 cost is (4/3)n3 +
· · ·. Thus, to avoid this occurring, RGEQR3 should not be used until k = n/2. Instead, we propose to use a hybrid recursive algorithm, RGEQRF, which is a modification of a standard Level 3 block algorithm in that it calls RGEQR3 for factorizing block columns instead of calling DGEQR2 and DLARFT.1 Hence, RGEQRF is a modified version of Algorithm 4.2 of Bischof and Van Loan, in [3], and RGEQR3 is a recursive Level 3 counterpart of their Algorithm 4.1. This work is a continuation of the work presented in [5].
In Section 2 we describe our new recursive serial algorithms RGEQR3 and RGEQRF and give a proof of correctness. The first subsection discusses aspects of increasing the FLOP count in order to gain performance. In the second subsection we give an explanation as to why the FLOP count increases for QR factorization when one goes from a Level 2 to a Level 3 factorization. Next we derive FLOP counts for LAPACK algorithms DGEQR2 and DGEQRF and our new algorithms RGEQR3 and RGEQRF. These counts are functions of the three parameters m, n, and k, the block size. On the basis of relations among these various counts, one can devise and tune various QR factorization algorithms; the above four are examples. Specifically, we demonstrate the correctness of the new recursive algorithm by using mathematical induction on k = log2 n, k = 0, 1, 2, · · ·. Additionally, a quantitative analysis is presented that details how the FLOP counts of the various standard and new recursive algorithms relate to one another. At the end of the section we detail savings for RGEQR3 in computing the T matrix when one need not update later with T. These savings are especially important for tall thin matrices. We also discuss alternative approaches for performing updates of the matrix, including a description of the routine DQTC (equivalent to DLARFB) used in our implementations.
Uniprocessor performance results for square and tall thin matrices are presented for the 120-MHz IBM POWER2 node and the 332-MHz IBM PPC 604e node in Sections 3 and 4. Roughly speaking, the algorithm RGEQR3 is times faster than DGEQRF, where decreases from 3 to 1.2 as n ranges from 50 to m and m 300. We remark that significantly more effort was needed to tune the parameters for DGEQRF than was needed for RGEQRF, since DGEQRF has two parameters that are to be set (defining a two-dimensional parameter space), whereas only one parameter is to be set in RGEQRF. The fact that less tuning is needed to obtain optimal performance is another feature of the automatic variable blocking of the recursive algorithm. We also mention the case for which recursion fails to produce good performance, and the remedy for this.
In Section 5 we describe our new parallel recursive algorithm. It is related to the LU factorization algorithm described in [6] and the dynamic load-balancing versions of LU and Cholesky factorization in [7]. The algorithm is based on a dynamic load-balancing strategy, implemented using the pool-of-tasks principle in which each processor enters a critical section to assign itself more work as soon as it has completed its last task. This process is fully asynchronous, since there are no fixed synchronization points. The amount of work performed in each task is large enough to make overhead in the work distribution process negligible. Section 6 shows performance results for the parallel algorithm on one, two, three, and four processors of a four-way 332-MHz IBM PPC604e SMP node. The uniprocessor performance of the parallel algorithm is basically the same as for the serial algorithm. The parallel results show nearly perfect speedups, up to 1.97, 2.99, and 3.97 for two, three, and four processors, respectively.
2. Recursive QR factorization
In our recursive algorithm, the QR factorization of an m × n matrix A,
|
A11
|
A12
|
|
= Q
|
|
R11
|
R12
|
|
,
|
(1)
|
|
A21
|
A22
|
0
|
R22
|
is initiated by a recursive factorization of the left-hand side n/2 columns, i.e.,
The remaining part of the matrix is updated,
and then 22 is recursively factorized,
2R22 = 22.
|
(4)
|
The recursion stops when the matrix to be factorized consists of a single column. This column is factorized by applying an elementary Householder transformation to it.
Proof of correctness
Our method of proof is mathematical induction. We consider only the case m n. The recursion takes place on the column dimension n. We want to prove correctness for n = 1, 2,
· · ·. However, recursion breaks the problem into nearly two equal pieces, n1 = n/2 and n2 = n n1. This suggests that we use mathematical induction on k = log2n, k = 0, 1,· · ·. On the basis of this observation, we have the following form of mathematical induction. Suppose the result is true for 1 n 2k. Then we establish the results for all j, 2k < j 2k + 1. Additionally, we need to establish the result for (m, n) where n = 1.
Now we give the proof: For n = 1 we compute the Householder reflector H = I uuT, where u is computed from A(1:m, 1). Thus, the result is true for n = 1. Now suppose the result is true for 1 n 2k. Let 2k < j 2k+1, j1 = j/2 , and j2 = j j1. Since j > 1, we do the computations indicated by Equations (2), (3), and (4) in that order.
Equation (2) is satisfied by the induction hypothesis; Equation (3) is a calculation, and Equation (4) is satisfied by the induction hypothesis. Using Equations (2)(4), we want to show that Equation (1) is true, where Q = Q1Q2 and Q, Q1, and Q2 are m × m. That Q is orthogonal follows by induction, using the induction hypothesis that Q1 and Q2 are orthogonal and the facts that Q1 and Q2 are orthogonal Householder transformations for the cases n1 = 1 and n2 = 1, respectively, and the fact that the product of two orthogonal matrices is an orthogonal matrix.
Partition Q1 and Q2 as follows:
|
Q1 =
|
|
Q11
|
Q12
|
|
;
|
|
Q2 =
|
|
I
|
0
|
|
,
|
(5)
|
|
Q21
|
Q22
|
0
|
2
|
where 2 is
m n1 by m n1. Since Q is orthogonal, QTQ = I, and it is sufficient to show that QTA = R.
Now,
|
Q1TA=
|
|
Q11TA11+Q21TA21
Q11TA12+Q21TA22
|
|
,
|
(6)
|
|
Q12TA11+Q22TA21
Q12TA12+Q22TA22
|
and by substituting (2) and (3) into (6) we obtain
|
Q1TA =
|
|
R11
|
R12
|
|
.
|
(7)
|
|
0
|
22
|
Now QTA = Q2T(Q1TA) becomes equal to R by substituting (7) for Q1TA, then carrying out the multiplication by Q2T [see (5)] and finally substituting R22 for 2T 22. [This latter substitution is valid by Equation (4)].
We use the storage-efficient WY form, that is, Q = I YTYT, to represent the orthogonal matrices, and the update by Q1T in (3) is performed as a series ofmatrix multiplications with general and triangular matrices(i.e., by calls to Level 3 BLAS DGEMM and DTRMM). In Figure 1, we give the details of our recursive algorithm, called RGEQR3.
Figure 1
We assume that A is m by n, where m n. In the else clause there are two recursive calls, one on matrix A(1:m, 1:n1), the other on matrix A(j1:m, j1:n), and the computations Q1TA(1:m, j1:n) and T1(Y1TY2)T2. These two computations consist mostly of calls to either DGEMM or DTRMM. Our implementation of RGEQR3 is made in standard FORTRAN 77, which requires the explicit handling of the recursion.
Algorithm RGEQR3 can be proved correct by following and modifying the above correctness proof. In this regard we note that RGEQR3 uses the storage-efficient representation Q = I YTYT, and so T must also be computed. Furthermore, RGEQR3 computes T recursively. The modification in the proof would include the correctness of the recursive T computation. Since it also follows in a similar way, we omit it.
In Figure 2 we give annotated descriptions of algorithms DGEQRF and DGEQR2 of LAPACK. See [1] for full details. The routine DGEQRF calls DGEQR2, which is a Level 2 version of DGEQRF. In the annotation we have assumed m n.
Figure 2
Remarks on the recursive algorithm RGEQR3
The algorithm RGEQR3 requires more floating-point operations than the algorithm DGEQRF, which in turn requires more floating-point operations than DGEQR2. Dongarra, Kaufman, and Hammarling, in [2], showed how to increase performance by increasing the FLOP count when they aggregated two Householder transforms before they were applied to a matrix C. The computation they considered was
where C is m by n, Q = Q1Q2, and Q1 and Q2 are Householder matrices of the form I iuiuiT, i = 1, 2. Their idea was to better use high-speed vector operations and thereby gain a decrease in execution time. Bischof and Van Loan, in [3], generalized (8) by using the WY transform. They represented the product of k Householder transforms Qi, i = 1, · · ·, k, as
|
Q = Q1Q2 · · · Qk = IWYT.
|
(9)
|
They used (9) to compute QTC = C YWTC. Later on, Schreiber and Van Loan, in [4], introduced a storage-efficient WY representation for Q:
|
Q = Q1Q2 · · · Qk = IYTYT,
|
(10)
|
where T is an upper triangular k by k matrix. In all three cases performance was enhanced by increasing the FLOP count. Here the idea was to replace the matrixvector-type computations with matrixmatrix-type computations. The decrease in execution time occurred because the new code, despite requiring more floating-point operations, made better use of the memory hierarchy.
In (9) and (10) Y is a trapezoidal matrix consisting of k consecutive Householder vectors, ui, i = 1,· · ·, k. The first component of each ui is 1, where the 1 is implicit and hence is not stored. These vectors are scaled with i. For k = 2, the T of Equation (10) is
|
T =
|
|
1
|
1u1Tu2 2
|
|
.
|
(11)
|
|
0
|
2
|
Suppose k1 + k2 = k and T1 and T2 are the associated triangular matrices in (10). We have
|
Q = (Q1 · · · Qk1)(Qk1+1
· · · Qk)
|
= (IY1T1Y1T)(IY2T2Y2T)
|
|
= IYTYT,
|
(12)
|
where Y = (Y1, Y2) is formed by concatenation. Thus, a generalization of (11) is
|
T =
|
|
T1
|
T1Y1TY2T2
|
|
,
|
(13)
|
|
0
|
T2
|
which is essentially a Level 3 formulation of (11), (12).
Schreiber and Van Loan and LAPACK's DGEQRF compute (12) (by LAPACK algorithm DLARFT) via a bordering technique consisting of a series of Level 2 operations. For each k1 = 1,
· · ·, k 1, k2 is chosen to be 1. However, as (12) and (13) suggest, Q = I YTYT can be done recursively as a series of k 1 matrixmatrix computations. Also, the FLOP count of the T computation in (12) by this matrixmatrix computation is the same as the FLOP count of the bordering computation to compute T (see the next subsection, below).
Algorithm RGEQR3 can be viewed as starting with k = 1 and doubling k until k = n/2. If this doubling were allowed to continue, performance would degrade drastically because of the cubically increasing FLOP count in the variable k (see the next subsection for details). To avoid this, RGEQR3 should not be used for large n. Instead, the LAPACK algorithm, DGEQRF, should be revised and used with RGEQR3. In Figure 3 we give our hybrid recursive algorithm, which we name RGEQRF. The routine DQTC described in the next two subsections applies QT = I YTTYT to a matrix C as a series of DGEMM and DTRMM operations.
Figure 3
Note that RGEQRF has no call to LAPACK routine DLARFT, which computes the upper triangular matrix T via Level 2 calls. Instead, routine RGEQR3 computes T via our Level 3 (matrixmatrix) approach in addition to computing , Y, and R. DGEQRF calls DGEQR2 to compute , Y, and R; there is no need to compute T, and so it is not done. However, the if clause of RGEQRF is not invoked after the last and sometimes only column block A(j:m, j:j + jb 1) is factored (this occurs when and only when j + jb 1 = n). In that case, some of the Level 3 T computations can be avoided. An example is given in Figure 4, where matrices Ti, i = 0, · · ·, 4 must be computed, whereas zero matrices Zi, i = 0, · · ·, 3 are not needed, and hence their computation can be avoided. Our algorithm RGEQR3 has an additional parameter ISW (a switch or flag), which when set will avoid the computation of the Zi matrices. ISW is set on in RGEQRF just before the last column block is factored. In RGEQR3, when ISW is set on, the T3 computation is avoided when and only when the right boundary of T3 is equal to n. In Figures 1 and 3 we did not include the switch logic, as it was a detail that would detract from the clarity of these algorithms. The necessary modifications should be clear from the description above. In the next subsection we detail the FLOP-count saving S(m, n) that results when ISW is set on. All performance tests have been made with ISW set on.
Figure 4
Comparative FLOP counts for algorithms DGEQR2, DGEQRF, RGEQRF, and RGEQR3
Some dense linear algebra algorithms have the same FLOP counts for both Level 2 and Level 3 versions. General LU factorization and Cholesky factorization are two examples. Others (QR factorization, for example) exhibit increasing FLOP count for their Level 2 and Level 3 versions as a function of increasing block size. We briefly describe why this is so.
Let L be the lower triangular factor of LU factorization with partial pivoting of a general matrix; i.e., LU = PA. In LAPACK, the routine that produces L, U, and P, given A, is called DGETRF. One can represent L as the product of n rank-one corrections of the identity matrix, i.e.,
L = L1L2 · · · Ln,
where Li = I + lieit. Here, vector lit = (0, · · · , 0, 1, li+1,i,
· · · , lm,i) is the ith column of L. The point we make here is that the product of the n elementary matrices Li requires no arithmetic to produce L; i.e., the product L is only a concatenation of the n vectors li, 1 i n (see [8] for details). On the other hand, Q = Q1Q2 · · · Qn, where each Qi = I iuiuiT is not equal to the concatenation of the n Qi matrices. To form Q we must multiply these n Qi matrices together. The actual representation we use is Q = I YTYT, and the extra work required to produce Q is the work needed to produce the T matrix.
In the Level 2 implementations DGETF2 and DGEQR2 we work only with matrices Li and Qi. In the Level 3 implementations DGETRF and DGEQRF, we work with block matrices L and Q. It is clear from the discussion above that the Level 3 implementation of general LU factorization requires no additional FLOPs, as L can be formed from the Li by concatenation, whereas the Level 3 implementation of general QR factorization requires additional FLOPS, namely those needed to produce T in the representation Q = I YTYT.
Our recursive algorithms RGEQRF and RGEQR3 also exhibit increasing FLOP counts. Furthermore, the FLOP count of RGEQRF is greater than the FLOP count of DGEQRF. In this subsection we compute the FLOP counts of these four algorithms. Specifically, we derive FLOP counts FG, F, FT as functions of m and n, and FB(m, n, k). The four functions refer to LAPACK auxiliary routines DLAR{FG, F, FT, FB}. These routines work on a matrix of size m by n; DLARFB computes QTC, where C is m by n and Q is m by k. Additionally, we compute FLOP-count functions T(m, k), Y(m, k), R(m, k), Diff(k), and M(m, k), S(m, k). Here we use k instead of n. These latter six functions relate RGEQR3 and RGEQRF to DGEQR2 and DGEQRF. The function T(m, k) = FT(m, k) computes the FLOP count of producing the T matrix, which we need to compute the block matrix Q = I YTYT. The cost of doing the k 1 update computations QTC in RGEQR3 is Y(m, k). We were unable to find an explicit solution to Y(m, k), and so computed W(m, k) [= Y(m, k) + W(k)] explicitly. R(m, k), standing for the cost of RGEQR3, was computed as the sum of FG(m, k), Y(m, k), and T(m, k). Diff(k) = Y(m, k) F(m, k) represents the difference in FLOP count between RGEQRF and DGEQRF during one block step, i.e., the FLOP count of one factor-and-update step of RGEQRF/DGEQRF. Now, T(m, k) = M(m, k) + S(m, k). M(m, k) is the modified FLOP count needed to compute only those parts of the T matrix that are necessary to factor a single (and last) block column, i.e., the Ti submatrices of T in Figure 4. S(m, k) is the FLOP count saved by not computing the Zi submatrices of T in Figure 4. We were not able to explicitly compute M and S. However, we produce implicit expressions of M and S and also produce explicit approximations of M and S that satisfy T = M + S.
The purpose of producing these FLOP counts is to be able to quantify and predict the performance properties of our new recursive algorithms. For example, we show that the FLOP count of RGEQR3 is about equal to the sum of the FLOP counts for DGEQR2 and DLARFT for tall thin matrices. Thus, for tall thin matrices we can expect about a threefold speedup of our algorithm over the LAPACK algorithm DGEQRF (see the second paragraph below for an explanation). Our experimental results in Section 3 verify this.
It is counterintuitive that an algorithm with a higher FLOP count will execute faster. The reason is that the FLOP rate depends crucially on the FLOP operands being in the higher levels of memory, e.g., in the cache. Our recursive algorithms automatically block for the memory hierarchy. Additionally, recursion produces a Level 3 implementation of the factorization part of the code. LAPACK implementations, on the other hand, compute the factorization part of the code as Level 2.
For RISC-type processors one usually needs a Level 3-type code to maintain peak FLOP rates, even if the operands are in cache. To a first approximation, execution time is equal to FLOP rate times FLOP count. The main result of this paper shows that recursion improves performance most for tall thin matrices. The reason is found in the difference between RGEQR3 and the sum of DGEQR2 and DLARFT. For tall thin matrices, RGEQRF and DGEQRF spend the majority of their time in RGEQR3 and in DGEQR2 plus DLARFT, respectively. We assume that, if appropriate, the switch ISW has been set on. (If we compare RGEQR3 with the switch on to just DGEQR2, the values of the leading term mk2 of the two FLOP counts are 7/3 and 2. Thus, the FLOP ratio of the two codes is about 7/6 for tall thin matrices.) Now RGEQR3 has a higher FLOP count than DGEQR2 plus DLARFT, but not drastically so. For tall thin matrices they are essentially equal. On the other hand, the FLOP-rate ratio of RGEQR3 to DGEQR2 plus DLARFT is, very loosely speaking, about 3, which explains why the performance results are so good for tall thin matrices.
We start with DGEQR2 and determine its FLOP count. We see from Figure 2 that DLARFG is called n times to compute n Householder transforms, Q(j) = I uuT. This routine computes the 2 (Euclidean) norm of a vector, avoiding overflow. It then scales the vector. The jth vector has size m j + 1. We give the FLOP count as 3(m j + 2) + 2. This is an underestimate because we designate the 2 norm cost as 2(m j) and count square root as one FLOP. Summing from 1 to n, we obtain
FG(m,n) = n(6m3n+13)/2.
Also, for 1 j < n, DGEQR2 applies Q(j)T to A(j:m, j + 1:n) from the left by calling DLARF. Let C = A(j:m, j + 1:n). DLARF computes (I uuT)C =C uwT, where w = CTu. Now w = CTu is computed by calling DGEMV, and C = C uwT is computed by calling DGER. The computation w = CTu followed by the scaling w = w costs 2(m + 1 j)(n j) FLOPs. The computation C = C uwT costs the same, so the FLOP count is 4(m + 1 j)(n j). Summing j from 1 to n 1, we obtain
F(m,n) = 2n(n1)[m(n2)/3].
Now we turn to DGEQRF. The routine in Figure 2 is a simplified version in that no mention is made of blocking parameter nx. However, it is sufficient for our purposes. There are two routines we want to analyze, DLARFT and DLARFB. DLARFT computes T via a Level 2 bordering computation. At the jth step, 1 j n, we have T as an order j 1 upper triangular matrix, and we want to compute the jth column of T. Letting
|
Y1 = (Y,v), T1 =
|
|
T
|
w
|
|
,
|
|
0
|
z
|
we have
(IYTYT)(I vvT) = IY1T1Y1T,
and so the cost of computing column j consists of computing TYTv, where Y is an m by j unit lower trapezoidal matrix and v = (1, uT)T is an m j + 1 vector. DLARFT computes w = T(1:j 1, j) = YTv by calling DGEMV and then calling DTRMV to compute w = Tw. The DGEMV computation, including the scaling by , costs 2(j 1)(m j + 1) FLOPs. The DTRMV computation costs (j 1)2 FLOPs. Thus, the total cost is (j 1)[2m (j 1)]. Summing j from 1 to n, we obtain
|
FT(m,n) = n(n 1)[m (2n1)/6].
|
Now we turn to DLARFB. This routine does the bulk of the computation. DLARFB is a general routine. In one specific instance it has the same function as routine DQTC. For our purposes we want to compute QTC, where C is m by n and Q = I YTYT. Here Y is m by k lower unit trapezoidal and T is order k upper triangular. Both DLARFB and DQTC have the same FLOP count and use an auxiliary work array of size k by n.
Let
|
Y =
|
|
Y1
|
|
,
|
|
Y2
|
where Y1 is an order k unit lower triangular matrix and Y2 is an m k by k rectangular matrix. Also, let
|
C =
|
|
C1
|
|
,
|
|
C2
|
where C1 is k by n and C2 is m k by n. Six computations are performed; they are summarized in Table 1.
|
Table 1 FLOP counts for operations used to perform C = QTC, where Q = I YTYT.
|
|
|
Routine
|
Computation
|
Number of floating-point operations
|
|
|
DTRMM
|
W = Y1T * C1
|
k(k 1)n
|
|
DGEMM
|
W = W + Y2T * C2
|
2k(m k)n
|
|
DTRMM
|
W = TT * W
|
k2n
|
|
DGEMM
|
C2 = C2 Y2 * W
|
2k(m k)n
|
|
DTRMM
|
W = C1 * W
|
k(k 1)n
|
|
Matrix subtract
|
C1 = C1 W
|
kn
|
|
|
|
Total computation
|
C = QTC
|
kn(4m k 1)
|
|
From Table 1 we have
FB(m, n, k) = kn(4mk1).
This concludes the analysis of DGEQRF.
We next turn to RGEQRF. Routines DGEQRF and RGEQRF differ (assuming both use the same blocking strategy) in FLOP count as follows. DGEQRF calls DGEQR2 and DLARFT, whereas RGEQRF calls RGEQR3. Both routines use the DQTC computation for the bulk of their operations. Hence, to compare the FLOP-count difference we need to study the difference between DGEQR2 + DLARFT and RGEQR3. Turning to RGEQR3, we now show that its T computation can be separated out and directly compared to DLARFT. The functional equation governing the T FLOP count is
|
T(m, k) = T(m, k1) + T(mk1,
k2) + k1k2(2mk1),
|
(14)
|
with T(m, 1) = 0. The k1k2(2m k1) FLOP component is the cost of computing T1(Y1TY2)T2, where T1 and T2 are k1 and k2 upper triangular, respectively, Y1 is m by k1 unit lower trapezoidal, and Y2 is m k1 by k2 unit lower trapezoidal.
The computation proceeds as follows. Let us write
|
Y1 =
|
|
Y11
|
|
,
|
|
Y21
|
|
Y31
|
|
|
Y2 =
|
|
Y12
|
|
,
|
|
Y22
|
|
where Y11 and Y12 are order k1 and k2 unit lower triangular, respectively, Y21 is k2 by k1, Y31 is m k by k1, Y22 is m k by k2, and k = k1 + k2. The computation is summarized in Table 2. Note that Y11 is not used. First set W = T(1:k1, k1 + 1:k) = YT21.
|
Table 2
FLOP counts for operations used to compute T3.
|
|
|
Routine
|
Computation
|
Number of floating-point operations
|
|
|
DTRMM
|
W = YT21 * Y12
|
k1k22
|
|
DGEMM
|
W = W + YT31 * Y22
|
2k1k2(m k)
|
|
DTRMM
|
W = T1 * W
|
k12k2
|
|
DTRMM
|
W = W * T2
|
k1k22
|
|
|
|
Total computation
|
W = T1(Y1TY2)T2
|
k1k2(2m k1)
|
|
We claim T(m, k) = FT(m, k). To show this we substitute FT(m, k) into (14) and verify that (14) holds.
Now we want to remove the DLARFG part of the RGEQR3 computation. This computation is done when n = 1. The remaining part of RGEQR3 consists of computing Q = I YTYT minus the T computation, i.e., the Y matrix. The FLOP count is
|
Y(m, k) = Y(m, k1) + Y(mk1,
k2) + k1k2(4mk11).
|
(15)
|
The cost k1k2(4m k1 1) is the cost of calling DQTC, i.e., the A(1:m, k1 + 1:k) = Q1TA(1:m, k1 + 1:k) computation of Figure 1. Also Y(m, 1) = 0.
To solve Equation (15), we compute an approximate solution. Define Y = W W, where
|
W(m, k) = W(m, k1) + W(mk1, k2) + k1k2(8m+k23k12)/2,
|
(16)
|
with W(m, 1) = 0. W(m, k) is a function of k only and satisfies
W(k) = W(k1) + W(k2) + (k2k1)(k1k2)/2,
|
(17)
|
with W(1) = 0.
The solution to (16) is
W(m, k) = k(k1)(4mk)/2.
This can be verified by substituting W(m, k) into (16). Similarly, Y = W W can be verified by substitution using (15), (16), and (17). Now we turn to the solution of (17). Note that k 2k1 = 0 when k is even and 1 when k is odd. Thus, there is no additional count change unless k is odd, and then it is k1(k1 + 1)/2. In particular, W(n) = 0 when n = 2k. To find the general solution we need some notation. Let N = i 0 bi2i be the base-two representation of N. Let Nj = N/2j = i j bi21j, and let nj = ji=0bi2i. The binary tree associated with N has 2i nodes at level i. At the root there is one node of size N0. Let n1 = 0.
There are ni nodes of size Ni+1 + 1 and 2i+1 ni nodes of size Ni+1 at level i + 1. As we saw above, only odd nodes add to the count by the amount Ni+1(Ni+1 + 1)/2. Now, Ni is odd if and only if bi = 1. Thus, Ni + 1 is odd if and only if bi = 0. Using these facts, we see that at level i the increase in the count is
[(1 bi)ni+bi(2i+1 ni)][Ni+1(Ni+1 + 1)/2].
Since W(1) = W(2) = 0, we obtain
W(N) =
|
log2N 1
|
[ni + 2bi(2i ni)][Ni+1(Ni+1 + 1)/2].
|
|
|
i=0
|
To obtain a bound on W(N), we note that the contribution at each node at level i + 1 is four times smaller than that at a node at level i. Since the number of nodes at level i + 1 is double that of level i, we find that the increase at level i + 1 is at most two times smaller than at level i. Thus, an upper bound for W(N) is two times the value at level 0, i.e., 2[N1(N1 + 1)/2]. But when N is even, the contribution at level 0 is 0. Hence, we write N = 2in, where n is odd. Contributions begin to show up only at level i, and there are 2i instances of the same positive contribution. Thus, an upper bound is
W(N) 2i+1(n/2)(n/2+1)/2.
In our implementation we found k = 48 to be an optimal blocking factor for RGEQRF. The corresponding best factor for DGEQRF was nb = 40. For k = 48, W(m, k) = 4512m 54144 W(48), where W(48) = 32.
Next we compute the FLOP-count difference between RGEQRF and DGEQRF during one block iteration of size k, i.e., the call to RGEQR3 minus the calls to DGEQR2 and DLARFT. This difference, Diff(m, k), is Y(m, k) minus F(m, k):
Diff(m, k) = k(k 1 )(k 8)/6 W(k).
|
(18)
|
Note that Diff is independent of m and the cubic growth in k is only 1/6, where k is the blocking factor and k is around 50 for practical problems. For k = 48, Diff is 15008 FLOPS. A realistic problem might have m = 1000. Summing FG(m, k), Y(m, k), and T(m, k), we obtain the FLOP count of RGEQR3:
R(m, k) = 3k2m k(5k2 + 3k 38)/6 W(k).
For m = 1000 and k = 48, R(m, k) 6.8 × 106. The ratio Diff/R(m, k) is 0.0022, about 0.2%. For tall thin matrices the difference is negligible; i.e., the FLOP counts are essentially equal. Thus, we have quantified a remark we made above regarding Equation (12).
Now compare Diff(k) to T(m, k) = FT(m, k). T(m, k) is at least four times more costly to compute, and this occurs when m = k. For tall thin matrices (say m/k = 20), the ratio is greater than 100. The point we make here is that the additional cost of using T in k 1 calls to DQTC of RGEQR3 is tiny compared to the cost of producing it.
Although the point is very minor, the cost T(m, k) is actually less than FT(m, k). We listed the cost of the first DTRMM computation as k1k22. However, Y12 is unit lower triangular; thus, the cost is actually k1k2(k2 1). We used the more expensive cost because the results were equal. Thus, using Level 3 BLAS to replace a series of Level 2 computations can be slightly cheaper because a constant feature in the higher-level BLAS can be exploited, whereas in the lower-level BLAS it is only one operation and usually cannot be taken advantage of.
As mentioned, there is a modification of RGEQR3 which can save FLOPs. Recall that the computation T1(Y1TY2)T2 is done after the second recursive call. The computation is necessary only when Q = I YTYT is needed to update a submatrix to the right of the currently factored submatrix. If the RGEQR3 task is to factor the entire matrix, there is never a right submatrix. In RGEQRF, RGEQR3 is called as a subroutine ñ = n/k times, where k is the block size. On the first ñ 1 calls there is a right submatrix to be updated. On the last call there is none. Thus, in these cases, in RGEQR3 we omit the T3 computation when the right boundary of T3, a submatrix of T, is k. If the right boundary of T3 is less than k, T3 must be computed (see Figure 4). Let M(m, k) be the modified FLOP count and S(m, k) be the saved FLOP count of computing the T3 matrix. We have
|
M(m, k) = T(m, k1) + M(m k1, k2);
|
(19)
|
|
S(m, k) = S(m k1,
k2) + k1k2(2m k1),
|
(20)
|
where M(m, 1) = S(m, 1) = 0. It is clear by induction that M(m, k) + S(m, k) = T(m, k), since the sum of the right-hand side of (19) and (20), using the induction hypothesis, gives the right-hand side of (14). We can show that
|
M(m, k) = f(k)m f1(k);
|
(21)
|
|
S(m, k) = g(k)m g1(k).
|
(22)
|
Furthermore,
|
f(k) = f(k2) + k1(k1 1);
|
|
|
f1(k) = f1(k2) + k1[f(k2) + (k1 1)(2k1 1)/6],
|
(23)
|
and
|
g(k) = g(k2) + 2k1k2
|
|
|
g1(k) = g1(k2) + k1[g(k2) + k1k2],
|
(24)
|
with f(1) = f1(1) = 0 and g(1) = g1(1) = 0 defining the functions f, f1, g, and g1, for k = 1, 2, · · ·.
Consider
|
af(2x) = af(x) + x(x 1);
|
| |
|
af1(2x) = af1(x) + x[f(x) + (x 1)(2x 1)/6],
|
(25)
|
|
ag(2x) = ag(x) + 2x2;
|
|
|
ag1(2x) = ag1(x) + x[ag(x) + x2].
|
(26)
|
The prefix a means approximate. A solution to (25), (26) is
|
af(x) = 1/3(x 1)(x 2);
|
|
af1(x) = (x 1)(4x2 17x + 18)/42;
|
(27)
|
|
ag(x) = 2/3(x + 1)(x 1);
|
|
ag1(x) = (x 1)(5x2 + 5x 9)/21.
|
(28)
|
These functions approximate the true functions (23) and (24). For k = 2 j, they are identical. Note that
f(k) + g(k) = af(k) + ag(k) k(k 1)
|
(29)
|
and
f1(k) + g1(k) = af1(k) + ag1(k) k(k 1)(2k 1)/6.
|
(30)
|
We can also show directly that
g(2k) = g(k);
|
(31)
|
g(2k + 1) = g(k + 1) + 2/3k;
|
(32)
|
g1(2k) = g1(k) + k g(k);
|
(33)
|
g1(2k + 1) = g1(k) + k[ g(k + 1) + (4k 1)/21],
|
(34)
|
where g(k) g(k) ag(k) and g1(k) g1(k) ag1(k).
Now define
ag(x) ag(x + 1) ag(x) = 2/3(2x+1)
|
(35)
|
and
ag1(x) ag1(x + 1) ag1(x) = (5x2 + 5x 3)/7.
|
(36)
|
It turns out that
ag(k) g(k) < ag(k + 1);
|
(37)
|
ag1(k) g1(k) < ag1(k + 1).
|
(38)
|
We can show that (37) and (38) are true by using induction. These proofs are straightforward. A sketch of the proofs is given in the Appendix.
Note that in each region 2j < k 2j+1, Equations (37) and (38) bound g(k) and g1(k) from below and above. Thus, limk [g(k)/k2] = limk [ag(k)/k2] = 2/3. Similarly, limk [g1(k)/k3] = limk [ag1(k)/k3] = 5/21. From (29) and (30) we obtain limk [f(k)/k2] = 1/3 and limk [f1(k)/k2] = 2/21. Let rm(k) = f(k)/g(k) and r(k) = f1(k)/g1(k). Figure 5 gives log plots of rm and r vs. k. The plots verify experimentally that rm and r have the correct limits of 0.5 and 0.4.
Figure 5
We now use the above results to compare RGEQR3 to DGEQR2 with the switch on. The cost of RGEQR3 in this case becomes R(m, k) S(m, k). Using ag(k) and ag1(k) for g(k) and g1(k) and keeping only terms in k2m and k3, R(m, k) S(m, k) k2[(7/3)m (40/21)k]. The FLOP count of DGEQR2, keeping the same terms, is k2(2m k/3). For large m and small k (the tall thin matrix case), the FLOP ratio of RGEQR3 to DGEQR2 is about 7/6; i.e., the FLOP count is about 17% higher. Assuming a 3 to 1 FLOP-rate ratio, it follows that RGEQR3 should execute 24/7 times faster. This loose analysis quantifies the remarks we made above regarding this comparison.
We close this section with some remarks on why the hybrid algorithm RGEQRF should not be replaced with RGEQR3. Let m = n and set k = n in R(m, k). The cubic term is (13/6)n3. For DGEQRF and RGEQRF using fixed nb n, the extra FLOP cost is marginally higher than for DGEQR2, which has a cubic term of (4/3)n3.
Even if we replace T(m, k) with M(m, k) to compute R(m, k) (which one should do), the reduction in the cubic term would only be 3/7, and the new cubic term would be 73/42. In summary, the preceding FLOP-count analysis of this section shows that the hybrid algorithm RGEQRF should be chosen. RGEQRF should exhibit Level 3 performance for even relatively small values of k. For large m and n, and using k = nb n, the additional FLOP count is marginally higher than for DGEQRF.
The routine DQTC
The matrix multiplications in the update operation
 QTC = C YTTYTC
can be performed in several different orders. Depending on the sizes of the matrices involved, the total number of floating-point operations may be quite different for the different approaches. In Table 3 we show the five possible alternatives and the number of floating-point operations required for each of them, assuming that C is n1 × n2, Y is n1 × k unit lower trapezoidal, and T is k × k upper triangular. Since we here consider each individual update through the whole factorization of an m by n matrix, we have for clarity chosen to use the new variables n1 and n2 to denote the size of the matrix to be updated.
Table 3
Alternative approaches for updates, QTC = C YTTYTC, in QR factorization.
|
|
|
Method
|
Number of floating-point operations
|
|
|
C1: C Y * [TT * (YT * C)]
|
kn2(4n1 k 1)
|
|
C2: C Y * [(TT * YT) * C]
|
kn2(4n1 k 1) + k(n1k k + 1/3 k2/3)
|
|
C3: C [Y * (TT * YT)] * C
|
n12(2n2 + 2k 1) + k(1/3 k k2/3)
|
|
C4: C [(Y * TT) * YT] * C
|
n12(2n2 + 2k 1) + k(n1 kn1 2k + 1)
|
|
C5: C (Y * TT) * (YT * C)
|
kn2(4n1 2k) + k(n1k k 2k2/3 + 2/3)
|
|
For comparisons with results in the preceding subsection, where we express the operations in terms of the first update, n1 and n2 should be replaced by m and n k, respectively. In the update performed in a QR factorization of a large m × n matrix where m n, we know that n1 n2 + k and that k is normally much smaller than n1 and n2 for almost all updates performed.
It is apparent that methods C1 and C5 require significantly fewer floating-point operations than the other alternatives for the typical case in the QR factorization, i.e., for n1 large and k relatively small. The C1 method requires the smallest number of floating-point operations of these methods for all updates that occur in RGEQRF for a factorization of an m × n matrix if m n. For the case m = n, method C5 is almost as cheap if m = n is large and k is small.
The larger the block size is or the smaller n2 is compared to n1, the larger is the gain from using method C1 compared to C5. From this we conclude that method C1 definitely is to be preferred by the recursive algorithm RGEQR3, where n2 typically never exceeds half of the block size nb used in the calling routine RGEQRF. For updates in RGEQRF, the choice of method is less obvious. However, in order to reduce the software complexity we have chosen to use the same method for all updates. Therefore, the routine DQTC implements method C1. The routine DLARFB called by the LAPACK algorithm DGEQRF also implements the C1 method, even though the implementation is slightly different. In both routines, the operations are performed as a series of calls to the Level 3 BLAS routines DGEMM and DTRMM.
3. Uniprocessor performance analysis
The uniprocessor performance of RGEQRF has been evaluated on a 120-MHz IBM POWER2 node and on one processor of a four-way 332-MHz IBM PPC 604e SMP node. The performance results are compared to those of the LAPACK Level 3 QR factorization routine DGEQRF. The performance has been analyzed both for square matrices (m = n) and for tall thin matrices (m n), which are important in many applications. Results are presented for large as well as for small matrices.
Both RGEQRF and DGEQRF have parameters that must be set properly in order to obtain optimal performance. The only parameter that must be set in RGEQRF is nb; i.e., nb is the number of columns to be factorized by the pure recursive algorithm RQEQR3. In DGEQRF, nb is the number of columns to be factorized by the Level 2 algorithm DGEQR2. DGEQRF also uses a parameter nx as a crossover point, so that when the remaining matrix to be factorized consists of fewer than nx columns, it is completely factorized by the Level 2 algorithm. By using an appropriately chosen crossover point, the overhead (extra FLOPs) of the Level 3 algorithm is avoided for matrices too small to benefit from the Level 3 operations.
One could argue that a similar crossover point should also be used in RGEQRF in order to improve performance, but in our effort to make the routine as close to self-tuned as possible, we have chosen not to use this approach. We note, however, that turning on the ISW switch has the effect of automatically introducing a second type of blocking. In the following sections, all serial and parallel performance results for the recursive algorithm have been obtained with the ISW switch turned on.
Of course, the optimal settings of nb for RGEQRF and nb and nx for DGEQRF vary depending on the size and the shape of the matrix to be factorized. Since we believe that most users of linear algebra library software do not change these parameters between different calls to the QR factorization routines, we have chosen to find a good general setting for these parameters for each machine used for performance testing. The settings have been made to give the best overall performance for square matrices varying from m = n = 100 to m = n = 2000.
In DGEQRF, the parameters nb and nx are obtained from the LAPACK routine ILAENV, which always should be modified to return the optimal parameters at the time of the installation (as we have done for the two machines considered here). We fear that some users do not notice this, or that the users do not take the time to determine what values to choose for these parameters. One should also note that the values for the parameters nb and nx used in DGEQRF are the same as the ones used in routines for RQ, QL, and LQ factorization. Of course, there are default values to be used (nb = 32 and nx = 128 in the version we used), but use of them will lead to suboptimal performance on many computer systems. We also note that significantly more effort is needed to find optimum parameter settings for DGEQRF than for RGEQRF because the two parameters (nb and nx) used in DGEQRF define a two-dimensional parameter space, while the parameter space for RGEQRF is one-dimensional (nb only). On this basis, we believe that each step toward automatic tuning is just as important as the optimum performance achieved.
A defect in the recursive approach is the overhead cost of handling the recursion (e.g., the calling overhead). Thus, for small m and n the direct method will perform faster. A way to avoid this defect is to introduce a recursive blocking parameter rb. Pure recursion has rb = 1. When n, the recursion variable, satisfies n > rb, a recursion step is taken; otherwise, a direct method is used. For QR the value of rb can be quite small, say rb = 4. This is so because we are interested in achieving Level 3 performance via register blocking; typically an unrolling of 4 works well. However, we have used only pure recursion, rb = 1, in our performance studies.
IBM POWER2 performance results
The performance results are obtained on a 120-MHz IBM POWER2 node, using IBM XL FORTRAN 5.1 [9]. The overall best block sizes nb for square matrices were found to be 48 for RGEQRF and 40 for DGEQRF. The overall best crossover point nx for using the Level 2 algorithm only in DGEQRF was found to be 112. These parameters were used in all performance tests for this node reported in this section, including tests for tall thin matrices. The BLAS routines used are from ESSL Version 2.2 [10].
IBM POWER2square matrices
The performance results in MFLOPs/s for RGEQRF and DGEQRF on square matrices are shown in Figure 6. RGEQRF clearly reaches high performance much earlier than DGEQRF for increasing m = n. For very small matrices RGEQRF does not provide optimal performance, primarily because of overhead from handling the recursion and because we do not use any crossover point similar to nx in DGEQRF to avoid the extra FLOPs performed in the T computations for matrices too small to benefit from Level 3 operations.
Figure 6
We note that the highest performance obtained for RGEQRF is 415.4 MFLOPs/s, which is 90.3% of the theoretical peak performance of the POWER2 node (460 MFLOPs/s). DGEQRF reaches 353.4 MFLOPs/s, which is 76.8% of the theoretical peak performance.
Table 4 shows that the performance of RGEQRF is nearly 20% better than that of DGEQRF for square matrices with n 800. This difference in performance seems to stabilize for large matrices. A similar behavior was also observed in the first results for the hybrid recursive algorithm presented for a 112-MHz IBM PPC 604 node in [5].
|
Table 4
|
Ratio of uniprocessor performance results for RGEQRF and DGEQRF on a 120-MHz IBM POWER2 node.
|
|
|
m = n
|
200
|
400
|
600
|
800
|
1000
|
1200
|
1400
|
1600
|
1800
|
2000
|
|
|
Ratio
|
0.91
|
1.03
|
1.13
|
1.17
|
1.20
|
1.19
|
1.18
|
1.18
|
1.17
|
1.17
|
|
IBM POWER2tall thin matrices
The performance results for RGEQRF and DGEQRF on tall thin matrices are shown in Table 5 and Table 6, respectively. Here, RGEQRF shows better performance than DGEQRF for all but four cases, and the difference is sometimes much larger than for the tests with square matrices. RGEQRF achieves a performance close to or greater than 300 MFLOPs/s for n 100 and m 600. For DGEQRF to obtain similar performance for large m, the number of columns n must be significantly larger (around 400).
|
Table 5
|
Uniprocessor performance on tall thin matrices for RGEQRF on a 120-MHz IBM POWER2 processor.
|
|
|
m\n
|
50
|
100
|
150
|
200
|
250
|
300
|
350
|
400
|
450
|
500
|
|
|
200
|
152.3
|
214.6
|
231.4
|
249.7
|
|
|
|
|
|
|
|
400
|
198.6
|
270.6
|
288.6
|
306.7
|
324.8
|
333.4
|
336.9
|
341.3
|
|
|
|
600
|
218.4
|
294.4
|
302.2
|
322.3
|
333.0
|
340.3
|
356.0
|
355.6
|
357.9
|
362.8
|
|
800
|
218.8
|
295.3
|
301.1
|
319.7
|
337.8
|
346.2
|
362.9
|
368.5
|
364.1
|
369.1
|
|
1000
|
232.1
|
296.6
|
305.0
|
326.8
|
344.1
|
354.2
|
369.0
|
374.2
|
369.2
|
374.7
|
|
1200
|
235.8
|
297.0
|
310.4
|
328.8
|
346.3
|
356.1
|
370.3
|
375.9
|
372.4
|
377.4
|
|
1400
|
220.3
|
298.6
|
307.7
|
327.6
|
341.9
|
348.0
|
366.2
|
372.3
|
368.7
|
373.3
|
|
1600
|
220.7
|
297.5
|
307.6
|
328.2
|
339.1
|
352.7
|
369.6
|
375.2
|
370.7
|
376.3
|
|
1800
|
225.7
|
301.0
|
308.2
|
327.6
|
345.0
|
355.3
|
372.7
|
378.9
|
372.6
|
377.8
|
|
2000
|
225.0
|
295.0
|
301.8
|
322.2
|
341.6
|
351.7
|
362.8
|
371.0
|
368.9
|
371.4
|
|
|
Table 6
|
Uniprocessor performance on tall thin matrices for DGEQRF on a 120-MHz IBM POWER2 processor.
|
|
|
m\n
|
50
|
100
|
150
|
200
|
250
|
300
|
350
|
400
|
450
|
500
|
|
|
200
|
217.4
|
210.3
|
242.9
|
278.0
|
|
|
|
|
|
|
|
400
|
215.8
|
155.6
|
206.4
|
291.0
|
303.0
|
314.1
|
313.6
|
328.9
|
|
|
|
600
|
177.2
|
151.5
|
197.5
|
271.1
|
286.5
|
301.4
|
310.0
|
329.9
|
330.7
|
331.2
|
|
800
|
145.4
|
133.7
|
175.9
|
246.0
|
261.4
|
260.3
|
286.4
|
308.9
|
311.1
|
305.4
|
|
1000
|
147.5
|
146.1
|
186.9
|
248.3
|
264.8
|
279.8
|
289.2
|
309.0
|
311.7
|
314.9
|
|
1200
|
146.5
|
144.0
|
184.9
|
246.9
|
263.7
|
278.3
|
288.0
|
308.3
|
311.4
|
315.0
|
|
1400
|
146.6
|
145.0
|
186.6
|
247.0
|
263.8
|
276.6
|
286.2
|
305.9
|
305.0
|
313.2
|
|
1600
|
131.7
|
130.5
|
170.6
|
232.8
|
250.2
|
264.3
|
274.3
|
296.2
|
299.5
|
305.5
|
|
1800
|
142.0
|
144.3
|
185.2
|
244.1
|
260.5
|
275.0
|
284.9
|
304.2
|
307.2
|
312.1
|
|
2000
|
139.8
|
140.7
|
181.8
|
241.2
|
257.9
|
272.2
|
281.2
|
301.4
|
301.6
|
309.3 |
|
In Figure 7 the performance ratio of RGEQRF to DGEQRF is shown for different values of n. Notably, for m 800 and n 150, RGEQRF outperforms DGEQRF by a factor of 1.52.3; for n = 100, RGEQRF shows more than twice the performance of DGEQRF for m 800. It confirms that the benefit of the recursive algorithm is bigger for tall thin matrices than for square matrices. However, one should note that the crossover point nx used in DGEQRF could be changed to give better performance for the LAPACK routine for tall thin matrices as well. As mentioned above, nx is here chosen to give a good overall performance. In RGEQRF the recursion leads to an automatic blocking that apparently performs well for a wider range of matrix sizes.
Figure 7
IBM PPC 604e uniprocessor performance results
The performance results are obtained on one processor of a four-way 332-MHz IBM PPC 604e SMP node, using IBM XL FORTRAN 5.1 [9]. The overall best block sizes nb for square matrices were found to be 56 for RGEQRF and 40 for DGEQRF. The overall best crossover point nx for using the Level 2 algorithm only in DGEQRF was found to be 72. We remark that this optimal crossover point is very small, in fact less than twice the block size nb. These parameters were used in all performance tests for this node reported in this section, including tests for tall thin matrices. The BLAS routines used are from ESSL Version 3.1 [11].
IBM PPC 604esquare matrices
The performance in MFLOPs/s for RGEQRF and DGEQRF on square matrices is shown in Figure 8. As for the POWER2 results, RGEQRF outperforms DGEQRF for all matrices except the very small ones.
Figure 8
The performance ratio of RGEQRF to DGEQRF is shown in Table 7. Except for drops in performance for m = n = 1000, 1100, and 2000, RGEQRF shows around 15% better performance than DGEQRF for large matrices, which is almost as good as the performance observed for the POWER2 node.
|
Table 7
|
Ratio of uniprocessor performance results for RGEQRF and DGEQRF on a 332-MHz IBM PPC 604e node.
|
|
|
m = n
|
200
|
400
|
600
|
800
|
1000
|
1200
|
1400
|
1600
|
1800
|
2000
|
|
|
Ratio
|
1.05
|
1.14
|
1.15
|
1.14
|
1.04
|
1.15
|
1.15
|
1.15
|
1.15
|
1.03 |
|
However, there seem to be some irregular drops in performance for m = n = 1000, 1100, and 2000 in Figure 8. In order to get a better picture of the critical matrix sizes, the performance results for m = n = 900, 901, 902, · · · , 1200 are presented in Figure 9. From this figure, it is clear that the performance is around 10% lower than what one would expect for 925 < m = n < 1130.
Figure 9
Additional tests indicate that this performance problem depends on the leading dimension of the matrix. All of the performance results presented here are performed with a leading dimension equal to m + 1. If the leading dimension, for example, is increasing by a constant 300, a similar performance drop is observed for m = n around 700 to 800.
This behavior is currently not fully understood, but one possible reason is that this is an effect explained by the fact that the Level 2 cache memory on the 332-MHz PPC 604e node is direct-mapped; i.e., it is not a set-associative cache memory. We remark that no indications of similar problems were observed on the POWER2 node, which indeed has a four-way set-associative Level 2 cache memory.
IBM PPC 604etall thin matrices
The performance results for RGEQRF and DGEQRF on tall thin matrices are shown in Table 8 and Table 9, respectively.
|
Table 8
|
Uniprocessor performance on tall thin matrices for RGEQRF on a 332-MHz IBM PPC 604e processor.
|
|
|
m\n
|
50
|
100
|
150
|
200
|
250
|
300
|
350
|
400
|
450
|
500
|
|
|
200
|
125.2
|
169.3
|
181.4
|
190.4
|
|
|
|
|
|
|
|
400
|
143.4
|
188.4
|
206.9
|
218.9
|
234.0
|
239.3
|
249.6
|
251.3
|
|
|
|
600
|
138.2
|
180.6
|
202.7
|
218.1
|
239.7
|
246.5
|
253.5
|
255.8
|
261.9
|
265.5
|
|
800
|
126.5
|
176.9
|
199.7
|
216.8
|
235.8
|
243.7
|
249.4
|
252.3
|
262.4
|
267.0
|
|
1000
|
114.2
|
152.5
|
168.1
|
176.8
|
195.0
|
199.2
|
211.5
|
213.4
|
232.8
|
237.4
|
|
1200
|
115.7
|
164.3
|
193.5
|
210.5
|
233.6
|
242.6
|
247.9
|
251.4
|
260.3
|
266.5
|
|
1400
|
119.9
|
172.6
|
194.6
|
211.9
|
233.6
|
243.2
|
250.6
|
254.0
|
263.3
|
269.4
|
|
1600
|
121.8
|
172.8
|
194.7
|
212.5
|
233.3
|
242.8
|
249.4
|
252.6
|
261.9
|
268.3
|
|
1800
|
116.0
|
168.5
|
190.9
|
208.9
|
229.2
|
239.4
|
245.9
|
250.2
|
258.6
|
265.5
|
|
2000
|
104.7
|
148.1
|
164.4
|
174.1
|
191.0
|
196.0
|
209.7
|
212.7
|
230.3
|
235.9 |
|
|
Table 9
|
Uniprocessor performance on tall thin matrices for DGEQRF on a 332-MHz IBM PPC 604e processor.
|
|
|
m\n
|
50
|
100
|
150
|
200
|
250
|
300
|
350
|
400
|
450
|
500
|
|
|
| |