========================================================================= Date: 20 February 1991, 19:14:02 EST From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: bizarre instructions Herman Rubin says: Do you mean to tell me that computations involving both integers and floating-point numbers are not important? Or that dividing floats by floats, obtaining an integer quotient and a floating remainder, likewise? That particular step is the first step of any trigonometric or exponential function computation when it is not known in advance that the argument is small. There are other periodic and related functions for which this is useful. It would also speed up interpolation, etc. Since argument reduction for the trigonometric functions is done in practice by multiplying by 1/pi I do not see how it would ben- efit from your proposed instruction. I believe this instruction would have little general utility. An efficient way to do the fortran "x=dint(y)" is important. This is lacking on the Risc System 6000 as was pointed out in a Los Alamos report. By the way is it true that it is impossible to reasonably express this in ADA? Peter Montgomery says: Yes, most programs are written in these languages. As Dan says, the language designers made mistakes. During one review period for Fortran 90, I requested an operation which takes four nonnegative integers a, b, c, n with a < n or b < n (c is optional and defaults to 0). The requested operation returns q and/or r, where a*b + c = q*n + r and 0 <= r < n Why didn't you ask for an integer*8 data type? Wouldn't that give you what you want and have much more general utility? I would like an instruction which counts the number of ones in the binary representation of an integer but I find it hard to argue that this would be widely used. James B. Shearer ========================================================================= Date: 21 February 1991, 19:05:55 EST From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: bizarre instructions Regarding my comment: Since argument reduction for the trigonometric functions is done in practice by multiplying by 1/pi I do not see how it would ben- efit from your proposed instruction. Dik Winter says: You must have a very limited experience. You do it only by multiplying by 1/pi if you are willing to forego a lot of precision. (Check your favorite hand-held calculator. If sin(pi) = 0, the implementation is wrong!) I don't understand this comment. This problem can occur with the divide with remainder instruction as well. The problem is that approximating x mod pi by x mod y where y is the machine rep- resentation of pi may lose all accuracy. Herman Rubin says: There is somewhat greater loss of accuracy in this, and it is still needed to extract the integer part to an integer register and the fractional part. Thus, it needs at least three operations, instead of one. Also, if one is calculating something like x - ln(1+x), a natural operation in certain problems, the computing problems become a little larger than one would expect. In fact, to avoid a loss of accuracy in some quite usual situations, it would even be a good idea to have Boolean operations on floats, and unnormalized floating arithmetic. Floating arithmetic, apart from some preliminaries, and normalization problems, is exactly the same as integer, so why should there be a separate arithmetic unit even? I don't understand the comments about loss of accuracy. As noted above some or all of the bits of the remainder will be bogus because pi is not a machine number. Accurate argument re- duction requires first finding the integer part, n, of the quotient then computing x-n*pi carefully using pi to more than machine pre- cision. It is more convenient to do this if n is in floating for- mat. Counting operations is a poor test of speed when floating divide typically takes 5 times or more as long as floating mult- iply. I don't understand the reference to x-ln(1+x). This will always lose accuracy if evaluated in this form for x near 0. It is usually possible to perform boolean operations on floats al- though it may be a little awkward. I presume the reason for a separate floating point unit is that this leads to faster machines. Regarding my suggestion for an integer*8 type Herman Rubin comments: Yes and no. It would have much more general utility, but it would do an abysmally inefficient job in this situation. You would need to have a way of indicating that the product of two integers*4 is an integer*8, which I do not know how to do in any language with which I am familiar without writing it as a function, and I do not think that one should have to write mult48(a,b) instead of a*b. In addition, how would you write the operation which, when applied to a*b+c and n, yields both q and r? Regarding a function to get a the 8 byte product of 2 4 byte integers I don't see why this is any worse than Montgomery's function. In any case it is not needed. Let i4,j4 be 4 byte i8,j8,k8 be 8 byte. Then write i8=i4 j8=j4 k8=i8*j8 A sufficiently intelligent compiler will do the right thing. It may work to write k8=i4*j4 This sometimes works in the analogous case where k8 is 4 byte and i4 and j4 are 2 bytes. However I am not sure strictly speaking that it should. What does the Fortran standard say? As for getting both q and r this is easy just write iq=ia/ib ir=mod(ia,ib) A reasonable compiler will only generate 1 divide with remainder. I will confess however that while this works when everything is 4 bytes I don't quite see how to make it work when ia is 8 bytes (since it seems unreasonable to define the quotient of a 8 byte integer by a 4 byte integer to be 4 bytes and if it is defined to be 8 bytes it is then unsafe to use the usual 8 byte by 4 byte giving 4 byte quotient instruction). James B. Shearer ========================================================================= Date: 25 February 1991, 23:40:57 EST From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: bizarre instructions In discussing what a compiler should do with i8=j4*k4 where i8 is integer*8, j4,k4 are integer*4, Herman Rubin says I have never seen any language description in which the type of the result in a replacement statement affected what operation is performed. I would not propose this. I would propose defining the type of an integer*4 x integer*4 multiplication to be integer*8. Then the right thing will be done whether i8 is 8 bytes or 4 bytes long. Several people have discussed various instruction sequences with the assumption that all fixed point instructions take the same amount of time. While on many machines most fixed point instructions are 1 cycle, this is generally not true of fixed point multiply and especially not true of fixed point divide. Thus a compiler which uses 2 divides for iq=ia/ib iq=mod(ia,ib) may be wasting the equivalent of 20+ typical fixed point instructions. In fact if you know the compiler will do this the sequence iq=ia/ib ir=ia-iq*ib will be much faster on many machines. The 11 instruction pop subroutine used a divide with remainder. On many machines this instruction will be considerably slower than the other 10 combined. If one is doing many divides by the same integer p it will often pay to replace them with multiplies by some "magic constant" based on the binary representation of 1/p. Regarding wrongheaded instructions I believe there are numerous examples where a complicated instruction was implemented in such a way that a sequence of simpler instructions doing exactly the same thing would execute faster. Of course this may be an implementation problem rather than an architecture problem. James B. Shearer ========================================================================= Date: 1 April 1991, 20:27:13 EST From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: Snakebytes Linley Gwennap writes: If you cannot recompile, your application will still run faster than it would on any other workstation available today. Isn't this a bit of an exaggeration? Suppose your application resembles tomcatv? James B. Shearer ========================================================================= Date: 12 May 1991, 22:34:55 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: generating random numbers Herman Rubin writes: I believe most people are aware of the existence of simulation, including Monte Carlo, or Las Vegas, methods for obtaining answers to otherwise intractable problems. In these, it is almost always the case that considerable use must be made of non-uniform random variables; the quantities of interest are usually exponential, normal, Cauchy, binomial, Poisson, etc., and not uniform. In any particular problem, it is quite likely that thousands or millions or even billions of these random numbers will be used. If this is not a situation which calls for efficiency, it will do until a real one comes along. :-) Even granting that Monte Carlo methods are important I remain unconvinced that: 1) The time spent generating random numbers dominates these computations. 2) Your proposed instructions would reduce the time it takes to generate the random numbers. James B. Shearer ========================================================================= Date: 12 May 1991, 22:49:53 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: float/float = integer, remainder Herman Rubin states: I have already pointed out that every trigonometic and exponential routine does, in some way, float/float -> integer, remainder. The integer is also used. 1. As I have already pointed out, the actual computation being done is float/real -> integer, remainder (since pi and ln(2) are not ma- chine numbers). A proposed float/float -> integer, remainder instruc- tion (assuming it ran at the same speed as floating divide, actually it would probably be slower) would not benefit the trigonometric or exponential routines on any architecture I am familiar with 2. In any case it is not true that "every" exponential routine does this computation. The scalar short precision exp routine in IBM's vs fortran library does not do this computation in any way. James B. Shearer ========================================================================= Date: 13 May 1991, 22:00:34 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Fujitsu linear recurrence vector instruction I believe what Fujitsu has actually done is provide a vector instruction to compute x(j)=a(j)*x(j-1)+b(j) (j=1,length vector register) given a,b and x(0). This runs slower than most vector instructions be- cause it can't be pipelined (ie the add for x(j) feeds the multiply for x(j+1)). It still may be advantageous however if it lies in a loop which can otherwise be vectorized since otherwise time is wasted moving stuff between the vector and scalar units. Of course I may have this all wrong. Perhaps someone from Fujitsu can give us the real story. James B. Shearer ========================================================================= Date: 13 May 1991, 22:18:20 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: how to compute exp(pi**3) without dividing by ln(2) Herman Rubin says: I fail to see how one would compute exp(pi^3) even to short precision without doing some computation like (pi^3)/ln2. I can think of at least 3 ways: 1) Use the power series. Since all terms are positive there are no numerical problems (if we compute in double). 2) Use exp(x)=exp(x/2)**2 to reduce the argument to a small number y, compute exp(y) with a minimax polynomial approximation, recover exp(x) by squaring the appropriate number of times. Each square will lose a bit of accuracy but if we are computing in double this need not be a problem. 3) Write x=a+b+c+d. Compute exp(a), exp(b), exp(c) and exp(d) by table look up. Compute exp(x) as exp(a)*exp(b)*exp(c)*exp(d). I am sure there are others. James B. Shearer ========================================================================= Date: 14 May 1991, 20:14:45 EDT From: JBS at YKTVMZ To: comp-arch at ucbvax.berkeley.edu Subject: float/float = integer, remainder Peter Montgomery gives an example (computing sin(1234.) in 4 digit decimal) which purports to show how to use a float/float -> integer, remainder instruction for argument reduction. There are some problems with his method (as I will discuss later). In any case it is besides the point. I am not claiming you can not use this instruction for argument reduction. I am claiming it will not enable you to do argument reduction any faster. This is because it can be expected to run slower than floating divide. Floating divides are so slow on most recent machines (as compared to floating multiplies or adds) that it pays to go to great lengths to avoid them. In this case this is easy to do as we are dividing by a constant so we can multiply by the reciprocal instead. Consider for example how argument reduction is done for the exp routine (as we will see this is the easier case). We are using the identity exp(x)=exp(n*log(2)+(x-n*log(2))=(2**n)*exp(y) where y=x-n* log(2). We want y as near to 0 as possible so we choose n to be the nearest integer to x/log(2). We will compute exp(y) by using a mini- max polynomial approximation on the interval (-.5*ln(2),.5*ln(2)). If we actually use a minimax approximation valid on a slightly bigger in- terval (say (-.35,.35)) than we can afford to be slightly sloppy in choosing n. Hence we take n to be the nearest integer to x*(1./log(2)). Now we must compute x-n*log(2) without loss of accuracy due to cancell- ation. In this case this is easy. Note n is at most 11 bits long (else exp(x) will overflow or underflow; we are assuming IEEE 64-bit arithme- tic). So write ln(2)=z1+z2 where z1 contains the first 42 bits of ln(2) and z2 the next 53. Then y=(x-n*z1)-n*z2 computes y accurately. Also this entire sequence will be faster than a single floating divide on either a 3090 mainframe or a Risc System 6000. For trigonometric functions we let n be the closest integer to x/(pi/2) and y=x-n*(pi/2). The computation of n and y is trickier for reasons: 1) n may be very large. 2) y must be computed with small relative error not just small absolute error (since sin has a zero at zero). Peter Montgomery's method does not deal with these problems. Since the quotient is returned as an integer it will overflow whenever n>2**31. His example assumes his integers have as many digits as his floats. If the integers were 2 digits long then 785 would overflow. His example also assumes 8 digits of pi/2 suffice for accurate argument argument reduction of 4 digit numbers. This is not true as can be seen by trying sin(355.). Finally in computing 785*pi2, 785 must be convert- ed back to floating point so you still have conversions. Problem 1 can be dealt with by requiring abs(x) to be not too large. Bigger arguments can be forbidden or handled by alternative slower code. Since almost all inputs to the trig functions lie near zero this will not affect the average performance. Problem 2 can be dealt with be dividing pi/2 into more than 2 parts. The precision of pi/2 required increases as allowed values of n increase. Note if you have 3 parts pi1, pi2 and pi3 you must compute t-n*pi2 accurately as well as t=x-n*pi1. The general problem of computing x-y*z accurately even when y*z is near x can handled in several ways without the new instruction. On machines such as the Risc System 6000 with a fused multiply-add there is no problem, the multiply-add instruction does the right thing. On ma- chines that can return the low part of y*z you just subtract the high part first, then the low part. James B. Shearer ========================================================================= Date: 17 May 1991, 01:18:29 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: float/float = integer, remainder David Hough states: A couple of issues have been confused in recent comp.arch discussion. One is whether argument reduction for elementary transcendental functions involves simultaneous computation of a quotient (preferably in integer format) and a remainder; another is whether instructions for that purpose should be included in computer instruction set architectures. Well the reason they have been "confused" is that people advo- cating the float/float = integer, remainder instruction have stated that such an instruction would be useful for argument reduction for elementary transcendental functions. David Hough appears (his post was little unclear) to agree with me that this is not the case. If you are going to put an instruction in your architecture that may run for 1000's of cycles and causes all sorts of design headaches it would be nice if it did something useful. James B. Shearer ========================================================================= Date: 17 May 1991, 01:44:54 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: communication between fixed and float units Herman Rubin says (in connection with the Risc System 6000): As for (2), it would reduce the cost of the hybrid operations down to 1/3 their present cost at least, and in some cases more. For unsigned integer to floating conversion, at present this takes an integer store, a floating load, and a floating add, assuming that the preparations have been made. Other situations are even worse. Things are not this simple. The 3 instructions can be over- lapped with instructions of the opposite type (it turns out a float- ing load is of integer type). A proposed integer->float instruction would presumedly be of both types. This would make the ratio 3:2. Herman Rubin also says (in connection with potential gains from easier communication between the fixed and float units): I believe that the current elementary function library for the RS/6000 does quite a bit of this, and this might give an indication. I agree that easier communication between the units would help in computing elementary functions. However the gains would not be spectacular (I guess 10% or so) and elementary functions themselves don't seem all that important (anybody have numbers on this?). There are also other perhaps easier ways of achieving comparable gains. For example: 1.) Add an instruction which gives the integer part of a floating number as a floating number. 2.) Drop the requirement that the functions work for all round- ing modes. 3.) Write functions which compute more than 1 value at a time. James B. Shearer ========================================================================= Date: 19 May 1991, 21:22:01 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: new instructions Herman Rubin writes: Fixed point arithmetic is little used now because the hardware to support it reasonably well does not exist. It is worse than the floating problems before hardware floating arithmetic, especially if floating is automatically normalized. THAT feature of "modern" architectures is, in my opinion, a sheer horror. I think fixed point arithmetic is little used now because the vast majority of users find it harder to use than floating point with no comp- ensating advantagres. Does anyone seriously believe if a few instructions were added to provide hardware support (btw what is missing? In what way has fixed point support deteriorated?) fixed point usage would increase significantly? Regarding unnormalized floating point what is this good for besides simulating multiple precision integer arithmetic? Herman Rubin also writes: In the early FP computers, much function calculation was done in fixed point, to get increased accuracy at little cost. This only makes sense when the floating point fraction length is less than a full integer. With 64-bit floating point there is little need for increased accuracy in any case. Herman Rubin also writes: How do you expect users who do not even know of the existence of the operations to use them? I expect the compiler to generate the instructions for them. If the compiler won't generate an instruction this is a strong reason for not having it in the instruction set. Herman Rubin also writes: There are many more algorithms than are in the philosophy of software, and especially hardware, designers. The argument here seems to be that there are numerous algorithms which are not used at all today which suddenly would be used all over the place if only a few changes were made to the instruction set. I don't agree. If two algorithms have similar performance both will be in use as there will be some problems which are particually suited to one or the other. If an algorithm is not used at all on today's machines this means it is not competive even on those problems for which it is particually suited. Making changes to the instruction set is unlikely to drastical- ly change the relative performance of two algorithms, hence is unlikely to drastically change the amount of usage each gets. I believe it is perfectly sensible for hardware designers to concentrate on speeding up the existing mix of applications. James B. Shearer ========================================================================= Date: 20 May 1991, 22:16:39 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: un-normalized floating point etc Herman Rubin writes: The last question first: How about multiple precision floating (or fixed) arithmetic? Considering that there are quite a few papers on this, it is certainly a topic of interest. I do not believe it should be necessary here to go into the full range of situations I can list NOW where this would be useful. Now what is the benefit of allowing only normalized floating point? It eliminates the need for a normalization option in floating instructions, and it provides ONE more bit of accuracy. Is that ONE bit exactly what is needed? This is very unlikely. Now what is the cost of not having forced normalization, besides the one bit? There would have to be a method for indicating which result of the operation is wanted (upper, lower, normalized). There would be little additional hardware other than the decoding, by the floating unit, of this information. Well the main cost of not having forced normalization is that your floating point format becomes incompatible with everybody else's. When a standard becomes as entrenched as the ieee floating point standard has become in recent years its technical merits or lack thereof become almost irrelevant. This is the main reason the IBM Risc System 6000 uses IEEE. (Does anyone know of any recent designs which don't use IEEE?) Also your claim that there is little hardware cost is de- batable. If your multiplier must be able produce the normalized product of two unnormalized numbers, the normalization stage has to be able to handle shifts of 100 bits (instead of 0 or 1 only). As well as adding hardware this may slow you down. Finally regarding multiple precision floating point arithmetic. If you mean twice working precision I fail to see the benefit of having unnormalized floating point operations. The Risc System 6000 is able to provide software quad-precision arith- metic with the current instructions. If you mean more precision than this I believe this is very little used. Herman Rubin also writes: So why not have increased integer accuracy? It is no harder to do this, and the same units can be used. I believe there is a good argument for 64 bit integers. However as someone already pointed out 64x64 bit multiplies can not be done with a 53x53 bit floating point multiplier. Herman Rubin also writes: The chicken and the egg again. Anyone who is willing to say that something is not useful is either ignorant, arrogant, or stupid. Nobody can, or should, attempt to ever do this. Even the best people can make big mistakes. Well when a architect leaves something out he is not say- ing it is useless, he is saying its benefits do not match its costs. Btw I suspect to be a good designer you need a bit of arro- gance. James B. Shearer ========================================================================= Date: 21 May 1991, 20:20:38 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: ieee floating standard Richard O'Keefe says: Given that the same posting has already beaten Herman Rubin over the head with the IEEE standard, this doesn't go down well. IEEE extended precision (and the IEEE standard does recommend that it be provided) uses 64-bit signifcands, so a "full" implementation of IEEE 754 would include a 64x64 multiplier, not a 53x53 one. Don't the 80387 and the 68882 chips offer extended precision? What size of multiplier do they use? Let me explain what I meant. When I stated that the IEEE standard has become so entrenched that manufacturers are effective- ly forced to use it, I was referring primarily to the representa- tion of 32 and 64 bit floating point numbers and secondarily to the way floating arithmetic (in round to nearest mode) is defined to work. I do not believe the entire standard is so entrenched. In any case, is it not the case that the IEEE standard for extended requires at least 64 bits in the fraction (not 64 bits)? If so a 128 bit format (a much more sensible length than 80) with 106 bit fractions could utilize a 53X53 multiplier. Also does the standard recommend that extended be provided in hardware? James B. Shearer ========================================================================= Date: 23 May 1991, 20:52:58 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE floating point Hugh LaMaster states: But, the reason that the standard is so entrenched is that it is so useful! Manufacturers are forced to use it because users have requirements for it. I believe the standard has become entrenched for reasons other than technical merit. If the IEEE had chosen some other standard it would probably be equally entrenched by now. Hugh LaMaster also states: There are always programs which run on the ragged edge of precision, and which don't work right on a slightly poorer implementation. I am skeptical about this statement. Do you have some non- contrived examples? I believe 64 bits with any reasonable format is enough for most problems. Certainly I have not noticed any great demand for more than 64 bits which suggests to me there is still some pre- cision to spare. Many users get by with 32 bits. Hugh LaMaster also states: 2) IEEE is very well behaved, compared with other representations. I won't bother to substantiate this, other than to state that many experts agreed at the time it was developed that there was no known way to improve it numerically. I do not consider this a point in favor of the IEEE standard. It is very rare that the optimum design in cases where there is trade- off between qualities A and B, consists of maximizing A and ignoring B. This is because there is usually a diminishing returns effect in which one must give up greater and greater amounts of B to obtain further im- provements in A. The above statement suggests the IEEE standard gave undue weight to numerical quality to the exclusion of factors such as performance and ease of implementation. Couldn't we do without some or all of the following features of the IEEE standard? 4 rounding modes (why only 4? why not 8?). Inf and NaN. Denormalized numbers. Distinction between +0 and -0. Will Nasa ever build a spacecraft of which experts will say "there is no known way to make this vehicle safer"? Hugh LaMaster also states: The down side of IEEE is a performance hit in heavily pipelined FP units, for some input values. On the other hand, it is nice to get the right answer, even if some cases slow down. You are understating the downside. The standard is complex which increases design time and cost. This added design cost applies to software such as compilers as well hardware. The performance hits are not confined to heavily pipelined floating point units on denorm- alized inputs. As to wrong answers, wrong answers are generally caused by users not knowing what they are doing. Users who do know what they are doing don't need IEEE arithmetic to get right answers, users who don't know what they are doing will have no problems getting wrong answers using IEEE arithmetic. James B. Shearer ========================================================================= Date: 24 May 1991, 19:06:52 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE floating point Henry Spencer writes: Unfortunately, numerical computing is too useful and too widespread to remain the plaything of the small number of people who "know what they are doing" in all respects. The fact is, almost nobody who is using computers to get real work done has time for an in-depth study of all the fine points of numerical mathematics. The notion that such people shouldn't try to do numerical computing at all is hopelessly unrealistic, not to mention obnoxiously elitist. Well many users apparently feel they don't have time for a superficial review of the basics of numerical mathematics either. Even willingness to use a little common sense would avoid a lot of problems. I believe it is the notion that such people are offered any significant protection by the IEEE standard that is hopelessly unrealistic. Henry Spencer also writes: Without lengthy analysis by experts, it is not possible to say *for sure* that the answers are right. However, well-designed tools like IEEE FP improve the odds a lot. In particular, they make it much more likely that problems will be obvious and predictable rather than subtle and mysterious. Well to repeat myself I don't believe IEEE floating point (part- icularly its more esoteric aspects) improves the odds a lot because I don't believe it addresses the main sources of error. I also don't believe that IEEE FP is well-designed. Exactly why do you believe nu- merical problems using IEEE FP are more likely to be obvious and pre- dictable than numerical problems using IBM hex (for example)? As long as we are on this subject using IBM hex; 0*x is always 0, x.le.y is the same test as .not.(x.gt.y), and (x.ne.x) is always false. None of the preceding properties hold with IEEE FP. Which behavior would you call "obvious and predictable" as opposed to "subtle and mysterious"? Have you considered the problems this sort of thing causes compiler writers? Henry Spencer also writes: This is very important in the real world, where experts are expensive and in short supply, and a great deal of numerical computing simply has to be done without them. Expert hardware designers and expert compiler writers are also expensive and in short supply. Why allow compromises in the area which is the major source of problems, while demanding perfection in an area which is a minor source of problems? James B. Shearer ========================================================================= Date: 29 May 1991, 01:15:30 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE floating point I said: >As to wrong answers, wrong answers are generally caused by >users not knowing what they are doing. D.C. Lindsay commented: I disagree. Dr. Kahan tells excellent anecdotes on this point, with examples of well-respected products (e.g. Mathematica, hand calculators) giving wrong answers to simple problems. I don't see your point here. In the first place I said generally not always. In the second place these examples support my point since the users here are the writers of Mathematica and the power function on calculators and they are making errors which have nothing to do with the quality of the floating point arithmetic they are using. D.C. Linsay also said Seconded. People who think that the common alternatives are as well designed, are probably unfamiliar with the sore points. For example, Cray format allows a value which will overflow if multiplied by 1 ... IEEE format allows values which when multiplied by 0 do not equal 0 which is in my opinion worse. I said: >...don't believe it addresses the main sources of error. I also don't >believe that IEEE FP is well-designed. Exactly why do you believe nu- >merical problems using IEEE FP are more likely to be obvious and pre- >dictable than numerical problems using IBM hex (for example)? ... Henry Spencer replies A proper response to this would basically be a detailed defence of IEEE FP's more controversial design decisions. I have neither the time nor, really, the expertise to do this. However, salvation arriveth from an unexpected direction. :-) Go read "What Every Computer Scientist Should Know About Floating-Point Arithmetic", by David Goldberg, in the latest (March) issue of ACM Computing Surveys; doing so will enlighten you in detail on the subject. I will confine myself to observing that IBM hex FP is the only FP format I know of that made half the FP instructions -- the single-precision ones -- just about useless to most programmers. I checked the Goldberg reference. He states in effect the pur- pose of his paper is to explain IEEE floating point not defend it. Your statements about IBM hex also are not responsive to the question I asked which to repeat is given that you have exceeded the precision provided why does this produce "obvious and predictable" effects with IEEE as opposed to "subtle and mysterious" effects with IBM hex (or any other alternative). Bill Davidson states: may not have more than a few significant bits for starters, and I would rather not trust my bits to a computer designed with the marketing department winning compromises between "right" and "fast" answers. I believe you have things backwards here. Marketing depart- ments generally love IEEE. After all they don't have to implement it. Keith H. Bierman states: Having 1% bad fp arithmetic is like playing russian roulette every morning. It will bite you someday. Unlike RR, the bad fp will just assist you in doing bad science. Perhaps that isn't a problem, just think of all the extra papers/research we all get to do to disprove the resulting drivel... This would imply people using Crays are more likely to pro- duce drivel than people using IEEE systems. I doubt this is true. James B. Shearer ========================================================================= Date: 30 May 1991, 23:14:39 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE floating point (Goldberg paper) I said: > I checked the Goldberg reference. He states in effect the pur- >pose of his paper is to explain IEEE floating point not defend it. Henry Spencer said: Did you read the paper, or just the introduction? He does a fairly good job of justifying the various decisions as he goes. I have looked through the paper ("What Every Computer Scientist Should Know About Floating-Point Arithmetic" by D. Goldberg, ACM Compu- ing Surveys, Vol 23, 1991, p5-48). He provides examples of ways in which various features of IEEE arithmetic can be used. I don't consider the fact that a feature has some use justification for including it. Justif- ication would also consider the cost of providing a feature and discus- sion of alternative ways of providing some or all of the benefits. A feature should be included only if this sort of analysis indicates a favorable cost/benefit ratio. The paper does little of this and indeed the author states this is not the purpose of his paper. Additionally some of the examples of how to use IEEE seem ill- advised to me. I will discuss one example. On page 22 the author says (I have made small changes to the notation, fmax is the largest finite floating point number): >Here is a practical example that makes use of infinity arithmetic. Con- >sider computing the function x/(x**2+1). This is a bad formula, because >not only will it overflow when x is larger than , but infin- >ity arithmetic will give the wrong answer because it will yield 0 rather >than a number near 1/x. However x/(x**2+1) can be rewritten as 1/(x+ >1/x). This improved expression will not overflow prematurely and be- >cause of infinity arithmetic will have the correct value when x=0: 1/(0+ >1/0)=1/(0+inf)=1/inf=0. Without infinity arithmetic the expression 1/ >(x+1/x) requires a test for x=0, which not only adds extra instructions >but may also disrupt a pipeline. This example illustrates a general >fact; namely, that infinity arithmetic often avoids the need for special >case checking; however formulas need to be carefully inspected to make >sure they do not have spurious behavior... I have the following comments. 1) If the user uses the original "bad" formula out of careless- ness or a belief x will never be large then as the author notes IEEE arithmetic may compute a totally bogus result with no error indication (other than the setting of an overflow bit which nobody looks at). On other systems the error is usually more obvious. 2) An alternative way of dealing with the large x problem is: if(abs(x).le.1.0d18)then y=x/(1+x*x) else y=1/x endif Compared to the author's solution, this has the following virtues a) It is more portable. b) It will be faster (at least on scalar machines) because it uses a single floating point divide. c) It will work properly on denormalized inputs. The authors code fails whenever 1/x overflows but x is not 0. Note the original "bad" code handles this properly. 3) Coding intentional harmless overflows makes it very diffi- cult to detect unintentional harmful overflows (such as the original formula can produce). Thus coding in the author's suggested style would appear to be dangerous. To sum up, in my opinion all that this example shows is that infs provide a new way to screw up. Another quote from this paper. In a footnote on page 17 the author states: >According to Kahan, extended precision has 64 bits of significand be- >cause that was the widest precision across which carry propagation >could be done on the Intel 8087 without increasing the cycle time... This may constitute justification to 8087 designers, I don't see why it should be for the rest of us. James B. Shearer ========================================================================= Date: 31 May 1991, 20:47:30 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic (Goldberg paper) Henry Spencer says: My understanding is that the funnier-looking features, in particular the infinities, NaNs, and signed zeros, mostly cost essentially nothing. I believe this statement is incorrect. Infs and nans compli- cate the design of the floating point unit. Even if there is ultimately no performance impact, this added design effort represents a very real cost. Perhaps some hardware designers could comment. Also as has already been pointed out there is a software cost due to such features as 0*x no longer being 0. In some cases this will have a performance impact in that compilers will generate inferior code. Henry Spencer says: Getting the rounding right reportedly takes work, but that one seems easy to support. As far as round to nearest goes, this does provide a real benefit. I am not totally convinced it's worth the cost but a case can made. The case for the round to even (when halfway) seems much more dubious. I believe this represents a substantial portion of the cost (comments?) and the additional benefits seem quite tenuous. I don't see any good reason for the other 3 modes. I said: >arithmetic may compute a totally bogus result with no error indication >(other than the setting of an overflow bit which nobody looks at). On >other systems the error is usually more obvious. Henry Spencer said: Only if the author goes to substantially more trouble than merely looking at an overflow bit. This really sounds to me like the old "people wrote better when they had to retype a whole page to fix an error, because they had to think about it more" fallacy. People who took care before will continue to take care; people who didn't before won't now. The difference is that both are somewhat less likely to get plausible-looking garbage now. Nobody is proposing that IEEE arithmetic is a replacement for care and understanding. All it does is improve the odds that a given amount of care and understanding will produce meaningful answers. I guess I was a bit obscure here. On IBM hex systems when floating overflow occurs the default behavior is for an error message to be printed which includes a traceback pointing to the offending instruction. Execution then continues (with the max floating number as a fixup) but will terminate if more than a small number (5?) of over- flows occur. I was under the impression that this was typical behavior for pre-IEEE systems. Is that not the case? I consider this behavior much friendlier than typical IEEE behavior which is to run to completion then output infs and nans (or worse plausible but wrong numbers) with no indication of where the error occurred. It seems to me that the existence of infs in IEEE is in- creasing the likelihood of obtaining plausible-looking garbage. I said quoting the paper: >>According to Kahan, extended precision has 64 bits of significand be- >>cause that was the widest precision across which carry propagation >>could be done on the Intel 8087 without increasing the cycle time... > > This may constitute justification to 8087 designers, I don't >see why it should be for the rest of us. Henry Spencer said: To me it sounded like "the precise number of bits was not thought to be crucial, so the limitations of a very important existing implementation were taken into consideration". If you don't like this, explain why the resulting number of bits is wrong or inadequate. If it's satisfactory, why do you care how it was arrived at? Standards decisions often turn on such seemingly trivial or ephemeral issues; what matters is the result. I was unaware the Intel 8087 could be considered a "very important existing implementation" as compared to the DEC format for example. Do you have some evidence for this? 79 bits is clearly an absurd length for a floating point format (the same is true of the 43 bit single extended format). As well being very awkward the extended lengths are too close in precision to the unextended lengths. The rationale behind the extended formats: providing extra precision for intermediate terms is ridiculous. If the added precision is available it should always be used otherwise it is of little value (see various comments on hex format). James B. Shearer ========================================================================= Date: 3 June 1991, 16:33:01 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic Several people have challenged my characterization of typical behavior on fp overflow in pre-IEEE environments. In my defense I was not intending to claim the exact behavior of IBM fortran was typical, just the fact that x/(x*x+1) cannot silently produce garbage. Even so it appears that pre-IEEE behavior was more diverse than I had realized. I said: >... The rationale behind the extended formats: >providing extra precision for intermediate terms is ridiculous. If the >added precision is available it should always be used otherwise it is >of little value... Henry Spencer says: Having had some need to compute hypotenuses without overflow in code that otherwise wanted to use single precision for speed, I'm afraid I disagree. Whether it is "of little value" depends a lot on what you are doing. First a nitpick, you are using the extra exponent range not extra precision. In deciding whether a feature is of value one must look at alternatives. In your example the obvious alternative is to compute the hypotenuses using double precision. I do not see any reason for an intermediate precision. Btw how exactly would you expose an intermediate format to the user? Dik Winter said (responding to the same statement): Providing extra precision for intermediate terms is *not* ridiculous. Witness the maf instructions on the RS6000. Witness also the many old papers that use inner products calculated in extra precision. The problems with hex format are the following: 1. The wobbling precision must let you use the worst case for error estimates. 2. This makes single precision much more useles than binary single precision is already. Note also that Blaauw reportedly has said that were he to design FP again he would not use hex. I believe the maf instructions omit the intermediate round- ing step primarily to improve performance. I believe the papers to which you refer basically require a way to obtain the double precis- ion product of two single precision numbers. I consider this a some- what different issue. I agree that binary formats are superior to hex in that they provide 2 extra bits of precision. I don't agree (particually for 64-bit formats) that this 2 bits makes it significantly easier to avoid numer- ical problems. If I were to design a FP format I would probably use binary also. I am not objecting to the use of binary. James B. Shearer ========================================================================= Date: 3 June 1991, 20:21:53 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic Jerry Callen says: This may be obvious, but let's not confuse "compiler and runtime system behavior" with "floating point hardware behavior." Yes, Fortran on 360/370 systems behaves as you describe. In PL/1 it is possible to completely ignore exponent overflow if you wish and get meaningless results. Similarly, IEEE overflow behavior is under programmer control. If you WANT to ignore overflow and then just check to see if some error occurred at the end of the calculation you may do so; if you want to know immediately, you can do that, too. I believe most users will find it difficult to do these things. I, for example, do not know how do this in fortran on the IBM Risc System 6000. This is a problem for other users on other machines as the "How to detect NaN's" thread in comp.lang.fortran shows. Also it is desirable that any method for doing this should be portable across IEEE systems (since IEEE is supposed to promote portabil- ity) and have no significant performance impact when exceptions do not occur (since otherwise users are encouraged to run in an unsafe way and IEEE is supposed to promote safety). James B. Shearer ========================================================================= Date: 4 June 1991, 20:12:02 EDT From: JBS at YKTVMZ To: comp-arch at ucbvax.berkeley.edu Subject: IEEE floating point Ping Huang asked: Reading this thread, I get the impression that several people feel that IEEE compliance exacts a large toll in speed. I do not doubt this, but I was wondering if anyone had comparative statistics on the speed of floating point operations when performed by an IEEE-compliant package versus one which was not. Of course, to be meaningful the data would have to be for the same piece of hardware. The following quote from "Machine Organization of the IBM RISC System/6000 processor" by G.F. Grohoski, IBM Journal of Research and Development", vol 34, 1990, p37-58 (p56-57) may give some indication. "... the RISC System/6000 uses the IEEE floating-point arith- metic format, while AMERICA used the IBM System/370 format. ... This degraded floating point performance substantially in peak floating- point loops. For example, using the 2D graphics example above, the RS/6000 machine takes seven cycles per loop iteration as opposed to four in AMERICA. On balance however this degradation is less severe; while the potential AMERICA LINPACK performance was approximately 15 MFLOPS, the RISC System/6000 achieves nearly 11 MFLOPS." The main reason for the performance degradation appears to be the need for rounding in IEEE. James B. Shearer ========================================================================= Date: 5 June 1991, 15:41:14 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic I said: > I believe the papers to > which you refer basically require a way to obtain the double precis- > ion product of two single precision numbers. I consider this a some- > what different issue. Dik Winter said: It would be a different issue if it where true. But it is not true, the papers I refer to are about the accumulation of an inner product in double (extra) precision. See for instance: J.H.Wilkinson, Rounding Errors in Numeric Processes, Notes on Applied Science No. 32, HMSO, London; Prentice Hall, New Jersey, 1963. (Yes fairly old; it was already known in that time!) I have checked this reference (actually "Rounding Errors in Algebraic Processes"). I believe it is consistent with what I said. For example p. 10 "Nearly all digital computers produce in the first instance, the 2t-digit product of two t-digit numbers", p. 14 "The subroutine accepts single-precision factors and gives a double preci- sion product", p. 32 "If we are working on a computer on which the inner products can be accumulated, then there may be stages in the work when we obtain results which are correct to 2t binary figures". Throughout this section it is clear what is being discussed is a way to obtain inner products to twice working precision, not slightly more than working precision as in the IEEE extended formats. I said: > I agree that binary formats are superior to hex in that they > provide 2 extra bits of precision. Dik Winter said: A nit; they provide 3 more; nearly a digit. How do you get 3? Suppose the leading hex digit is 1, then we lose 3 bits from the leading zeros, 1 bit from the hidden 1, but gain 2 bits from the smaller exponent field for a net loss of 2 bits. I said > I don't agree (particually for 64-bit > formats) that this 2 bits makes it significantly easier to avoid numer- > ical problems. Dik Winter said: Oh, but they are, see the paper I mentioned above. I found nothing in the above reference to justify this state- ment. James B. Shearer ========================================================================= Date: 5 June 1991, 16:36:46 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic (Goldberg paper) I said: > Also as has already been pointed out there is a software cost >due to such features as 0*x no longer being 0. In some cases this will >have a performance impact in that compilers will generate inferior code. Jim Dehnert said: As a compiler writer who has worked with IEEE systems, I don't know why, at least in any area with an important impact on performance. The reason for the IEEE definition (0*Inf=NaN) is that zero may represent a result merely close to zero, and Inf really represents a very large number -- I can pick representatives of these sets which multiplied together give me a true result in the representable range (not zero or Inf). If I see a literal zero while compiling (the only case I can think of when the compiler does anything that might care about what 0*x equals), I feel quite free to assume that it is really zero, in which case 0*x is really zero even for very large x. So I will do the same thing I would have for non-IEEE. (This is the only case I'm aware of where IEEE defines 0*x != x.) Some people write: IF(0.D0*X.NE.0.D0)THEN to detect infs or NaNs (see recent posts to comp.land.fortran). This works on many systems. If you replace 0*x with 0 then it will not work on your system and your system is broken. Jim Dehnert also said Round to +/- infinity give you the floor and ceiling functions, which come up frequently, for example in many transcendental function algorithms. This appears to apply just to convert to integer instructions. Different modes could be offered for these instructions alone. In any case I don't believe they are useful for transcendental function algor- ithms. In fact the different rounding modes are a problem because the functions must work for all modes. Jim Dehnert also said (referring to trapping floating overflow): It is the case. It's also perfectly achievable behavior for IEEE systems if exceptions are trapped. This has nothing to do with the IEEE standard. Instead, it's probably a result of default Unix software behavior, which is not to trap in most implementations. Possibly I should be blaming Unix. I tend to associate the two. Note in Goldberg's paper he suggests rewriting x/(1+x*x) in a way which assumes trapping is not occurring. I said: > 79 bits is clearly an absurd length for a floating point >format (the same is true of the 43 bit single extended format). Jim Dehnert also said: An interesting statement. Are you of the "only powers of two are real numbers" persuasion? I believe floating format sizes should be compatible with the basic organization of the machine. IE if your machine is based on 8-bit bytes the length best be a multiple of 8. I believe 32 or 64 bits are typical widths of data paths in current machines. Therefore moving 79 bit things around will clearly be wasteful. Consider a machine with 64 bit floating registers and a 64 bit data path between the floating unit and cache. I see no efficient way to include a 79 bit format. An 128 bit format on the other hand is easily accomodated. Herman Rubin asked: Why should it even be the case that the exponent and significand are in the same word? Because if they aren't every load or store will require 2 mem- ory references instead of 1 halving the speed in many cases. James B. Shearer ========================================================================= Date: 6 June 1991, 20:44:38 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic (Goldberg paper) Richard Firth quotes the fortran standard: Evaluation of Arithmetic Expressions ... Once the interpretation has been established in accordance with these rules, the processor may evaluate any mathematically equivalent expression, provided that the integrity of parentheses is not violated. Two arithmetic expressions are mathematiccally equivalent if, for all possible values of their primaries, their mathematical values are equal. However, mathematically equivalent arithmetic expressions may produce different computational results. to conclude: In otherwords: yes, Jim Dehnert is fully entitled to replace 0.0*X with the mathematically equivalent expression 0.0; Some comments: 1. The question is not what fortran requires but what IEEE imple- mented in fortran requires. One of the problems with IEEE's infs and NaNs is that they do not fit well with fortran. 2. Appeals to a standard which predates IEEE are not totally con- vincing. What does the Fortran 90 standard say? 3. In any case my reading of the standard is opposed to yours. Infinity is a possible value of x, 0*infinity does not have math- ematical value 0, therefore 0*x is not mathematically equivalent to 0 in IEEE. Do you replace 0*x with 0 when compiling without optimization? I think that if a (legal) program behaves differently when compiled with and without optimization that then the compiler is broken. Perhaps you disagree. How would you handle 0-x? Goldberg in his paper (p.33) spec- ifically forbids replacing it with -x although the fortran standard would appear to permit it since mathematically +0=-0. Richard Firth also said: ... anyone who writes rubbish like IF (0.0*X .NE. 0.0) would be well advised to take the trouble to learn the language in which he is attempting to write code. So how would you code this (detecting whether x inf or NaN)? James B. Shearer ========================================================================= Date: 6 June 1991, 21:14:20 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: massive linpack Richard Lethin asks: At what size N for massive linpack does the accumulated numerical fuzz swamp the precision available in a double precision number? Very large. Typical relative error (in the vector norm sense) in the solution is n*c*e, where n is matrix size, c is condition number and e is machine epsilon (2**-52 for IEEE double). Therefore if you want a solution accurate to 1 part in 1000 (10 bits) and your matrix has condition number less than 1000 (10 bits) your n can be 2**30 or 10**9. This would require 10**18 storage and 10**27 ops to solve so is well beyond current machines. Note once you have any bits in your answer you can obtain more by iterative improvement an order N*2 pro- cess. Richard Lethin asks: Aren't really large problems of interest sparse matrices anyway? Who holds the record for massive sparse matrix solves? People should try to avoid solving problems in ways which in- volve large dense matrix solves simply because they are so expensive. I suspect in many cases where large dense matrices are solved, a diff- erent method would be faster but I have no real evidence for this. It is unreasonable to ask for record sparse solves since the difficulty is strongly dependent on the structure of the problem. James B. Shearer ========================================================================= Date: 7 June 1991, 19:13:16 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic (Goldberg paper) Bruce Seiler says: The main justification for having all the rounding modes was the +/- infinity modes made creating an interval arith. package much easier. As I recall, without them an interval package could run 300x's slower than straight floating point. With them it was ca. 3x's slower. Besides, these rounding modes were cheap. I don't believe interval arithmetic is used enough to justify any hardware support. Providing quad precision (much more useful in my opinion) would also provide some support of interval arithmetic. In any case I believe your estimate of 3x slower than straight floating point is unrealistic. Does anyone have any examples of interval arith- metic packages? What is their speed compared to straight floating point? The rounding modes may appear cheap but they have a large hidden cost. Every intrinsic function subroutine for example must worry about handling all rounding modes. I said: |> I believe floating format sizes should be compatible with the |> basic organization of the machine. IE if your machine is based on |> 8-bit bytes the length best be a multiple of 8. I believe 32 or 64 |> bits are typical widths of data paths in current machines. Therefore |> moving 79 bit things around will clearly be wasteful. Consider a |> machine with 64 bit floating registers and a 64 bit data path between |> the floating unit and cache. I see no efficient way to include a 79 |> bit format. An 128 bit format on the other hand is easily accomodated. Bruce Seiler said: The floating point chips from Intel and Motorola have 80 bit floating point registers. At the time, the standard developers decided that the main formats should be a power of two in size so array addressing is fast, it is useful to have a format with a wider mantissa to have guard bits, and a hardware ALU finds it convenient to have the mantissa to be a power of two in size. Putting the last two points together gives a lower limit of 79 bits. That is a wierd number so Intel and Moto. rounded up to 80. Actually, Moto. stores an extended to memory in 96 bits so it is long word aligned. If your compiler has good register allocation, there will be little extended loads and stores. I expect you know this already, but I am trying to understand what you saying above. Supporting any format greater than the FP ALU is going to be slow. What IS so bad about extended? According to "Computer Architecture a Quantitative Approach" (Hennessy and Patterson, p.A-53, p. E-2) none of the MIPS R3010, Weitek 3364, TI 8847, Intel i860, Sparc or Motorola M88000 architectures implement IEEE double ex- tended. This represents a lot of independent support for my position that the double extended format is not cost effective. Apparently these vendors do not agree that it is convenient for the fraction to be a power of 2 in length. If your chip has 80 bit registers I fail to see how you can avoid performing 80 bit loads and stores while saving and restoring these registers. My objection to the IEEE double extended format is that it should be 128 bits wide and called quad precision. Also I do not agree that it is useful to have a format with a few extra (guard) bits in the fraction. James B. Shearer ========================================================================= Date: 7 June 1991, 20:12:05 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE floating point (Goldberg paper) I said: > 1. The question is not what fortran requires but what IEEE imple- >mented in fortran requires. One of the problems with IEEE's infs and >NaNs is that they do not fit well with fortran. Robert Firth said: Yes, that's the problem. The Fortran standard describes the implementation of arithmetic expressions in terms of the "mathematical" rules, that is, the mathematical properties of real numbers. It is the business of the implementation to provide a reasonable approximation that, as far as possible, obeys those rules. If it fails to do so, i's not a legitimate "fortran machine" and shouldn't claim to support ANSI F-77. But a language implementation must follow the standard. That is the whole point of standards, after all, that they circumscribe the permissible implementations. You can't correct bogus hardware by writing a bogus compiler. Clearly it would be possible to write a compiler for an IEEE machine which performed all floating point operations by simulating IBM hex arithmetic (for example) using fixed point instructions. Such a compiler might obey the fortran standard. However it would be ludicrous to claim it was obeying the IEEE floating point standard. Are you saying it is impossible to write a compiler which simultaneously obeys the fortran standard and the IEEE floating point standard? Are you saying the fortran standard forbids floating point models which include infs? Do you plan to claim your compiler complies with the IEEE standard? This thread started with my claim that it is expensive for compilers to comply with the IEEE standard. I do not claim it is im- possible. I said: > Do you replace 0*x with 0 when compiling without optimization? >I think that if a (legal) program behaves differently when compiled with >and without optimization that then the compiler is broken. Perhaps you >disagree. Robert Firth said: Yes, I disagree. The issue is the relevance of the difference. Is a compiler broken because an optimised program runs faster? That's a difference in behaviour, but you presumably think it an acceptable one. Likewise, if a program transformation is permitted by the standard, a compiler is entitled to apply that transformation; it is the responsibility of the programmer to be aware which transformations are legal and not assume any of them won't happen. To clarify I was not referring to timing type differences. Perhaps I should have said it is undesirable for a program to behave differently when compiled with and without optimization regardless of whether this is technically permitted by the standard. (Actually I think it would desirable for the standard to exactly specify the behavior of any legal program but apparently it does not.) An example may explain why I believe this. Consider an user with a 100000 line fortran package. It works fine when compiled on other machines (with and without optimization). It works fine when compiled with your compiler without optimization. However it fails when compiled with your compiler with optimization. I believe it will difficult to con- vince this user that the bug is in his package and not in your compiler whatever the technical validity of your position James B. Shearer ========================================================================= Date: 7 June 1991, 21:17:11 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: massive linpack Richard Lethin asked: >At what size N for massive linpack does the accumulated numerical fuzz >swamp the precision available in a double precision number? I answered: > Very large. Typical relative error (in the vector norm sense) >in the solution is n*c*e, where n is matrix size, c is condition number >and e is machine epsilon (2**-52 for IEEE double). Therefore if you >want a solution accurate to 1 part in 1000 (10 bits) and your matrix >has condition number less than 1000 (10 bits) your n can be 2**30 or >10**9. This would require 10**18 storage and 10**27 ops to solve so >is well beyond current machines. Note once you have any bits in your >answer you can obtain more by iterative improvement an order N*2 pro- >cess. David Chase objects: I don't have the time to prove that you're wrong here, but I think that you are either wrong, or else being very misleading. In general, why should we believe you? What do you usually do? Goldberg and Hough are experts, as are the people that I took numerical analysis from too many years ago. Are you an expert? (I only know enough to know when I should be suspicious, but I'm suspicious.) First, you haven't said anything about worst case, which must be worried about. In addition, you haven't discussed a typical condition number. Second, I know that the NA people I once worked with were interested in using exact arithmetic for error analysis. (There's a paper by Boehm, O'Donnell, and Riggle in the 1986 Lisp & FP conference -- when I say "exact", I mean an answer guaranteed to be within epsilon of the true answer). This means that double precision isn't good enough for very accurate error analysis, though I don't know how they defined "very accurate". I suspect that the worst case bounds on error estimation were not satisfactory. Third, "LINPACK" is not a numerical analysis algorithm. There is a wide variety of algorithms in LINPACK, with varying stability and cost. If you don't specify the algorithm, how are we to know what you are talking about? Is it GE, GEPP, GEFP, QR, SVD, or Cholesky? Band matrix, or not? The worst case on GEPP (which is what I think you meant) is still pretty bad, though I've also heard (from an expert) that it is almost never seen. If (as I think you are saying) GEPP is so good, then why do we bother with more stable algorithms like QR and SVD? To clarify I was referring to the numerical properties of Gaussian elimination with partial pivoting (GEPP) when solving a large dense system of linear equations. This is what the TPP (towards peak performance) linpack benchmark does (on a system of size 1000). I presume the massive linpack benchmark is the same thing with larger matrices. Someone queried whether the quoted results were for problems which required pivoting which I agree is a good question. Anyone desiring further authority for my statements about GEPP should be able to find it in any good mathematics library. One reference is the book "Numerical Methods and Software" (David Kahaner, Cleve Moler and Stephen Nash, Prentice Hall, 1989). The accuracy of GEPP is discussed in chapter 3 (esp. p.66-p.67). It is well known there are certain pathological matrices for which GEPP does not perform well numerically. These matrices have the property that in the process of computing the factorization certain intermediate terms are formed which are much larger than the elements in the original matrix. These terms spoil the error analysis. If this does not happen then you are certain to be ok. This problem is almost never seen in practice. There- fore my statement about typical behavior. If you are using an appropriate numerical method the condition number generally will be determined by the stabilty of the physical system you are modeling and not by the number of points in your discret- ization. If your problem uses real inexact data the condition number best not be too large or your answer is going to be garbage regardless of how accurately you compute it. QR and SVD are used for solving least squares problems. Least squares problems can be formulated as linear equations (the normal equa- tions) but doing so unnecessarily increases the condition number. This is not a fault with GEPP but with the formulation. (This is discussed on p202-203 of the above referenced book.) As to credibility: I prefer to judge participants in the news- groups by the quality of their posts rather than by their "credentials" (or lack thereof). However for those readers who care about credentials I do have a few. I have a PhD in mathematics from MIT. Numerical anal- ysis is not my field of specialization but I have been exposed to it over the years, I have access to a mathematics library and I hopefully have enough mathematical sophistication to avoid gross misstatements. As to practical experience I have contributed to IBM's VS Fortran (ver- sion 2) intrinsic function library and to IBM's Engineering and Scienti- fic Subroutine Library (ESSL). James B. Shearer ========================================================================= Date: 11 June 1991, 19:48:25 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic (Goldberg paper) I said: >If your chip has 80 bit registers I fail to see how you can avoid >performing 80 bit loads and stores while saving and restoring these registers. David Chase said: Other people have managed in the past. Perhaps you are insufficiently imaginative. Perhaps not. Of course one way people have managed in the past is to only save and restore the first 64 bits of the register. In my view this demonstrates excessive imagination. James B. Shearer ========================================================================= Date: 11 June 1991, 20:00:27 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic (Goldberg paper) I said: > I don't believe interval arithmetic is used enough to justify > any hardware support. Dik Winter said: Well, IBM thought it important enough to provide support in some models (43xx). Acrith is the keyword. Ah Acrith, I will let Kahan comment ("Anomalies in the IBM Acrith Package" by W. Kahan and E. LeBlanc, Version dated March 13, 1985, p. 9): "ACRITH is not at all typical of IBM products but seems instead first to have escaped prematurely from a research project and then to have evaded quality controls." I said: > In > any case I believe your estimate of 3x slower than straight floating > point is unrealistic. Dik Winter said How come unrealistic? Example, interval add a to b giving c: set round to - inf; c.low = a.low + b.low; set round to + inf; c.upp = a.upp + b.upp; why would this be more than 3x slower than straight floating point? Because it's 4 instructions vrs 1. Also lets see your code for interval multiplication. James B. Shearer ========================================================================= Date: 11 June 1991, 20:18:07 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic (Goldberg paper) I said: > I don't believe interval arithmetic is used enough to justify > any hardware support. Keith Bierman said: This falls into the "H.R." chicken and egg rathole. If interval arithmetic (or anything else new and claimed to be good) is 100x slower than the traditional approach, it will simply never get off the ground. Some comments: 1. Interval arithmetic is not "new". People have been writing books and papers on it for at least 30 years. 2. Herman Rubin is still trying to get people to implement his proposed instructions. The interval arithmetic people got their instructions into the IEEE standard and they have been implemented in hardware in many machines. Interval arithmetic still hasn't gotten off the ground. How long are we expected to wait until concluding it's not going to fly. 3. The real problem with interval arithmetic is not that it's slow but that on real applications its error estimates are gener- ally so pessimistic as to be useless. Extended precision generally gives better estimates with less work. There are many ideas which have seemed promising but never quite made it, bubble memory for example. While judgement can never be final, automatic error estimation in general and interval arithmetic in particular seems unlikely to ever fulfill the hopes many had for it. James B. Shearer ========================================================================= Date: 11 June 1991, 20:58:09 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic David Hough writes: Part of the problem is that the benchmark programs in use measure only common cases. Various versions of the Linpack benchmark all have in common that the data is taken from a uniform random distribution, producing problems of very good condition. So the worst possible linear equation solver algorithm running on the dirtiest possible floating-point hardware should be able to produce a reasonably small residual, even for input problems of very large dimension. This is untrue. Consider pivoting on the column element of smallest magnitude while using IEEE single precision. I don't believe n has to be very large before you are in big trouble. David Hough writes: Such problems may actually correspond to some realistic applications, but many other realistic applications tend at times to get data that is much closer to the boundaries of reliable performance. This means that the same algorithms and input data will fail on some computer systems and not on others. In consequence benchmark suites that are developed by consortia of manufacturers proceeding largely by consensus, such as SPEC, will tend to exclude applications, however realistic, for which results of comparable quality can't be obtained by all the members' systems. If you know of a single realistic application where the diff- erence between 64 bit IEEE rounded to nearest and 64 bit IEEE chopped to zero makes a significant difference in the maximum performance ob- tainable I would like to see it. James B. Shearer ========================================================================= Date: 14 June 1991, 21:25:14 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic David Seal (quoting someone quoting the fortran standard): >The missing rule is (ANSI X3.9-1978 sect 6.6): > "Any numeric operation whose result is not mathematically defined is > prohibited in the execution of an executable program. Examples are > dividing by zero and raising a zero-valued primary to a zero-valued or > negative-valued power." >Who knows whether IEEE infinity is a "mathematically defined result"? If you I would think it isn't: it doesn't correspond to a single mathematical real number in the way that ordinary IEEE numbers do. Some comments: 1. The result of multiplying 2 reals is always mathematically de- fined. Therefore I don't see how this clause prohibits multiplies which will overflow or underflow. 2. Ordinary IEEE numbers don't really correspond to single mathema- tical real numbers, they correspond to intervals. Inf and -inf are just the names of the intervals on the ends. James B. Shearer ========================================================================= Date: 14 June 1991, 21:45:16 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic I said: >Perhaps I should have said it is undesirable for a program to behave >differently when compiled with and without optimization regardless >of whether this is technically permitted by the standard. (Actually >I think it would desirable for the standard to exactly specify the >behavior of any legal program but apparently it does not.) David Boundy said: Undesirable, yes. But also unavoidable in a market where performance is what sells machines. Unavoidable is too strong. I believe there are compilers (such as xlf) for which compiling with optimization does not change the behavior of legal programs. David Boundy said: Thus, standards do "exactly specify the behavior of any legal program:" that's the defintion of a "legal program." I believe this is incorrect. If I write x=a+b+c and the stand- ard does not specify the order of evaluation then the compiler is free to evaluate it as (a+b)+c or as a+(b+c) (or according to some in either way depending on the optimization level). Thus the behavior of the pro- gram is not exactly specified. David Boundy said: ( There's the issue that on our 68000 machines, we do arithmetic at compile time in 64 bits, but the same expression unoptimized will be evaluated at run time in 80 bits. Sigh. ) Well if the behavior of any legal program is exactly specified how can this sort of thing occur. James B. Shearer ========================================================================= Date: 14 June 1991, 22:33:45 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: IEEE arithmetic Dik Winter said (the issue is whether 3x is a realistic estimate of how much slower interval arithmetic will be compared to straight floating point): > How come unrealistic? Example, interval add a to b giving c: > set round to - inf; > c.low = a.low + b.low; > set round to + inf; > c.upp = a.upp + b.upp; > why would this be more than 3x slower than straight floating point? I said: > Because it's 4 instructions vrs 1. Dik Winter: Thus it would be 4 times as slow if each instruction takes exactly the same time to issue. This is not true on processors where the FPU is not pipelined (in that case the setting of the rounding mode is much faster than the addition). It might also not be true on some pipelined processors, i.e. if the multiply unit is not pipelined. Etc. Do you indeed know machines where the above sequence would be four times as slow as a normal multiply? It will be at least 4 times slower on the Risc System 6000. It may be more if the set floating mode instructions screw up the pipe- line (I don't know if they do or not). I said > Also lets see your code > for interval multiplication. Dik Winter: Not difficult, it goes something like: if a.low >= 0.0 then if b.low >= 0.0 then 4 operations else if b.upp <= 0.0 then 4 operations else if b.upp >= - b.low then 4 operations else 4 operations the remainder is left as an exercise for the reader. The code is a bit hairy, and on modern RISC machines you would not put it inline (a subroutine jump to a leaf routine is extremely cheap). We see that also here on many machines the multiply operations take most of the time. (The code assumes that intervals do not contain inf, if you allow that the code only gets hairier, not much more time consuming.) Some comments: 1. Handling this by a subroutine call would seem to require 4 floating loads to pass the input arguments and 2 floating loads to re- trieve the answers. This is already expensive. 2. All the floating compares and branches will severely impact performance on the Risc System 6000. The slowdown will be much more than 3x. 3. Do you know of any machine where the above code will average 3x (or less) the time of a single multiply? On another topic Dik Winter said: But you can get reasonable results without any pivoting if the condition is very good! The need for pivoting is unrelated to the condition number. This matrix 0 1 has condition number 1 and requires pivoting. 1 0 This matrix 1+e 1-e (e small) has bad condition number and is not help- 1-e 1+e ed by pivoting. James B. Shearer ========================================================================= Date: 16 June 1991, 19:22:04 EDT From: JBS at YKTVMV To: comp.arch at grape.ecs.clarkson.edu Subject: Re: IEEE arithmetic David Hough said: > Various versions of the Linpack benchmark all have in common > that the data is taken from a uniform random distribution, producing problems > of very good condition. I believe "very good condition" is an exaggeration. Problems of size 1000 appear to typically have condition number between 10**5 and 10**6. I would not consider this very good. The condition num- ber appears to be growing considerably faster than n, does anyone know the expected asymptotic behavior? David Hough said: >The Linpack benchmark is intended to solve an equation A x = b >where the elements of A are chosen from a uniform random distribution >and b is computed so that the correct solution x is close to a vector >of 1's. So looking at the size of components of x-1's is supposed to >indicate how accurate the solution is, although >the residual b - A x is a more reliable measure of quality. Why do you consider the residual a more reliable measure of quality. If the condition is bad the residual will still be small when the answer is garbage. I asked: >> If you know of a single realistic application where the diff- >>erence between 64 bit IEEE rounded to nearest and 64 bit IEEE chopped >>to zero makes a significant difference in the maximum performance ob- >>tainable I would like to see it. David Hough said: >I doubt you'll find anything like this in normal scientific computation >- the difference in problem domain between floating-point systems with >identical exponent and significand fields and rounding schemes >comparable in cost and care is rarely likely to be extremely >noticable. I would expect to notice the difference in problems like >orbit computations, however - rounding should yield acceptable results >for longer simulated times than chopping, even if everything else is >equal, simply because the statistics of rounding are more favorable. I >have also heard that the details of rounding are important in financial >calculations of bond yields. I believe IEEE chopped will be easier to implement and perform better than IEEE rounded. If rounding almost never matters why require it? I would expect the accuracy of orbit computations to be limited by the accuracy to which the initial parameters are known and by the accur- acy of the physical model (if I am not mistaken the models of tidal eff- ects for example are not too accurate). I find it hard to believe the rounding method matters in bond calculations given 64-bit arithmetic and sound methods (I believe there are foolish ways to compute internal rates of return for example). James B. Shearer ========================================================================= Date: 18 June 1991, 21:12:20 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: IEEE arithmetic (Goldberg paper) I said: >I think that if a (legal) program behaves differently when compiled with >and without optimization that then the compiler is broken. Perhaps you >disagree. Malcolm said: B = 5.0 IF ( ( B .GT. 3.0 ) .OR. MYTEST(A) ) PRINT some diagnotic messages LOGICAL FUNCTION MYTEST(X) REAL X MYTEST = some boolean work PRINT a few diagnostics on the screen RETURN END I modified this to: LOGICAL*4 MYTEST B = 5.0 IF ( ( B .GT. 3.0 ) .OR. MYTEST(B) )THEN WRITE(6,*)' MAIN' ENDIF END LOGICAL FUNCTION MYTEST(X) REAL X WRITE(6,*)' MYTEST' MYTEST = .TRUE. RETURN END I checked the behavior with the IBM VS Fortran and Xlf compi- lers. In both cases the behavior was not changed by optimization. The behavior differs between the two compilers but that is a different is- sue. So this example will not convince me that I'm wrong. (Actually I later softened my original statement to say it is undesirable for optimization to affect a program's behavior.) James B. Shearer ========================================================================= Date: 18 June 1991, 21:25:17 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: IEEE arithmetic Dik Winter said: The origin of the discussion was a remark that interval arithmetic in software had been observed to be 300 times as slow as standard hardware floating point. While interval arithmetic had been observed to be 3 times as slow. Shearer questioned the number 3; I thought he questioned the order of magnitude but apparently he wants to know the exact number. Of course that can not be given as it is entirely machine dependent. Also the factor would be a function of the mix of operations performed. I believe the original remark referred to estimates not observa- tions. I questioned in passing whether 3x was a realistic estimate. I continue to believe it is extremely optimistic and that 10x slower would be more reasonable. Dik Winter: About the multiplication routine, JBS: > 1. Handling this by a subroutine call would seem to require 4 > floating loads to pass the input arguments and 2 floating loads to re- > trieve the answers. This is already expensive. But you have in most cases to load these 4 numbers anyhow, so why would this be more expensive? Why loads to retrieve the answers? Why not just let them sit in the FP registers? If you are doing your operations memory to memory then this is correct. However if you are keeping intermediate results in registers as will often be the case, then you must move your 4 operands from the registers they are in to the registers the subroutine expects them in and you must move your 2 answers out of the return registers into the registers you want them in (if you just let them sit they will be wiped out by the next call). In general it will be possible to avoid doing some of this movement but I don't see how to avoid all of it. Dik Winter: Moreover, never it was said that there is an exact factor of 3 involved; that factor was simply observed, for a set of programs. > 3. Do you know of any machine where the above code will average > 3x (or less) the time of a single multiply? So this is irrelevant. Who observed this? Under what conditions? Dik Winter: > On another topic Dik Winter said: > But you can get reasonable results without any pivoting if the condition > is very good! I should have added the context, where not only the condition of the complete matrix is very good, but also of all its principal minors. This still isn't right. Consider e 1 e small (but not zero). 1 e Dik Winter (in a later post): I am *not* an advocate for interval arithmetic (the people at Karlsruhe are). I do not use it. But I object to the way Shearer handles this: a. Shearer asks: what is the justification for the different rounding modes. b. Many responses come: interval arithmetic. c. Shearer asks: would it not be better helped with quad arithmetic? d. Response: observed speed difference a factor 3 with hardware rounding modes, a factor 300 in software. e. Shearer questions the factor 3. Apparently he believes the factor 300 (does he?). Even if the factor 3 would degrade on other machines to a factor of 5 or even 10, the difference with 300 is still striking. I ask Shearer again: come with an interval add assuming the base arithmetic is round to nearest only (or even worse, with truncating arithmetic, which you advocate in another article). Some comments: Regarding b if interval arithmetic is the only reason for the different rounding modes then I think they may safely be junked. Regarding c what I actually said was I thought quad precision would provide some support for interval arithmetic (not better support). In any case upon further reflection I will withdraw this statement. Regarding d as I said above I believe these were estimates. Regarding e I don't really believe the 300x either and I part- icularly don't believe the 100x ratio. Regarding how I would implement interval arithemtic it is not particularly difficult to do without the rounding modes as long as you don't insist on maintaining the tightest possible intervals. There is no great loss in being a little sloppy since an extra ulp only matters for narrow intervals and the main problem with interval arithmetic is that the intervals don't stay narrow even if you are careful. James B. Shearer ========================================================================= Date: 18 June 1991, 22:59:21 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Condition number of uniform random matrices; residuals David Hough said: jbs@ibm.com asserts about solving linear equation with matrices composed of elements from a uniform random distribution: > Problems > of size 1000 appear to typically have condition number between 10**5 > and 10**6. I haven't observed any evidence to support this. If uniform random 1000x1000 matrices had such condition numbers then it would be impossible to solve them accurately with Gaussian elimination. I did report that gaussian elimination without pivoting was somewhat unstable on matrices of that size, but that only reflects on the stability of that method, rather than the underlying condition of the problem, independent of the method. For instance, the normalized residual computed in the linpack program, RESIDn = RESID/( N*NORMA*NORMX*EPS ) grows very slowly; typically 1.5-2 for 100x100 systems, and around 10 for 1000x1000 systems, using IEEE arithmetic and conventional partial pivoting. So the growth of RESID itself is only slightly faster than linear with N. Of course the size of the residual b-Ax doesn't tell you too much about the size of the error in x, in general. Luck is with us in this case, however: for single precision matrices, the measured difference in the computed x's vs. the nominally correct x's whose elements are all exactly one is in the same ratio, e.g. about 1e-4 for 1000x1000 vs. 1e-5 for 100x100. Of course the actual correct answer for this problem isn't a vector of 1's since the right hand side b was computed by rounding the result of computing A*(1 1 1 1 1 ...) and so the computed b differs from the one corresponding to the vector of 1's, and correspondingly the correct x corresponding to the right hand side b actually used is somewhat different - but fortunately that doesn't matter much for this well-conditioned problem. But this is a good illustration of a point I made previously: in general you don't know the error, because that would be tantamount to knowing the answer, and most people don't spend most of their time solving problems with known answers. So the normalized residual is a good indication of whether an algorithm has solved a linear equations problem about as well as can be expected in a particular floating-point precision. If an answer with a small error bound is desired, regardless of the condition of the problem, something else will have to be done. I tried 10 random matrices of size 1000 and observed condition numbers from 97627. to 1124910. I believe the actual 1000 by 1000 ma- trix generated in the linpack benchmark has condition number 368554. What do you observe it to be? A bad condition number does not guarantee you will be unable to solve the problem accurately. The matrix e 0 has condition number 1/e but there are no 0 1 numeric problems. For the linpack matrix the entries are such (a/16384, a integer between -32768 to 32768) that the row sums are generally computed exact- ly (especially with the 64-bit formats) so (1,...,1) is the exact answer. Otherwise comparisons between different floating formats would be unfair. On the general question of what the growth rate of the condition number with n is I found a somewhat relevant reference ("The Probability that a Numerical Analysis Problem is Difficult", James W. Demmel, Math- ematics of Computation, Vol 50, 1988. p449-480). This paper is a bit obscure but it seemed to be saying if we consider complex matrices in- stead of real, normal distributions about 0 instead of uniform and a somewhat unusual definition of condition number then the asymptotic growth rate of the median condition number will be n**1.5. It also suggested the real case will have worse condition. If anyone knows of better results I would be interested. David Hough said (in a later post) jbs@ibm.com says: > I believe IEEE chopped will be easier to implement and perform > better than IEEE rounded. This is true, but only marginally. The most important cost in IEEE (or VAX) floating-point arithmetic is getting the correct unrounded result, or at least enough of it to permit correct rounding. This requires provision of guard and (for chopping or unbiased rounding) sticky bits, postnormalization, development of full doubled-precision products, careful division and sqrt, etc. Once you've gotten the unrounded answer, rounding (as opposed to chopping) costs an extra add time and dealing with the possibility of a carry out of the most significant bit. Although these are real design costs, I doubt they've affected performance of modern pipelined IEEE designs compared to the cost of getting the unrounded result. Well as I pointed out in an earlier post the performance of the IBM Risc design was seriously affected by the need to round. The per- formance impact was reduced by the multiply-add instruction (which only rounds once) and by the ability of the multiply-add unit to accept an operand with a pending round. Just adding a stage to the floating pipe- line is not satisfactory for a scalar design since the latency is very important. David Hough said: In general, any process that involves repeated inexact conversions, particularly binary->decimal->binary->decimal, or addition of many small inexact quantities, will benefit from the better statistics of rounding over chopping - drifting tendencies will be reduced, and error bounds will be smaller. If your conversion routines are written properly and you output sufficient decimal digits binary->decimal->binary should not drift. I don't believe rounding generally gives smaller error bounds. If you add a bunch of positive numbers both methods lead to a interval containing the true answer. Chopping gives the lower end of the interval, round- ing the middle. The true answer is statistically much more likely to lie near the center of the interval but there is no guarantee of this. Also I believe the statistics of chopping improve if .5 ulp is added to each operand before every operation. I believe this is still cheaper to implement than rounding. David Hough said: For instance, during the 1970's one of the minor regional stock exchanges noticed that its market stock average was declining in a bull market in which most of the component stocks were rising in value. There were a number of factors at work in this phenomenon; using chopping arithmetic was one of them. Do you have an exact reference for this story? David Hough said: Using enough precision can sometimes overcome deficiencies of chopping or dirty arithmetic, otherwise Seymour Cray would have done things differently. But it isn't always enough, as most Cray customers find out occasionally. Well "sometimes" is really almost always. Also Cray arithmetic is an extreme example, problems with Cray arithmetic would not necessar- ily occur with chopping alone. James B. Shearer ========================================================================= Date: 19 June 1991, 00:22:54 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: Implementing Interval Arithmetic with IEEE rounding modes David Hinds said: Really, it seems to me that if support for interval arithmetic is built into a compiler, it should be able to optimize away most of the rounding mode changes without too much trouble. This should be trivial for the cases where the mode changes would otherwise do the most damage, such as in tight inner loops with no dependencies, i.e., vector operations. Just split and do two loops, one for the lower and one for the upper vectors. I believe if you try to do this with DO 10 J=1,N 10 X(J)=Y(J)*Z(J) you will discover things are not so simple. James B. Shearer ========================================================================= Date: 19 June 1991, 00:28:08 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: Implementing Interval Arithmetic with IEEE rounding modes Robert Herdon asked: While I've seen lots of verbage on the pros and cons of interval operations, I've seen NO discussion of its impact on algorithms. The question is then, how does one allow for changes of algorithm flow because of interval arithmetic? I.e., any time a decision is made based on the value of an interval, one sees a dichotomy of possible flow paths. If you are using interval arithmetic to figure error bounds for a non-interval arithmetic calculation then it is necessary to carry along the non-interval arithmetic values at each stage as well to get the con- trol flow correct. Using the usual interval representation this will in- crease the cost considerably. Of course you could represent intervals as (value, maximum possible absolute error). James B. Shearer ========================================================================= Date: 21 June 1991, 00:22:25 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: condition numbers of uniform random matrices David Hough and myself have been disagreeing about the condition number of uniform random matrices. It appears this is because we have been using different matrix norms. My results were for the L1 norm while David Hough's were using the L2 norm. Somewhat to my surprise it appears that uniform random matrices are much better conditioned with respect to the L2 norm than the L1 norm. For example for the size 1000 linpack matrix I compute a L2 norm of 3623. as compared to a L1 norm of 223642. (in an earlier post I reported an L1 norm of 368554, this was an error). 10 random 1000X1000 matrices had L2 norms ranging from 2398. to 27667. while as I posted earlier the L1 norms ranged from 97627. to 1124910. The median L2 norms appear to grow roughly like 5*n. I guess this goes to show condition numbers do not provide terribly precise estimates of the error to be expected. James B. Shearer ========================================================================= Date: 21 June 1991, 00:53:48 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: IEEE arithmetic (Goldberg paper) I said: > I checked the behavior with the IBM VS Fortran and Xlf compi- >lers. In both cases the behavior was not changed by optimization. The >behavior differs between the two compilers but that is a different is- >sue. So this example will not convince me that I'm wrong. (Actually >I later softened my original statement to say it is undesirable for >optimization to affect a program's behavior.) (My original statement was a compiler which compiles a legal program so that it behaves differently when compiled with optimization is "broken".) Jon Krueger selectively quoted this as: >I checked the behavior with the IBM VS >Fortran and Xlf compilers. In both cases the behavior was not changed >by optimization ... So this example will not convince me that I'm wrong. and commented: :I see. Therefore no standard FORTRAN compiler would be permitted :to do any differently, right? These compilers define what is and :is not a legal optimization, is that your assertion? I have the following comments: 1. This is not my position as I believe the final sentence in my post (which you tendentiously failed to quote) makes clear. Do you disagree that it is undesirable for optimization to affect a program's behavior? 2. I believe a compiler can obey the Fortran standard and still be properly called broken. For example it can take 100 hours to compile a trivial program. 3. The person who posted the code example didn't say what the exam- ple was supposed to show. If it was supposed to show any compiler would produce code that behaved differently when compiling with optimization then it is incorrect. James B. Shearer ========================================================================= Date: 21 June 1991, 19:21:26 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: rounded vs. chopped floating-point arithmetic David Hough posted the following: >Nelson Beebe (beebe@math.utah.edu) recollected the following message >to a colleague: >------------------------------------------- > The following little program can be used to illustrate the effect of > truncating arithmetic has on your larger program: > real dt,t0,t1,t2,tend > integer n > n = 0 > dt = 0.018 > t0 = 4000.0 > tend = 5000.0 > t1 = t0 > t2 = t0 > 10 n = n + 1 > t1 = t1 + dt > t2 = t0 + float(n)*dt > if (t2 .lt. tend) go to 10 > write (6,*) t1,t2,(t1 - t2)/t2 > end > On the IBM 3090, this single precision version prints: > 4879.89844 5000.00781 -0.240218341E-01 > That is, the relative error is 2.4%. On the Sun 4, it produces > 5003.70 5000.01 7.37889E-04 > The effect of truncating arithmetic on the running sum is large. > The double precision version is: > double precision dt,t0,t1,t2,tend > integer n > n = 0 > dt = 0.018D+00 > t0 = 4000.0D+00 > tend = 5000.0D+00 > t1 = t0 > t2 = t0 > 10 n = n + 1 > t1 = t1 + dt > t2 = t0 + dfloat(n)*dt > if (t2 .lt. tend) go to 10 > write (6,*) t1,t2,(t1 - t2)/t2 > end > The IBM 3090 result is > 5000.00799995563648 5000.00799999999981 -0.887265231637227285E-11 > The Sun 4 result is > 5000.0080000016 5000.0080000000 3.2341579848518D-13 >------------------------------------------- >Note that satisfactory results are obtained if you use enough precision or if you round rather than chop. Also note that this is not the program >that failed, but rather a drastic simplification of the user's actual application to reveal the essential problem. It's a simple example where >the superior statistics of rounding rather than chopping imply a broader >domain of applicability for a particular program. I believe this example is misleading for 2 reasons. 1.) The main source of error in the hex results is the following. Since 16**3 = 4096 and t0 and tend have been chosen so that the running sum t1 goes from 4000 to 5000, most of the time t1 will have leading hex digit 1. As is well known this causes a loss of accuracy in the hex computation. Running the sum from 3000 to 4000 improves the relative error in the hex results by factors of 10 and 23. While the possibilty of this sort of accuracy loss is undeniably a defect in the hex repres- entation, it says nothing about the relative merits of BINARY chopped vrs binary rounded. 2.) As pointed out above we should run the program using IEEE chopped for a fair comparison. Now the relative error is -.44*10**-2 (vrs .73*10**-3 for rounded) for single precision and -.92*10**-11 (vrs .32*10**-12 for rounded) for double precision. Hence the magnitude of the chopped error is about 6 times the magnitude of the rounded error in the single precision case and about 29 times in the double precision case. These ratios seem substantial but they are not typical for the following reason. As soon as the running sum surpasses 2**12=4096 the binary representation of .018 is such that chopping makes an error of .86 ulps per addition in the single precision case, .968 ulps per addition in the double precision case. Clearly rounding reduces this error to .14 ulps in the single precision case, .032 ulps in the double precision case. The ratios .86/.14 and .968/.032 are consistent with the observed difference in accuracy. However it is easy to see that the average error made by chopping is .5 ulps just twice the average error made by rounding (.25 ulps). Similarly the worse case error made by chopping is 1 ulp just twice the worst case error (.5 ulps) made by rounding. Hence on the average rounding gains one bit of accuracy. This is the result of the fact that the uncertain- ty in the rounded answer is two-sided while the uncertainty in the chopped answer is one-sided. The width of the interval of uncertainty is not any less for rounding. If the rounding errors were independent (which is not the case here) then the answer would be more likely to lie near the center of the uncertainty interval than near the ends. This would decrease the expected error in the rounded answer vrs the chopped answer by a factor of about sqrt(n) where n is the number of rounds. The worst case error ratio however would remain 2. David Hough also said >Correct rounding and chopping, and several other good paradigms, can be >characterized by the property > The rounded computed result is chosen from the two machine-representable > numbers nearest the unrounded infinitely-precise exact result, > according to a rule that depends only on the infinitely-precise > exact result, and not on the operands or operation (or phase of moon). >Most "fast" sort-of-rounding or sort-of-chopping schemes invented by >hardware designers eventually frustrate error analysts because they >can't be so characterized. The chopping scheme I mentioned if properly interpreted has this property. The machine numbers for this scheme lie exactly half-way in- between the IEEE machine numbers (ie they have an explicit 1 bit at the end instead of an implicit 0 bit). Chopping then corresponds to round- ing to nearest (with half way cases always going up). As far as I can see the main problem with this scheme is that small integers are no longer machine numbers which some would no doubt find unsettling. David Hough also said: >As for the first IBM RS/6000 implementation, I have heard that the original floating-point unit was designed to implement IBM 370 arithmetic, and was >changed to IEEE 754 format relatively late in the game. If true then it >would not be surprising that some aspects of 754 were problematic to add in. >The interesting question would then be >which aspects of IEEE arithmetic will be really >problematic for a high-performance RS/6000 implementation designed from >scratch to support 754. They may be some truth to this. On the other hand the design may have benefited from the efforts of the designers to lose as little performance as possible relative to hex. Certainly the performance of this chip seems to compare favorably to competitive IEEE from the start designs. IEEE designers may take the standard for granted and not spend a lot of time thinking about how fast alternatives would be thus not fully understanding how much performance the IEEE standard is giving up. James B. Shearer ========================================================================= Date: 25 June 1991, 18:24:25 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: IEEE arithmetic (Goldberg paper) Dan Prener said: >I would say that, since the program(s) offered above are, according to >the language standard, non-deterministic, then they are illegal. But any program with floating point arithmetic in it is, according to the language standard, non-deterministic (since the lang- uage standard does not specify the floating point model). Surely you don't believe all such programs are illegal. Preston Briggs quotes Dan Prener: :>Indeed, the spirit of FORTRAN is, overwhelmingly, that a program :>whose behavior is changed by optimization is not a legal program. and adds: >A wonderfully concise, quotable statement. >So much nicer than Shearer's assertion that the optimizer must be broken. Isn't it however logically equivalent to my statement: >I think that if a (legal) program behaves differently when compiled with >and without optimization that then the compiler is broken. I asked > Do you disagree that it is undesirable for optimization to affect > a program's behavior? Jon Krueger replied: >Yes. >It is a good and joyful thing, always and everywhere, for >compilers to expose broken programs. >Correct programs can be written in available languages. >The meaning of such programs does not depend on whether >compilers apply permissible optimizations. My comments: 1. I would not buy a compiler from someone who enjoys breaking user code. 2. As I pointed out above the behavior of any Fortran program with floating point operations is not specified by the standard. If you believe optimizers may make any transformation permitted by the standard then the meaning of any program with floating point opera- tions may depend on the optimization level. Hence the above quoted statement would imply all such programs are not correct. In general I am disturbed by the attitude that just because the fortran standard permits some atrocity (such as evaluating a*x+ a*y as a*(x+y)) it is improper to criticize a compiler for doing it. In my view the standard represents a least common denominator set of minimal requirements and any quality compiler will obey a stronger set of requirements. In this regard, I am not a language lawyer but it is my understanding that since the standard says nothing about the accuracy of intrinsic functions a compiler could replace sin(x) with 0 and cos(x) with 1 and be standard compliant. Is this the case? James B. Shearer ========================================================================= Date: 28 June 1991, 18:56:17 EDT From: JBS at YKTVMV To: comp-lang-fortran at ucbvax.berkeley.edu Subject: Re: Need complex matrix solver for CM. Rewriting a dense complex system of linear equations as a real system will not only use twice the storage it will take twice as long to solve (using Gaussian elimination). It is comparable to solving a sym- metric system with a general system solver. James B. Shearer ========================================================================= Date: 28 June 1991, 19:13:12 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: IEEE arithmetic (Goldberg paper) I said: :> 1. I would not buy a compiler from someone who enjoys breaking :>user code. Jon Krueger replied: >Neither would I. >But a compiler which exposes broken user code is a different animal. User code which complies with a variant or extension of the Fortran 77 (or any other) standard may not be portable but it is not "broken". The DO, ENDDO construct is not in the Fortran 77 standard. Nevertheless many Fortran 77 compilers recognize it and many users use it. I would not advise omitting it from your compiler in order to "expose" "broken" user code. James B. Shearer ========================================================================= Date: 1 July 1991, 18:43:19 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: IEEE arithmetic (Goldberg paper) I said: jbs> The DO, ENDDO construct is not in the Fortran 77 standard. jbs> Nevertheless many Fortran 77 compilers recognize it and many users use jbs> it. I would not advise omitting it from your compiler in order to jbs> "expose" "broken" user code. John McCalpin said: >The word "broken" is clearly being used in this context to refer to >code whose behavior is undetermined under the standard --- therefore >*logically* "broken". The examples which have been the topic of >discussion here depend on either subtle issues of floating-point code >re-arrangement or (in the other thread) on a particular >implementation's treatment of functions with side effects. >The use of extensions is an entirely different issue. The resulting >codes are *syntactically* incorrect FORTRAN-77 --- their *logical* >interpretation under a *superset* of the FORTRAN language is >independent of the fact that extensions are used. The Fortran standard can be extended in various ways. You can add constructs such as DO, ENDDO. You can remove ambiguity for example by stating a*b*c will always be evaluated as (a*b)*c. You can specify a floating point model (such as IEEE binary or IBM hex). In all of these cases a code's behavior may be undetermined with respect to the original standard but determined with respect to the extended standard. In such cases I believe it is wrong to say the code is broken. Many Fortran programs will fail given arbitrarily bad floating point arithmetic. Should these programs be considered "broken"? James B. Shearer ========================================================================= Date: 3 July 1991, 18:55:09 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: IEEE arithmetic (Goldberg paper) I said: >>The Fortran standard can be extended in various ways. ... >>You can remove ambiguity for >>example by stating a*b*c will always be evaluated as (a*b)*c. Jon Krueger comments: >Incorrect: not an extension. Extensions are downward compatible. >Extending a language allows sentences from the old language to >mean the same thing in the new language (and at least one new >sentence could be written). By contrast, this change allows >some sentences having many permissible meanings in the old language >to mean one thing in the new language. That may be an improvement, >but it's not an extension. >However, it's perfectly appropriate to call this a definition >of a Fortran-like language, having the possibly desirable >property of rendering some sentences unambiguous that would >be ambiguous in Fortran itself. Programs written in this >language are not broken if they depend on left to right expression >evaluation. If they start calling themselves Fortran programs, >however, and still depend on left to right evaluation, they're broken. I said: >>You can specify a floating point model (such as IEEE binary or >>IBM hex). Jon Kruger comments: >Extension? Or just implementation specific? You be the judge. I said: >>Many Fortran programs will fail given arbitrarily >>bad floating point arithmetic. Should these programs be considered >>"broken"? Jon Kruger comments: >What standards have you reference to? That define not arbitrarily >bad floating point arithmetic? And are these standards part of the >language definition? I believe Fortran refers to family of related languages or dialects. Examples would be "Cray Fortran", "DEC Fortran", "ANSI Fortran 77" etc. Hence a program may not be ANSI standard and still call itself Fortran (although to avoid confusion it would be better to specify the dialect). Because the Fortran 77 standard does not specify a floating point model, a statement such as "X = Y * Z" has many possible mean- ings. However all actual Fortran implementations that I am aware of remove this ambiguity by specifying a model (usually the hardware floating point model). Almost any Fortran program which performs floating point arithmetic assumes this floating point model is not "too bad". By your statements above either these programs are not entitled to call themselves Fortran or they are broken because they are relying on something (that floating point arithmetic is not "too bad") which is not found in the ANSI Fortran 77 standard. Since the Fortran 77 standard allows arbitrarily bad float- ing point arithmetic it is not usable as it stands for floating point computations. It must be extended (or changed if you prefer) to for- bid arbitrarily bad floating point arithmetic (usually by specifying a particular floating point model). Such variants or dialects then define "not arbitrarily bad floating point arithmetic". Note it is not sufficient to specify a floating point model to make the Fortran 77 standard usable, it is also necessary to limit the transformations allowed by the clause about evaluating mathema- tically equivalent expressions. James B. Shearer ========================================================================= Date: 5 July 1991, 15:10:11 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: Implementing Interval Arithmetic with IEEE rounding modes Dan Bernstein said: >Yes. I use it a lot for finding roots quickly and reliably, as well as >for solving differential equations. >Apparently some people in this discussion aren't familiar with the real >advantage of interval arithmetic. It's not the guaranteed error bounds >on final results. It's the guaranteed error bounds on *intermediate* >results. >You see, Newton's method (for instance) gives you an iteration that >converges to a root if it's close enough (in whatever sense). With >interval arithmetic, you start with an interval around the root (zero >through infinity, say) and apply one step of Newton's method. Since the >interval includes numbers arbitrarily close to the root, the resulting >interval must also include numbers arbitrarily close to the root. So the >*intersection* of the intervals does too. You can use *that* as the >starting interval for the next step. Unlike Newton's method, this >modification (under suitable conditions) is guaranteed to converge, and >rather rapidly too. That's the real advantage of interval arithmetic. I am unfamiliar with any advantages of interval arithmetic sufficient to justify cluttering up the floating point standard with 4 rounding modes. This purported application of interval arithmetic does not seem very convincing. Contrary to your statement Newton's method is also guaranteed to converge under suitable conditions. Also your method may "converge" to an interval which is no success. Trying to find sqrt(a) with this method starting with 0,inf as you suggest does not give me confidence that this method is significantly more robust than Newton's. Finally you don't explain why this method requires the precise interval arithmetic in the IEEE standard as oppposed to the approximate interval arithmetic achievable on any reasonable system. James B. Shearer ========================================================================= Date: 27 September 1991, 16:09:43 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: Latest generation of RISCs (was: Creeping featurism... Linley Gwennap said: "You may argue that the CPU design is not related to the system cost, but consider that IBM's system uses 8 VLSIs to do what HP uses 3 VLSIs for." Actually the IBM chip set includes 4 data cache chips whereas I believe the HP design uses commodity cache chips which are not included in the HP 3 chip set. So the above comparison is not quite fair. James B. Shearer ========================================================================= Date: 2 October 1991, 20:17:02 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: Latest generation of RISCs (was: Creeping featurism... Linley Gwennap said: > "You may argue that >the CPU design is not related to the system cost, but consider that IBM's >system uses 8 VLSIs to do what HP uses 3 VLSIs for." I said: > Actually the IBM chip set includes 4 data cache chips whereas I >believe the HP design uses commodity cache chips which are not included >in the HP 3 chip set. So the above comparison is not quite fair. Richard Crisp disagrees: >I would argue that the comparison is fair: they still have more large pincount >packages that have to be manufactured, tested etc in volumes appropriate >for the assembly of the systems commensurate with demand. They are undoubedly >sole sourced items, have been developed by them rather than commodity SRAM >suppliers and generally will pay far more per bit than the HP folks who >chose to use commodity bits and place their efforts elsewhere.The IBM approach >took more engineering resources to develop and costs more to manufacture. >That is the bottom line in my book. Commodity cache chips may well be cheaper than the IBM custom cache chips. However they are not free and their cost must be added to the cost of the 3 HP chips for a fair comparison to the 8 IBM chips. Contrary to the original statement the IBM chips are in fact doing more than the HP chips. Also as long as we are taking chip cost into account I doubt 4 custom cache chips cost 4 times as much as 1 custom floating point chip. Even if the HP design is cheaper per bit I believe the HP cache is 4 times as large so it does not follow that the total cost is less. What basis do you have for your assertions about compari- tive design and manufacturing costs? In any case I don't believe cache design has much to do with which design will be easier to speed up (the original focus of this thread). James B. Shearer ========================================================================= Date: 10 October 1991, 16:12:51 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: humans vrs compilers Colin Plumb said: >Examples of really good compilers: IBM's math library for 3090's is >written in Fortran; it's faster than the hand-coded attempt. I don't know what this can be referring to. Most of the critical portions of IBM's ESSL (Engineering and Scientific Subrou- tine Library) library for 3090's with vector are written in assembler for performance reasons. James B. Shearer ========================================================================= Date: 14 October 1991, 22:40:17 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: Secrets Robert Bernecky writes: >In article <21761@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubi && n) writes: >> >>Are you sure that this information is that concealed? I agree that it >>is not that well written, but in most cases, I was only unable to get >>it when the manufacturers themselves did not know it. >In the case of mainframe manufacturers, such as Hitachi, IBM, and >Fujitsu/Amdahl, they ceased publishing instruction timings more than >10 year ago. I think the 370/145 was about the last machine you could >extract timing information on with available documentation. >A recent request to (Large Manufacturer) regarding instruction timing and inte && rlock >information (I was interested in code scheduling on the /370) resulted >in "Sorry, bub, but we don't give out that information. Our competitors >might be able to use it to their advantage". >SO, knowledge of the information is NOT the issue. The information IS >concealed. For what its worth I have been told that one reason IBM is reluctant to provide timing information for 370 mainframes is that the machines have become so complicated that it is difficult to provide totally accurate information and that providing a simplified model risks customer complaints that their machine is broken when an instruc- tion sequence takes longer than the simplified model predicts. I believe that when customers have a real need for such infor- mation it is usually possible to obtain it. It may require a lot of persistence to find the right person to ask though. As for interlocks and instruction scheduling this was not too important for 3090 and earlier models. The two common 3090 interlocks that I am aware of are 1) modify a gpr, reference memory with address depending on gpr which was just modified: costs 1 cycle. 2) modify memory, reference location just modified: costs about 5 cycles. In some cases inserting a nop will help. Scheduling will be more important on the newest mainframes, unfortunately I don't know enough about them to give advice. As far I know no IBM mainframe compiler schedules instructions. James B. Shearer ========================================================================= Date: 25 October 1991, 22:13:39 EDT From: JBS at YKTVMV To: comp-arch at ucbvax.berkeley.edu Subject: Re: The next-gen of RISC: One FLOP or two?? Jeff Carroll writes: > On the other hand, there are only one or two tricks that can be >played with a MFLOPS number, without descending into really scungy stuff. >One is to give a number for 32-bit floats to compare against a 64-bit >number, and the other is to multiply your peak MFLOPS number by 2 because >you have a MAC instruction. If you mean to imply that it is improper to consider multiply- add instructions when computing a peak MFLOPS number I must disagree. The peak MFLOPS rate is sometimes stated to be a "speed of light" which is guaranteed not to be exceeded. This requires t