.ds HT "\h'-.65m'\v'-.40m'^\v'.40m' .ds TD "\h'-.65m'\v'-.40m'~\v'.40m' .ds CT "\(mu\h'-.66m'\(ci .ds NI "\(mo\h'-.65m'/ .nrEP 1 .EQ delim @@ .EN .EQ define newt 'B sup{-1} g' define blambda 'B ~+~ lambda I' define balpha 'B ~+~ alpha I' define indent '~~~~~~' define sqr ' down 40 size +5 \(sq ' define hat '\*(HT' define tilde '\*(TD' define times '\(mu' define ctimes '\*(CT' define notin '\*(NI' define indent2 '~~~~~~~~~~~~~' define inter ' [^ lambda sub L ^,^ lambda sub U ^] ' .EN .nr VS 14 .TL Performance and Library Issues .nl for Mathematical Software .nl on High Performance Computers .AU J.J. Dongarra and D.C. Sorensen .AI Mathematics and Computer Science Division Argonne National Laboratory Argonne Illinois 60439 .FS Work supported in part by the Applied Mathematical Sciences Research Program (KC-04-02) of the Office of Energy Research of the U.S. Department of Energy under Contract W-31-109-Eng-38. .FE .QS .ps 10 .in .25i .ll -.25i .I Abstract \(em This paper discusses some of the fundamental issues facing designers of mathematical software libraries for medium scale parallel processors such as the CRAY X-MP-4 and the Denelcor HEP. We discuss the problems that arise with performance and demonstrate that it may be appropriate to exploit parallelism at all levels of the program, not just at the highest level. We give performance measurements indicating the efficiency of a linear algebra library written in terms of a few high level modules. These modules chosen at the matrix vector level extend the concept of the BLAS [13] and provide enough computational granularity to allow efficient implementations on a wide variety of architectures. Only three modules must be recoded for efficiency in order to transport the library to various machines. We report experience on machines as diverse as the CRAY X-MP and the Denelcor HEP. Finally, we report on some special algorithms for the HEP which take advantage of the fine grain parallelism capabilities. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .NH Introduction. .PP This paper will focus upon some fundamental issues that must be faced if we are to provide mathematical software for the high performance computers and for medium scale parallel systems (2-50 processors) in general. Our experience with the CRAY X-MP and Denelcor HEP has been invaluable in providing insight into what these issues might be. While the experience we shall report here is primarily concerned with the X-MP and HEP, we believe there is much within this experience that will carry over to vector/parallel processors that may appear in the near future. .PP Our discussion will not cover software for massively parallel systems. We shall consider algorithms from linear algebra that are not dependent upon relationship between the number of processors and the size of the problem. These algorithms will be of two varieties: Conversion of standard libraries to versions that are transportable to a wide variety of architectures and specialized algorithms that take advantage of the tightly coupled synchronization that is offered on the Denelcor HEP. .PP We shall touch upon issues involved with parallelism that require restructuring of software libraries to shield the user safely from any machine intrinsics required to take advantage of parallelism from within the libraries. Moreover, we shall go into some detail concerning performance of the HEP. In particular we shall demonstrate that it is appropriate to exploit parallelism at all levels of a program, and not just at the highest level. .NH Vector Architectures .PP The current generation of vector computers exploits several advanced concepts to enhance their performance over conventional computers: .sp .ls 1 .in +5. Fast cycle time, .sp Vector instructions to reduce the number of instructions interpreted, .sp Pipelining to utilize a functional unit fully and to deliver one result per cycle, .sp Chaining to overlap functional unit execution, and .sp Overlapping to execute more than one independent vector instruction concurrently. .in 0 .sp .PP Current vector computers typically provide for "simultaneous" execution of a number of elementwise operations through pipelining. Pipelining generally takes the approach of splitting the function to be performed into smaller pieces or stages and allocating separate hardware to each of these stages. Pipelining is analogous to an industrial assembly line where a product moves through a sequence of stations. Each station carries out one step in the manufacturing process, and each of the stations works simultaneously on different units in different phases of completion. With pipelining, the functional units (floating point adder, floating point multiplier, etc.) can be divided into several suboperations that must be carried out in sequence. A pipelined functional unit, then, is divided into stages, each of which does a portion of the work in, say, one clock period. At the end of each clock period, partial results are passed to the next stage and partial results are accepted from the previous stage. .PP The goal of pipelined functional units is clearly performance. After some initial startup time, which depends on the number of stages (called the length of the pipeline, or pipe length), the functional unit can turn out one result per clock period as long as a new pair of operands is supplied to the first stage every clock period. Thus, the rate is independent of the length of the pipeline and depends only on the rate at which operands are fed into the pipeline. Therefore, if two vectors of length @k@ are to be added, and if the floating point adder requires 3 clock periods to complete, it would take @3 ~+~ k@ clock periods to add the two vectors together, as opposed to @3~*~k@ clock periods in a conventional computer. .PP Another feature that is used to achieve high rates of execution is chaining. @Chaining@ is a technique whereby the output register of one vector instruction is the same as one of the input registers for the next vector instruction. If the instructions use separate functional units, the hardware will start the second vector operation during the clock period when the first result from the first operation is just leaving its functional unit. A copy of the result is forwarded directly to the second functional unit and the first execution of the second vector is started. The net result is that the execution of both vector operations takes only the second functional unit startup time longer than the first vector operation. The effect is that of having a new instruction which performs the combined operations of the two functional units that have been chained together. On the CRAY in addition to the arithmetic operations, vector loads from memory to vector registers can be chained with other arithmetic operations. .PP It is also possible to overlap operations if the two operations are independent. If an addition and an independent multiplication operation are to be processed, the execution of the second independent operation would begin one cycle after the first operation has started. .PP The key to utilizing a high performance computer effectively is to avoid unnecessary memory references. In most computers, data flows from memory into and out of registers; and from registers into and out of functional units, which perform the given instructions on the data. Performance of algorithms can be dominated by the amount of memory traffic, rather than the number of floating point operations involved. The movement of data between memory and registers can be as costly as arithmetic operations on the data. This provides considerable motivation to restructure existing algorithms and to devise new algorithms that minimize data movement. .PP Many of the algorithms in linear algebra can be expressed in terms of a SAXPY operation: @y ~<-~y ^+^ alpha x@ , @i.e.@ adding a multiple @alpha@ of a vector @x@ to another vector @y@. This would result in three vector memory references for each two vector floating point operations. If this operation comprises the body of an inner loop which updates the same vector @y@ many times then a considerable amount of unnecessary data movement will occur. Usually, a SAXPY occurring in an inner loop will indicate that the algorithm may be recast in terms of some matrix vector operation, such as @y ~ <-~ y + M * x @, which is just a sequence of SAXPYs involving the columns of the matrix @M@ and the corresponding components of the vector @x@. The advantage of this is the @y@ vector and the length of the columns of @M@ are a fixed size throughout. This makes it relatively easy to automatically recognize that only the columns of @M@ need be moved into registers while accumulating the result @y@ in a vector register, avoiding two of the three memory references in the inner most loop. This also allows chaining to occur on vector machines, and results in a factor of three increase in performance on the CRAY 1. The cost of the algorithm in these cases is not determined by floating point operations, but by memory references. As we shall see later, this recasting also facilitates retargeting the algorithm to a completely different parallel architecture by rearranging the way the matrix vector operations are done. .PP Machines similar to the CRAY, in which the memory cycle time is longer than the processor cycle time, there exists the possibility of a memory bank conflict. The operation of loading elements of a vector from memory into a vector register can be pipelined, so that after some initial startup following a vector load, vector elements arrive in the register at the rate of one element per cycle. To keep vector operands streaming at this rate, sequential elements are stored in different banks or areas in an "interleaved" memory. After an access, a memory bank requires a certain amount of time before another reference can be made. This delay time is referred to as the "memory bank cycle time." This memory cycle time on both the CRAY 1 and the Cyber 205 is four processor cycles. If the same memory bank is accessed before it has a chance to acquiesce or cycle, chaining operations will stop. Instead of delivering one result per cycle, the machine will deliver one result per functional unit time. In other words, it will operate at scalar speeds, seriously degrading performance. Note, too, that the Cyber 205 must gather data that are not contiguous in memory before it can begin computing. .PP Programs that properly use all of the features mentioned above will fully exploit the potential of a vector machine. These features, when used to varying degrees, give rise to three basic modes of execution: scalar, vector, and super-vector [9]. To provide a feeling for the difference in execution rates, we give the following table for execution rates on a CRAY 1: .sp .KS .TS center; c | c. Mode of Execution Rate of Execution _ _ Scalar 0 - 10 MFLOPS Vector 10 - 50 MFLOPS Super-Vector 50 - 160 MFLOPS .TE .KE .sp These rates represent, more or less, the upper end of their range. We define the term MFLOPS to be a rate of execution representing millions of floating point operations (additions or multiplications) performed per second. .PP The basic difference between scalar and vector performance is the use of vector instructions. The difference between vector and super-vector performance hinges upon avoiding unnecessary movement of data between vector registers and memory. The CRAY 1 is limited in the sense that there is only one path between memory and the vector registers. This creates a bottleneck if a program loads a vector from memory, performs some arithmetic operations, and then stores the results. While the load and arithmetic can proceed simultaneously as a chained operation, the store is not started until that chained operation is fully completed. .PP Most algorithms in linear algebra can be easily vectorized. However, to gain the most out of a machine like the CRAY 1, such vectorization is usually not enough. In order to achieve top performance, the scope of the vectorization must be expanded to facilitate chaining and minimization of data movement in addition to utilizing vector operations. Recasting the algorithms in terms of matrix vector operations makes it easy for a vectorizing compiler to achieve these goals. This is primarily due to the fact that the results of the operation can be retained in a register and need not be stored back to memory, thus eliminating the bottleneck. Moreover, when the compiler is not successful, it is reasonable to hand tune these operations, perhaps in assembly language, since there are so few of them and since they involve simple operations on regular data structures. When the process is extended to cover the next level, matrix-vector operations, the generated code can perform at super-vector levels. .NH Performance and Transportability .PP We are interested in examining the performance of linear algebra algorithms on large-scale scientific vector processors and on emerging parallel processors. In many applications, linear algebra calculations consume large quantities of computer time. If substantial improvements can be found for the linear algebra part, a significant reduction in the overall performance will be realized. We are motivated to look for alternative formulations of standard algorithms, as implemented in software packages such as LINPACK [2] and EISPACK [10,16], because of their wide usage in general and their poor performance on vector computers. We are also motivated to restructure in a way that will allow these packages to be easily transported to new computers of radically different design. If this can be accomplished without serious loss of efficiency there are many advantages. In this section we report on some experience with restructuring these linear algebra packages in terms of the high level modules proposed in [3]. This experience verifies performance increases are achieved on vector machines and that this modular approach offers a viable solution to the transportability issue. This restructuring often improves performance on conventional computers and does not degrade the performance on any computer we are aware of. .PP Both of the packages have been designed in a portable, robust fashion, so they will run in any Fortran environment. The routines in LINPACK even go so far as to use a set of vector subprograms called the BLAS [13] to carry out most of the arithmetic operations. The EISPACK routines do not explicitly make reference to vector routines, but the routines have a high degree of vector operations which most vectorizing compilers detect. The routines from both packages should be well suited for execution on vector computers. As we shall see, however, the Fortran programs from LINPACK and EISPACK do not attain the highest execution rate possible on a CRAY 1 [10]. While these programs exhibit a high degree of vectorization, the construction which leads to super-vector performance is in most cases not present. We will examine how the algorithms can be constructed and modified to enhance performance without sacrificing clarity or resorting to assembly language. .PP To give a feeling for the difference between various computers, both vector and conventional, a timing study on many different computers for the solution of a @100 times 100 @ system of equations has been carried out [1]. The LINPACK routines were used in the solution without modification. .sp 2 .KS .B .ce Solving a System of Linear Equations .sp .ce with LINPACK @ nothing sup a @ in Full Precision @ nothing sup b @ .sp .R .ps 10 .TS center; l l c c c c l l c c c c l l c c c c. Computer OS/Compiler@ nothing sup c @ Ratio@ nothing sup d @ MFLOPS@ nothing sup e @ Time secs _ CRAY X-MP-1 CFT(Coded BLAS) .36 33 .021 CDC Cyber 205 FTN(Coded BLAS) .48 25 .027 CRAY 1S CFT(Coded BLAS) .54 23 .030 CRAY X-MP-1 CFT(Rolled BLAS) .57 21 .032 Fujitsu VP-200 Fortran 77(Comp directive) .64 19 .040 Fujitsu VP-200 Fortran 77(Rolled BLAS) .72 17 .040 Hitachi S-810/20 FORT77/HAP(Rolled BLAS) .74 17 .042 CRAY 1S CFT(Rolled BLAS) 1 12 .056 CDC Cyber 205 FTN(Rolled BLAS) 1.5 8.4 .082 NAS 9060 w/VPF VS opt=2(Coded BLAS) 1.8 6.8 .101 .TE .KE .sp .nr PS 8 .ps 8 .PP @ nothing sup a @ LINPACK routines @SGEFA@ and @SGESL@ were used for single precision and routines @DGEFA@ and @DGESL@ were used for double precision. These routines perform standard @LU@ decomposition with partial pivoting and backsubstitution. .PP @ nothing sup b @\f2Full Precision\f1 implies the use of (approximately) 64 bit arithmetic, e.g. CDC single precision or IBM double precision. .PP @ nothing sup c @\f2OS/Compiler\f1 refers to the operating system and compiler used, (Coded BLAS) refers to the use of assembly language coding of the BLAS, and (Rolled BLAS) refers to a Fortran version with, single statement, simple loops and \f2Comp Directive\f1 refers to the use of compiler directives to set the maximum vector length. .PP @ nothing sup d @\f2Ratio\f1 is the number of times faster or slower a particular machine configuration is when compared to the CRAY 1S using a Fortran coding for the BLAS. .PP @ nothing sup e @\f2MFLOPS\f1 is a rate of execution, the number of million floating point operations completed per second. For solving a system of @n@ equations, approximately @ 2/3 n sup 3 ~+~ 2 n sup 2 @ operations are performed (we count both additions and multiplications). .nr PS 11 .ps 11 .sp .PP The LINPACK routines used to generate the timings in the previous table do not reflect the true performance of "high performance computers". A different implementation of the solution of linear equations, presented in a report by Dongarra and Eisenstat [4], better describes the performance on such machines. That algorithm is based on matrix-vector operations rather than just vector operations. This restructuring allowed the various compilers to take advantage of the features described in Section 2. It is important to note that the numerical properties of the algorithm have not been altered by this restructuring. The number of floating point operations required and the roundoff errors produced by both algorithms are exactly the same, only the way in which the matrix elements are accessed is different. As before, a Fortran program was run and the time to complete the solution of equations for a matrix of order 300 is reported. .PP Note that these numbers are for a problem of order 300 and all runs are for full precision. .ps .sp 2 .KS .B .ce Solving a System of Linear Equations .ce Using the Vector Unrolling Technique .sp 2 .R .ps 10 .TS center; l l c c c l l c c c l l c c c. Computer OS/Compiler@ nothing sup c @ MFLOPS@ nothing sup e @ Time secs _ CRAY X-MP-4 @"" sup \(sc @ CFT(Coded ISAMAX) 356 .051 CRAY X-MP-2 @"" sup \(dd @ CFT(Coded ISAMAX) 257 .076 Fujitsu VP-200 Fortran 77(Comp directive) 220 .083 Fujitsu VP-200 Fortran 77 183 .099 CRAY X-MP-2 @"" sup \(dd @ CFT 161 .113 Hitachi S-810/20 FORT77/HAP 158 .115 CRAY X-MP-1 @"" sup \(dg @ CFT(Coded ISAMAX) 134 .136 CRAY X-MP-1 @"" sup \(dg @ CFT 106 .172 CRAY 1-M CFT(Coded ISAMAX) 83 .215 CRAY 1-S CFT(Coded ISAMAX) 76 .236 CRAY 1-M CFT 69 .259 CRAY 1-S CFT 66 .273 CDC Cyber 205 ftn 200 opt=1 31 .59 NAS 9060 w/VPF VS opt=2(Coded BLAS) 9.7 1.9 .TE .sp .KE .nr PS 8 .ps 8 .I Comments: .R .in +2. @"" sup \(sc @ These timings are for four processors with manual changes to use parallel features. .br @"" sup \(dd @ These timings are for two processors with manual changes to use parallel features. .br @"" sup \(dg @ These timings are for one processor. .nr PS 11 .ps 11 .sp .PP Similar techniques of recasting matrix decomposition algorithms in terms of on matrix vector operations have provided significant improvements in the performance of algorithms for the eigenvalue problem. In a paper by Dongarra, Kaufman and Hammarling [7] many of the routines in the EISPACK collection have been restructered to use matrix vector primitives resulting in improvements by a factor of two to three in performance over the standard implementation on vector computers such as CRAY X-MP, Hitachi S-810/20, and Fujitsu VP-200. .PP Using the matrix vector operations as primatives in constructing algorithms can also play an important role in achieving performance on multiprocessor systems with minimal recoding effort. Again, the recoding is restricted to the relatively simple modules and the numerical properties of the algorithms are not altered as the codes are retargeted for a new machine. This feature takes on added importance as the complexity of the algorithms reach the level required for some of the more difficult eigenvalue calculations. A number of factors influence the performance of an algorithm in multiprocessing. These include the degree of parallelism, process synchronization overhead, load balancing, inter processor memory contention, and modifications needed to separate the parallel parts of an algorithm. .PP A parallel algorithm must partition the work to be done into tasks or processes which can execute concurrently in order to exploit the computational advantages offered by a parallel computer. These cooperating processes usually have to communicate with each other to claim a unique identifier or follow data dependency rules for example. This communication takes place at synchronization points within the instruction streams defining the process. The amount of work in terms of number of instructions that may be performed between synchronization points is referred to as the granularity of a task. The need to synchronize and to communicate before and after parallel work will greatly impact the overall execution time of the program. Since the processors have to wait for one another instead of doing useful computation, it is obviously better to minimize that overhead. In the situation where segments of parallel code are executing in vector mode, typically at ten to twenty times the speed of scalar mode, granularity becomes an even more important issue, since communication mechanisms are implemented in scalar mode. .PP Granularity is also closely related to the degree of parallelism, which is defined to be the percentage of time spent in the parallel portion of the code. Typically, a small granularity job means that parallelism occurs in an inner loop levels (although not necessarily the innermost loop). In this case, even the loop setup time in outer loops will become significant without even mentioning frequent task synchronization needs. .PP Matrix vector operations offer the proper level of modularity for achieving both performance and transportability across a wide range of computer architectures. Evidence has already been given for a variety of vector architectures. We shall present further evidence in the following sections concerning their suitability for parallel architectures. In addition to the computational evidence, there are several reasons which support the use of these modules. We can easily construct the standard algorithms in linear algebra out of these types of modules. The matrix vector operations are simple and yet encompass enough computation that they can be vectorized and also parallelized at a reasonable level of granularity [5,8]. Finally, these modules can be constructed in such a way that they hide all of the machine specific intrinsics required to invoke parallel computation. Thereby shielding the user from being concerned with any changes to the library which are machine specific. .NH The Denelcor HEP Computer. .PP The Denelcor HEP is the first commercially available machine of the MIMD variety. It provides a way for multiple instruction streams to operate on multiple data streams concurrently. It supports fine grained parallel computations, and is quite different from the vector computers discussed earlier. The fully configured computing system offered by Denelcor consists of up to 16 processing elements (PEMs) sharing a large global memory through a crossbar switch. Within a single PEM parallelism is achieved through pipelining independent serial instruction streams called processes. The principle pipeline which handles the numerical and logical operations consists of synchronous functional units that have been segmented into an eight stage pipe. The storage functional unit (SFU) operates asynchronously from this execution pipeline. .PP There are four types of memory: Data, program, register, and constant memory. Each word of this memory is 64 bits together with an extra special purpose bit used to mark the memory location as full or empty. Processes are synchronized through marking memory and register locations with the full or empty property. This means that a process requesting access to a memory location may be blocked until that location is marked full by another process. Such suspension of a process only takes place in the SFU and therefore does not interfere with other processes that are ready to execute instructions. A very useful diagram for understanding this mechanism and the architecture of the HEP has been provided by H. Jordan [12] and appears in Figure 1. .sp ---------- figure 1 ---------- .sp .PP This architecture allows a fairly clean introduction of parallel constructs into the FORTRAN 77 language through two simple extensions. The first of these is the introduction of asynchronous variables and the second is the CREATE statement. An asynchronous variable is defined by putting a $ character in front of a standard FORTRAN variable name. Type defaults to the first character after the $. The $ indicates that the full-empty bit is significant. When an asynchronous variable appears on the left of an assignment statement it must be empty before that statement can be executed, and when it appears on the right of an assignment statement the variable must be full before the statement can be executed. In either case flow of execution for the process is suspended until the appropriate full-empty conditions are met. This provides a straightforward data activated communication mechanism between concurrent processes. In HEP FORTRAN processes are invoked through the CREATE statement. The usage of the CREATE statement is similar to a subroutine CALL but control does not pass to the subroutine. Instead a parallel process is spawned that operates independently from the parent process. These intrinsics will change form when the new UNIX based operating system HEP/UPX becomes available so that no formal extensions to the FORTRAN language will be invoked. However, the same mechanisms will be available in the new version. .NH HEP Performance. .PP To analyze HEP performance it is sufficient in most situations to concentrate on the task queue (see fig 1). A programmer of this machine must have two goals in mind. The first is to create enough processes to fill the pipelines sufficiently so that an instruction is issued from the task queue every machine cycle (100 nano-seconds). Once this has been accomplished it is important to ensure that the compiler generated instructions have be optimized so as to reduce duplication of indexing operations and unnecessary references of variables. Careful management of data movement operations to and from register memory is the key to achieving high performance on this pipelined MIMD architecture. While vector instructions are not implemented in hardware, it is possible to construct such vector operations in software. This is significant in two respects. First it demonstrates that SIMD constructs may be recovered in this architecture. Second, similar techniques are expected to be applicable to speeding up certain basic operations that do not vectorize. .PP To illustrate the idea, let us again consider the computation .EQ (5.1) y~<-~y~+~Mx ~, .EN where @y@ is a real vector of length @n sub 1@, @M@ is an @n sub 1 times n sub 2 @ real matrix, and @x@ is a real vector of length @n sub 2@. The vector @y@ is to be overwritten with the result @y~+~Mx@. This is a basic task in numerical linear algebra which has many applications. It has been proposed for inclusion in the extended BLAS [13] that will be needed to achieve efficiency and transportability across various parallel architectures. .PP To code the above operation in FORTRAN on the HEP, one might simply self-schedule the inner products .EQ (5.2) y(i) ~<-~y(i) ~+~ M(i,*)x(*) ~~~for~~ i ~=~ 1,2,...,n sub 1 .EN in parallel. However, even though a substantial speedup could be achieved, the FORTRAN DO LOOP that would be used to express the inner product will be inefficient since it introduces a large amount of integer arithmetic for subscript calculation at each array reference. A clever optimizing compiler might avoid this if the FORTRAN loop were unrolled, but this remains to be done. In fact, the data movement and indexing operations will completely dominate and hide any floating point work if the present HEP FORTRAN 77 compiler is used. .PP On a vector machine one would accumulate the desired result in @y@ by adding in appropriate multiples of the columns of @M@ in the following fashion .EQ (5.3) y(*) ~<-~y(*) ~+~ M(*,j)x(j) ~~~~for~~ j ~=~ 1,2,...,n sub 2 ~ . .EN This operation is inappropriate for an MIMD machine because one would have to synchronize the access to @y@ by each of the processes, and this would cause busy waiting to occur. One would do better to partition the vector @y@ and the rows of the matrix @M@ into blocks .EQ (5.4) left ( pile {y sub 1 above y sub 2 above . above . above . above y sub k } right ) ~=~ left ( pile {y sub 1 above y sub 2 above . above . above . above y sub k } right ) ~+~ left ( pile {M sub 1 above M sub 2 above . above . above . above M sub k } right ) x ~ .EN and self-schedule individual vector operations on each of the blocks in parallel: .EQ (5.5) y sub i ~<-~ y sub i ~+~ M sub i x ~~~~for ~~ i ~=~ 1,2,...,k ~. .EN If a vector operation such as (5.3) were available to compute each of the blocks on (5.5), there would be nothing left to do. Synchronization costs at the outer loop would be reduced from requirements of (5.2), and we would have very efficient vector operations at the inner loop level. .PP As mentioned previously, there is no vector operation in hardware on the HEP. However, it is possible to construct such a vector operation in software that effectively uses the pipelined arithmetic functional units. One way to achieve this is to set up a contiguous block of registers designated as a buffer vector-register and another block of registers designated as a result vector-register. If the length of the buffer is @l@, then we want @k@ in (5.5) to satisfy @n sub 1 ~=~ (k-1)l~+~ r@ with @y sub i @ of length @l@ for @i~<~ k@ . Then for each @i@, @y sub i@ is loaded into the result vector-register and then the columns of @M sub i@ are successively loaded into the buffer vector-register, multiplied by the appropriate component of @x@, and added to the result vector-register. Since storage of the result register is deferred until the computation is completed, data movement is minimal. .PP This may be described in the following pseudo-code: .sp2 .KS @indent2 mark program~matvec(ldm,n,M,x,y)@ .sp1 @lineup comment: ~~ldm~the~leading~dimension~of~M~is~needed~for~column~ indexing@; @lineup comment: ~~initialize~the~result~register@; .sp1 @lineup begin @ @lineup result ~<-~ y@; @lineup "for"~j ~=~ 1 ~step~1~until ~n ~ do@ .sp1 @lineup indent begin @ @lineup indent comment:~~load~the~next~column~of~M~into~buffer@; @lineup indent buffer ~<-~M(*,j)@; .sp1 @lineup indent comment:~~multiply~by~x(j)~"and"~add~"to"~result~register@; @lineup indent buffer ~<-~ buffer size -2 {times} x(j)@; @lineup indent result ~<-~ result ~+~ buffer@; @lineup indent end @ .sp1 @lineup comment:~~ store~result~into~y~before~exit@; @lineup y ~<-~ result@; @lineup end@. .sp2 .KE It is possible to accomplish this in assembly language by unrolling the loops and calculating the index increments at assembly time. This is facilitated by the three address instruction set for data transfer between Data memory and Register memory, and for binary arithmetic operations that allow indexing of registers. .sp _____ figure 2 ----- .sp .PP The performance improvement over FORTRAN is quite dramatic. We illustrate this in Figure 2. Note that nearly 6 megaflops were achieved with this technique on a machine capable of 10 million instructions per second. A factor of 10 decrease in the basic cycle time would provide a 4-PEM system with the capability of over 200 megaflops without any other modifications to the basic design. When this is combined with the opportunity to express parallelism at the so called "outer loop" level and with the ability to multi-task large jobs that split into obvious independent chunks of computation, the HEP architecture appears to be very attractive indeed. However, some basic problems are introduced as we attempt to take advantage of parallelism within library tasks. .NH Library Issues. .PP We would like to illustrate these difficulties with a specific example and offer some remedies that have proved successful in this setting. The library we shall discuss is a modified version of LINPACK [2] that has been designed for both transportability and efficiency across a wide variety of computing systems including serial, vector, and MIMD architectures. This effort represents a prototype of a library structuring scheme that promises to provide a reasonable compromise between the opposing goals of portability and performance. The ultimate goal of this research is to provide provide a suitable framework for restructuring existing software libraries that will port efficiently to advanced computer architectures as they emerge. .PP The algorithms implemented within this library are based on standard procedures in linear algebra including LU, QR, and Cholesky decompositions. They have been reformulated in terms of a few relatively high level constructs such as matrix-vector operations. This concept is not new, LINPACK has been written in terms of a set of low level constructs called the Basic Linear Algebra Subprograms (BLAS). The BLAS operations are functions such as inner product, a multiplication of a vector added to another vector, and vector scaling. The BLAS are well suited for operations that occur on some of the vector processors, but they are not the best choice for certain other vector processors, multiprocessors, or parallel processing computers [4]. The next level up from simple vector operations are the matrix-vector operations. These include such operations as a matrix times a vector and a rank one change to a matrix. Of course these matrix vector operations can be coded in terms of the BLAS. However, efficiency may gained by coding them directly because they provide enough computational granularity to make use of the more powerful capabilities of the most advanced architectures. In most cases performance is vastly improved through the ability to accumulate results in vector registers and mask data movement between main memory and register memory with floating point operations. This technique was discussed in detail in Section 3 for SMXPY on the HEP and it is described in [4] for the CRAY machines. .PP The module concept has proven itself to be very successful already in transporting these LINPACK routines across various architectures. To date it has been run on a variety of machines as reported by Dongarra in [1]. In particular these codes have achieved super-vector speeds on all of the existing CRAY machines. Moreover, near optimal speedups on both the 2 and 4 processor CRAY X-MP has been achieved. Results are reported in the next section. It is important that these modules are able to accommodate both parallel and vector machines of such variety. It is equally important that so few modules are required in order to produce LINPACK. All of this suggests that the modules above are indeed at the right level of granularity. .PP Designing the algorithms in terms of such operations is the hard part of an implementation. This can require considerable rethinking of the design of the algorithm. However, restructuring the algorithms in terms of these modules is more than just a challenging exercise. It tends to unify the structure of various algorithms much more than the use of the BLAS. Moreover, the module concept allows us to avoid locking the structure of the algorithms into one computer's architecture. .PP Given the algorithms described in terms of these high level modules, we are able to produce a version targeted for a new machine, such as the HEP, by replacing only three modules: matrix-vector multiplication @( y ~=~ y ~+~ Mx)@, vector-matrix multiplication @( y sup T ~=~ y sup T ~+~ x sup T M)@, and a rank one update to a matrix @(M ~=~ M ~+~ x y sup T )@. These modules represent a high level of granularity in the algorithm in the sense that they are based on matrix-vector operations, @O(n sup 2 )@ work, not just vector operations, @O(n)@ work. .PP Several tricky issues present themselves as one considers implementing a library of subroutines that take advantage of parallelism through the high level modules. The most important issue that arises with the users freedom to create processes is that unless great care is taken with the design and structure of the library, the user will have to be concerned with the possibility of invoking the library from parallel branches of his code and unwittingly overwhelming the system resources. Nevertheless, the implementors of such a library must strive to conceal machine dependencies as much as possible from the user. Also, in the case of transporting existing libraries to the HEP one wishes to preserve the interfaces to avoid causing past users to modify their code unnecessarily. This desire is at first glance at odds with the HEP. If we are to invoke parallelism at the level described above then it would appear that the user must be conscious of the number of parallel tasks throughout his program. This is the result of limitations on some systems in the total number processes allowed to be be created. Should the library routines be called from multiple branches of a parallel program, the user could inadvertently attempt to create many more processes than is allowed. .PP A solution to this dilemma is to construct a scheduler that is capable of serving numerous requests from a few processes. For the HEP this scheduler may take the form of several parallel processes which are capable of preforming all of the low-level requests coming from the library. Each of these processes polls an asynchronous variable which indicates work to be done when full. The work process assumes the appropriate identity that is indicated and performs the requested computation. These work processes are created at the outset and exist throughout the total computation. The service requests are data activated as described above. This is important on multipem systems as the overhead for issuing intertask creates is substantial. .NH Performance of the Modules. .PP In Figure 3 performance of the LU decomposition with these high performance modules is shown and compared to corresponding performance with the modules written in FORTRAN (both parallel and non-parallel versions). .sp ------- figure 3 ------ .sp The transportability of these codes has also been demonstrated. The modules have been coded for efficiency on CRAY 1 and CRAY X-MP machines as well as many others. A partial list on vector machines was shown in section 3. The performance of linear algebra routines written in terms of the modules appears to be very satisfactory both from a transportability standpoint and from an efficiency standpoint. The performance is near optimal both on the CRAY and on the HEP machines in terms of comparison with the best specially coded algorithms based upon the standard algorithms. We turn now to special algorithms that are tuned specifically to the HEP. .NH Special algorithms for the HEP .PP The above techniques do achieve a level of transportability along with very respectable performance. However, the technique employed above to exploit parallelism involved a @fork-join@ construct at the lowest level routines. Although process create's were avoided, parallel operations were invoked only intermittenly with each submatrix during the reduction process. Algorithms that avoid this fork join synchronization and make a better attempt at keeping the processors busy at all times should make better use of the machine. .PP To find such algorithms for linear algebra tasks one naturally turns to the data-flow type algorithms. The HEP architecture and FORTRAN language extensions will support very fine grained synchronization. This may take place at the binary operation level that one traditionally associates with data-flow algorithms. However, it is more profitable to seek a larger grain of computation that is more compatible with the synchronization overhead. .PP Our example involves the Given's method for obtaining the @QR@ factorization of a matrix. This algorithm has been a favorite among data-flow and systolic array algorithmists because a basic operation is fine grained (multiply a 2-vector by a 2 times 2 matrix) and the procedure is not complicated by pivoting. The serial variant of Given's method that we consider is as follows. Given a real @m times n@ matrix @A@ , the goal of the algorithm is to apply elementary plane rotations @G sub ij@ which are constructed to zero out the @ji sup th @ element of the matrix @A@. Such a matrix may be thought of as a @2 times 2@ orthogonal matrix of the form .EQ G ~=~ left (^ pile { gamma above - sigma} ~pile {sigma above gamma } ^ right ) .EN where @sigma sup 2 ~+~ gamma sup 2 ~=~ 1@ . If .EQ left ( pile {alpha above beta} ~ pile { a sup T above b sup T } ^ right ) .EN represents a @2 times n @ matrix then a zero can be introduced with left multiplication by @G@ into the @beta@ position through proper choice of @ gamma @ and @ sigma @. When embedded in the @n times n @ identity the matrix @ G sub ij@ is of the form .EQ G sub ij ~=~ I ~+~ D sub ij .EN where all elements of @D sub ij@ are zero with the possible exception of the @ii^,^ ij^,^ ji^,^@ and @jj@ entries. The matrices @ G sub ij@ are used to reduce @A@ to upper triangular form in the following order. .EQ left ( G sub {n-1,n} ^.^.^.^G sub 2n G sub 1n right ) left ( G sub n-2,n-1 ^.^.^.^ G sub 2,n-1 G sub 1,n-1 right ) ^.^.^.^ left ( G sub 23 G sub 13 right ) left ( G sub 12 right ) A ~=~ R~.~ .EN The order of the zeroing pattern may be seen in the following @5 times 5@ example .EQ pile {times above ctimes sub 1 above ctimes sub 2 above ctimes sub 4 above ctimes sub 7} ~~~~ pile {times above times above ctimes sub 3 above ctimes sub 5 above ctimes sub 8} ~~~~ pile {times above times above times above ctimes sub 6 above ctimes sub 9} ~~~~ pile { times above times above times above times above ctimes sub 10} ~~~~ pile { times above times above times above times above times} ~~~~ .EN This order is important if one wishes to "pipeline " the row reduction process. This pipelining may be achieved by expressing @R@ as a linear array in packed form by rows and then dividing this linear array into equal length pipeline segments. A process is responsible for claiming an unreduced row of the original matrix and reducing it to zero by combining it with the existing @R@ matrix using Givens transformations. A new row may enter the pipe immediately after the row ahead has been processed in the first segment. Each row proceeds one behind the other until the entire matrix has been processed. However, these rows cannot be allowed to get out of order once they have entered the pipe because of data dependencies. This is accomplished using an array of asynchronous variables; one protecting access to each segment of the pipe. A process gains entry to the next segment by emptying the corresponding asynchronous variable before releasing the ownership of the segment currently occupied. Granularity may be adjusted to hide the cost of this synchronization by simply adjusting the length of a segment. Segment boundaries do not correspond to row boundaries in @R@. Further details are given in [8]. The pipelining scheme and the synchronization mechanism has been extended in a very natural way to the @QR@ factorization of a sparse matrix [12]. .PP This algorithm has been coded for the HEP and the computational results were somewhat surprising. In fact these results provided the motivation to look more closely at the machine performance and resulted in the work described in Section 3 The results shown in Figure 4 compare the Householder method from the modified LINPACK library using FORTRAN modules with a FORTRAN version of the pipelined Given's method. ----- fig 4 ----- It turns out that in FORTRAN the Given's method is twice as fast as the Householder method on the HEP. This is contrary to the performance predicted by a complexity analysis based solely upon floating point operations. The reason for this is that memory references (ie. load and store operations) dominate floating point operations on this machine and Given's method requires half the memory references that Householder's method requires. .NH Conclusions .PP As multiprocessor designs proliferate, research efforts should focus on "generic" algorithms that can be easily transported across various architectures. If a code has been written in terms of high level synchronization and data management primitives, that are expected to be supported by every member of the model of computation, then these primitives only need to be customized to a particular realization. A very high level of transportability may be achieved through automating the transformation of these primitives. The benefit to software maintenance, particularly for large codes, is in the isolation of synchronization and data management peculiarities. .PP This desire for portability is often at odds with the need to efficiently exploit the capabilities of a particular architecture. Nevertheless, algorithms should not be intrinsically designed for a particular machine. One should be prepared to give up a marginal amount of efficiency in trade for reduced man power requirements to use and maintain software. There are many possibilities for implementation of the general ideas that are briefly described above. We are certainly not in a position to recommend a particular implementation with any degree of finality. However, we already have experience indicating the feasibility of both of the approaches discussed here. We believe that a high level of transportability as described above can be achieved without seriously degrading potential performance. We would like to encourage others to consider the challenge of producing transportable software that will be efficient on these new machines. .sp .B References .sp .IP [1] J.J. Dongarra, .I Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment, .R Argonne National Laboratory Report MCS-TM-23, (updated August 1984). .sp .IP [2] J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart, .I LINPACK Users' Guide, .R SIAM Publications, Philadelphia, 1979. .sp .IP [3] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson, .I A Proposal for an Extended Set of Fortran Basic Linear Algebra Subprograms, .R MCS/TM 41, (December, 1984), to appear in ACM SIGNUM Jan. 1985. .sp .R .IP [4] J.J. Dongarra and S.C. Eisenstat, .I Squeezing the Most out of an Algorithm in CRAY Fortran, .R ACM Trans. Math. Software, Vol. 10, No. 3, 1984. .sp .IP [5] J.J. Dongarra and R.E. Hiromoto, .I A Collection of Parallel Linear Equations Routines for the Denelcor HEP, .R Parallel Computing, Vol. 1 No. 2, 1984. .sp .IP [6] J.J. Dongarra, E.L. Lusk, R.A. Overbeek, B.T. Smith, and D.C. Sorensen, .I New Directions in Software for Advanced Computer Architectures, .R Argonne National Laboratory Report MCS/TM 32 (August, 1984). .sp .IP [7] .R J.J. Dongarra, L. Kaufman, and S. Hammarling .I Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers, .R ANL MCS-TM 46, January 1985. .sp .IP [8] J.J. Dongarra, A.H. Sameh, and D.C. Sorensen, "Some Implementations of the QR Factorization on an MIMD Machine", .R ANL/MCS-TM-25, October 1984, to appear in Parallel Computing. .sp .IP [9] K. Fong and T.L. Jordan, .I Some Linear Algebra Algorithms and Their Performance on CRAY 1, .R Los Alamos Scientific Laboratory, UC-32, June 1977. .sp .IP [10] B.S. Garbow, J.M. Boyle, J.J. Dongarra, and C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide Extension, .R Lecture Notes in Computer Science, Vol. 51, Springer-Verlag, Berlin, 1977. .sp .IP [11] M.T. Heath and D.C. Sorensen, .I A Pipelined Givens Method for Computing the QR Factorization of a Sparse Matrix, .R Proceedings of the Workshop on Parallel Processing using the HEP, Ed. S. Lakshmivarahan, University of Oklahoma, March 1985. .sp .IP [12] H.F. Jordan, .I An Overview of HEP Architecture, Programming, and Performance .R to appear in book Ed. J.S. Kowalik, Parallel MIMD Computation: The HEP Supercomputer and its Applications (1984). .sp .IP [13] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, .I Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Software, .R 5 (1979), 308-371. .sp .IP [14] E. Lusk and R. Overbeek, "Implementation of Monitors with Macros: A Programming Aid for the HEP and Other Parallel Processors", .R ANL-83-97, (1983). .sp .IP [15] E. Lusk and R. Overbeek, "An Approach to Programming Multiprocessing Algorithms on the Denelcor HEP", .R ANL-83-96, (1983). .sp .IP [16] B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe, V.C. Klema, and C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide, .R Lecture Notes in Computer Science, Vol. 6, 2nd Edition, Springer-Verlag, Berlin, 1976. .sp .IP [17] D.C. Sorensen, .I Buffering for Vector Performance on a Pipelined MIMD Machine, .R Argonne National Laboratory Report MCS-TM-29 (April 1984). .sp .