.EQ delim @@ .EN .B .nr BT ''-%-'' .he '''' .pl 11i .de fO 'bp .. .wh -.5i fO .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 Implementing Dense Linear Algebra Algorithms Using Multitasking on the CRAY X-MP-4 (or Approaching the Gigaflop) .ft 3 .ps 11 .sp 3 Jack J. Dongarra and Tom Hewitt .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 55 .sp .7i August, 1985 .pn 1 .he ''-%-'' .he '''' .bp .ft 3 .ps 11 .bp .LP .LP .EQ gsize 11 .EN .TL .ps 12 .in 0 Implementing Dense Linear Algebra Algorithms .br Using Multitasking on the CRAY X-MP-4 .br (or Approaching the Gigaflop) .AU .ps 11 .in 0 Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i' .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' Argonne, Illinois 60439\h'.20i' .AU .ps 11 .in 0 Tom Hewitt\h'.20i' .AI .ps 10 .in 0 CRAY Research Inc.\h'.20i' Chippewa Falls, Wisconsin 54701\h'.20i' .FS .ps 9 .vs 11p @size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S. Department of Energy, under Contract W-31-109-Eng-38. .FE .QS .sp 2 .ps 10 .in .25i .ll -.25i .I Abstract \(em This note describes some experiments on simple, dense linear algebra algorithms. These experiments show that the CRAY X-MP is capable of small - grain multitasking arising from standard implementations of LU and Cholesky decomposition. The implementation described here provides the "fastest" execution rate for LU decomposition, 718 MFLOPS for a matrix of order 1000. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .SH Introduction .PP Over the past few months we have been experimenting with some simple linear algebra algorithms on the CRAY X-MP-4 multiprocessor. The CRAY X-MP family is a general-purpose multiprocessor system. It inherits the basic vector functions of CRAY-1S, with major architectural improvements for each individual processor. The CRAY X-MP-4 system is a four - processor model housed in a physical chassis identical of the CRAY-1S. The system can be used to perform simultaneous scalar and vector processing of either independent job streams or independent tasks within one job. Hardware in the X-MP enables multiple processors to be applied to a single Fortran program in a timely and coordinated manner. .PP All processors share a central bipolar memory (of up to 16 million words), organized in 64 interleaved memory banks. Each processor has four memory ports: two for vector fetches, one for vector stores, and one for independent I/O operations. In other words, the total memory bandwidth of the four processors is up to sixteen times that of the CRAY-1S system. .PP This note describes results obtained from three experiments: LU decomposition based on matrix-vector operations, LU based on a "best" implementation for the architecture, and an implementation of Cholesky decomposition based on matrix-vector operations. .SH LU Decomposition .PP The versions of LU and Cholesky factorization, used here, are based on matrix-vector modules that allow for a high level of granularity, permitting high performance in a number of different environments, see [2]. .PP The algorithm designed to give the "best" performance on the X-MP architecture is worth noting. It is based on standard Gaussian elimination with partial pivoting. The algorithm is organized such that it zeros out three columns (below the diagonal) of the matrix, then applies these transformations to the rest of the matrix. The application of the transformations to the remainder of the matrix is split up among the processors. An assembly language kernel is used to apply the three pivot rows for simultaneous row operations. This reduces memory traffic and allows a single processor of an X-MP to obtain its theoretical maximum sustainable computation rate of 198 MFLOPS. The assembly language kernel was multitasked in the experiment among two, three, and four CPUs. Fortran - callable assembly language synchronization subroutines were used. These require less than half a microsecond for synchronization. .PP The process of finding the three pivot rows was not multitasked. However, it was written as an assembly language kernel to reduce overhead for small problems. As the factorization proceeds, the size of the relevant vector and submatrix decreases. The final 15 @times@ 15 block of the reduction was performed by an unrolled [2] version of standard Gaussian elimination. This portion is entirely single - threaded Fortran and runs two to three times the speed of the Fortran which has not been unrolled. Synchronization was accomplished through a fork - and - join mechanism using the cluster (shared) registers of the X-MP. .PP Listed in Table 1 are the results of the experiments; speedups range from 1.3 on a problem of size 50 @times@ 50 to 3.8 for a matrix of order 1000 @times@ 1000. The performance ranges from 97 MFLOPS to an impressive 718 MFLOPS for four processors on a 1000 @times@ 1000 matrix. For small problems, startup times dominate the overall performance. It is interesting to note that a system of equations of order 1000 can now be factored and solved in under a second! (The same problem on a VAX 11/780 would take roughly two hours to complete.) .sp .KS .ce Table 1 .ce .I High - Performance LU Decomposition .R .sp .TS center; c c s s s c s s c c s s s|c s s c|c c c c|c c c n|n n n n|n n n. MFLOPS Speedup over 1 Processor # Processors # Processors Order 1 2 3 4 2 3 4 _ 50 97 124 135 145 1.28 1.39 1.49 100 145 230 281 325 1.64 1.94 2.24 200 172 330 426 526 1.81 2.47 3.05 400 183 353 507 652 1.93 2.77 3.56 600 186 364 535 689 1.96 2.87 3.70 1000 188 372 550 718 1.98 2.92 3.81 .TE .KE .sp .PP For comparison we give in Table 2 the results for LU decomposition based on matrix-vector operations. The parallelism here is gained by simply splitting the matrix vector operation across the processors (see [1] for details on the algorithm). The matrix-vector modules have been coded in assembly language. The same number of operations is performed here as in the version that gives "best" performance. .sp .KS .ce Table 2 .ce .I LU Based on Matrix-Vector Operations .R .TS center; c|c c c|c c n|n n. Order MFLOPS Ratio of Algorithms from (4 processors) Table 1 / Table 2 _ 50 57 2.54 100 167 1.95 200 343 1.53 400 537 1.21 600 608 1.13 1000 675 1.06 .TE .KE .sp .PP As we can see, for lower order matrix problems, the high-performance algorithm is far more efficient. For large problems, however, there is not too much difference; for the matrix of order 1000 there is only a difference of 6% in the running time. This fact shows one of the strengths in using the matrix-vector design for algorithms of this nature. The matrix-vector modules can be easily changed as we go to a different architecture, but the basic algorithm is unaltered. .SH Cholesky Decomposition .PP A version of Cholesky decomposition for a symmetric positive definite matrix was implemented on the CRAY X-MP-4 based on matrix-vector routines, (see [2] for algorithm details). Table 3 gives the performance of that routine when run on 1 and 4 processors. In addition, the algorithm was reorganized to compute information necessary to perform four steps of the decomposition during the same step. This results in four independent "full size" matrix-vector multiplications. This information is listed in the last column of Table 3. .sp .KS .ce Table 3 .ce .I Cholesky Decomposition Based on Matrix-Vector Operations .R .TS center; c c s c c s c|c c|c| c c. MFLOPS MFLOPS Order 1 Processor 4 Processors Speedup 4 Processors (MV split) (4 MV ops) _ 50 59 52 .88 97 100 117 154 1.32 264 200 163 345 2.12 460 400 184 544 2.96 631 600 189 623 3.30 683 1000 193 689 3.58 733 .TE .KE .sp Synchronization was accomplished via the X-MP shared registers and semaphores. Multiprocessing overhead in this case is largely the result of code changes and of the breaking of a large piece, of work into smaller pieces - each of which has essentially the same startup time as the original large piece. .PP The apparent overhead for multitasking small problems is largely caused by the following items: .sp .in +.25i 1. Single-threaded portions of work. Finding pivot rows, scaling rows of the matrix, the scalar square root in Cholesky decomposition - these are all of great importance in the current study. .sp 2. Less parallel work per processor in the computational kernel. Even in scalar operations the X-MP uses a considerable amount of parallelism. In vector mode, 100 floating - point operations can be performed in the time it takes to call a subroutine and scores of flops in the time required to initialize a DO loop. .sp 3. Additional memory-bank conflicts. The single CPU times were run with no activity in the other three CPUs. When four processors are active, additional conflicts will occur, though in this case the effect is small as the matrix-vector operations is conflict-insensitive on the X-MP. .sp 4. Synchronization overhead. This is dominated by the time a CPU is waiting for all of its vector memory references to be completed. Thus, it is generally wise to complete all vector memory references before synchronization, as it eliminates the possibility of an inter-CPU memory race condition. .in -.25i .sp .SH Conclusions .PP The CRAY X-MP is capable of small granularity multitasking. For large problems that are both vectorized and parallelized, performance of the X-MP-4 should be in the range of 400 to 700 MFLOPS. For smaller problems the startup time of the parallel processes can dominate the execution time. This feature becomes more important as more processors are applied to the problem. .sp .B NOTE .PP .R Since the time this work was conducted, Cray Research has introduced micro-tasking, which provides a mechanism for the user to conveniently, and with low overhead, exploit small - granularity parallelism from Fortran programs with compiler directives. .SH References .IP [1] .R Steve S. Chen, Jack J. Dongarra, and Christopher C. Hsiung, "Multiprocessing for Linear Algebra Algorithms on the CRAY X-MP-2: Experiences with Small Granularity," .R J. Parallel and Distributed Computing, 1, (1984), 22-31. .sp .IP [2] .R Jack J. Dongarra and Stanley C. Eisenstat, "Squeezing the Most out of an Algorithm in CRAY Fortran," .R ACM Trans. Math. Software, .R 10, No. 3 (1984), 221-230. .