.EQ delim @@ .EN .TL .ps 12 .in 0 Multiprocessing Linear Algebra \h'.35i' .sp .br Algorithms on the CRAY X-MP-2: \h'.35i' .sp .br Experiences with Small Granularity \h'.35i' .sp 2 .AU .ps 11 .in 0 Steve C. Chen .AI .ps 10 .in 0 CRAY Research Inc. \h'.25i' Chippewa Falls, Wisconsin\h'.20i' .AU .ps 11 .in 0 Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i' .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' .AU .ps 11 .in 0 Christopher C. Hsiung .AI .ps 10 .in 0 CRAY Research Inc. \h'.25i' Chippewa Falls, Wisconsin\h'.20i' .FS .ps 9 .vs 11p @size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical Sciences Research Program (KC-04-02) of the Office of Energy Research of the U. S. Department of Energy under Contract W-31-109-Eng-38. .sp .25v .FE .QS .ps 10 .in .25i .ll -.25i .I Abstract \(em This paper gives a brief overview of the CRAY X-MP-2 general-purpose multiprocessor system and discusses how it can be used effectively to solve problems that have small granularity. An implementation is described for linear algebra algorithms that solve systems of linear equations when the matrix is general and when the matrix is symmetric and positive definite. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .SH Overview of the System .PP The CRAY X-MP is a follow-up to the CRAY-1S system offered by CRAY Research, Inc. The CRAY X-MP family is a general-purpose multiprocessor system. It inherits the basic vector functions of CRAY-1S, with major architectural improvements for each individual processor. The interprocessor communication mechanism and the provision of Solid-State Disk device(SSD) are new designs that create tremendous potential in the realm of high speed computing. .PP The CRAY X-MP-2 system is the first product of the CRAY X-MP family. It is a dual processor model that is housed in the identical physical chassis as that of the CRAY-1S. Each processor occupies half of the space of the original CRAY-1S. This is achieved through larger IC integration, denser packaging and much improved cooling capacity. .PP The system is designed specifically to handle computation intensive and I/O intensive jobs in an efficient way. It can be used to perform simultaneous scalar and vector processing of either independent job streams or independent tasks within one job. Hardware in the X-MP enables multiple processors to be applied to a single Fortran program in a timely and coordinated manner. .PP All processors share a central bipolar memory (of up to 4 million words), organized in 32 interleaved memory banks. Each processor has four memory ports: two for vector fetches, one for vector store, and one for independent I/O operations. In other words, the total memory bandwidth of the two processors is up to eight times that of the CRAY-1S system. The added memory bandwidth provides a balanced architecture for memory-to-memory vector operations as typified by scientific Fortran codes. .PP Other features of the machine includes hardware automatic "flexible chaining"[1]. This feature allows each individual processor to have simultaneous memory fetches, arithmetic, and memory store operations in a series of related vector operations. The elimination of "chain slot time" guarantees "supervector" speed in all vector operations. It contrasts with the "fixed chaining" and unidirectional vector fetch/store of the CRAY-1S [7]. The balance between CPU speed and memory bandwidth makes the CRAY X-MP friendly to Fortran codes. The need to resort to assembly level hand coding is drastically reduced. Other improved features of the CPU include multiple data paths and increased instruction buffer size. The machine has a faster central clock with improved cycle time, 9.5 ns, as compared with 12.5 ns on the CRAY-1. The result is a much improved scalar and vector speed over the CRAY-1S. .PP In addition, a new, optional, CPU-driven Solid-State Storage Device (SSD) offers an extended central memory, an important element to buffer between fast central memory and slow disk devices. The SSD can be used as a fast-access device for large pre-stage or intermediate files generated and manipulated repetitively by user programs, or it can be used by the system for job swapping space and temporary storage of frequently used system programs. .KS .ce Figure 1 .ce CRAY X-MP-2 System Organization .sp 8 include here a diagram of machine .sp 7 .KE .PP Additional hardware in the X-MP enables efficient and coordinated application of multiple processors to a single user program. All processors assigned to a task share a unique set of synchronization and communication registers. There are three kinds of shared registers: a set of binary semaphores, a set of index registers and a set of data registers. These registers, in cooperating with the shared central memory, allow processors to signal each other, wait for each other and transfer data between each other. Processors can also interrupt each other through the interprocessor interrupt. These basic hardware functions provide a basic mechanism for efficient communication and synchronization between processors. .PP Other than hardware instructions, software support for multitasking is at the library level, where the user makes calls to ask the system for multitasking functions. Three categories of routines provide different mechanisms for parallel processing [5]. First, a task can be created to be scheduled for execution through the TSKSTART call. The calling task may wait for the termination of a created task with the TSKWAIT routine. Task is a schedulable unit that the user expects to be executed in a serial manner. It is a software entity that the programmer deals with, as the physical processor is concealed from him. .PP Second, tasks may need to communicate or synchronize with one another as they execute concurrently. Producing tasks may signal consuming tasks through EVPOST routine. A consuming task may wait for the signal through EVWAIT routine. As the signal is consumed, it may be reset through a EVCLEAR call. .PP In the third category, the LOCKON and LOCKOFF routines are used to protect the integrity of code segment or shared resources (e.g., data) among tasks. .PP Based on these three categories, other synchronization and communication mechanisms can be developed. Since the multitasking library employs a queueing mechanism (among others) to handle general situations, it is very flexible. Of course, the user always has the option to hand code his own synchronization routines through the use of hardware instructions. .SH Granularity of Tasks .PP A number of factors influence the performance of an algorithm in multiprocessing. These include the degree of parallelism, process synchronization overhead, load balancing, inter processor memory contention, and modifications needed to separate the parallel parts of an algorithm. .PP The size of the work performed in parallel, the granularity of the task, is the first critical factor in matching a parallel algorithm to an architecture which should be addressed. By granularity, we mean the amount of time the cooperating tasks execute concurrently on related codes in between synchronization points. The need to synchronize and to communicate before and after parallel work will greatly impact the overall execution time of the program. Since the processors have to wait for one another instead of doing useful computation, it is obviously better to minimize that overhead. In the situation where segments of parallel code are executing in vector mode, typically at ten to twenty times the speed of scalar mode, granularity becomes an even more important issue, since communication mechanisms are implemented in scalar mode. .PP Granularity is also closely related to the degree of parallelism, which is defined to be the percentage of time spent in the parallel portion of the code. Typically, a small granularity job means that parallelism occurs in an inner loop level (although not necessarily the innermost loop). In this case, even the loop setup time in outer loops will become significant without even mentioning frequent task synchronization needs. .PP For the CRAY X-MP family, granularity on the order of milliseconds is considered large. Many reports [1][4][5] have shown significant speedups for multitasking of large granularity problems on the CRAY X-MP-2. For example, speedups of 1.8 to 1.9 were seen when two processors were used instead of one to run a particle-in-cell code, a weather forecasting model, and a three-dimensional seismic migration code. For these problems, the multitasking library is used to implement parallel tasks. The reported successes indicate that, for large granularity codes that have high degrees of parallelism, the payoff of doing multitasking on the CRAY X-MP-2 is very significant. The use of the CRAY multitasking library to handle intertask communications proves to be very powerful. It will be interesting to see, however, how far we can push the granularity down and still get a descent speedup. As will be shown later, even when the library appears to be too coarse for small granularity tasks, the hardware capability still allows efficient handling of them. For our purposes, we would like to use matrix vector operation to test the machine behavior for small granularity tasks. .SH The Algorithms .PP We have chosen matrix vector operation for two reasons. First, we can easily construct the standard algorithms in linear algebra out of these types of modules. Second, matrix vector operations provides simple modules for parallel execution[2][3]. We will describe an implementation for standard LU factorization with partial pivoting for a general square matrix. .PP The algorithm can be described as having basically three distinct parts within a loop: .KS .sp .in +1. .I do i = 1,n .br .in +1. perform the i-th matrix vector product (forms part of L) .br search for a pivot and interchange .br perform the i-th vector matrix product (forms part of U) .in -1. .br end .in -1. .KE .R .sp The algorithm organized storage so that the original matrix is overwritten with the factorization. The amount of work required to perform the factorization is approximately @2/3 n sup 3 @ floating point operations (here we count both additions and multiplications as operations). The factored matrix can then be used to solve systems of equations. In order to maintain stability in the algorithm, a partial pivoting scheme is used. This helps in avoiding problems with small divisors which can cause inaccurate computations. The pivot is chosen to be the element of largest absolute value in a column. The method used is a simple variant of Crout reduction [6], but the algorithm has been expressed in terms of modules that are easy both to understand and to implement. .PP The algorithm described above allows for a number of alternatives for parallel implementation. That is, the available processors can be assigned to each of the three steps, allowing for multiprocessing within each. This avenue has not been investigated since the pivoting does not take as much time as the other steps. Instead, each processor will be given the task of performing one of the matrix vector operations concurrently. The algorithm can be easily modified so each matrix vector operation is independent and can proceed in parallel. The pivoting is handled by one of the two processors in a sequential fashion. .PP Depicted graphically, the processing of the matrix by the algorithm at the @i sup th @ stage would look like the following (the matrix factors overwrite the original matrix): .KS .sp 10 .KE .PP The algorithm is modified slightly to allow independent operations to be performed in parallel. The modified algorithm in no way increases the number of operations or complexity over the original algorithm. The modified algorithm just rearranges the computation within the loop to expose independent operations. The resulting modified algorithm is of the following form: .sp .KS .in +1. .I perform the 1-st matrix vector product .br do i = 1,n-1 .in +1. .br search for a pivot and interchange .br perform the i-th vector matrix product .br perform the i+1-st matrix vector product .in -1. .br end .br perform the n-th vector matrix product .in -1. .R .KE .sp .PP It is now easy to see how to partition the work between two processors: each matrix vector operation within an iteration is independent of the other. The two cooperating tasks need to synchronize @2(n-1)@ times during the course of the calculation (once before a task is started and once after). There is, however, a slight imbalance in the work of each task. At the @i sup th @ step, the amount of work for each task is @O((n-i+1)*i)@ and @O((n-i)*i)@ operations. .PP To perform the synchronization between tasks, we have used task control (TSKSTART) to start a second task before the LU decomposition routine is entered. This task waits until it is directed to start up. Event control (EVPOST, EVWAIT, and EVCLEAR) is used to start and synchronize the work within the algorithm. .PP Table 1 describes the performance for this implementation of LU decomposition. The column labeled "Degradation from code change" reflects the loss of performance when the sequential algorithm was restructured to separate independent tasks. The numbers are obtained by running the original algorithm and the modified algorithm using no multitasking features on a single processor and taking the percentage difference. Measurements were also made of the total time spent in the parallel portion of the algorithm. As expected, small order matrices consume a greater percentage of time in non-parallel parts than do larger matrices. Even for matrices of order 50, however, over 75% of the time is used for parallel processing. .sp .KS .ce Table 1 .ce Parallel Version of LU Decomposition .sp .TS center; c c c c c c n n n. Degradation from Percentage of n code change parallel code _ _ _ 50 6.5% 75.1% 100 4.9% 82.0% 300 2.0% 87.5% 600 0.7% 97.0% .TE .KE .sp We also measured the speedup of the algorithm when two processors were used. Table 2 shows the comparison between the modified (but without multitasking mechanism) one processor version and the multitasked two processor version. The column labeled "Mean granularity" is the average time spent in each of the matrix vector calls for that particular order problem. .sp .KS .ce Table 2 .ce Speedup of Algorithm on 2 Processors vs 1 Processor .sp .TS center; c c c c c c c c c c n n n n n. n Optimal MT library calls MT CAL calls Mean Granularity _ _ _ _ _ 50 1.44 0.86 1.39 40 @ mu @sec 100 1.57 1.05 1.54 66 @ mu @sec 300 1.77 1.60 1.79 250 @ mu @sec 600 1.91 1.80 1.86 800 @ mu @sec .TE .KE .R .sp .PP The "Optimal" column is an attempt to filter out the time required to synchronize the processors and the time introduced by memory contention caused by multitasking on the two processors. These numbers were generated by running the program twice. The first run was a sequential process, using no multitasking constructs. In the second run the call to the shorter of the two parallel tasks was removed; thus, in some sense, this is the best situation (the calculation, of course, does not produce the correct results, but it provides a good measure of the overhead). The third column is the result of using the standard multitasking routines in the Multitasking (MT) library. As an alternative, there are assembly language (CAL) routines that uses hardware instruction directly. These CAL routines performs minimum synchronization function required by the code and is faster than the general purpose multitasking library routines.* .FS * The MT library routines keep track of additional information on the activities of other tasks. They are better suited for larger and more tasks and are quite flexible for different programming styles. For this particular small granularity job that information is not needed, hence the CAL mechanism is more amenable. The difference in the implementation is O(1) clocks for the CAL version and O(100) to O(1000) clocks for the multitasking library, depending on mechanism used. .FE .PP With this algorithm the work partition is well matched for two processors. The overhead in multitasking is essentially wiped out as the problem size increases and for small problems it is not a great penalty. The implementation comes very close in the limit to attaining the optimum performance. .PP We now focus on another algorithm for dealing with a system of equations where the matrix is symmetric and positive definite: Cholesky factorization. The algorithm can again be described in terms of a matrix vector operations, but in this case because of symmetry only half the matrix is referenced. .PP As before, we can graphically describe the algorithm at the @i sup th @ stage as follows: .KS .sp 10 .KE In this case, the algorithm does not divide naturally into two parts as in the previous algorithm. To distribute the work between the processors, we take the naive approach of just splitting the matrix vector operation in half, letting one processor take the left half and the other processor the right half, and then put the two parts back in a sequential part. Table 3 shows the results. .sp .KS .ce Table 3 .ce Parallel Version of Cholesky Decomposition .sp .TS center; c c c c c c n n n. Degradation from Percentage of n code change parallel code _ _ _ 50 33.1% 80.2% 100 24.2% 85.4% 300 8.0% 91.9% 600 3.5% 96.1% .TE .KE .sp As before, the percentage given here is for the modified code before multitasking mechanism is put in. The degradation in code performance is the result of the additional subroutine calls to perform the matrix vector product and the fact that one more work array has to be initialized and added to the other half in each iteration. In the following table, the modified (but without multitasking mechanism) one processor version is compared against the multitasked two processor version. Results are shown in Table 4. .sp .KS .ce Table 4 .ce Speedup of Algorithm on 2 Processors vs 1 Processor .sp .TS center; c c c c c c c c c c n n n n n. n Optimal MT Library calls MT CAL calls Mean Granularity _ _ _ _ _ 50 1.67 0.76 1.56 30 @ mu @sec 100 1.74 0.94 1.54 45 @ mu @sec 300 1.85 1.52 1.85 130 @ mu @sec 600 1.93 1.80 1.92 400 @ mu @sec .TE .sp .KE .R As in the previous example, the improvement is substantial when two processors are used to partition the work and perform the task. For large orders, the parallel program reaches the optimal rate of speedup. This experiment is relevant to the case when there are more than two processors. The matrix vector operation will than be split across the matrix in a similar fashion as done here. Perhaps going to a block matrix scheme to achieve the desired number of parallel tasks. We expect to observe the same trend when we can perform a similar experiment with more processors. .PP Note that there is certain amount of fluctuation in between runs on the CRAY X-MP depending on background activities. The numbers we present here should be given an 1-3% tolerance. .sp .SH Conclusions .PP The multitasking concept on the CRAY X-MP-2 has been shown to be advantageous in solving problems with relatively large granularities (that is, when there is more than one millisecond of computation that can be performed in parallel between synchronization points). .PP For problems with small granularity with a reasonable degree of parallelism that can be exploited, at least from the standpoint of linear algebra solvers where the granularity is in the order of microseconds, the situation can be handled just as efficiently. In general, the main sources of overhead (other than synchronization, load imbalance and code change) are memory contention and operating system service. The speedup figures of the examples presented here show that the interference from these two factors is insignificant. Our experience demonstrates that multitasking with small granularity jobs is very promising on the CRAY X-MP-2. .bp .SH References .sp .IP [1] S.C. Chen, .I Large-scale and High-Speed Multiprocessor System for Scientific Applications - CRAY-X-MP-2 Series, .R NATO Advanced Research Workshop on High Speed Computation, Nuclear Research Center, Julich, West Germany, June 1983. .sp .IP [2] J.J. Dongarra and R. Hiromoto, .I A Collection of Parallel Linear Equations Routines for the Denelcor HEP, .R ANL/MCS-TM-15, September 1983. .sp .IP [3] J.J. Dongarra and S.C. Eisenstat, .I Squeezing the Most out of an Algorithm in Cray Fortran, .R ANL/MCS-TM-9, May 1983. .sp .IP [4] C.C. Hsiung and Werner Butscher, .I A New Numerical Seismic 3-D Migration Model on the CRAY X-MP, .R SIAM Conference on Parallel Processing and Scientific Computing, Norfolk, VA, 1983. .sp .IP [5] John L. Larson, .I An Introduction to Multitasking on the CRAY X-MP-2 Multiprocessor, .R to appear in IEEE Computer. .sp .IP [6] G.W. Stewart, .I Introduction to Matrix Computations, .R Academic Press, New York, 1973. .sp .IP [7] P. M. Johnson, .I An Introduction to Vector Processing, .R Computer Design, Vol. 17, No. 2, 289-97, Feb. 1978. .