.EQ
delim @@
.EN
.B
.nr BT ''-%-''
.he ''''
.pl 11i
.de fO
'bp
..
.wh -.5i fO
.LP
.nr LL 6.5i
.ll 6.5i
.nr LT 6.5i
.lt 6.5i
.ta 5.0i
.ft 3
.bp
.R
.sp 1i
.ce 100
.R
.sp .5i
 .
.sp 10
ARGONNE NATIONAL LABORATORY
.br
9700 South Cass Avenue
.br
Argonne, Illinois  60439
.sp .6i
.ps 12
.ft 3
Implementing Dense Linear Algebra Algorithms 

Using Multitasking on the CRAY X-MP-4

(or Approaching the Gigaflop)
.ft 3
.ps 11
.sp 3
Jack J. Dongarra and Tom Hewitt
.sp 3
.ps 10
.ft 1
Mathematics and Computer Science Division
.sp 2
Technical Memorandum No. 55
.sp .7i
August, 1985
.pn 1
.he ''-%-''
.he ''''
.bp
.ft 3
.ps 11
.bp
.LP
.LP
.EQ
gsize 11
.EN
.TL
.ps 12
.in 0
Implementing Dense Linear Algebra Algorithms 
.br
Using Multitasking on the CRAY X-MP-4
.br
(or Approaching the Gigaflop)
.AU
.ps 11
.in 0
Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i'    
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'   
Argonne National Laboratory\h'.20i'   
Argonne, Illinois 60439\h'.20i'
.AU
.ps 11
.in 0
Tom Hewitt\h'.20i'
.AI
.ps 10
.in 0
CRAY Research Inc.\h'.20i'   
Chippewa Falls, Wisconsin 54701\h'.20i'
.FS
.ps 9
.vs 11p
@size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical
Sciences subprogram of the Office of Energy Research,
U. S. Department of Energy, under Contract W-31-109-Eng-38.
.FE
.QS
.sp 2
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em This note describes some experiments on simple, dense
linear algebra algorithms. These experiments show that
the CRAY X-MP is capable of small - grain multitasking
arising from standard implementations of LU and Cholesky decomposition.
The implementation described here provides the "fastest"
execution rate for LU decomposition, 718 MFLOPS for a matrix of order
1000.
.in 
.ll 
.QE
.nr PS 11
.nr VS 16
.nr PD 0.5v
.SH
Introduction
.PP
Over the past few months we have been experimenting with some
simple linear algebra algorithms on the CRAY X-MP-4 multiprocessor.
The CRAY X-MP family is a
general-purpose multiprocessor system. It inherits the basic
vector functions of CRAY-1S, with major architectural improvements
for each individual processor. 
The CRAY X-MP-4 system is 
a four - processor model housed
in a physical chassis identical of the CRAY-1S.
The system 
can be used
to perform simultaneous scalar and vector processing of either
independent job streams or independent tasks within one job.
Hardware in the X-MP enables multiple processors to be applied to
a single Fortran program in a timely and coordinated manner.
.PP
All processors share a central bipolar memory (of up to
16 million words), organized in 64 interleaved memory banks.
Each processor has four memory ports: two for vector
fetches, one for vector stores, and one for independent I/O
operations. In other words, the total memory bandwidth of the four
processors is up to sixteen times that of the CRAY-1S system.
.PP
This note describes results obtained from three experiments:
LU decomposition based on matrix-vector operations,
LU based on a "best" implementation for the architecture, and
an implementation of Cholesky decomposition based on matrix-vector
operations.
.SH
LU Decomposition
.PP
The versions of LU and Cholesky 
factorization, used here, are based on matrix-vector
modules that allow for a high level
of granularity, permitting high performance in a number 
of different environments, see [2].
.PP
The algorithm designed to give the "best" performance
on the X-MP architecture is worth noting. It is based
on standard Gaussian elimination with partial pivoting.
The algorithm is organized such that it
zeros out three columns (below the diagonal) of the matrix,
then applies these transformations to the rest of the matrix.
The application of the transformations to the remainder of the
matrix is split up among the processors.
An assembly language kernel is used to apply
the three pivot rows for simultaneous row operations.
This reduces memory traffic and allows a single processor of an X-MP
to obtain its theoretical maximum sustainable computation rate
of 198 MFLOPS. The assembly language kernel was multitasked
in the experiment among two, three, and four CPUs. 
Fortran - callable assembly language synchronization subroutines were
used. These require less than half a microsecond for synchronization.
.PP
The process of finding the three pivot rows was not multitasked.
However, it was written as an assembly language kernel to reduce
overhead for small problems. As the factorization proceeds,
the size of the relevant vector and submatrix decreases.
The final 15 @times@ 15 block of the reduction was performed
by an unrolled [2] version of standard Gaussian elimination.
This portion is entirely single - threaded Fortran and runs two to three times
the speed of the Fortran which has not been unrolled.
Synchronization was accomplished through a fork - and - join mechanism using
the cluster (shared) registers of the X-MP.
.PP
Listed in Table 1 are the results of the experiments; speedups
range from 1.3 on a problem of size 50 @times@ 50 to 3.8 for a 
matrix of order 1000 @times@ 1000. The performance ranges from 97 MFLOPS
to an impressive 718 MFLOPS for four processors on a 1000 @times@ 1000
matrix. For small problems, startup times dominate
the overall performance.
It is interesting to note that a system of equations of order 
1000 can now be factored and solved in under a second!
(The same problem on a VAX 11/780 would take roughly two hours to complete.)
.sp
.KS
.ce
Table 1
.ce
.I
High - Performance LU Decomposition
.R
.sp
.TS
center;
c c s s s c s s
c c s s s|c s s
c|c c c c|c c c
n|n n n n|n n n.
	MFLOPS	Speedup over 1 Processor
	# Processors	# Processors
Order	1	2	3	4	2	3	4
_
50	97	124	135	145	1.28	1.39	1.49
100	145	230	281	325	1.64	1.94	2.24
200	172	330	426	526	1.81	2.47	3.05
400	183	353	507	652	1.93	2.77	3.56
600	186	364	535	689	1.96	2.87	3.70
1000	188	372	550	718	1.98	2.92	3.81
.TE
.KE
.sp
.PP
For comparison we give in Table 2 the results for LU decomposition based
on matrix-vector operations. The parallelism here is gained
by simply splitting the matrix vector operation across the processors
(see [1] for details on the algorithm).
The matrix-vector modules have been coded in assembly language.
The same number of operations is performed here as in the version
that gives "best" performance.
.sp
.KS
.ce
Table 2
.ce
.I
LU Based on Matrix-Vector Operations
.R
.TS
center;
c|c c
c|c c
n|n n.
Order	MFLOPS 	Ratio of Algorithms from
	(4 processors)	Table 1 / Table 2
_
50	57	2.54
100	167	1.95
200	343	1.53
400	537	1.21
600	608	1.13
1000	675	1.06
.TE
.KE
.sp
.PP
As we can see, for lower order matrix problems, the high-performance
algorithm is far more efficient. For large problems, however, there is not too
much difference; for the matrix of order 1000 there is only a difference
of 6% in the running time.
This fact shows one of the strengths in using the matrix-vector design
for algorithms of this nature.
The matrix-vector modules can be easily changed as we go to a different architecture,
but the basic algorithm is unaltered.
.SH
Cholesky Decomposition
.PP
A version of Cholesky decomposition for a symmetric positive definite
matrix was implemented on the CRAY X-MP-4 based on matrix-vector 
routines, (see [2] for algorithm details).
Table 3 gives the performance of that routine when run on 1 and 4 processors.
In addition, the algorithm was reorganized to compute information
necessary to perform four steps of the decomposition during the same step.
This results in four independent "full size" matrix-vector multiplications.
This information is listed in the last column of Table 3.
.sp
.KS
.ce
Table 3
.ce
.I
Cholesky Decomposition Based on Matrix-Vector Operations
.R
.TS
center;
c c s c c s
c|c c|c| c c.
	MFLOPS		MFLOPS
Order	1 Processor	4 Processors	Speedup	4 Processors
		(MV split)		(4 MV ops)
_
50	59	52	.88	97	
100	117	154	1.32	264
200	163	345	2.12	460
400	184	544	2.96	631
600	189	623	3.30	683
1000	193	689	3.58	733
.TE
.KE
.sp
Synchronization was accomplished via the X-MP shared registers and
semaphores. Multiprocessing overhead in this case is largely the
result of code changes and of the breaking of a large piece,
of work into smaller pieces - each of which has essentially the
same startup time as the original large piece.
.PP
The apparent overhead for multitasking small problems is
largely caused by the following items:
.sp
.in +.25i
1. Single-threaded portions of work. Finding pivot rows, scaling
rows of the matrix, the scalar square root in Cholesky decomposition - these 
are all of great importance in the current study.
.sp
2. Less parallel work per processor in the computational kernel.
Even in scalar operations the X-MP uses a considerable amount of
parallelism. In vector mode, 100 floating - point operations can be 
performed in the time it takes to call a subroutine and scores of
flops in the time required to initialize a DO loop.
.sp
3. Additional memory-bank conflicts.  The single CPU times
were run with no activity in the other three CPUs. When four processors
are active, additional conflicts will occur, though in this
case the effect is small as the matrix-vector operations is
conflict-insensitive on the X-MP.
.sp
4. Synchronization overhead. This is dominated by the time
a CPU is waiting for all of its vector memory references to be
completed. Thus, it is generally wise to complete all vector memory
references before synchronization,
as it eliminates the possibility of an inter-CPU memory race condition.
.in -.25i
.sp
.SH
Conclusions
.PP
The CRAY X-MP is capable of small granularity multitasking. For large
problems that are both vectorized and parallelized, performance
of the X-MP-4 should be in the range of 400 to 700 MFLOPS. For
smaller problems the startup time of the parallel processes
can dominate the execution time.
This feature becomes more important as more processors are applied to the
problem.
.sp
.B
NOTE 
.PP
.R
Since the time this work was conducted, Cray Research has introduced 
micro-tasking, which provides a mechanism for the user to conveniently,
and with low overhead, exploit small - granularity parallelism
from Fortran programs with compiler directives.
.SH
References

.IP [1]
.R
Steve S. Chen, Jack J. Dongarra, and Christopher C. Hsiung,
"Multiprocessing for Linear Algebra
Algorithms on the CRAY X-MP-2:
Experiences with Small Granularity,"
.R
J. Parallel and Distributed Computing, 1, (1984), 22-31.
.sp 
.IP [2]
.R
Jack J. Dongarra and Stanley C. Eisenstat, 
"Squeezing the Most out of an Algorithm
in CRAY Fortran,"
.R
ACM Trans. Math. Software,
.R
10, No. 3 (1984), 221-230.


.