.EQ     
delim @@
.EN
.TL
.ps 12
.in 0
Multiprocessing Linear Algebra \h'.35i'     
.sp
.br
Algorithms on the CRAY X-MP-2: \h'.35i'
.sp
.br
Experiences with Small Granularity \h'.35i'     
.sp 2
.AU
.ps 11
.in 0
Steve C. Chen
.AI
.ps 10
.in 0
CRAY Research Inc. \h'.25i'  
Chippewa Falls, Wisconsin\h'.20i'   
.AU
.ps 11
.in 0
Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i'    
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'   
Argonne National Laboratory\h'.20i'   
.AU
.ps 11
.in 0
Christopher C. Hsiung
.AI
.ps 10
.in 0
CRAY Research Inc. \h'.25i'  
Chippewa Falls, Wisconsin\h'.20i'   
.FS
.ps 9
.vs 11p
@size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical
Sciences Research Program (KC-04-02) of the Office of Energy Research of the 
U. S. Department of Energy under Contract W-31-109-Eng-38.
.sp .25v
.FE
.QS
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em This paper gives a brief overview of 
the CRAY X-MP-2 general-purpose multiprocessor system
and discusses how it can be used effectively to solve problems 
that have small granularity. An implementation is described
for linear algebra algorithms that solve 
systems of linear equations when the matrix is general
and when the matrix is symmetric
and positive definite.
.in 
.ll 
.QE

.nr PS 11
.nr VS 16
.nr PD 0.5v
.SH
Overview of the System
.PP
The CRAY X-MP is a follow-up to the CRAY-1S system
offered by CRAY Research, Inc. The CRAY X-MP family is a
general-purpose multiprocessor system. It inherits the basic
vector functions of CRAY-1S, with major architectural improvements
for each individual processor. The interprocessor communication
mechanism and the provision of Solid-State Disk device(SSD) are new
designs that create tremendous potential in the realm of
high speed computing.
.PP
The CRAY X-MP-2 system is the first product of the CRAY X-MP
family. It is a dual processor model that is housed
in the identical physical chassis as that of the CRAY-1S. Each
processor occupies half of the space of the original CRAY-1S. This is
achieved through larger IC integration, denser packaging
and much improved cooling capacity.
.PP
The system is designed specifically to handle computation
intensive and I/O intensive jobs in an efficient way. It can be used
to perform simultaneous scalar and vector processing of either
independent job streams or independent tasks within one job.
Hardware in the X-MP enables multiple processors to be applied to
a single Fortran program in a timely and coordinated manner.
.PP
All processors share a central bipolar memory (of up to
4 million words), organized in 32 interleaved memory banks.
Each processor has four memory ports: two for vector
fetches, one for vector store, and one for independent I/O
operations. In other words, the total memory bandwidth of the two
processors is up to eight times that of the CRAY-1S system.
The added memory bandwidth provides a balanced
architecture for memory-to-memory vector operations as typified by
scientific Fortran codes.
.PP
Other features of the machine includes hardware automatic
"flexible chaining"[1]. This feature allows each individual
processor to have simultaneous
memory fetches, arithmetic, and memory store operations in a
series of related vector operations. The elimination of "chain slot time" 
guarantees "supervector" speed in all vector
operations. It contrasts with the "fixed chaining" 
and unidirectional vector fetch/store of
the CRAY-1S [7]. The balance between CPU speed and memory
bandwidth makes the CRAY X-MP friendly to Fortran codes. The need to
resort to assembly level hand coding is drastically reduced.
Other improved features of the CPU include multiple data paths
and increased instruction buffer size. 
The machine has a faster central clock with improved
cycle time, 9.5 ns, as compared with 12.5 ns on the CRAY-1.
The result is a
much improved scalar and vector speed over the CRAY-1S.
.PP
In addition, a new, optional, CPU-driven Solid-State Storage Device 
(SSD) offers an extended central memory, an important element to
buffer between fast central memory and slow disk devices. The SSD
can be
used as a fast-access device for large pre-stage or
intermediate files generated and manipulated repetitively by
user programs, or it can be used by the system for job swapping 
space and temporary storage of frequently used system programs.
.KS
.ce
Figure 1
.ce
CRAY X-MP-2 System Organization
.sp 8
include here a diagram of machine
.sp 7
.KE
.PP
Additional hardware in the X-MP enables efficient and coordinated
application of multiple processors to a single user program.
All processors assigned to a task share a 
unique set of synchronization and communication registers.
There are three kinds of shared registers: a set
of binary semaphores, a set of index registers
and a set of data registers.
These registers, in cooperating with the shared central memory, allow
processors to signal each other, wait for each other and transfer data
between each other. Processors can also interrupt each other through
the interprocessor interrupt. These basic hardware functions provide
a basic mechanism for efficient communication and synchronization
between processors.
.PP
Other than hardware instructions,
software support for multitasking is at the library
level, where the user makes calls to ask the system for multitasking functions.
Three categories of routines
provide different mechanisms for parallel 
processing [5]. 
First, a task can be created to be scheduled for execution
through the TSKSTART call.
The calling task
may wait for the termination of a created task with
the TSKWAIT routine. Task is a schedulable unit that the user expects
to be executed in a serial manner. It is a software entity
that the programmer deals with, as the physical processor is
concealed from him.
.PP
Second, tasks may need to  communicate
or  synchronize  with  one  another  as  they  execute  concurrently.  
Producing tasks may signal consuming tasks through EVPOST
routine. A consuming task may wait for the signal through EVWAIT
routine. As the signal is consumed, it may be reset through a EVCLEAR
call.
.PP
In the third category,
the LOCKON and LOCKOFF routines are used to protect 
the integrity of code segment or shared resources (e.g., data)
among tasks.
.PP
Based on these three categories, other synchronization and
communication mechanisms can be developed. Since the multitasking
library employs a queueing mechanism (among others) to handle
general situations, it is very flexible. Of course, the user always
has the option to hand code his own synchronization routines through
the use of hardware instructions.
.SH
Granularity  of Tasks
.PP
A number of factors influence the performance
of an algorithm in multiprocessing. These include
the degree of parallelism, process synchronization overhead, 
load balancing, inter processor
memory contention,
and modifications needed to separate the parallel parts of an algorithm.
.PP
The size of the work performed in parallel, the granularity of the
task, is the first critical factor in 
matching a parallel algorithm to an architecture which should be
addressed. By granularity, we mean the amount of
time the cooperating tasks execute concurrently on related
codes in between synchronization points.
The need to
synchronize and to communicate before and after parallel work
will greatly impact the overall execution time of the program.
Since the processors have to wait for one another instead of
doing useful computation, it is obviously better to minimize that
overhead.
In the situation where segments of parallel code are executing in vector mode,
typically at ten to twenty times the speed of scalar mode, granularity
becomes an even more important issue, 
since communication mechanisms are implemented
in scalar mode. 
.PP 
Granularity is also closely related to the degree of parallelism,
which is defined to be the percentage of time spent in the parallel
portion of the code. Typically, a small granularity job means that
parallelism occurs in an inner loop level (although not necessarily the
innermost loop). In this case, even the loop setup time in outer
loops will become significant without even mentioning frequent task
synchronization needs.
.PP
For the CRAY X-MP family, granularity on the order of milliseconds
is considered large.
Many reports [1][4][5]
have shown significant speedups for multitasking
of large granularity problems on the CRAY X-MP-2.
For example, speedups of 1.8 to 1.9 were seen
when two processors were used instead of one to run a
particle-in-cell code, a weather forecasting model, and a
three-dimensional seismic migration code. For these
problems, the multitasking library is used to implement parallel
tasks. The reported successes indicate that, for large granularity codes
that have high degrees of
parallelism, the payoff of doing multitasking on the CRAY X-MP-2 is
very significant. The use of the CRAY multitasking library to handle
intertask communications proves to be very powerful. It  will  be
interesting to see, however,
how far we can push the granularity down and still
get a descent speedup. As will be shown later, even when the library
appears to be too coarse for small granularity tasks, the hardware
capability still allows efficient handling of them.
For our purposes, we would like to use matrix vector operation to
test the machine behavior for small granularity tasks.
.SH
The Algorithms
.PP
We have chosen matrix vector operation for two reasons.
First, we can easily construct
the standard algorithms in linear algebra out of these types of modules. 
Second, matrix vector operations
provides simple modules for parallel execution[2][3].
We will describe an implementation for standard LU factorization
with partial pivoting for a general square matrix.
.PP
The algorithm can be described as having basically three distinct parts
within a loop:
.KS
.sp
.in +1.
.I
do i = 1,n
.br
.in +1.
perform the i-th matrix vector product (forms part of L)
.br
search for a pivot and interchange
.br
perform the i-th vector matrix product (forms part of U)
.in -1.
.br
end
.in -1.
.KE
.R
.sp
The algorithm organized storage so that the original matrix
is overwritten with the factorization. The amount
of work required to perform the factorization
is approximately @2/3 n sup 3 @ floating point operations
(here we count both additions and multiplications as operations).
The factored matrix can then be used to solve systems
of equations. In order to maintain stability in the algorithm, a partial
pivoting scheme is used. This helps in avoiding problems with small
divisors which can cause inaccurate computations.
The pivot is chosen to be the element of largest absolute value
in a column.
The method used is a simple variant of Crout reduction [6],
but the algorithm has been expressed in terms of modules that are
easy both to understand and to implement.
.PP
The algorithm described above allows for a number of
alternatives
for parallel implementation. That is, the available processors can
be assigned to each of the three steps, allowing for multiprocessing
within each. This avenue has not been investigated since the pivoting
does not take as much time as the other steps.
Instead, each processor will be given the
task of performing one of the matrix vector operations concurrently.
The algorithm can be easily modified so each matrix vector operation
is independent and can proceed in parallel.
The pivoting is handled by one of the two processors in a sequential fashion.
.PP
Depicted graphically, the 
processing of the matrix by the algorithm at the @i sup th @
stage would look like the following (the matrix factors overwrite the 
original matrix):
.KS
.sp 10
.KE
.PP
The algorithm is
modified slightly to allow independent operations to
be performed in parallel. The modified algorithm in no way increases
the number of operations or complexity over the original algorithm.
The modified algorithm just rearranges the computation within the loop
to expose independent operations.
The resulting modified algorithm is of the 
following form:
.sp
.KS
.in +1.
.I
perform the 1-st matrix vector product
.br
do i = 1,n-1
.in +1.
.br
search for a pivot and interchange
.br
perform the i-th vector matrix product
.br
perform the i+1-st matrix vector product
.in -1.
.br
end
.br
perform the n-th vector matrix product
.in -1.
.R
.KE
.sp
.PP
It is now easy to see how to partition the work between two processors:
each matrix vector operation within an iteration is independent
of the other. 
The two cooperating tasks need to synchronize @2(n-1)@ times
during the course of the calculation (once before a task is started
and once after).
There is, however, a slight imbalance in the work of each
task. At the @i sup th @ step, the amount of work for each task is
@O((n-i+1)*i)@ and @O((n-i)*i)@ operations.
.PP
To perform the synchronization between tasks, we have used task control
(TSKSTART) to start a second task before the LU decomposition
routine
is entered. This task waits until it is directed to start up.
Event control (EVPOST, EVWAIT, and EVCLEAR) is used to
start and synchronize the work within the algorithm.
.PP
Table 1 describes the performance for this implementation of
LU decomposition. The column labeled
"Degradation from code change" reflects the loss
of performance when the sequential algorithm
was restructured to separate independent tasks.
The numbers are obtained by running the original algorithm and
the modified algorithm using no multitasking features on a single
processor and taking the percentage difference.
Measurements were also made of the total time
spent in the parallel portion of the algorithm. As expected, 
small order matrices consume a greater
percentage of time in non-parallel parts
than do larger matrices. Even for matrices of order 50, however, over
75% of the time is used for parallel processing.
.sp
.KS
.ce
Table 1
.ce
Parallel Version of LU Decomposition
.sp
.TS
center;
c c c 
c c c
n n n.
	Degradation from	Percentage of
n	code change	parallel code
_	_	_
50	6.5%	75.1%
100	4.9%	82.0%
300	2.0%	87.5%
600	0.7%	97.0%
.TE
.KE
.sp
We also measured the speedup of the algorithm when two
processors were used. 
Table 2 shows the comparison between the modified (but without
multitasking mechanism) one processor
version and the multitasked two processor version.
The column labeled "Mean granularity" is the average time spent in each of
the matrix vector calls for that particular order problem.
.sp
.KS
.ce
Table 2
.ce
Speedup of Algorithm on 2 Processors vs 1 Processor
.sp
.TS
center;
c c c c c
c c c c c
n n n n n.
n	Optimal	MT library calls	MT CAL calls	Mean
				Granularity
_	_	_	_	_
50	1.44	0.86	1.39	40 @ mu @sec
100	1.57	1.05	1.54	66 @ mu @sec
300	1.77	1.60	1.79	250 @ mu @sec
600	1.91	1.80	1.86	800 @ mu @sec
.TE
.KE
.R
.sp
.PP
The "Optimal" column is an attempt to filter out the time required
to synchronize the processors and the time introduced by memory contention
caused by multitasking on the two processors.
These numbers were generated by running the program twice.
The first run was a sequential
process, using no multitasking constructs.
In the second run the call to the shorter of the two parallel tasks was
removed;
thus, in some sense, this is the best situation
(the calculation, of course, does not produce the correct results,
but it provides a good measure of the overhead). The third
column is the result of using the standard multitasking routines
in the Multitasking (MT) library. As an alternative, there are
assembly language (CAL) routines that uses hardware instruction
directly. These CAL routines performs minimum synchronization
function required by the code and is faster than the general
purpose multitasking library routines.*
.FS
* The MT library routines keep track of additional information 
on the activities of other tasks. They
are better suited for larger and more tasks and are quite flexible 
for different programming styles. For this particular small
granularity job
that information is not needed, hence the CAL mechanism is more
amenable. The difference in the implementation is
O(1) clocks for the CAL version and O(100) to O(1000) clocks for the
multitasking library, depending on mechanism used.
.FE
.PP
With this algorithm the work partition 
is well matched for two 
processors.
The overhead in multitasking is essentially wiped out
as the problem size increases and for small
problems it is not a great penalty.
The implementation comes very close in the limit to attaining
the optimum performance.
.PP
We now focus on another algorithm for dealing with a system of equations
where the matrix is symmetric and positive definite: Cholesky factorization.
The algorithm can again be
described in terms of a matrix vector operations, but in this
case because of symmetry only half the matrix is referenced.
.PP
As before, we can graphically describe the algorithm at the @i sup th @
stage as follows:
.KS
.sp 10
.KE
In this case, the algorithm does not divide
naturally into two parts as in the previous algorithm.
To distribute
the work between the processors, we take the naive approach
of just splitting the matrix vector operation in half, letting one
processor take the left half and the other processor the right half,
and then put the two parts back in a sequential part.
Table 3 shows the results.
.sp
.KS
.ce
Table 3
.ce
Parallel Version of Cholesky Decomposition
.sp
.TS
center;
c c c
c c c
n n n.
	Degradation from	Percentage of
n	code change	parallel code
_	_	_
50	33.1%	80.2%
100	24.2%	85.4%
300	8.0%	91.9%
600	3.5%	96.1%
.TE
.KE
.sp
As before, the percentage given here is for the modified
code before multitasking mechanism is put in.
The degradation in code performance is the result of the additional
subroutine calls to perform the matrix vector product and the
fact that one more work array has to be initialized and added
to the other half in each iteration.
In the following table, the modified (but without multitasking
mechanism) one processor version is compared against the
multitasked two processor version.
Results are shown in Table 4.
.sp
.KS
.ce
Table 4
.ce
Speedup of Algorithm on 2 Processors vs 1 Processor
.sp
.TS
center;
c c c c c
c c c c c
n n n n n.
n	Optimal	MT Library calls	MT CAL calls	Mean
				Granularity
_	_	_	_	_
50	1.67	0.76	1.56	30 @ mu @sec
100	1.74	0.94	1.54	45 @ mu @sec
300	1.85	1.52	1.85	130 @ mu @sec
600	1.93	1.80	1.92	400 @ mu @sec
.TE
.sp
.KE
.R
As in the previous example, the improvement is substantial
when two processors are used to partition the work and perform the task.
For large orders, the parallel program reaches the optimal rate of speedup.
This experiment is relevant to the case when there are more
than two processors. The matrix vector operation will than be split
across the matrix in a similar fashion as done here.
Perhaps going to a block matrix scheme to achieve the desired number
of parallel tasks.
We expect to observe the same trend
when we can perform a similar experiment with 
more processors.
.PP
Note that there is certain amount of fluctuation in between runs on
the CRAY X-MP depending on background activities. The
numbers we present here should be given an 1-3% tolerance.
.sp
.SH
Conclusions
.PP
The multitasking concept on the CRAY X-MP-2 
has been shown to be advantageous in solving
problems with relatively large granularities (that is, when there
is more than one millisecond of computation
that can be performed in parallel between
synchronization points).
.PP
For problems with small granularity with a reasonable degree of
parallelism that can be exploited, at least from the
standpoint of linear algebra solvers where the granularity
is in the order of microseconds, the situation can be
handled just as efficiently.
In general, the main sources of overhead (other than
synchronization, load imbalance
and code change) are memory contention and operating system service.
The speedup figures of the examples presented here show that the interference
from these two factors is insignificant. Our experience
demonstrates that multitasking with small granularity jobs is very
promising on the CRAY X-MP-2.
.bp
.SH
References
.sp
.IP [1]
S.C. Chen,
.I
Large-scale and High-Speed Multiprocessor System for
Scientific Applications - CRAY-X-MP-2 Series,
.R
NATO Advanced Research Workshop on High Speed Computation,
Nuclear Research Center, Julich, West Germany, June 1983.
.sp
.IP [2]
J.J. Dongarra and R. Hiromoto,
.I
A Collection of Parallel Linear Equations Routines
for the Denelcor HEP,
.R
ANL/MCS-TM-15, September 1983.
.sp
.IP [3]
J.J. Dongarra and S.C. Eisenstat,
.I
Squeezing the Most out of an Algorithm in Cray Fortran,
.R
ANL/MCS-TM-9, May 1983.
.sp
.IP [4]
C.C. Hsiung and Werner Butscher,
.I
A New Numerical Seismic 3-D Migration Model on the CRAY X-MP,
.R
SIAM Conference on Parallel Processing and Scientific Computing,
Norfolk, VA, 1983.
.sp
.IP [5]
John L. Larson, 
.I
An Introduction to Multitasking on the
CRAY X-MP-2 Multiprocessor, 
.R
to appear in IEEE Computer.
.sp
.IP [6]
G.W. Stewart, 
.I
Introduction to Matrix Computations, 
.R
Academic Press, New York, 1973.
.sp
.IP [7]  
P. M. Johnson,  
.I
An Introduction to Vector Processing,
.R
Computer Design, Vol. 17, No. 2, 289-97, Feb. 1978.

.