.ds HT "\h'-.65m'\v'-.40m'^\v'.40m'
.ds TD "\h'-.65m'\v'-.40m'~\v'.40m'
.ds CT "\(mu\h'-.66m'\(ci
.ds NI "\(mo\h'-.65m'/
.nrEP 1
.EQ
delim @@
.EN
.EQ
define newt 'B sup{-1} g'
define blambda 'B ~+~ lambda I'
define balpha 'B ~+~ alpha I'
define indent '~~~~~~'
define sqr ' down 40  size +5 \(sq '
define hat '\*(HT'
define tilde '\*(TD'
define times '\(mu'
define ctimes '\*(CT'
define notin '\*(NI'
define indent2 '~~~~~~~~~~~~~'
define inter ' [^ lambda sub L ^,^ lambda sub U ^] '  
.EN
.nr VS 14
.TL
Performance and Library Issues 
.nl
for Mathematical Software 
.nl
on High Performance Computers
.AU
J.J. Dongarra and D.C. Sorensen
.AI
Mathematics and Computer Science Division 
Argonne National Laboratory 
Argonne Illinois 60439 
.FS
Work supported in part by the Applied Mathematical Sciences
Research Program (KC-04-02) of the Office of Energy Research of the
U.S. Department of Energy under Contract W-31-109-Eng-38. 
.FE
.QS
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em This paper discusses 
some of the fundamental issues facing designers of mathematical software 
libraries for medium scale parallel processors such as the CRAY X-MP-4
and the Denelcor HEP. 
We discuss the problems that arise with performance and demonstrate that it 
may be appropriate to exploit parallelism at all levels of the program, not just
at the highest level.  We give performance measurements indicating the 
efficiency of a linear algebra library written in terms of a few high level 
modules.  These modules chosen at the matrix vector level extend the 
concept of the BLAS [13] and provide enough computational granularity
to allow efficient implementations on a wide variety of architectures.  Only
three modules must be recoded for efficiency in order to transport the 
library to various machines.   We report experience on machines as diverse
as the CRAY X-MP and the Denelcor HEP.  Finally, we report on some special
algorithms for the HEP which take advantage of the fine grain parallelism
capabilities.
.in 
.ll 
.QE
.nr PS 11
.nr VS 16
.nr PD 0.5v
.NH
Introduction.
.PP
This paper will focus upon some fundamental issues that must be faced if we 
are to provide mathematical software for the high performance computers
and for medium scale parallel systems (2-50 processors) in general.  
Our experience with the CRAY X-MP and Denelcor HEP 
has been invaluable in providing insight into
what these issues might be.  While the experience we shall report here is 
primarily concerned with the X-MP and HEP, we believe there is much within this
experience that will carry over to vector/parallel processors that may appear
in the near future.  
.PP
Our discussion will not cover software for massively parallel systems.
We shall consider algorithms from linear algebra that are not dependent
upon relationship between the number of processors and the size of 
the problem.  These algorithms will be of two varieties:  Conversion 
of standard libraries to versions that are transportable to a wide 
variety of architectures and specialized algorithms that take advantage 
of the tightly coupled synchronization that is offered on the Denelcor HEP.
.PP
We shall touch upon issues involved with parallelism that require restructuring
of software libraries to shield the user safely from any machine intrinsics
required to take advantage of parallelism from within the libraries.
Moreover, we shall go into some detail concerning performance of the HEP. 
In particular we shall demonstrate that it is appropriate to exploit 
parallelism at all levels of a program, and not just at the highest level.
.NH
Vector Architectures
.PP
The current generation of vector computers
exploits several advanced concepts to enhance
their performance over conventional computers:
.sp
.ls 1
.in +5.
Fast cycle time,
.sp
Vector instructions to reduce the number of instructions interpreted,
.sp
Pipelining to utilize a functional unit fully and to deliver one result 
per cycle,
.sp
Chaining to overlap functional unit execution, and
.sp
Overlapping to execute more than one independent vector instruction concurrently. 
.in 0
.sp
.PP
Current vector computers
typically provide for "simultaneous" 
execution of a number of elementwise
operations through pipelining. Pipelining generally takes the approach
of splitting the function to be performed into smaller pieces or stages
and allocating separate hardware to each of these stages.
Pipelining is analogous to an 
industrial assembly line where a product moves through a sequence
of stations. Each station carries out one step in the manufacturing process,
and each of the stations works simultaneously on different units in different
phases of completion. With pipelining, the functional units (floating point 
adder, floating point multiplier, etc.) 
can be divided into several suboperations
that must be 
carried out in sequence. A pipelined functional unit, then, is divided into
stages, each of which does a portion of the work in, say, one clock period. At
the end of each clock period, partial results are passed to the next stage
and partial results are accepted from the previous stage. 
.PP
The goal of pipelined functional units is clearly performance.
After some initial
startup time, which depends on the number of stages (called
the length of the pipeline, or pipe length), the functional unit
can turn out one result per clock period as long as a new pair of operands
is supplied to the first stage every clock period. 
Thus, the rate is independent of the length of the pipeline and
depends only on the rate at which operands are fed into the pipeline.
Therefore, if two vectors
of length @k@ are to be added, and if the floating point
adder requires 3 clock periods to complete, it would take @3 ~+~ k@ clock
periods to add the two vectors together, as opposed to @3~*~k@ clock periods
in a conventional computer.
.PP
Another feature that is used to achieve high rates of execution is chaining.
@Chaining@ is a technique whereby the 
output register of one vector instruction is the same as one of the input
registers for the next vector instruction.
If the instructions use separate functional units, the hardware will
start the second vector operation during the clock period when the first result
from the first operation is just leaving its functional unit. A copy of the
result is forwarded directly to the second functional unit and the
first execution of the second vector is started.
The net result is that the execution of both vector operations
takes only the second functional unit startup time longer than the first
vector operation. 
The effect is that of having a new instruction which performs the
combined operations of the two functional units that have been 
chained together.
On the CRAY in addition to the arithmetic operations, vector loads 
from memory to 
vector registers can be chained with other arithmetic operations.
.PP
It is also possible to overlap operations if the two operations are independent.
If an addition and an independent multiplication operation are to be
processed, the execution of the second independent operation
would begin one cycle after the first operation has started.
.PP
The key to utilizing a high performance computer effectively is to avoid 
unnecessary memory references.  In most computers,
data flows from memory into and out of registers; and
from registers into and out of functional units, which
perform the given instructions on the data.
Performance of algorithms can be dominated by the amount
of memory traffic, rather than the number of floating point
operations involved.
The movement of data between memory and registers can be as costly as
arithmetic operations on the data.
This
provides considerable motivation to restructure existing algorithms and 
to devise new algorithms that minimize data movement.
.PP
Many of the algorithms in linear algebra can be
expressed in terms of a SAXPY operation:
@y ~<-~y ^+^ alpha x@ , @i.e.@  
adding a multiple @alpha@ of a vector @x@  to another vector @y@. 
This would result in three vector memory references for each
two vector floating point operations. If this operation comprises the
body of an inner loop which updates the same vector @y@ many times 
then a considerable amount of unnecessary data movement will occur.
Usually, a SAXPY occurring in
an inner loop will indicate that the algorithm may be recast 
in terms of some matrix vector operation, such as 
@y ~ <-~ y + M * x @, which is just a sequence of SAXPYs involving the
columns of the matrix @M@ and the corresponding components of the vector @x@.
The advantage of this is the @y@ vector and the length of the columns of @M@
are a fixed size throughout.  This makes it relatively easy
to automatically recognize that only the columns of @M@ need be moved 
into registers while
accumulating the result @y@ in a vector register,
avoiding two of the three memory references
in the inner most loop.  
This also allows chaining to occur on vector machines, and results in a 
factor of three increase in performance on the CRAY 1.
The cost of the algorithm in these cases
is not determined by floating point operations,
but by memory references. 
As we shall see later, this recasting also
facilitates retargeting the algorithm to a completely different 
parallel architecture by rearranging the way the matrix vector operations
are done.
.PP
Machines similar to the CRAY,
in which the memory cycle time is
longer than the processor cycle time, there
exists the possibility of a memory bank conflict. 
The operation of loading elements of a vector from memory
into a vector register can be pipelined, so that after some initial startup
following a vector load, vector elements arrive in the register
at the rate of one element per cycle. To keep vector operands
streaming at this rate, sequential elements 
are stored in different banks or areas in an "interleaved" memory.
After an access, a memory bank requires a certain amount of time
before another reference can be made. This delay time is referred to as
the "memory bank cycle time." This memory cycle time on both the CRAY 1 and
the Cyber 205 is four processor cycles. If the same memory bank is accessed before
it has a chance to acquiesce or cycle, chaining operations will stop.
Instead of delivering one result per cycle, the machine will deliver
one result per functional unit time. In other words, it will
operate at scalar speeds, seriously degrading performance.
Note, too, that the Cyber 205 must gather data that are not
contiguous in memory before it can begin computing.
.PP
Programs that properly use all of the features mentioned above will fully
exploit the potential of a vector machine.
These features, when used to varying degrees, give rise to 
three basic modes of execution:
scalar, vector, and super-vector [9]. To provide a feeling for the difference
in execution rates, we give the following table for execution rates on a 
CRAY 1:
.sp
.KS
.TS
center;
c | c.
Mode of Execution	Rate of Execution 
_	_
Scalar	0 - 10 MFLOPS
Vector	10 - 50 MFLOPS
Super-Vector	50 - 160 MFLOPS
.TE
.KE
.sp
These rates represent, more or less, the upper end of their
range.
We define the term MFLOPS to be a rate of execution representing
millions of floating point operations (additions or
multiplications) performed per second.
.PP
The basic difference between scalar and vector performance is the use of vector 
instructions. The difference between vector and super-vector performance hinges
upon avoiding unnecessary movement of data between vector registers and memory.
The CRAY 1 is limited in the sense that there is only one path between
memory and the vector registers. This creates a bottleneck if a program
loads a vector from memory, performs some arithmetic operations, 
and then stores the results.
While the load and arithmetic can proceed simultaneously as a chained 
operation, the store is
not started until that chained operation is fully completed.
.PP
Most algorithms in linear algebra can be easily vectorized.
However,
to gain the most out of a machine like the CRAY 1, such vectorization
is usually not enough.
In order to achieve top performance,
the scope of the vectorization must be expanded to 
facilitate chaining and minimization of data movement in
addition to utilizing vector operations.
Recasting the algorithms in terms of matrix vector operations makes it
easy for a vectorizing compiler to achieve these goals.  
This is primarily due to the fact that the results of the operation
can be retained in a register and need not be stored back to memory,
thus eliminating the bottleneck.
Moreover, when
the compiler is not successful, it is reasonable to hand tune these 
operations, perhaps in assembly language, 
since there are so few of them and since
they involve simple operations on regular data structures.
When the process is extended to cover the next level, matrix-vector 
operations, the generated code can perform at super-vector levels. 
.NH
Performance and Transportability
.PP
We are interested in examining the performance of linear algebra
algorithms on large-scale scientific vector processors and on
emerging parallel processors.
In many applications, linear algebra calculations consume large 
quantities of computer time. If substantial improvements can be
found for the linear algebra part, a significant reduction
in the overall performance will be realized.
We are motivated to look for alternative formulations of
standard algorithms, as implemented in software
packages such as LINPACK [2] and EISPACK [10,16], because of their wide usage 
in general and their poor
performance on vector computers.  We are also motivated to restructure in
a way that will allow these packages to be easily transported to new
computers of radically different design.  If this can be accomplished without
serious loss of efficiency there are many advantages.  In this section we
report on some experience with restructuring these linear algebra packages
in terms of the high level modules proposed in [3].  This experience 
verifies performance increases are achieved on vector machines and that
this modular approach offers a viable solution to the transportability 
issue.  
This restructuring often improves performance on conventional computers
and does not degrade the performance on any computer we are aware of.
.PP
Both of the packages have been designed in a portable, robust
fashion, so they will run in any Fortran environment. The
routines in LINPACK even go so far as to use a set of vector
subprograms called the BLAS [13] to carry out most of the
arithmetic operations. The EISPACK routines do not explicitly
make reference to vector routines, but the routines have a high
degree of vector operations which most vectorizing compilers
detect. The routines from both packages should be well suited
for execution on vector computers. As we shall see, however,
the Fortran programs from LINPACK and EISPACK do not attain
the highest execution rate possible on a CRAY 1 [10]. While
these programs exhibit a high degree of vectorization, the
construction which leads to super-vector performance is in most cases not
present. We will examine how the algorithms can be constructed and modified
to enhance performance without sacrificing clarity or resorting 
to assembly language. 
.PP
To give a feeling for the difference between various
computers, both vector and conventional, 
a timing study on many different computers for the solution
of a @100 times 100 @ system of equations has been
carried out [1]. The LINPACK routines
were used in the solution without modification.
.sp 2
.KS
.B
.ce
Solving a System of Linear Equations 
.sp
.ce
with LINPACK @ nothing sup a @ in Full Precision @ nothing sup b @
.sp
.R
.ps 10
.TS
center;
l l c c c c
l l c c c c
l l c c c c.
Computer	OS/Compiler@ nothing sup c @	Ratio@ nothing sup d @	MFLOPS@ nothing sup e @	Time
				secs
_

CRAY X-MP-1	CFT(Coded BLAS)	.36	33	.021
CDC Cyber 205	FTN(Coded BLAS)	.48	25	.027
CRAY 1S	CFT(Coded BLAS)	.54	23	.030
CRAY X-MP-1	CFT(Rolled BLAS)	.57	21	.032
Fujitsu VP-200	Fortran 77(Comp directive)	.64	19	.040
Fujitsu VP-200	Fortran 77(Rolled BLAS)	.72	17	.040
Hitachi S-810/20	FORT77/HAP(Rolled BLAS)	.74	17	.042
CRAY 1S	CFT(Rolled BLAS)	1	12	.056
CDC Cyber 205	FTN(Rolled BLAS)	1.5	8.4	.082
NAS 9060 w/VPF	VS opt=2(Coded BLAS)	1.8	6.8	.101
.TE
.KE
.sp
.nr PS 8
.ps 8
.PP
@ nothing sup a @ LINPACK routines @SGEFA@ and @SGESL@ were used for single
precision and routines @DGEFA@ and @DGESL@ were used for double precision.
These routines perform standard @LU@ decomposition with partial pivoting and
backsubstitution.
.PP
@ nothing sup b @\f2Full Precision\f1 implies the use of (approximately) 64 bit
arithmetic, e.g. CDC single precision or IBM double precision.
.PP
@ nothing sup c @\f2OS/Compiler\f1 refers to the operating system and
compiler used, 
(Coded BLAS) refers to the use of assembly language coding of the BLAS, and
(Rolled BLAS) refers to a Fortran version with, single statement, simple
loops 
and \f2Comp Directive\f1 refers to the use of compiler directives to
set the maximum vector length.
.PP
@ nothing sup d @\f2Ratio\f1 is the number of times faster or slower a 
particular machine
configuration is when compared to the CRAY 1S using a Fortran
coding for the BLAS.
.PP
@ nothing sup e @\f2MFLOPS\f1 is a rate of execution, the number of million 
floating point
operations completed per second. For solving a system of @n@ equations, 
approximately
@ 2/3 n sup 3 ~+~ 2 n sup 2 @ operations are performed (we count both
additions and multiplications).
.nr PS 11
.ps 11
.sp
.PP
The LINPACK routines used to generate the timings in the previous table do
not reflect the true performance of "high performance computers".
A different implementation of the solution of linear equations,
presented in 
a report by Dongarra and Eisenstat [4], better describes the performance
on such machines.
That algorithm is based on matrix-vector operations rather
than just vector operations. 
This restructuring allowed the various compilers to take advantage of
the features described in Section 2.  It is important to note that
the numerical properties of the algorithm have not been altered
by this restructuring.
The number of floating point operations required and the 
roundoff errors produced by both algorithms are exactly the same,
only the way in which the matrix elements are accessed is different.
As before, a Fortran program was run and the time to complete the solution
of equations for a matrix of order 300 is reported.
.PP
Note that these numbers are for a problem of order 300
and all runs are for full precision.
.ps 
.sp 2
.KS
.B
.ce
Solving a System of Linear Equations 
.ce
Using the Vector Unrolling Technique
.sp 2
.R
.ps 10
.TS
center;
l l c c c 
l l c c c
l l c c c.
Computer	OS/Compiler@ nothing sup c @	MFLOPS@ nothing sup e @	Time
			secs
_

CRAY X-MP-4 @"" sup \(sc @	CFT(Coded ISAMAX)	356	.051
CRAY X-MP-2 @"" sup \(dd @	CFT(Coded ISAMAX)	257	.076
Fujitsu VP-200	Fortran 77(Comp directive)	220	.083
Fujitsu VP-200	Fortran 77	183	.099
CRAY X-MP-2 @"" sup \(dd @	CFT	161	.113
Hitachi S-810/20	FORT77/HAP	158	.115
CRAY X-MP-1 @"" sup \(dg @	CFT(Coded ISAMAX)	134	.136
CRAY X-MP-1 @"" sup \(dg @	CFT	106	.172
CRAY 1-M	CFT(Coded ISAMAX)	83	.215
CRAY 1-S	CFT(Coded ISAMAX)	76	.236
CRAY 1-M	CFT	69	.259
CRAY 1-S	CFT	66	.273
CDC Cyber 205	ftn 200 opt=1	31	.59
NAS 9060 w/VPF	VS opt=2(Coded BLAS)	9.7	1.9
.TE
.sp
.KE
.nr PS 8
.ps 8
.I
Comments:
.R
.in +2.
@"" sup \(sc @ These timings are for four processors with manual changes
to use parallel features.
.br
@"" sup \(dd @ These timings are for two processors with manual changes
to use parallel features.
.br
@"" sup \(dg @ These timings are for one processor.
.nr PS 11
.ps 11
.sp
.PP
Similar techniques of recasting matrix decomposition algorithms
in terms of
on matrix vector operations have provided significant improvements in the 
performance of algorithms for the eigenvalue problem.  In a
paper by Dongarra, Kaufman and Hammarling [7] many
of the routines in the EISPACK collection have been restructered
to use matrix vector primitives resulting in improvements
by a factor of two to three in performance over the
standard implementation on vector computers such as
CRAY X-MP, Hitachi S-810/20, and Fujitsu VP-200.
.PP
Using the matrix vector operations as primatives in constructing 
algorithms can also play an important role in achieving performance
on multiprocessor systems with minimal recoding effort. 
Again, the recoding is restricted to the relatively simple modules
and the numerical properties of the algorithms are not altered
as the codes are retargeted for a new machine.  This feature takes
on added importance as the complexity of the algorithms reach the
level required for some of the more difficult eigenvalue calculations.
A number of factors influence the performance
of an algorithm in multiprocessing. These include
the degree of parallelism, process synchronization overhead, 
load balancing, inter processor
memory contention,
and modifications needed to separate the parallel parts of an algorithm.
.PP
A parallel algorithm must partition the work to be done into tasks
or processes which can execute concurrently in order to exploit the 
computational advantages offered by a parallel computer.  These 
cooperating processes usually have to communicate with each other
to claim a unique identifier or follow data dependency rules for example.
This communication takes place at synchronization points within the
instruction streams defining the process.
The amount of work in terms of number of instructions that may be
performed between synchronization points is referred to as the 
granularity of a task.  
The need to
synchronize and to communicate before and after parallel work
will greatly impact the overall execution time of the program.
Since the processors have to wait for one another instead of
doing useful computation, it is obviously better to minimize that
overhead.
In the situation where segments of parallel code are executing in vector mode,
typically at ten to twenty times the speed of scalar mode, granularity
becomes an even more important issue, 
since communication mechanisms are implemented
in scalar mode. 
.PP 
Granularity is also closely related to the degree of parallelism,
which is defined to be the percentage of time spent in the parallel
portion of the code. Typically, a small granularity job means that
parallelism occurs in an inner loop levels (although not necessarily the
innermost loop). In this case, even the loop setup time in outer
loops will become significant without even mentioning frequent task
synchronization needs.
.PP
Matrix vector operations offer the proper level of modularity
for achieving both performance and transportability across a wide
range of computer architectures.  Evidence has already been given
for a variety of vector architectures.  We shall present further
evidence in the following sections concerning their suitability
for parallel architectures.  
In addition to the computational evidence, there are several reasons which 
support the use of these modules. 
We can easily construct
the standard algorithms in linear algebra out of these types of modules. 
The matrix vector operations are simple and yet encompass enough computation
that they can be vectorized and also parallelized at a reasonable 
level of granularity [5,8].
Finally, these modules can be constructed in such a way that they hide all
of the machine specific intrinsics required to invoke parallel computation.
Thereby shielding the user from being concerned with any changes to 
the library which are machine specific.
.NH
The Denelcor HEP Computer.
.PP
The 
Denelcor HEP is the first commercially available machine of the MIMD variety.
It provides a way for multiple instruction streams to operate on multiple
data streams concurrently.
It supports fine grained parallel computations, and is quite different
from the vector computers discussed earlier.  The fully configured
computing system offered by Denelcor consists of up to 16 processing 
elements (PEMs) sharing a large global memory through a crossbar switch.  
Within a single PEM parallelism is achieved through pipelining independent 
serial instruction streams called processes.  The principle pipeline which
handles the numerical and logical operations consists of synchronous functional
units that have been segmented into an eight stage pipe.  The storage functional
unit (SFU) operates asynchronously from this execution pipeline.  
.PP
There are four types of memory: Data, program, register, and constant memory.
Each word of this memory is 64 bits together with an extra special purpose 
bit used to mark the memory location as full or empty.
Processes
are synchronized through marking memory and register locations with the full
or empty property.  This means that a process requesting access to a memory
location may be blocked until that location is marked full by another process.
Such suspension of a process only takes place in the SFU and therefore does not
interfere with other processes that are ready to execute instructions.  A 
very useful diagram for understanding this mechanism and the architecture
of the HEP has been provided by H. Jordan [12] and appears in Figure 1.  
.sp
 ----------
   figure 1 
 ----------
.sp
.PP
This architecture allows a fairly clean introduction of parallel constructs into
the FORTRAN 77 language through two simple extensions.  The first of these is the 
introduction of asynchronous variables and the second is the CREATE statement.
An asynchronous variable is defined by putting a $ character in front of 
a standard FORTRAN variable name.  Type defaults to the first character after 
the $.  The $ indicates that the full-empty bit is significant.  When an 
asynchronous variable appears on the left of an assignment statement it must be
empty before that statement can be executed, and when it appears on the right
of an assignment statement the variable must be full before the statement
can be executed.  In either case flow of execution for the process is suspended
until the appropriate full-empty conditions are met.  This provides a 
straightforward data activated communication mechanism between concurrent
processes.  
In HEP FORTRAN processes are invoked through the CREATE statement.  The usage
of the CREATE statement is similar to a subroutine CALL but control does not
pass to the subroutine.  Instead a parallel process is spawned that operates
independently from the parent process.  These intrinsics will change form
when the new UNIX based operating system HEP/UPX becomes available so that
no formal extensions to the FORTRAN language will be invoked.  However, the
same mechanisms will be available in the new version.  
.NH 
HEP Performance.
.PP
To analyze HEP performance it is sufficient in most situations to concentrate
on the task queue (see fig 1).  A programmer of this machine must have two
goals in mind.  The first is to create enough processes to fill the pipelines 
sufficiently so that an instruction is issued from the task queue every 
machine cycle (100 nano-seconds).  
Once this has been accomplished it is 
important to 
ensure that the compiler generated
instructions have be optimized so as to reduce
duplication of indexing operations and unnecessary references of variables.
Careful management of data movement operations to and from
register memory is the key to achieving high performance on this 
pipelined MIMD architecture.  While vector instructions are not implemented 
in hardware, it is possible to construct such vector operations in
software.  This is significant in two respects.  First it demonstrates that
SIMD constructs may be recovered in this architecture.  Second, similar 
techniques are expected to be applicable to speeding up certain basic operations
that do not vectorize.  
.PP
To illustrate the idea, let us again consider the computation
.EQ (5.1)
      y~<-~y~+~Mx ~,
.EN
where @y@ is a real vector of length @n sub 1@, @M@ 
is an @n sub 1 times n sub 2 @ real 
matrix, and @x@ is a real vector of length @n sub 2@.  The vector @y@ is to be 
overwritten with the result @y~+~Mx@.  This is a basic task in numerical
linear algebra which has many applications. It has been proposed 
for inclusion in the extended BLAS [13] that will be needed to achieve 
efficiency and transportability across various parallel architectures.
.PP
To code the above operation
in FORTRAN on the HEP, one might simply self-schedule the inner products 
.EQ (5.2)
     y(i) ~<-~y(i) ~+~ M(i,*)x(*) ~~~for~~ i ~=~ 1,2,...,n sub 1 
.EN
in parallel.
However, even though a substantial speedup could be 
achieved,  the FORTRAN DO LOOP that would be used to express the inner product 
will be inefficient since it introduces a large amount
of integer arithmetic for subscript
calculation at each array reference.  A clever optimizing compiler might
avoid this if the FORTRAN loop were unrolled, but this remains to be done.
In fact, the 
data movement and indexing operations will completely dominate 
and hide any floating point work if the present HEP FORTRAN 77 compiler is used.
.PP
On a vector machine one would accumulate the desired result in @y@ by adding
in appropriate multiples of the columns of @M@ in the following fashion
.EQ (5.3)
    y(*) ~<-~y(*) ~+~ M(*,j)x(j) ~~~~for~~ j ~=~ 1,2,...,n sub 2 ~ .
.EN
This operation is inappropriate for an MIMD machine because one would 
have to synchronize the access to @y@ by each of the processes, and this would
cause busy waiting to occur.  One would do better to partition the
vector @y@ and the rows of the matrix @M@ into blocks
.EQ (5.4)
     left ( pile {y sub 1 above y sub 2 above . above . above . above 
          y sub k } right ) ~=~ 
     left ( pile {y sub 1 above y sub 2 above . above . above . above 
          y sub k } right ) ~+~ 
     left ( pile {M sub 1 above M sub 2 above . above . above . above 
          M sub k } right ) x  ~
.EN
and self-schedule individual vector operations on each of the blocks 
in parallel:  
.EQ (5.5)
      y sub i ~<-~ y sub i ~+~ M sub i x ~~~~for ~~ i ~=~ 1,2,...,k ~.
.EN
If a vector operation such as (5.3) were available to compute each of 
the blocks on (5.5), there would
be nothing left to do.  Synchronization costs at the outer loop would be 
reduced from requirements of (5.2), and we would have very efficient
vector operations at the inner loop level.
.PP
As mentioned previously, there is no
vector operation in hardware on the HEP.  However, it is possible to construct
such a vector operation in software that effectively uses the pipelined
arithmetic functional units.
One way to achieve this is to set up a contiguous block of registers 
designated as a buffer vector-register and another block of registers 
designated as a result vector-register.   If the length of the 
buffer is @l@, then we want @k@ in (5.5) to satisfy @n sub 1 ~=~ (k-1)l~+~ r@
with @y sub i @ of length @l@ for @i~<~ k@ .  Then for each @i@, 
@y sub i@ is loaded into the result vector-register and then the columns
of @M sub i@ are successively loaded into the buffer vector-register,
multiplied by the appropriate component of @x@, and added to the 
result vector-register.  Since storage of the result register is 
deferred until the computation is completed, data movement is minimal.
.PP
This may be described in the following pseudo-code:
.sp2
.KS
 @indent2 mark program~matvec(ldm,n,M,x,y)@
.sp1
 @lineup comment: ~~ldm~the~leading~dimension~of~M~is~needed~for~column~
indexing@;
 @lineup comment: ~~initialize~the~result~register@;
.sp1
 @lineup begin @
 @lineup result ~<-~ y@;
 @lineup "for"~j ~=~ 1 ~step~1~until ~n ~ do@
.sp1
 @lineup indent begin @
 @lineup indent comment:~~load~the~next~column~of~M~into~buffer@;
 @lineup indent buffer ~<-~M(*,j)@;
.sp1
 @lineup indent comment:~~multiply~by~x(j)~"and"~add~"to"~result~register@;
 @lineup indent buffer ~<-~ buffer size -2 {times} x(j)@;
 @lineup indent result ~<-~ result ~+~ buffer@;
 @lineup indent end @
.sp1
 @lineup comment:~~ store~result~into~y~before~exit@;
 @lineup y ~<-~ result@;
 @lineup end@.
.sp2
.KE
It is possible to accomplish this in assembly language by unrolling the
loops and calculating the index increments at assembly time.  This is 
facilitated by the three address instruction set for data transfer
between Data memory and Register memory, and for binary arithmetic 
operations that allow indexing of registers.
.sp
 _____
   figure 2
 -----
.sp
.PP
The performance improvement over FORTRAN is quite dramatic.  We illustrate
this in Figure 2.  Note that nearly 6 megaflops were achieved with
this technique on a machine capable of 10 million instructions per second.
A factor of 10 decrease in the basic cycle time would provide a 4-PEM
system with the capability of over 200 megaflops without any other
modifications to the basic design.  When this is combined with the
opportunity to express parallelism at the so called "outer loop" level
and with the ability to multi-task large jobs that split into obvious
independent chunks of computation, the HEP architecture appears to 
be very attractive indeed.  However, some basic problems are introduced
as we attempt to take advantage of parallelism within library tasks.
.NH
Library Issues.
.PP
We would like to illustrate these difficulties with a specific example
and offer some remedies that have proved successful in this setting.
The library we shall discuss is a modified version of LINPACK [2]  that
has been designed for both transportability and efficiency across a 
wide variety of computing systems including serial, vector, and MIMD 
architectures.  This effort represents a prototype of a library structuring
scheme that promises to provide a reasonable compromise between 
the opposing goals of portability and performance.  The ultimate goal 
of this research is to provide 
provide a suitable framework for restructuring existing software
libraries that will port efficiently to  advanced computer architectures as 
they emerge.
.PP
The algorithms implemented within this library 
are based on standard procedures in linear algebra including LU, QR, and 
Cholesky decompositions.
They have been reformulated in terms of a few relatively high level constructs
such as matrix-vector operations.  This concept is not new,
LINPACK has been written in terms of a set of low level constructs called the
Basic Linear Algebra Subprograms (BLAS).
The BLAS operations are functions such as inner product, 
a multiplication of a vector added to
another vector, and vector scaling. The BLAS are well suited for
operations that occur on some of the vector processors, but they are
not the best choice for certain other vector processors, multiprocessors,
or parallel processing computers [4]. The next level up from
simple vector operations are the matrix-vector operations.  These include such
operations as a matrix times a vector and a rank one change to a matrix.  Of 
course these matrix vector operations 
can be coded in terms of the BLAS.  However, efficiency may gained by coding
them directly because
they provide enough computational granularity to make use of the more powerful
capabilities of the most advanced architectures.  In most cases performance
is vastly improved through the ability to accumulate results in vector 
registers and mask data movement between main memory and register memory
with floating point operations.  This technique was discussed in detail
in Section 3 for SMXPY on the HEP and it is described in [4] for the CRAY 
machines.
.PP
The module concept has proven itself to be very successful already in 
transporting these LINPACK routines across various architectures.  To date
it has been run on a variety of machines as reported by Dongarra in [1].
In particular these codes have achieved super-vector speeds on all of the 
existing CRAY machines.  Moreover, near optimal speedups on both the 2 and 4
processor CRAY X-MP has been achieved.  Results are reported in the next section.
It is important that these modules are able to accommodate
both parallel and vector machines of such variety. It is equally important that
so few modules are required in order to produce LINPACK.  All of this suggests
that the modules above are indeed at the right level of granularity.
.PP
Designing the algorithms in terms of such operations is
the hard part of an implementation.  This can require considerable rethinking 
of the design of the algorithm.  However,
restructuring the algorithms in terms of these modules is more than just a 
challenging exercise.  It tends to unify the structure of various algorithms 
much more than the use of the BLAS.  Moreover, the module concept allows us
to avoid locking the structure of the algorithms into one computer's 
architecture.  
.PP
Given the algorithms described in terms of these high level modules,
we are able to produce a version targeted for a new machine, such as the HEP, 
by replacing only three modules: matrix-vector multiplication
@( y ~=~ y ~+~ Mx)@, vector-matrix multiplication
@( y sup T ~=~ y sup T ~+~ x sup T M)@, and a rank one update
to a matrix @(M ~=~ M ~+~ x y sup T )@. These modules
represent a high level of granularity in the algorithm in the sense
that they are based on matrix-vector operations, @O(n sup 2 )@ work,
not just vector operations, @O(n)@ work.
.PP
Several tricky issues present themselves as one considers implementing a 
library of subroutines that take advantage of parallelism through 
the high level modules.  The most important issue that arises with the 
users freedom to create processes is that unless great care is taken
with the design and structure of the library, the user will have to 
be concerned with the possibility of invoking the library from parallel
branches of his code and unwittingly overwhelming the system resources.
Nevertheless, the implementors of such a library must strive to conceal
machine dependencies as much as possible from the user.  Also, in
the case of transporting existing libraries to the HEP one wishes to preserve
the interfaces to avoid causing past users to modify their code unnecessarily.
This desire is at first glance at odds with the HEP.   If we are to invoke 
parallelism at the level described above then it would appear that 
the user must be conscious of the number of parallel tasks
throughout his program.  This is the result of limitations on some systems
in the total number
processes allowed to be be created. Should the library
routines be called from multiple branches of a parallel program, the user
could inadvertently attempt to create many more processes than is allowed.
.PP
A solution to this dilemma is to construct a scheduler that is capable
of serving numerous requests from a few processes.   For the HEP this 
scheduler may take the form of several parallel processes which are 
capable of preforming all of the low-level requests coming from the
library.  Each of these processes polls an asynchronous variable which 
indicates work to be done when full.  The work process assumes 
the appropriate identity that is indicated and performs the requested
computation.  These work processes are created at the outset and exist
throughout the total computation.  The service requests are data activated
as described above.  This is important on multipem systems as the
overhead for issuing intertask creates is substantial.  
.NH
Performance of the Modules.
.PP
In Figure 3 performance of the LU decomposition with these high performance
modules is shown and compared to corresponding performance with the 
modules written in FORTRAN (both parallel and non-parallel versions).
.sp
 -------
   figure 3
 ------
.sp
The transportability of these codes has also been demonstrated.  The modules
have been coded for efficiency on CRAY 1 and CRAY X-MP machines as well as many 
others.  A partial list on vector machines was shown in section 3.
The performance of linear algebra routines written in terms of the modules
appears to be very satisfactory both from a transportability standpoint
and from an efficiency standpoint.  The performance is near optimal both
on the CRAY and on the HEP machines in terms of comparison with the 
best specially coded algorithms based upon the standard algorithms.  We 
turn now to special algorithms that are tuned specifically to the HEP.  
.NH
Special algorithms for the HEP
.PP
The above techniques do achieve a level of transportability along with
very respectable performance.  However, the technique employed above
to exploit parallelism involved a @fork-join@ construct at the lowest level
routines.  Although process create's were avoided,  parallel operations were
invoked only intermittenly with each submatrix during the reduction process.
Algorithms that avoid this fork join synchronization and make a better
attempt at keeping the processors busy at all times should make better use of 
the machine.  
.PP
To find such algorithms for linear algebra tasks one naturally turns to 
the data-flow type algorithms.  The HEP architecture and FORTRAN language
extensions will support very fine grained synchronization.  This
may take place at the binary operation level that one traditionally 
associates with data-flow algorithms.  However, it is more profitable to
seek a larger grain of computation that is more compatible with 
the synchronization overhead.  
.PP
Our example involves the Given's method 
for obtaining the @QR@ factorization of a matrix.  This algorithm has 
been a favorite among data-flow and systolic array algorithmists
because a basic operation is fine grained (multiply a 2-vector by a 2 times 2
matrix) and the procedure is not complicated by pivoting.
The serial variant of Given's method that we consider is as follows.  Given a 
real @m times n@ matrix @A@ ,  the goal of the algorithm is to apply 
elementary plane rotations @G sub ij@ which are
constructed to zero out the @ji sup th @ element of the matrix @A@.
Such a matrix may be thought of as a @2 times 2@ orthogonal matrix of the
form
.EQ 

  G ~=~ left (^ pile { gamma above - sigma} ~pile {sigma above gamma } ^ right )
.EN
where @sigma sup 2 ~+~ gamma sup 2 ~=~ 1@ .
If 
.EQ 
    left ( pile {alpha above beta} ~ pile { a sup T above b sup T } ^ right )
.EN
represents a @2 times n @ matrix 
then a zero can be introduced with left multiplication by @G@ into the
@beta@ position through proper choice of @ gamma @ and @ sigma @.  
When embedded in the @n times n @ identity the matrix @ G sub ij@ is of the form
.EQ 

       G sub ij ~=~ I ~+~ D sub ij
.EN
where all elements of @D sub ij@ are zero with the possible 
exception of the @ii^,^ ij^,^ ji^,^@
and @jj@ entries.  The matrices @ G sub ij@ are used to reduce @A@ to 
upper triangular form in the following order.
.EQ 

    left ( G sub {n-1,n} ^.^.^.^G sub 2n G sub 1n right ) left ( 
      G sub n-2,n-1 ^.^.^.^ G sub 2,n-1 G sub 1,n-1 right ) ^.^.^.^
     left ( G sub 23 G sub 13 right ) left (  G sub 12 right ) A ~=~ R~.~
.EN
The order of the zeroing pattern may be seen in the following @5 times 5@ 
example
.EQ 

    pile {times above ctimes sub 1 above 
     ctimes sub 2 above ctimes sub 4 above ctimes sub 7} ~~~~
    pile {times above times above ctimes sub 3 above 
       ctimes sub 5 above ctimes sub 8} ~~~~
    pile {times above times above times above 
       ctimes sub 6 above ctimes sub 9} ~~~~
    pile {
        times above times above times above times above ctimes sub 10} ~~~~
    pile { times above times above times above times above times} ~~~~

.EN
This order is important if one wishes to "pipeline " the row reduction
process.  This pipelining may be achieved by expressing @R@ as a linear 
array in packed form by rows and then dividing this linear array into
equal length pipeline segments.  A process is responsible for claiming
an unreduced row of the original matrix and reducing it to zero by 
combining it with the existing @R@ matrix using Givens transformations.
A new row may enter the pipe immediately
after the row ahead has been processed in the first segment.  Each row
proceeds one behind the other until the entire matrix has been processed.
However, these rows cannot be allowed to get out of order once they 
have entered the pipe because of data dependencies.  This is accomplished
using an array of asynchronous variables; one protecting access to 
each segment of the pipe.  A process gains entry to the next segment
by emptying the corresponding asynchronous variable before releasing the 
ownership of the segment currently occupied.  Granularity may be adjusted
to hide the cost of this synchronization by simply adjusting the 
length of a segment.  Segment boundaries do not correspond to row
boundaries in @R@.  Further details are given in [8].  The pipelining
scheme and the synchronization mechanism has been extended in a 
very natural way to the @QR@ factorization of a sparse matrix [12].
.PP
This algorithm has been coded for the HEP and the computational results 
were somewhat surprising.  In fact these results provided the motivation
to look more closely at the machine performance and resulted in the 
work described in Section 3  The results shown in Figure 4
compare the Householder method from the modified LINPACK library using
FORTRAN modules with a FORTRAN version of the pipelined Given's method. 
 -----
 fig 4
 -----
It turns out that in FORTRAN the Given's method is twice as fast as the
Householder method on the HEP.  This is contrary to the performance 
predicted by a complexity analysis based solely upon floating point
operations.  The reason for this is that memory 
references (ie. load and store operations) dominate floating point
operations on this machine and Given's method requires half the 
memory references that Householder's method requires. 
.NH
Conclusions
.PP
As multiprocessor designs proliferate, 
research efforts should focus on "generic" algorithms that can be easily 
transported across various architectures.
If a code has been written in terms of high level synchronization and
data management primitives, that are expected to be supported by every member
of the model of computation, then these primitives only need to be customized
to a particular realization.  A very high level of transportability may
be achieved through automating the transformation of these primitives.
The benefit to software maintenance, particularly for large codes, is in 
the isolation of synchronization and data management peculiarities.     
.PP
This desire for portability is often at odds with the need to efficiently
exploit the capabilities of a particular architecture.  
Nevertheless, algorithms should not 
be intrinsically designed for a particular machine.
One should be prepared to give up a marginal amount of efficiency in trade
for reduced man power requirements to use and maintain software.
There are many possibilities for implementation of the general ideas
that are briefly described above.  We are certainly not in a position
to recommend a particular implementation with any degree of finality.
However, we already have experience indicating the 
feasibility of both of the approaches discussed here. 
We believe that a high level of transportability as described above 
can be achieved without seriously degrading 
potential performance.  
We would like to encourage others to consider the challenge of producing
transportable software that will be efficient on these new machines.
.sp
.B
References
.sp
.IP [1]
J.J. Dongarra,
.I
Performance of Various Computers Using Standard Linear Equations
Software in a Fortran Environment,
.R
Argonne National Laboratory Report MCS-TM-23, (updated August 1984).
.sp
.IP [2]
J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart, 
.I
LINPACK Users' Guide,
.R
SIAM Publications, Philadelphia, 1979.
.sp
.IP [3]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson,
.I
A Proposal for an Extended Set of Fortran
Basic Linear Algebra Subprograms,
.R
MCS/TM 41, (December, 1984), to appear in ACM SIGNUM Jan. 1985.
.sp
.R
.IP [4]
J.J. Dongarra and S.C. Eisenstat, 
.I
Squeezing the Most out of an Algorithm 
in CRAY Fortran,
.R
ACM Trans. Math. Software, Vol. 10, No. 3, 1984.
.sp
.IP [5]
J.J. Dongarra and R.E. Hiromoto, 
.I
A Collection of Parallel Linear Equations 
Routines for the Denelcor HEP,
.R
Parallel Computing, Vol. 1 No. 2, 1984.
.sp
.IP [6]
J.J. Dongarra, E.L. Lusk, R.A. Overbeek, B.T. Smith, and D.C. Sorensen,
.I
New Directions in Software for Advanced Computer Architectures,
.R
Argonne National Laboratory Report MCS/TM 32 (August, 1984).
.sp
.IP [7]
.R
J.J. Dongarra, L. Kaufman, and S. Hammarling
.I
Squeezing the Most out of Eigenvalue Solvers
on High-Performance Computers,
.R
ANL MCS-TM 46, January 1985.
.sp
.IP [8]
J.J. Dongarra, A.H. Sameh, and D.C. Sorensen, 
"Some Implementations of the 
QR Factorization on an MIMD Machine",
.R
ANL/MCS-TM-25, October 1984,
to appear in Parallel Computing.
.sp
.IP [9]
K. Fong and T.L. Jordan, 
.I
Some Linear Algebra Algorithms and Their Performance on CRAY 1,
.R
Los Alamos Scientific Laboratory, UC-32, June 1977.
.sp
.IP [10] 
B.S. Garbow, J.M. Boyle, J.J. Dongarra, and C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide Extension, 
.R
Lecture Notes in Computer Science, Vol. 51, Springer-Verlag, Berlin, 1977.
.sp
.IP [11]
M.T. Heath and D.C. Sorensen,
.I
A Pipelined Givens Method for Computing the QR Factorization
of a Sparse Matrix,
.R
Proceedings of the Workshop on Parallel Processing using
the HEP, Ed. S. Lakshmivarahan, University of Oklahoma, March 1985. 
.sp
.IP [12]
H.F. Jordan,
.I
An Overview of HEP Architecture, Programming, and Performance
.R
to appear in book Ed. J.S. Kowalik, Parallel MIMD Computation:
The HEP Supercomputer and its Applications (1984).
.sp
.IP [13]
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
.I
Basic Linear Algebra Subprograms for Fortran Usage.
ACM Trans. Math. Software, 
.R
5 (1979), 308-371.
.sp
.IP [14]
E. Lusk and R. Overbeek, 
"Implementation of Monitors with Macros: A 
Programming Aid for the HEP and Other Parallel Processors",
.R
ANL-83-97, (1983).
.sp
.IP [15]
E. Lusk and R. Overbeek, 
"An Approach to Programming Multiprocessing 
Algorithms on the Denelcor HEP",
.R
ANL-83-96, (1983).
.sp
.IP [16] 
B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe,
V.C. Klema, and C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide, 
.R
Lecture Notes in Computer Science, Vol. 6, 2nd Edition, 
Springer-Verlag, Berlin, 1976.
.sp
.IP [17]
D.C. Sorensen,
.I
Buffering for Vector Performance on a Pipelined MIMD Machine,
.R
Argonne National Laboratory Report MCS-TM-29 (April 1984).
.sp

.