.ds TH "\h'-.75m'\v'-.50m'^\v'.50m'
.ds TD "\h'-.75m'\v'-.45m'~\v'.45m'
.B
.nr BT ''-%-''
.he ''''
.pl 11i
.de fO
'bp
..
.wh -.5i fO
.LP
.nr LL 6.5i
.ll 6.5i
.nr LT 6.5i
.lt 6.5i
.ta 5.0i
.ft 3
.bp
.R
.sp 1i
.ce 100
.R
.sp .5i
 .
.sp 10
ARGONNE NATIONAL LABORATORY
.br
9700 South Cass Avenue
.br
Argonne, Illinois  60439
.sp .6i
.ps 12
.ft 3
An Extended Set of Fortran

Basic Linear Algebra Subprograms
.ps 11
.sp 3
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson
.sp 3
.ps 10
.ft 1
Mathematics and Computer Science Division
.sp 2
Technical Memorandum No. 41 (Revision 2)
.sp .7i
January, 1986
.pn 0
.he ''-%-''
.he ''''
.bp
 .
.sp 10
.B
.ce 1
.ps 11
ABSTRACT
.R
.ps 10
.sp .5i
This paper describes an extension to
the set of Basic Linear Algebra Subprograms. The extensions
are targeted at matrix vector operations which should
provide for more efficient and portable implementations of algorithms
for high performance computers.
.in
.bp
.ft 3
.ps 11
.bp
.LP
.LP
.EQ
gsize 11
delim @@
.EN
.TL
.ps 12
.in 0
An Extended Set of Fortran\h'.35i'
.sp
Basic Linear Algebra Subprograms\h'.35i'
.AU
.ps 11
.in 0
Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i'
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'
Argonne National Laboratory\h'.20i'
Argonne, Illinois 60439\h'.20i'
.AU
.ps 11
.in 0
Jeremy Du Croz\h'.20i'
.AI
.ps 10
.in 0
Numerical Algorithms Group Ltd.\h'.20i'
NAG Central Office, Mayfield House\h'.20i'
256 Banbury Road, Oxford OX2 7DE\h'.20i'
.AU
.ps 11
.in 0
Sven Hammarling\h'.20i'
.AI
.ps 10
.in 0
Numerical Algorithms Group Ltd.\h'.20i'
NAG Central Office, Mayfield House\h'.20i'
256 Banbury Road, Oxford OX2 7DE\h'.20i'
.AU
.ps 11
.in 0
Richard J. Hanson\h'.20i'
.AI
.ps 10
.in 0
Applied Math  2646
Sandia National Laboratory\h'.20i'
Albuquerque, New Mexico 87185\h'.20i'
.FS
.ps 9
.vs 11p
@size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical
Sciences subprogram of the Office of Energy Research,
U. S. Department of Energy, under Contract W-31-109-Eng-38.


Typeset on \*(DY.
.FE
.sp 3
.QS
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em This paper describes an extension to
the set of Basic Linear Algebra Subprograms. The extensions
are targeted at matrix vector operations which should
provide for more efficient and portable implementations of algorithms
for high performance computers.
.in
.ll
.QE

.nr PS 11
.nr VS 16
.nr PD 0.5v
.SH
1.  Introduction
.PP
In 1973 Hanson, Krogh, and Lawson wrote an article in the SIGNUM Newsletter (Vol.
8, no. 4, page 16) describing the advantages of adopting a set of basic
routines for problems in linear algebra. The original basic linear algebra
subprograms, now commonly referred to as the BLAS and fully described in
Lawson, Hanson, Kincaid, and Krogh [7,8], have been very
successful and have been used in a wide range of software
including LINPACK [4] and many of the algorithms
published by the ACM Transactions on Mathematical Software. In particular
they are an aid to clarity, portability, modularity and maintenance of
software and they have become a \f2de facto \f1standard for the 
elementary vector
operations.
An excellent discussion of the 
.I
raison d' \*^etre
.R
of the BLAS is given in Dodson and Lewis [1].
.PP
Special versions of the BLAS, in some cases machine code versions, have been
implemented on a number of computers, thus improving the efficiency of the
BLAS. However, with some of the modern machine architectures, 
the use of the BLAS
is not the best way to improve the efficiency of higher level codes.
On vector machines, for example, one needs to optimize at least at the level of
matrix-vector operations in order to approach the potential efficiency of the
machine (see [2 and 3]); and the use of the BLAS inhibits
this optimization because they hide the matrix-vector nature of the
operations from the compiler.
.PP
We believe that the time is right to propose the specifications of an
additional set of BLAS designed for matrix-vector operations. It has been our
experience that a small set of matrix-vector operations occur frequently in
the implementation of many of the most common algorithms in linear algebra.
We define here the basic operations for that set, together with the naming
conventions and the calling sequences.
Routines at this level should provide a reasonable compromise between
the sometimes conflicting aims of efficiency and modularity and
it is our hope that efficient implementations will become available on a wide
range of computer architectures.
.PP
In this paper we shall refer to the existing BLAS of Lawson et al.
as ``Level 1 BLAS'' or ``Existing BLAS'', and the proposed
new set as ``Level 2 BLAS'' or ``Extended BLAS''.
The Level 2 BLAS involve @O(mn )@ scalar operations
where @m@ and @n@ are the dimensions of the matrix involved. These could be programmed
by a series of calls to the Level 1 BLAS, though we do not recommend
that they be implemented in that way. Hence, in a natural sense, 
the Level 2 BLAS
are performing basic operations at one level higher than the Level 1 BLAS.
.PP
We will make available a complete set of Level 2 BLAS in Fortran 77
so that software developers without access to specific implementations
can make use of them. We also plan to develop a test program
so that implementations of the extended BLAS can be thoroughly tested before
being distributed. We intend to submit the test program
and the Fortran 77 version of the routines for publication as an ACM algorithm.

.SH
2. Scope of the Extended BLAS
.PP
We propose that the following three types of basic operation be performed by the
extended BLAS:

a)  Matrix-vector products of the form
.in +.5i
.EQ I
y ~<-~  alpha Ax      ~+~ beta y   ,~~ 
y ~<-~  alpha A ~ sup T x    ~+~ beta y   ,~~ and~
~ y ~<-~  alpha A bar ~ sup T x    ~+~ beta y
.EN
where @ alpha @ and @ beta @ are scalars, @x@ and @y@ are vectors and @A@ is a matrix, and
.EQ I
x ~<-~  Tx ,~   ~ 
x ~<-~  T ~ sup T x,~~ and~
~ x ~<-~  T bar ~ sup T x,
.EN
where @x@ is a vector and @T@ is an upper or lower triangular matrix.
.in -.5i

b)  Rank-one and rank-two updates of the form
.in +.5i
.EQ I
A ~<-~ alpha  xy ~ sup T ~+~ A  ,~~
A ~<-~ alpha  x y bar ~ sup T ~+~ A  ,~~
H ~<-~ alpha  x x bar ~ sup T ~+~ H  ,~~and ~
H ~<-~ alpha  x y bar ~ sup T ~+~ alpha bar  y x bar ~ sup T ~+~ H   ,
.EN
where @~H~@ is a Hermitian matrix.
.in -.5i

c)  Solution of triangular equations of the form
.in +.5i
.EQ I
x ~<-~ T ~ sup -1  x   ,
~x ~<-~ T ~ sup -T  x  , ~~and~
~x ~<-~ T bar ~ sup -T  x  ,
.EN
where @T@ is a non-singular upper or lower triangular matrix.
.in -.5i
.PP
We propose that, where appropriate, the operations be applied to general,
general band, Hermitian, Hermitian band, triangular, and triangular
band matrices in both real and complex arithmetic, and in single and double 
precision.
.PP
An example illustrating the implementation of the routines is given in 
Appendix A.
.PP
In Appendix C we propose some additional routines in which
the vectors @x@ and/or @y@ have a higher precision
than the matrix, so that on machines with extended precision
registers the extra internal precision is not all discarded on
return from the routine. We propose these routines as an appendix
solely because it is not
possible, even in single precision, to specify a complete set
within the confines of ANSI Fortran 77.
The only case that can be realized is where the matrix is
real single precision and the vector is real double precision.
.SH
3. Naming Conventions
.PP
The proposed name of a Level 2 BLAS is in the LINPACK style and consists of
five characters, the last of which may be blank. 
The fourth and fifth characters in the name denote
the type of operation, as follows:

.KS
.TS
center;
l l.
MV	- Matrix-vector product
R	- Rank-one update
R2	- Rank-two update
SV	- Solve a system of linear equations
.TE
.KE

Characters two and three in the name denote the kind of matrix involved, as
follows:

.KS
.TS
center;
l l.
GE	 General matrix
GB	 General band matrix
HE	 Hermitian matrix 
SY	 Symmetric matrix 
HP	 Hermitian matrix stored in packed form
SP	 Symmetric matrix stored in packed form
HB	 Hermitian band matrix
SB	 Symmetric band matrix
TR	 Triangular matrix
TP	 Triangular matrix in packed form
TB	 Triangular band matrix
.TE
.KE

The first character in the name denotes the Fortran data type of the matrix,
as follows:

.TS
center;
l l.
S	 REAL
D	 DOUBLE PRECISION
C	 COMPLEX
Z	 COMPLEX*16 or DOUBLE COMPLEX (if available)
.TE

The proposed available combinations are indicated in the table below.
In the first column, under \f2complex\f1, the initial C may be replaced
by Z. In the second column, under \f2real\f1, the initial S may be
replaced by D.
See Appendix D for the full subroutine calling sequences.
.PP
The proposed collection of routines can be thought of as being divided
into four separate
parts, \f2complex\f1, \f2real\f1, \f2double precision\f1, and 
\f2complex*16\f1.
Each part will be accompanied by a separate testing program.
The routines proposed are written in the ANSI Fortran 77
standard with the exception of the routines that use COMPLEX*16 variables.
These routines are included for completeness and, because the Fortran
standard does not provide for this variable type, may not be available
on all machines.

.KS
.ce
Table 3.1
.TS
center;
c c c c c c c.
\f2complex	real\f1	MV	R	R2	SV

CGE	SGE	*	* @nothing sup \(dg @
CGB	SGB	*
CHE	SSY	*	*	*
CHP	SSP	*	*	*
CHB	SSB	*
CTR	STR	*			*
CTP	STP	*			*
CTB	STB	*			*
.TE
.KE
.EQ
gsize 8
.EN
.FS
@size -1 {"" sup \(dg}@\|
For the general rank-1 update (GER) we propose two complex routines:
CGERC for @ A ~<-~ alpha x y bar ~ sup T ~+~ A @ and 
CGERU for @ A ~<-~ alpha xy ~ sup T ~+~ A @.
This is the only exception to the one to one correspondence between
real and complex routines.
See section 7 for further discussion.
.FE
.EQ
gsize 11
.EN
.PP
We do not propose routines for rank-one and rank-two updates applied to band
matrices because these can be obtained by calls to the rank-one and rank-two
full matrix routines. This is illustrated in 
Appendix B.

.SH
4. Parameter Conventions
.PP
We propose a similar convention for the parameter lists to that for the
existing BLAS, but with extensions where comparable parameters are not
present in the existing BLAS. The proposed order of parameters is as follows:

       a)  Parameters specifying options.
       b)  Parameters defining the size of the matrix.
       c)  Input scalar.
       d)  Description of the input matrix.
       e)  Description of input vector(s).
       f)  Input scalar (associated with input-output vector).
       g)  Description of the input-output vector.
       h)  Description of the input-output matrix.

Note that not each category is present in each of the routines.
.PP
The parameters that specify options are character parameters with the
names TRANS, UPLO, and DIAG.
TRANS is used by the matrix vector product routines as follows:

.KS
.TS
center;
l l.
Value	Meaning
_	_
`N'	Operate with the matrix.
`T'	Operate with the transpose of the matrix.
`C'	Operate with the conjugate transpose of the matrix.
.TE
.KE

In the real case the values `T' and `C' have the same meaning.
.PP
UPLO is used by the Hermitian, symmetric, and triangular matrix routines to
specify whether the upper or lower triangle is being referenced as follows:
.KS
.TS
center;
l l.
Value	Meaning
_	_
`U'	Upper triangle
`L'	Lower triangle
.TE
.KE

.PP
DIAG is used by the triangular matrix routines to specify whether or not
the matrix is unit triangular, as follows:
.KS
.TS
center;
l l.
Value	Meaning
_	_
`U'	Unit triangular
`N'	Non-unit triangular
.TE
.KE
When DIAG is supplied as `U' the diagonal elements are not
referenced.
.PP
It is worth noting that actual character arguments in Fortran may be longer
than the corresponding dummy arguments. So that, 
for example, the value `T' for TRANS may be passed as `TRANSPOSE'.
.PP
The size of the matrix is determined by the two parameters M and N for an @m@
by @n@ rectangular matrix; 
by the parameters M, N, KL, and KU for an @m@ by @n@ band matrix with @kl@
sub-diagonals and \f2ku\f1 super-diagonals; and by the parameters N and K for an @n@ by @n@
symmetric or Hermitian or triangular band matrix with \f2k\f1 super- and/or
sub-diagonals. 
.PP
The description of the matrix consists either of the array name (A)
followed by the leading dimension of the array as declared in the
calling (sub) program (LDA), when the matrix is being stored in a two-
dimensional array; or the Fortran array name (AP) alone when the matrix is being
stored as a (packed) vector. When A is not band, then in the former case the
actual array must contain at least @(m ~+~ d(n-1))@ elements, where @d@ is the
leading dimension of the array and @m~=~n@ for a square matrix; and in the latter
case the actual array must contain at least @n(n+1)/2@ elements.
.PP
The scalars always have the dummy argument names ALPHA and BETA.
As with the
existing BLAS the description of a vector consists of the name of the 
array (X or Y) followed by the storage spacing (increment) in the 
array of the vector elements (INCX or INCY). The increment is allowed to be
negative, zero, or positive. When the vector @x@ consists of @k@ elements, then
the corresponding actual array argument X must be of length at least
(1 + (@k@-1)|INCX|).
.sp
The following values of parameters are invalid:
.nf
.sp
      Any value of the character parameters DIAG, TRANS, or UPLO 
         whose meaning is not specified.
      M < 0 for GE and GB routines.
      N < 0 for all routines.
      KL < 0 or KL < M - N  for the GB routines
      KU < 0 or KU < N - M  for the GB routines
      K < 0 or for the HB, SB, and TB routines.
      LDA < M for the GE routines.
      LDA < KL + KU + 1 for the GB routines.
      LDA < N for the HE, SY, and TR routines.
      LDA < K + 1 for the HB, SB, and TB routines.
.fi

Note that it is permissible to call the routines with M or N = 0,
in which case the routines exit immediately without referencing
their vector or matrix arguments.
For those routines that include the parameter BETA,
when BETA is supplied as zero then the array Y need not
be set on input, so that operations such as @y ~ <- ~ alpha A x @ may be
performed without initially setting @y@ to zero.

.SH
5. Storage Conventions
.PP
Unless otherwise stated it is assumed that matrices are stored
conventionally in a 2-dimensional array with matrix-element @a sub ij @
stored in array-element A(I,J).
.PP
The routines for real symmetric and complex Hermitian matrices
allow for the matrix to be stored in either the upper 
(UPLO = `U')
or lower triangle 
(UPLO = `L') 
of a two dimensional array,
or to be packed in a one dimensional array. In the latter case the upper
triangle may be packed sequentially column by column 
(UPLO = `U'), or the lower triangle may be packed sequentially
column by column
(UPLO = `L'). 
Note that for real symmetric matrices packing the upper
triangle by column is equivalent to packing the lower triangle by rows,
and packing the lower triangle by columns is equivalent to packing the
upper triangle 
by rows. (For complex Hermitian matrices the only difference is that
the off-diagonal elements are conjugated.)
.PP
For triangular matrices the parameter UPLO serves to define
whether the matrix is upper
.br
(UPLO = `U')
or lower 
(UPLO = `L') 
triangular. In the packed storage the triangle has to be packed by column.
.PP
The band matrix routines allow storage in the same style as with LINPACK,
so that the @j sup th @ column of the matrix is stored in the @j sup th @
column of the Fortran array. For a general band matrix the diagonal
of the matrix is stored in the @ku + 1 sup st @ row of the array.
For a Hermitian or symmetric matrix either the upper triangle
(UPLO = `U')
may be stored in which case the leading diagonal is in the @k+1 sup st @
row of the array, or the lower
triangle
(UPLO = `L') 
may be stored in which case the leading diagonal is in the first
row of the array.
For an upper triangular band matrix
(UPLO = `U')
the leading diagonal is in the @k+1 sup st @ row of the array
and for a lower triangular band matrix
(UPLO = `L') 
the leading diagonal is in the first row.
.PP
For a Hermitian matrix the imaginary parts of the diagonal elements
are of course zero and thus the imaginary parts of the corresponding
Fortran array elements need not be set, but are assumed to be zero. In
the R and R2 routines these imaginary parts will be set to zero on return.
.PP
For packed triangular matrices the same storage layout is used whether
or not DIAG = `U' (diagonal elements are assumed to have the value 1),
i.e. space is left for the diagonal elements even if those array elements
are not referenced.

.SH
6. Specification of the Extended BLAS
.PP
Type and dimension for variables occurring in the subroutine
specifications are as follows:

.KS
          INTEGER     INCX, INCY, K, KL, KU, LDA, M, N
          CHARACTER*1 DIAG, TRANS, UPLO
.KE

For routines whose first letter is an S:

          REAL  ALPHA, BETA
          REAL  X(*), Y(*)
          REAL  A(LDA,*)
          REAL  AP(*)

For routines whose first letter is a D

          DOUBLE PRECISION  ALPHA, BETA
          DOUBLE PRECISION  X(*), Y(*)
          DOUBLE PRECISION  A(LDA,*)
          DOUBLE PRECISION  AP(*)

For routines whose first letter is a C:

          COMPLEX  ALPHA
          COMPLEX  BETA
          COMPLEX  X(*), Y(*)
          COMPLEX  A(LDA,*)
          COMPLEX  AP(*)

except that for CHER and CHPR the scalar @alpha@ is real so that the first
declaration above is replaced by:

          REAL  ALPHA

For routines whose first letter is Z:

    COMPLEX*16  ALPHA 				DOUBLE COMPLEX  ALPHA
    COMPLEX*16  BETA 				DOUBLE COMPLEX  BETA
    COMPLEX*16  X(*), Y(*)   or			DOUBLE COMPLEX  X(*), Y(*)
    COMPLEX*16  A(LDA,*) 			DOUBLE COMPLEX  A(LDA,*)
    COMPLEX*16  AP(*) 				DOUBLE COMPLEX  AP(*)

except that for ZHER and ZHPR the first declaration above is
replaced by:

          DOUBLE PRECISION  ALPHA

The generic names,  parameter lists and specifications for the extended BLAS
now follow. Refer to table 3.1 for the specific subroutine names.

a)  General Matrix Vector Products

    for a general matrix:
    GEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )

    for a general band matrix:
    GBMV ( TRANS, M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )

    Operation:
.sp
        if @TRANS~=~@ `N',@~~~~~y ~ <- ~ alpha Ax ~+~ beta y @
.sp
        if @TRANS~=~@ `T',@ ~~~~y ~ <- ~ alpha A ~ sup T x ~+~ beta y@
.sp
        if @~TRANS~=~@ `C',@ ~~~~y ~ <- ~ alpha A bar ~ sup T x ~+~ beta y @.
.sp

b)  Symmetric or Hermitian Matrix Vector Products

    for a symmetric or Hermitian matrix:
    SYMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
    HEMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )

    for a symmetric or Hermitian matrix in packed storage:
    SPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY )
    HPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY )

    for a symmetric or Hermitian band matrix:
    SBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
    HBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )

    Operation:
.EQ I
y ~ <- ~ alpha Ax ~+~ beta y ~.
.EN

c)  Triangular Matrix Vector Products

    for a triangular matrix:
    TRMV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX )

    for a triangular matrix in packed storage:
    TPMV ( UPLO, TRANS, DIAG, N, AP, X, INCX )

    for a triangular band matrix:
    TBMV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX )

    Operation:
.sp
         if @TRANS~=~@ `N', @~~~~x ~ <- ~ Tx @
.sp
         if @TRANS~=~@ `T', @~~~~x ~ <- ~ T ~ sup T x@
.sp
         if @TRANS~=~@ `C', @~~~~x ~ <- ~ T bar ~ sup T x@
.sp

d)  Triangular equation solvers

    for a triangular matrix:
    TRSV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX )

    for a triangular matrix in packed storage:
    TPSV ( UPLO, TRANS, DIAG, N, AP, X, INCX )

    for a triangular band matrix:
    TBSV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX )

    Operation:
.sp
         if @TRANS~=~ @ `N',  @~~~~x ~ <- ~ T ~ sup -1 x@ 
.sp
         if @TRANS~=~@ `T',  @~~~~x ~ <- ~ T ~ sup -T x @
.sp
         if @TRANS~=~@ `C', @ ~~~~x ~ <- ~ T bar ~ sup -T x ~ @.
.sp

e)  General Rank-1 Updates:

    for a general matrix:
    GER_ ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA )

    for real matrices:
    SGER performs the operation @ A ~<-~ alpha x y ~ sup T ~+~ A @.

    for complex matrices:
    CGERC performs the operation @ A ~<-~ alpha x y bar ~ sup T ~+~ A @
    and
    CGERU performs the operation @ A ~<-~ alpha x y ~ sup T ~+~ A @.

f)  Symmetric or Hermitian Rank-1 Updates:

    for a symmetric or Hermitian matrix:
    SYR ( UPLO, N, ALPHA, X, INCX, A, LDA )
    HER ( UPLO, N, ALPHA, X, INCX, A, LDA )

    for symmetric or Hermitian matrix in packed storage:
    SPR ( UPLO, N, ALPHA, X, INCX, AP )
    HPR ( UPLO, N, ALPHA, X, INCX, AP )

    Operation:
.EQ I
A ~<- ~ alpha x x bar ~ sup T ~+~ A
.EN
    for real symmetric matrices this is simply
.EQ I
A ~<- ~ alpha x x ~ sup T ~+~ A ~.
.EN

g)  Symmetric or Hermitian Rank-2 Updates:

    for a symmetric or Hermitian matrix:
    SYR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA )
    HER2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA )

    for symmetric or Hermitian matrix in packed storage:
    SPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP )
    HPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP )

    Operation:
.EQ I
A ~<- ~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T  ~+~ A
.EN

    for real symmetric matrices this is simply
.EQ I
A ~<- ~ alpha x y ~ sup T ~+~ alpha y x ~ sup T ~+~ A ~.
.EN

.SH
7.  Rationale
.PP
The three basic matrix-vector operations chosen (Section 2) were obvious
candidates because they occur in a wide range of linear algebra applications,
and they occur at the innermost level of many algorithms. The hard decision
was to restrict the scope only to these operations, since there are many
other potential candidates, such as matrix scaling and sequences of plane
rotations. Similarly, we could have extended the scope by applying the
operations to other types of matrices such as complex symmetric or
augmented band matrices. 
We have aimed at a reasonable compromise between a much larger
number of routines each performing one type of operation
(e.g. @x ~<-~ L ~ sup -T x@), and a smaller number of routines
with a more complicated set of options. There are in fact, in each precision,
16 real routines performing altogether 43 different operations, and 17
complex routines performing 58 different operations.
.PP
We feel that to extend the scope further would significantly reduce the
chances of having the routines implemented efficiently
over a wide range of machines,
because it would place too heavy a burden on implementors. On the other hand,
to restrict the scope further would place too narrow a limit on the potential
applications of the level 2 BLAS.
.PP
The parameters @ alpha @ and @ beta@ are included in the non-triangular routines to 
give extra flexibility, but we
recommend that implementors consider special code for the cases where @ alpha ~or ~beta~=~ 1.0 @
and @ alpha ~or~ beta ~=~ -1.0 @. Similarly, as with the level 1 BLAS, 
we have included an
increment parameter with the vectors so that a vector could, for example, be
a row of a matrix. But again we recommend that implementors consider special
code for the case where the increments are unity.
Note that the implementation must be such that when
@ beta ~ = ~ 0.0 @ then @y@ need not be defined on input.
.PP
As noted earlier, corresponding to the real routine SGER we propose
two complex routines CGERC 
(for @A ~<-~ alpha x y bar ~ sup T ~+~ A@)
and
CGERU 
(for @A ~<-~ alpha x y ~ sup T ~+~ A@).
Both are frequently required. An alternative would be to provide a single
complex routine CGER with an option parameter; however this parameter
would have become redundant in the real routine SGER. Rather than
have redundant parameters, or different parameter lists for the
real and complex routines, we have chosen two distinct complex
routines; they are analogous to the level 1 BLAS CDOTC
(@c ~<- ~ c ~+~ x bar ~ sup T y@)
and CDOTU
(@c ~<- ~ c ~+~ x ~ sup T y@).
.PP
Note that no check has been included for singularity, and near
singularity, in the triangular equation solving routines.
The requirements for such a test depend upon the
application and so we felt that this should not be included,
but should instead be performed before calling the triangular solver.
.PP
On certain machines, which do not use the ASCII sequence on
all of their Fortran systems, lower case characters may not exist,
so that the innocent looking argument \f2`t'\f1, passed through the parameter 
TRANS for designating a transposed matrix, is not in the Fortran character set. 
Some UNIVAC systems do not have a lower case representation
using the `field data' character set.  
On the CDC NOS-2 system, a mechanism is provided for a full 128 ASCII 
character set by using pairs of 6-bit host characters for certain 7-bit 
ASCII characters.  
This means that there is a `2 for 1' physical extension of the logical records
that  contain lower case letters.  
This fact can hamper portability of codes written on ASCII machines that 
are later moved to CDC systems. 
The only safe way to proceed is to convert
the transported text entirely into the Fortran character set.
On the other hand we believe that users on ASCII character set
systems may wish to 
treat upper and lower case letters as equivalent in meaning.
If this is done, it means that text that will be transported to
machines of unknown types must have the ASCII set mapped into the
Fortran character set before the text is moved.
.PP
The band storage scheme used by the GB, HB, SB, and TB routines has
columns of the matrix stored in columns of the array, and diagonals
of the matrix stored in rows of the array. This is the storage scheme
used by LINPACK. An alternative scheme (used in some EISPACK [6,9] routines)
has rows of the matrix stored in rows of the array, and diagonals of the
matrix stored in columns of the array. The latter scheme has the advantage
that a band matrix-vector product of the form
@ y ~<- ~ alpha Ax ~+~ beta y @ can be computed using long vectors
(the diagonals of the matrix) stored in contiguous elements, and
hence is much more efficient on some machines (e.g. CDC Cyber 205)
than the first scheme. However other computations involving
band matrices, such as
@x ~<-~ Tx@,
@x ~<-~ T ~ sup -1 x@ and @LU@ and @U ~ sup T U @ factorization, cannot
be organized `by diagonals'; instead the computation sweeps along
the band, and the LINPACK storage scheme has the advantage of 
reducing the number of page swaps and allowing contiguous vectors
(the columns of the matrix) to be used.
.PP
Although not discussed here, we plan to provide a portable
testing package for each of the four parts of this extended 
BLAS package. The test package will be self-contained; generating
test data and checking for conformity with this proposal in a portable
fashion.
.PP
We considered the possibility of generalizing the rank-1 and rank-2
updates to rank-k updates. Rank-k updates with @k ~>~ 1@
(but @k ~<<~n@) can achieve significantly better performance on some
machines than rank-1 [5]. But to take advantage of this usually
requires complicating the calling algorithm; and moreover
rank-k updates with @k ~ approx ~n@ would allow an even higher
level operation such as matrix multiplication `in by the back door'.
We prefer to keep to a clean concept of genuine matrix-vector
operations.

.SH
8.  Acknowledgements
.PP
An earlier draft of this proposal was discussed at the
Parvec IV Workshop organized by John Rice at Purdue University
on October 29-30, 1984 and at the SIAM Conference on
Applied Linear Algebra in Raleigh, April 29-May 2, 1985. 
We wish to thank all the participants at
the workshop and meeting for their comments, discussions, and encouragement,
as well as the many people who have sent us comments separately.

.bp
.SH
13.  References
.sp
.IP [1]
D.S. Dodson and J.G. Lewis, "Issues relating to extension of
the Basic Linear Algebra Subprograms", ACM SIGNUM Newsletter,
vol 20, no 1, (1985), 2-18.
.sp
.IP [2]
.R
J.J. Dongarra and S.C. Eisenstat,
``Squeezing the Most out of an Algorithm in CRAY Fortran,''
.I
ACM Transactions on Mathematical Software,
.R
Vol. 10, No. 3, (1984), 221-230.
.sp
.IP [3]
J.J. Dongarra,
.I
Increasing the Performance of Mathematical Software
through High-Level Modularity.
.R
Proceedings of the Sixth International
Symposium on Computing Methods in Engineering and Applied Sciences.
(Versailles, France). North-Holland (1984), pp 239-248.
.sp
.IP [4]
J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart,
.I
LINPACK Users' Guide,
.R
SIAM Publications, Philadelphia, 1979.
.sp
.IP [5]
.R
J.J. Dongarra, L. Kaufman, and S. Hammarling
.I
Squeezing the Most out of Eigenvalue Solvers
on High-Performance Computers,
.R
Argonne National Laboratory Report ANL MCS-TM 46, (January 1985),
to appear in Linear Algebra and Its Applications.
.sp
.IP [6] 
B.S. Garbow, J.M. Boyle, J.J. Dongarra, C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide Extension, 
.R
Lecture Notes in Computer Science, Vol. 51, Springer-Verlag, Berlin, 1977.
.sp 
.IP [7]
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
``Basic Linear Algebra Subprograms for Fortran Usage,''
.I
ACM Transactions on Mathematical Software
.R
\fB5\fP (1979), 308-323.
.sp
.IP [8]
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh,
``Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage,''
.I
ACM Transactions on Mathematical Software
.R
\fB5\fP (1979), 324-325.
.sp
.IP [9] 
B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe,
V.C. Klema, and C.B. Moler, 
.I
Matrix Eigensystem Routines - EISPACK Guide, 
.R
Lecture Notes in Computer Science, Vol. 6, 2nd edition, 
Springer-Verlag, Berlin, 1976.
.sp 
.bp
.SH
Appendix A
.PP
This appendix gives an example to illustrate three possible implementations,
within Fortran, of the proposed extended BLAS. The example considers 3
versions of part of the the routine SGEMV; the first a straightforward version that is
likely to be suitable for many conventional serial paged or non-paged
machines; the second a tuned version, with unrolled loops, suitable for
vector machines such as the CRAY 1; the third a version with accumulated
The examples below assume the value of one for variables INCX and INCY.
inner products aimed at short word length machines.

.ps 8
.sp 2
.cs 1 24
.nf
\f2Version 1.\f1
.vs 11p


*
*        Form  y := alpha*A*x + y.
*
            JX = KX
            DO 80, J = 1, N
               IF( X( JX ).NE.ZERO )THEN
                  TEMP = ALPHA*X( JX )
                  IY   = KY
                  DO 70, I = 1, M
                     Y( IY ) = Y( IY ) + TEMP*A( I, J )
                     IY      = IY      + INCY
   70             CONTINUE
               END IF
               JX = JX + INCX
   80       CONTINUE
 
.vs 16p
\f2Version 2.\f1
.vs 11p
 

*
*     The following code illustrates a possible Fortran CRAY 1 version.
*     In this version the loops are only unrolled to a depth of 4, but a
*     depth of 16 should be considered for a 'production' version.
*     Special code for cases such as alpha = 1.0 and unit increments has
*     not been included, but this could be considered for a 'production'
*     version.
*
*        Form  y := alpha*A*x + y. The first two loops are 'clean-up'
*        loops for the left over vectors.
*
         JX = KX
         J  = MOD( N, 2 )
         IF( J.EQ.1 )THEN
            TEMP1 = ALPHA*X( JX )
            IY    = KY
            DO 50, I = 1, M
               Y( IY ) = Y( IY ) + TEMP1*A( I, J )
               IY      = IY      + INCY
   50       CONTINUE
            JX = JX + INCX
         END IF
         J = MOD( N, 4 )
         IF( ( J.EQ.2 ).OR.( J.EQ.3 ) )THEN
            TEMP1 = ALPHA*X( JX        )
            TEMP2 = ALPHA*X( JX + INCX )
            IY    = KY
            DO 60, I = 1, M
               Y( IY ) = ( ( Y( IY )             ) +
     $                       TEMP1*A( I, J - 1 ) ) +
     $                       TEMP2*A( I, J     )
               IY      = IY + INCY
   60       CONTINUE
            JX = JX + 2*INCX
         END IF
*
*        The main loop runs over groups of four vectors.
*
         JSTART = J + 4
         DO 80, J = JSTART, N, 4
            TEMP1 = ALPHA*X( JX          )
            TEMP2 = ALPHA*X( JX +   INCX )
            TEMP3 = ALPHA*X( JX + 2*INCX )
            TEMP4 = ALPHA*X( JX + 3*INCX )
            IY    = KY
            DO 70, I = 1, M
               Y( IY ) = ( ( ( ( Y( IY )             ) +
     $                           TEMP1*A( I, J - 3 ) ) +
     $                           TEMP2*A( I, J - 2 ) ) +
     $                           TEMP3*A( I, J - 1 ) ) +
     $                           TEMP4*A( I, J     )
               IY      = IY + INCY
   70       CONTINUE
            JX = JX + 4*INCX
   80    CONTINUE

.vs 16p
\f2Version 3.\f1
.vs 11p

*
*     The following code illustrates a version for short word length
*     machines for which speed is not essential. Accumulated inner
*     products are used. DTEMP has been declared double precision.
*
*        Form  y := alpha*A*x + y.
*
         JX = KX
         DO 40, I = 1, M
            DTEMP = ZERO
            JX    = KX
            DO 30, J = 1, N
               DTEMP = DTEMP + A( I, J )*DBLE( X( JX ) )
               JX    = JX    + INCX
   30       CONTINUE
            Y( IY ) = ALPHA*DTEMP + Y( IY )
            IY      = IY          + INCY
   40    CONTINUE

.vs 16p
.fi
.ps 12
.sp 2
.cs 1
.bp
.SH
Appendix B
.PP
In this appendix we illustrate how to use the full matrix update routines to
obtain rank-one and rank-two updates to band matrices. We assume that the
vectors x and y are such that no fill-in occurs outside the band, in which
case the update affects only a full rectangle within the band matrix A. This
is illustrated in Fig. B.1 for the case where @m=n=9@, @kl=2@, @ku=3@ and the update
commences in row (and column) @l=3@.

.cs 1 24
.nf
( *  *  *  *                )   ( 0 )( 0  0  *  *  *  *  0  0  0 )
( *  *  *  *  *             )   ( 0 )
( *  * |*  *  *  *|         )   ( * )
(    * |*  *  *  *| *       )   ( * )
(      |*  *  *  *| *  *    ) + ( * )
(          *  *  *  *  *  * )   ( 0 )
(             *  *  *  *  * )   ( 0 )
(                *  *  *  * )   ( 0 )
(                   *  *  * )   ( 0 )
    
.fi
.cs 1
.EQ
                      A            ~+~     xy  ~ sup T
.EN

                                  Fig. B.1

We see that the update affects only that part of A indicated by the dotted
lines, that is, the (@kl@ +1) by (\f2ku\f1+1) part of @A@ starting at @a sub ll @.

The routines that we have omitted are GBR, SBR, and SBR2 (in the
complex case HBR and HBR2). 
Their parameter lists could have been

   GBR ( M, N, KL, KU, L, ALPHA, X, INCX, Y, INCY, A, LDA )
   SBR ( UPLO, N, K, L, ALPHA, X, INCX, A, LDA )
   SBR2 ( UPLO, N, K, L, ALPHA, X, INCX, Y, INCY, A, LDA )

where the parameter L denotes the starting row and column for the update and
the elements @x sub l @ and @y sub l @, of the vectors @x@ and @y@,
are in elements X(1) and Y(1)
of the arrays X and Y.

Calls to SGBR can be achieved by

   KM = MIN (KL+1, M-L+1)
   KN = MIN (KU+1, N-L+1)
   CALL SGER (KM, KN, ALPHA, X, INCX, Y, INCY, A(KU+1, L), MAX(KM, LDA-1))


Calls to SSBR can be achieved by

   KN = MIN(K+1, N-L+1)
   IF (UPLO .EQ. 'U') THEN
      CALL SSYR('U', KN, ALPHA, X, INCX, A(K+1, L), MAX(1, LDA-1))
   ELSE
      CALL SSYR('L', KN, ALPHA, X, INCX, A(1, L), MAX(1, LDA-1))
   ENDIF
      

and similarly for calls to SSBR2.
.bp
.SH
Appendix C
.PP
In this appendix we propose an additional set of real and complex
single precision level 2 routines which allow extended precision
matrix-vector operations to be performed.
.PP
The names of these routines are obtained by preceding the character
representing the Fortran data type (S or C),
by the character E. These routines are to perform the operations
described in section 2 as follows.
.sp 2
.PP
For the matrix-vector operations
.sp
.EQ I
y ~<-~ alpha A x ~+~ beta y, ~~ y ~<-~ alpha A ~ sup T x ~+~ beta y , ~~ y ~<-~ alpha A bar ~ sup T x ~+~ beta y
.EN
@ alpha , ~ beta , ~ A,@ and @x@ are single precision, @y@ is double
precision and the computation of @y@ is to be performed in at least
double precision.
.sp
.PP
For the triangular operations
.sp
.EQ I
x ~<-~ T x, ~~ x ~<-~ T ~ sup T x, ~~ x ~<-~ T bar ~ sup T x,
.EN
.EQ I
x ~<-~ T ~ sup -1 x, ~~ x ~<-~ T ~ sup -T x, ~~ x ~<-~ T bar ~ sup -T x
.EN
.sp
@T@ is single precision, @x@ is double precision and the computation of @x@
is to be performed in at least double precision.
.PP
For the rank-one and rank-two updates
.sp 
.EQ I
A ~<-~ alpha x y ~ sup T ~+~ A,~~  A ~<-~ alpha x y bar ~ sup T ~+~ A,
.EN
.EQ I
 H ~<-~ alpha x x bar ~ sup T ~+~ H,
~~ H ~<-~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T ~+~ H,
.EN
.sp
@ alpha@, @A@, and @H@ are single precision, @x@ and @y@ are
double precision and the computation is to be performed in at least
double precision.
.PP
The type and dimension of the corresponding Fortran variables (see section 6)
is as follows.
.PP
For all routines whose first two letters are ES:
.sp
       REAL  ALPHA, BETA
       REAL  A( LDA, * )
       REAL  AP( * )
       DOUBLE PRECISION  Y( * )
.sp
.PP
For routines ESGEMV, ESGBMV, ESSYMV, ESSBMV, ESSPMV
.sp
       REAL  X( * )
.sp
but for all the other ES routines:
.sp
       DOUBLE PRECISION X( * )
.PP
For all routines whose first two letters are EC:
.sp
       COMPLEX  ALPHA
       COMPLEX  BETA
       COMPLEX  A( LDA, * )
       COMPLEX  AP( * )
       COMPLEX*16  Y( * )	or  DOUBLE COMPLEX  Y( * )
.sp
except that for ECHER and ECHPR the scalar @ alpha @ is real
so that the first declaration is replaced by:
.sp
       REAL  ALPHA
.sp
.PP
For routines ECGEMV, ECGBMV, ECHEMV, ECHBNV, ECHPMV:
.sp
       COMPLEX  X( * )
.sp
but for all the other EC routines:
.sp
       COMPLEX*16  X( * )	or  DOUBLE COMPLEX  X( * )
.sp
.PP
We intend to make a set of these routines available together
with the routines of the main proposal, and the test program
will include tests of these extended precision level 2 BLAS.
.PP
These extended precision routines require just the COMPLEX*16
extension to ANSI Fortran 77. An equivalent set of double precision
routines would require further data types to be available and so we
have not included these as part of this proposal. However many systems
do have an extended precision real data type beyond double precision,
such as REAL*16, and similarly some systems have an additional
complex data type such as COMPLEX*32. Thus implementors may wish
to provide real and complex double precision versions of these routines,
in which case we recommend that the names of the routines are again obtained
by preceding the characters representing the Fortran data type
(D or Z) by the character E. We implore implementors of the
double precision routines to clearly describe the arithmetic used
for the internal computation.
.PP
We thank Velvel Kahan for insisting that we think about the
extended precision issue.

.bp
.SH
Appendix D
.PP
This appendix contains the calling sequences for all the proposed
level 2 BLAS.

.ps 8
.cs 1 24
.nf
 name     options         dim   b-width scalar matrix  x-vector scalar y-vector

_GEMV(       TRANS,       M, N,         ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_GBMV(       TRANS,       M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_HEMV( UPLO,                 N,         ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_HBMV( UPLO,                 N,    K,   ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_HPMV( UPLO,                 N,         ALPHA, AP,     X, INCX, BETA, Y, INCY )
_SYMV( UPLO,                 N,         ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_SBMV( UPLO,                 N,    K,   ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
_SPMV( UPLO,                 N,         ALPHA, AP,     X, INCX, BETA, Y, INCY )
_TRMV( UPLO, TRANS, DIAG,    N,                A, LDA, X, INCX )
_TBMV( UPLO, TRANS, DIAG,    N,    K,          A, LDA, X, INCX )
_TPMV( UPLO, TRANS, DIAG,    N,                AP,     X, INCX )

_TRSV( UPLO, TRANS, DIAG,    N,                A, LDA, X, INCX )
_TBSV( UPLO, TRANS, DIAG,    N,    K,          A, LDA, X, INCX )
_TPSV( UPLO, TRANS, DIAG,    N,                AP,     X, INCX )

 name      options         dim   scalar x-vector y-vector matrix

_GER_(                     M, N, ALPHA, X, INCX, Y, INCY, A, LDA )
_HER (  UPLO,                 N, ALPHA, X, INCX,          A, LDA )
_HPR (  UPLO,                 N, ALPHA, X, INCX,          AP )
_HER2(  UPLO,                 N, ALPHA, X, INCX, Y, INCY, A, LDA )
_HPR2(  UPLO,                 N, ALPHA, X, INCX, Y, INCY, AP )
_SYR (  UPLO,                 N, ALPHA, X, INCX,          A, LDA )
_SPR (  UPLO,                 N, ALPHA, X, INCX,          AP )
_SYR2(  UPLO,                 N, ALPHA, X, INCX, Y, INCY, A, LDA )
_SPR2(  UPLO,                 N, ALPHA, X, INCX, Y, INCY, AP )
.fi
.cs 1

.