.ds TH "\h'-.75m'\v'-.50m'^\v'.50m' .ds TD "\h'-.75m'\v'-.45m'~\v'.45m' .B .nr BT ''-%-'' .he '''' .pl 11i .de fO 'bp .. .wh -.5i fO .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 An Extended Set of Fortran Basic Linear Algebra Subprograms .ps 11 .sp 3 Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 41 (Revision 2) .sp .7i January, 1986 .pn 0 .he ''-%-'' .he '''' .bp . .sp 10 .B .ce 1 .ps 11 ABSTRACT .R .ps 10 .sp .5i This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix vector operations which should provide for more efficient and portable implementations of algorithms for high performance computers. .in .bp .ft 3 .ps 11 .bp .LP .LP .EQ gsize 11 delim @@ .EN .TL .ps 12 .in 0 An Extended Set of Fortran\h'.35i' .sp Basic Linear Algebra Subprograms\h'.35i' .AU .ps 11 .in 0 Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i' .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' Argonne, Illinois 60439\h'.20i' .AU .ps 11 .in 0 Jeremy Du Croz\h'.20i' .AI .ps 10 .in 0 Numerical Algorithms Group Ltd.\h'.20i' NAG Central Office, Mayfield House\h'.20i' 256 Banbury Road, Oxford OX2 7DE\h'.20i' .AU .ps 11 .in 0 Sven Hammarling\h'.20i' .AI .ps 10 .in 0 Numerical Algorithms Group Ltd.\h'.20i' NAG Central Office, Mayfield House\h'.20i' 256 Banbury Road, Oxford OX2 7DE\h'.20i' .AU .ps 11 .in 0 Richard J. Hanson\h'.20i' .AI .ps 10 .in 0 Applied Math 2646 Sandia National Laboratory\h'.20i' Albuquerque, New Mexico 87185\h'.20i' .FS .ps 9 .vs 11p @size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S. Department of Energy, under Contract W-31-109-Eng-38. Typeset on \*(DY. .FE .sp 3 .QS .ps 10 .in .25i .ll -.25i .I Abstract \(em This paper describes an extension to the set of Basic Linear Algebra Subprograms. The extensions are targeted at matrix vector operations which should provide for more efficient and portable implementations of algorithms for high performance computers. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .SH 1. Introduction .PP In 1973 Hanson, Krogh, and Lawson wrote an article in the SIGNUM Newsletter (Vol. 8, no. 4, page 16) describing the advantages of adopting a set of basic routines for problems in linear algebra. The original basic linear algebra subprograms, now commonly referred to as the BLAS and fully described in Lawson, Hanson, Kincaid, and Krogh [7,8], have been very successful and have been used in a wide range of software including LINPACK [4] and many of the algorithms published by the ACM Transactions on Mathematical Software. In particular they are an aid to clarity, portability, modularity and maintenance of software and they have become a \f2de facto \f1standard for the elementary vector operations. An excellent discussion of the .I raison d' \*^etre .R of the BLAS is given in Dodson and Lewis [1]. .PP Special versions of the BLAS, in some cases machine code versions, have been implemented on a number of computers, thus improving the efficiency of the BLAS. However, with some of the modern machine architectures, the use of the BLAS is not the best way to improve the efficiency of higher level codes. On vector machines, for example, one needs to optimize at least at the level of matrix-vector operations in order to approach the potential efficiency of the machine (see [2 and 3]); and the use of the BLAS inhibits this optimization because they hide the matrix-vector nature of the operations from the compiler. .PP We believe that the time is right to propose the specifications of an additional set of BLAS designed for matrix-vector operations. It has been our experience that a small set of matrix-vector operations occur frequently in the implementation of many of the most common algorithms in linear algebra. We define here the basic operations for that set, together with the naming conventions and the calling sequences. Routines at this level should provide a reasonable compromise between the sometimes conflicting aims of efficiency and modularity and it is our hope that efficient implementations will become available on a wide range of computer architectures. .PP In this paper we shall refer to the existing BLAS of Lawson et al. as ``Level 1 BLAS'' or ``Existing BLAS'', and the proposed new set as ``Level 2 BLAS'' or ``Extended BLAS''. The Level 2 BLAS involve @O(mn )@ scalar operations where @m@ and @n@ are the dimensions of the matrix involved. These could be programmed by a series of calls to the Level 1 BLAS, though we do not recommend that they be implemented in that way. Hence, in a natural sense, the Level 2 BLAS are performing basic operations at one level higher than the Level 1 BLAS. .PP We will make available a complete set of Level 2 BLAS in Fortran 77 so that software developers without access to specific implementations can make use of them. We also plan to develop a test program so that implementations of the extended BLAS can be thoroughly tested before being distributed. We intend to submit the test program and the Fortran 77 version of the routines for publication as an ACM algorithm. .SH 2. Scope of the Extended BLAS .PP We propose that the following three types of basic operation be performed by the extended BLAS: a) Matrix-vector products of the form .in +.5i .EQ I y ~<-~ alpha Ax ~+~ beta y ,~~ y ~<-~ alpha A ~ sup T x ~+~ beta y ,~~ and~ ~ y ~<-~ alpha A bar ~ sup T x ~+~ beta y .EN where @ alpha @ and @ beta @ are scalars, @x@ and @y@ are vectors and @A@ is a matrix, and .EQ I x ~<-~ Tx ,~ ~ x ~<-~ T ~ sup T x,~~ and~ ~ x ~<-~ T bar ~ sup T x, .EN where @x@ is a vector and @T@ is an upper or lower triangular matrix. .in -.5i b) Rank-one and rank-two updates of the form .in +.5i .EQ I A ~<-~ alpha xy ~ sup T ~+~ A ,~~ A ~<-~ alpha x y bar ~ sup T ~+~ A ,~~ H ~<-~ alpha x x bar ~ sup T ~+~ H ,~~and ~ H ~<-~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T ~+~ H , .EN where @~H~@ is a Hermitian matrix. .in -.5i c) Solution of triangular equations of the form .in +.5i .EQ I x ~<-~ T ~ sup -1 x , ~x ~<-~ T ~ sup -T x , ~~and~ ~x ~<-~ T bar ~ sup -T x , .EN where @T@ is a non-singular upper or lower triangular matrix. .in -.5i .PP We propose that, where appropriate, the operations be applied to general, general band, Hermitian, Hermitian band, triangular, and triangular band matrices in both real and complex arithmetic, and in single and double precision. .PP An example illustrating the implementation of the routines is given in Appendix A. .PP In Appendix C we propose some additional routines in which the vectors @x@ and/or @y@ have a higher precision than the matrix, so that on machines with extended precision registers the extra internal precision is not all discarded on return from the routine. We propose these routines as an appendix solely because it is not possible, even in single precision, to specify a complete set within the confines of ANSI Fortran 77. The only case that can be realized is where the matrix is real single precision and the vector is real double precision. .SH 3. Naming Conventions .PP The proposed name of a Level 2 BLAS is in the LINPACK style and consists of five characters, the last of which may be blank. The fourth and fifth characters in the name denote the type of operation, as follows: .KS .TS center; l l. MV - Matrix-vector product R - Rank-one update R2 - Rank-two update SV - Solve a system of linear equations .TE .KE Characters two and three in the name denote the kind of matrix involved, as follows: .KS .TS center; l l. GE General matrix GB General band matrix HE Hermitian matrix SY Symmetric matrix HP Hermitian matrix stored in packed form SP Symmetric matrix stored in packed form HB Hermitian band matrix SB Symmetric band matrix TR Triangular matrix TP Triangular matrix in packed form TB Triangular band matrix .TE .KE The first character in the name denotes the Fortran data type of the matrix, as follows: .TS center; l l. S REAL D DOUBLE PRECISION C COMPLEX Z COMPLEX*16 or DOUBLE COMPLEX (if available) .TE The proposed available combinations are indicated in the table below. In the first column, under \f2complex\f1, the initial C may be replaced by Z. In the second column, under \f2real\f1, the initial S may be replaced by D. See Appendix D for the full subroutine calling sequences. .PP The proposed collection of routines can be thought of as being divided into four separate parts, \f2complex\f1, \f2real\f1, \f2double precision\f1, and \f2complex*16\f1. Each part will be accompanied by a separate testing program. The routines proposed are written in the ANSI Fortran 77 standard with the exception of the routines that use COMPLEX*16 variables. These routines are included for completeness and, because the Fortran standard does not provide for this variable type, may not be available on all machines. .KS .ce Table 3.1 .TS center; c c c c c c c. \f2complex real\f1 MV R R2 SV CGE SGE * * @nothing sup \(dg @ CGB SGB * CHE SSY * * * CHP SSP * * * CHB SSB * CTR STR * * CTP STP * * CTB STB * * .TE .KE .EQ gsize 8 .EN .FS @size -1 {"" sup \(dg}@\| For the general rank-1 update (GER) we propose two complex routines: CGERC for @ A ~<-~ alpha x y bar ~ sup T ~+~ A @ and CGERU for @ A ~<-~ alpha xy ~ sup T ~+~ A @. This is the only exception to the one to one correspondence between real and complex routines. See section 7 for further discussion. .FE .EQ gsize 11 .EN .PP We do not propose routines for rank-one and rank-two updates applied to band matrices because these can be obtained by calls to the rank-one and rank-two full matrix routines. This is illustrated in Appendix B. .SH 4. Parameter Conventions .PP We propose a similar convention for the parameter lists to that for the existing BLAS, but with extensions where comparable parameters are not present in the existing BLAS. The proposed order of parameters is as follows: a) Parameters specifying options. b) Parameters defining the size of the matrix. c) Input scalar. d) Description of the input matrix. e) Description of input vector(s). f) Input scalar (associated with input-output vector). g) Description of the input-output vector. h) Description of the input-output matrix. Note that not each category is present in each of the routines. .PP The parameters that specify options are character parameters with the names TRANS, UPLO, and DIAG. TRANS is used by the matrix vector product routines as follows: .KS .TS center; l l. Value Meaning _ _ `N' Operate with the matrix. `T' Operate with the transpose of the matrix. `C' Operate with the conjugate transpose of the matrix. .TE .KE In the real case the values `T' and `C' have the same meaning. .PP UPLO is used by the Hermitian, symmetric, and triangular matrix routines to specify whether the upper or lower triangle is being referenced as follows: .KS .TS center; l l. Value Meaning _ _ `U' Upper triangle `L' Lower triangle .TE .KE .PP DIAG is used by the triangular matrix routines to specify whether or not the matrix is unit triangular, as follows: .KS .TS center; l l. Value Meaning _ _ `U' Unit triangular `N' Non-unit triangular .TE .KE When DIAG is supplied as `U' the diagonal elements are not referenced. .PP It is worth noting that actual character arguments in Fortran may be longer than the corresponding dummy arguments. So that, for example, the value `T' for TRANS may be passed as `TRANSPOSE'. .PP The size of the matrix is determined by the two parameters M and N for an @m@ by @n@ rectangular matrix; by the parameters M, N, KL, and KU for an @m@ by @n@ band matrix with @kl@ sub-diagonals and \f2ku\f1 super-diagonals; and by the parameters N and K for an @n@ by @n@ symmetric or Hermitian or triangular band matrix with \f2k\f1 super- and/or sub-diagonals. .PP The description of the matrix consists either of the array name (A) followed by the leading dimension of the array as declared in the calling (sub) program (LDA), when the matrix is being stored in a two- dimensional array; or the Fortran array name (AP) alone when the matrix is being stored as a (packed) vector. When A is not band, then in the former case the actual array must contain at least @(m ~+~ d(n-1))@ elements, where @d@ is the leading dimension of the array and @m~=~n@ for a square matrix; and in the latter case the actual array must contain at least @n(n+1)/2@ elements. .PP The scalars always have the dummy argument names ALPHA and BETA. As with the existing BLAS the description of a vector consists of the name of the array (X or Y) followed by the storage spacing (increment) in the array of the vector elements (INCX or INCY). The increment is allowed to be negative, zero, or positive. When the vector @x@ consists of @k@ elements, then the corresponding actual array argument X must be of length at least (1 + (@k@-1)|INCX|). .sp The following values of parameters are invalid: .nf .sp Any value of the character parameters DIAG, TRANS, or UPLO whose meaning is not specified. M < 0 for GE and GB routines. N < 0 for all routines. KL < 0 or KL < M - N for the GB routines KU < 0 or KU < N - M for the GB routines K < 0 or for the HB, SB, and TB routines. LDA < M for the GE routines. LDA < KL + KU + 1 for the GB routines. LDA < N for the HE, SY, and TR routines. LDA < K + 1 for the HB, SB, and TB routines. .fi Note that it is permissible to call the routines with M or N = 0, in which case the routines exit immediately without referencing their vector or matrix arguments. For those routines that include the parameter BETA, when BETA is supplied as zero then the array Y need not be set on input, so that operations such as @y ~ <- ~ alpha A x @ may be performed without initially setting @y@ to zero. .SH 5. Storage Conventions .PP Unless otherwise stated it is assumed that matrices are stored conventionally in a 2-dimensional array with matrix-element @a sub ij @ stored in array-element A(I,J). .PP The routines for real symmetric and complex Hermitian matrices allow for the matrix to be stored in either the upper (UPLO = `U') or lower triangle (UPLO = `L') of a two dimensional array, or to be packed in a one dimensional array. In the latter case the upper triangle may be packed sequentially column by column (UPLO = `U'), or the lower triangle may be packed sequentially column by column (UPLO = `L'). Note that for real symmetric matrices packing the upper triangle by column is equivalent to packing the lower triangle by rows, and packing the lower triangle by columns is equivalent to packing the upper triangle by rows. (For complex Hermitian matrices the only difference is that the off-diagonal elements are conjugated.) .PP For triangular matrices the parameter UPLO serves to define whether the matrix is upper .br (UPLO = `U') or lower (UPLO = `L') triangular. In the packed storage the triangle has to be packed by column. .PP The band matrix routines allow storage in the same style as with LINPACK, so that the @j sup th @ column of the matrix is stored in the @j sup th @ column of the Fortran array. For a general band matrix the diagonal of the matrix is stored in the @ku + 1 sup st @ row of the array. For a Hermitian or symmetric matrix either the upper triangle (UPLO = `U') may be stored in which case the leading diagonal is in the @k+1 sup st @ row of the array, or the lower triangle (UPLO = `L') may be stored in which case the leading diagonal is in the first row of the array. For an upper triangular band matrix (UPLO = `U') the leading diagonal is in the @k+1 sup st @ row of the array and for a lower triangular band matrix (UPLO = `L') the leading diagonal is in the first row. .PP For a Hermitian matrix the imaginary parts of the diagonal elements are of course zero and thus the imaginary parts of the corresponding Fortran array elements need not be set, but are assumed to be zero. In the R and R2 routines these imaginary parts will be set to zero on return. .PP For packed triangular matrices the same storage layout is used whether or not DIAG = `U' (diagonal elements are assumed to have the value 1), i.e. space is left for the diagonal elements even if those array elements are not referenced. .SH 6. Specification of the Extended BLAS .PP Type and dimension for variables occurring in the subroutine specifications are as follows: .KS INTEGER INCX, INCY, K, KL, KU, LDA, M, N CHARACTER*1 DIAG, TRANS, UPLO .KE For routines whose first letter is an S: REAL ALPHA, BETA REAL X(*), Y(*) REAL A(LDA,*) REAL AP(*) For routines whose first letter is a D DOUBLE PRECISION ALPHA, BETA DOUBLE PRECISION X(*), Y(*) DOUBLE PRECISION A(LDA,*) DOUBLE PRECISION AP(*) For routines whose first letter is a C: COMPLEX ALPHA COMPLEX BETA COMPLEX X(*), Y(*) COMPLEX A(LDA,*) COMPLEX AP(*) except that for CHER and CHPR the scalar @alpha@ is real so that the first declaration above is replaced by: REAL ALPHA For routines whose first letter is Z: COMPLEX*16 ALPHA DOUBLE COMPLEX ALPHA COMPLEX*16 BETA DOUBLE COMPLEX BETA COMPLEX*16 X(*), Y(*) or DOUBLE COMPLEX X(*), Y(*) COMPLEX*16 A(LDA,*) DOUBLE COMPLEX A(LDA,*) COMPLEX*16 AP(*) DOUBLE COMPLEX AP(*) except that for ZHER and ZHPR the first declaration above is replaced by: DOUBLE PRECISION ALPHA The generic names, parameter lists and specifications for the extended BLAS now follow. Refer to table 3.1 for the specific subroutine names. a) General Matrix Vector Products for a general matrix: GEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) for a general band matrix: GBMV ( TRANS, M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) Operation: .sp if @TRANS~=~@ `N',@~~~~~y ~ <- ~ alpha Ax ~+~ beta y @ .sp if @TRANS~=~@ `T',@ ~~~~y ~ <- ~ alpha A ~ sup T x ~+~ beta y@ .sp if @~TRANS~=~@ `C',@ ~~~~y ~ <- ~ alpha A bar ~ sup T x ~+~ beta y @. .sp b) Symmetric or Hermitian Matrix Vector Products for a symmetric or Hermitian matrix: SYMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) HEMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) for a symmetric or Hermitian matrix in packed storage: SPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) HPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) for a symmetric or Hermitian band matrix: SBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) HBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) Operation: .EQ I y ~ <- ~ alpha Ax ~+~ beta y ~. .EN c) Triangular Matrix Vector Products for a triangular matrix: TRMV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) for a triangular matrix in packed storage: TPMV ( UPLO, TRANS, DIAG, N, AP, X, INCX ) for a triangular band matrix: TBMV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) Operation: .sp if @TRANS~=~@ `N', @~~~~x ~ <- ~ Tx @ .sp if @TRANS~=~@ `T', @~~~~x ~ <- ~ T ~ sup T x@ .sp if @TRANS~=~@ `C', @~~~~x ~ <- ~ T bar ~ sup T x@ .sp d) Triangular equation solvers for a triangular matrix: TRSV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) for a triangular matrix in packed storage: TPSV ( UPLO, TRANS, DIAG, N, AP, X, INCX ) for a triangular band matrix: TBSV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) Operation: .sp if @TRANS~=~ @ `N', @~~~~x ~ <- ~ T ~ sup -1 x@ .sp if @TRANS~=~@ `T', @~~~~x ~ <- ~ T ~ sup -T x @ .sp if @TRANS~=~@ `C', @ ~~~~x ~ <- ~ T bar ~ sup -T x ~ @. .sp e) General Rank-1 Updates: for a general matrix: GER_ ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) for real matrices: SGER performs the operation @ A ~<-~ alpha x y ~ sup T ~+~ A @. for complex matrices: CGERC performs the operation @ A ~<-~ alpha x y bar ~ sup T ~+~ A @ and CGERU performs the operation @ A ~<-~ alpha x y ~ sup T ~+~ A @. f) Symmetric or Hermitian Rank-1 Updates: for a symmetric or Hermitian matrix: SYR ( UPLO, N, ALPHA, X, INCX, A, LDA ) HER ( UPLO, N, ALPHA, X, INCX, A, LDA ) for symmetric or Hermitian matrix in packed storage: SPR ( UPLO, N, ALPHA, X, INCX, AP ) HPR ( UPLO, N, ALPHA, X, INCX, AP ) Operation: .EQ I A ~<- ~ alpha x x bar ~ sup T ~+~ A .EN for real symmetric matrices this is simply .EQ I A ~<- ~ alpha x x ~ sup T ~+~ A ~. .EN g) Symmetric or Hermitian Rank-2 Updates: for a symmetric or Hermitian matrix: SYR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) HER2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) for symmetric or Hermitian matrix in packed storage: SPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) HPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) Operation: .EQ I A ~<- ~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T ~+~ A .EN for real symmetric matrices this is simply .EQ I A ~<- ~ alpha x y ~ sup T ~+~ alpha y x ~ sup T ~+~ A ~. .EN .SH 7. Rationale .PP The three basic matrix-vector operations chosen (Section 2) were obvious candidates because they occur in a wide range of linear algebra applications, and they occur at the innermost level of many algorithms. The hard decision was to restrict the scope only to these operations, since there are many other potential candidates, such as matrix scaling and sequences of plane rotations. Similarly, we could have extended the scope by applying the operations to other types of matrices such as complex symmetric or augmented band matrices. We have aimed at a reasonable compromise between a much larger number of routines each performing one type of operation (e.g. @x ~<-~ L ~ sup -T x@), and a smaller number of routines with a more complicated set of options. There are in fact, in each precision, 16 real routines performing altogether 43 different operations, and 17 complex routines performing 58 different operations. .PP We feel that to extend the scope further would significantly reduce the chances of having the routines implemented efficiently over a wide range of machines, because it would place too heavy a burden on implementors. On the other hand, to restrict the scope further would place too narrow a limit on the potential applications of the level 2 BLAS. .PP The parameters @ alpha @ and @ beta@ are included in the non-triangular routines to give extra flexibility, but we recommend that implementors consider special code for the cases where @ alpha ~or ~beta~=~ 1.0 @ and @ alpha ~or~ beta ~=~ -1.0 @. Similarly, as with the level 1 BLAS, we have included an increment parameter with the vectors so that a vector could, for example, be a row of a matrix. But again we recommend that implementors consider special code for the case where the increments are unity. Note that the implementation must be such that when @ beta ~ = ~ 0.0 @ then @y@ need not be defined on input. .PP As noted earlier, corresponding to the real routine SGER we propose two complex routines CGERC (for @A ~<-~ alpha x y bar ~ sup T ~+~ A@) and CGERU (for @A ~<-~ alpha x y ~ sup T ~+~ A@). Both are frequently required. An alternative would be to provide a single complex routine CGER with an option parameter; however this parameter would have become redundant in the real routine SGER. Rather than have redundant parameters, or different parameter lists for the real and complex routines, we have chosen two distinct complex routines; they are analogous to the level 1 BLAS CDOTC (@c ~<- ~ c ~+~ x bar ~ sup T y@) and CDOTU (@c ~<- ~ c ~+~ x ~ sup T y@). .PP Note that no check has been included for singularity, and near singularity, in the triangular equation solving routines. The requirements for such a test depend upon the application and so we felt that this should not be included, but should instead be performed before calling the triangular solver. .PP On certain machines, which do not use the ASCII sequence on all of their Fortran systems, lower case characters may not exist, so that the innocent looking argument \f2`t'\f1, passed through the parameter TRANS for designating a transposed matrix, is not in the Fortran character set. Some UNIVAC systems do not have a lower case representation using the `field data' character set. On the CDC NOS-2 system, a mechanism is provided for a full 128 ASCII character set by using pairs of 6-bit host characters for certain 7-bit ASCII characters. This means that there is a `2 for 1' physical extension of the logical records that contain lower case letters. This fact can hamper portability of codes written on ASCII machines that are later moved to CDC systems. The only safe way to proceed is to convert the transported text entirely into the Fortran character set. On the other hand we believe that users on ASCII character set systems may wish to treat upper and lower case letters as equivalent in meaning. If this is done, it means that text that will be transported to machines of unknown types must have the ASCII set mapped into the Fortran character set before the text is moved. .PP The band storage scheme used by the GB, HB, SB, and TB routines has columns of the matrix stored in columns of the array, and diagonals of the matrix stored in rows of the array. This is the storage scheme used by LINPACK. An alternative scheme (used in some EISPACK [6,9] routines) has rows of the matrix stored in rows of the array, and diagonals of the matrix stored in columns of the array. The latter scheme has the advantage that a band matrix-vector product of the form @ y ~<- ~ alpha Ax ~+~ beta y @ can be computed using long vectors (the diagonals of the matrix) stored in contiguous elements, and hence is much more efficient on some machines (e.g. CDC Cyber 205) than the first scheme. However other computations involving band matrices, such as @x ~<-~ Tx@, @x ~<-~ T ~ sup -1 x@ and @LU@ and @U ~ sup T U @ factorization, cannot be organized `by diagonals'; instead the computation sweeps along the band, and the LINPACK storage scheme has the advantage of reducing the number of page swaps and allowing contiguous vectors (the columns of the matrix) to be used. .PP Although not discussed here, we plan to provide a portable testing package for each of the four parts of this extended BLAS package. The test package will be self-contained; generating test data and checking for conformity with this proposal in a portable fashion. .PP We considered the possibility of generalizing the rank-1 and rank-2 updates to rank-k updates. Rank-k updates with @k ~>~ 1@ (but @k ~<<~n@) can achieve significantly better performance on some machines than rank-1 [5]. But to take advantage of this usually requires complicating the calling algorithm; and moreover rank-k updates with @k ~ approx ~n@ would allow an even higher level operation such as matrix multiplication `in by the back door'. We prefer to keep to a clean concept of genuine matrix-vector operations. .SH 8. Acknowledgements .PP An earlier draft of this proposal was discussed at the Parvec IV Workshop organized by John Rice at Purdue University on October 29-30, 1984 and at the SIAM Conference on Applied Linear Algebra in Raleigh, April 29-May 2, 1985. We wish to thank all the participants at the workshop and meeting for their comments, discussions, and encouragement, as well as the many people who have sent us comments separately. .bp .SH 13. References .sp .IP [1] D.S. Dodson and J.G. Lewis, "Issues relating to extension of the Basic Linear Algebra Subprograms", ACM SIGNUM Newsletter, vol 20, no 1, (1985), 2-18. .sp .IP [2] .R J.J. Dongarra and S.C. Eisenstat, ``Squeezing the Most out of an Algorithm in CRAY Fortran,'' .I ACM Transactions on Mathematical Software, .R Vol. 10, No. 3, (1984), 221-230. .sp .IP [3] J.J. Dongarra, .I Increasing the Performance of Mathematical Software through High-Level Modularity. .R Proceedings of the Sixth International Symposium on Computing Methods in Engineering and Applied Sciences. (Versailles, France). North-Holland (1984), pp 239-248. .sp .IP [4] J.J. Dongarra, J.R. Bunch, C.B. Moler, and G.W. Stewart, .I LINPACK Users' Guide, .R SIAM Publications, Philadelphia, 1979. .sp .IP [5] .R J.J. Dongarra, L. Kaufman, and S. Hammarling .I Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers, .R Argonne National Laboratory Report ANL MCS-TM 46, (January 1985), to appear in Linear Algebra and Its Applications. .sp .IP [6] B.S. Garbow, J.M. Boyle, J.J. Dongarra, C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide Extension, .R Lecture Notes in Computer Science, Vol. 51, Springer-Verlag, Berlin, 1977. .sp .IP [7] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, ``Basic Linear Algebra Subprograms for Fortran Usage,'' .I ACM Transactions on Mathematical Software .R \fB5\fP (1979), 308-323. .sp .IP [8] C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, ``Algorithm 539: Basic Linear Algebra Subprograms for Fortran Usage,'' .I ACM Transactions on Mathematical Software .R \fB5\fP (1979), 324-325. .sp .IP [9] B.T. Smith, J.M. Boyle, J.J. Dongarra, B.S. Garbow, Y. Ikebe, V.C. Klema, and C.B. Moler, .I Matrix Eigensystem Routines - EISPACK Guide, .R Lecture Notes in Computer Science, Vol. 6, 2nd edition, Springer-Verlag, Berlin, 1976. .sp .bp .SH Appendix A .PP This appendix gives an example to illustrate three possible implementations, within Fortran, of the proposed extended BLAS. The example considers 3 versions of part of the the routine SGEMV; the first a straightforward version that is likely to be suitable for many conventional serial paged or non-paged machines; the second a tuned version, with unrolled loops, suitable for vector machines such as the CRAY 1; the third a version with accumulated The examples below assume the value of one for variables INCX and INCY. inner products aimed at short word length machines. .ps 8 .sp 2 .cs 1 24 .nf \f2Version 1.\f1 .vs 11p * * Form y := alpha*A*x + y. * JX = KX DO 80, J = 1, N IF( X( JX ).NE.ZERO )THEN TEMP = ALPHA*X( JX ) IY = KY DO 70, I = 1, M Y( IY ) = Y( IY ) + TEMP*A( I, J ) IY = IY + INCY 70 CONTINUE END IF JX = JX + INCX 80 CONTINUE .vs 16p \f2Version 2.\f1 .vs 11p * * The following code illustrates a possible Fortran CRAY 1 version. * In this version the loops are only unrolled to a depth of 4, but a * depth of 16 should be considered for a 'production' version. * Special code for cases such as alpha = 1.0 and unit increments has * not been included, but this could be considered for a 'production' * version. * * Form y := alpha*A*x + y. The first two loops are 'clean-up' * loops for the left over vectors. * JX = KX J = MOD( N, 2 ) IF( J.EQ.1 )THEN TEMP1 = ALPHA*X( JX ) IY = KY DO 50, I = 1, M Y( IY ) = Y( IY ) + TEMP1*A( I, J ) IY = IY + INCY 50 CONTINUE JX = JX + INCX END IF J = MOD( N, 4 ) IF( ( J.EQ.2 ).OR.( J.EQ.3 ) )THEN TEMP1 = ALPHA*X( JX ) TEMP2 = ALPHA*X( JX + INCX ) IY = KY DO 60, I = 1, M Y( IY ) = ( ( Y( IY ) ) + $ TEMP1*A( I, J - 1 ) ) + $ TEMP2*A( I, J ) IY = IY + INCY 60 CONTINUE JX = JX + 2*INCX END IF * * The main loop runs over groups of four vectors. * JSTART = J + 4 DO 80, J = JSTART, N, 4 TEMP1 = ALPHA*X( JX ) TEMP2 = ALPHA*X( JX + INCX ) TEMP3 = ALPHA*X( JX + 2*INCX ) TEMP4 = ALPHA*X( JX + 3*INCX ) IY = KY DO 70, I = 1, M Y( IY ) = ( ( ( ( Y( IY ) ) + $ TEMP1*A( I, J - 3 ) ) + $ TEMP2*A( I, J - 2 ) ) + $ TEMP3*A( I, J - 1 ) ) + $ TEMP4*A( I, J ) IY = IY + INCY 70 CONTINUE JX = JX + 4*INCX 80 CONTINUE .vs 16p \f2Version 3.\f1 .vs 11p * * The following code illustrates a version for short word length * machines for which speed is not essential. Accumulated inner * products are used. DTEMP has been declared double precision. * * Form y := alpha*A*x + y. * JX = KX DO 40, I = 1, M DTEMP = ZERO JX = KX DO 30, J = 1, N DTEMP = DTEMP + A( I, J )*DBLE( X( JX ) ) JX = JX + INCX 30 CONTINUE Y( IY ) = ALPHA*DTEMP + Y( IY ) IY = IY + INCY 40 CONTINUE .vs 16p .fi .ps 12 .sp 2 .cs 1 .bp .SH Appendix B .PP In this appendix we illustrate how to use the full matrix update routines to obtain rank-one and rank-two updates to band matrices. We assume that the vectors x and y are such that no fill-in occurs outside the band, in which case the update affects only a full rectangle within the band matrix A. This is illustrated in Fig. B.1 for the case where @m=n=9@, @kl=2@, @ku=3@ and the update commences in row (and column) @l=3@. .cs 1 24 .nf ( * * * * ) ( 0 )( 0 0 * * * * 0 0 0 ) ( * * * * * ) ( 0 ) ( * * |* * * *| ) ( * ) ( * |* * * *| * ) ( * ) ( |* * * *| * * ) + ( * ) ( * * * * * * ) ( 0 ) ( * * * * * ) ( 0 ) ( * * * * ) ( 0 ) ( * * * ) ( 0 ) .fi .cs 1 .EQ A ~+~ xy ~ sup T .EN Fig. B.1 We see that the update affects only that part of A indicated by the dotted lines, that is, the (@kl@ +1) by (\f2ku\f1+1) part of @A@ starting at @a sub ll @. The routines that we have omitted are GBR, SBR, and SBR2 (in the complex case HBR and HBR2). Their parameter lists could have been GBR ( M, N, KL, KU, L, ALPHA, X, INCX, Y, INCY, A, LDA ) SBR ( UPLO, N, K, L, ALPHA, X, INCX, A, LDA ) SBR2 ( UPLO, N, K, L, ALPHA, X, INCX, Y, INCY, A, LDA ) where the parameter L denotes the starting row and column for the update and the elements @x sub l @ and @y sub l @, of the vectors @x@ and @y@, are in elements X(1) and Y(1) of the arrays X and Y. Calls to SGBR can be achieved by KM = MIN (KL+1, M-L+1) KN = MIN (KU+1, N-L+1) CALL SGER (KM, KN, ALPHA, X, INCX, Y, INCY, A(KU+1, L), MAX(KM, LDA-1)) Calls to SSBR can be achieved by KN = MIN(K+1, N-L+1) IF (UPLO .EQ. 'U') THEN CALL SSYR('U', KN, ALPHA, X, INCX, A(K+1, L), MAX(1, LDA-1)) ELSE CALL SSYR('L', KN, ALPHA, X, INCX, A(1, L), MAX(1, LDA-1)) ENDIF and similarly for calls to SSBR2. .bp .SH Appendix C .PP In this appendix we propose an additional set of real and complex single precision level 2 routines which allow extended precision matrix-vector operations to be performed. .PP The names of these routines are obtained by preceding the character representing the Fortran data type (S or C), by the character E. These routines are to perform the operations described in section 2 as follows. .sp 2 .PP For the matrix-vector operations .sp .EQ I y ~<-~ alpha A x ~+~ beta y, ~~ y ~<-~ alpha A ~ sup T x ~+~ beta y , ~~ y ~<-~ alpha A bar ~ sup T x ~+~ beta y .EN @ alpha , ~ beta , ~ A,@ and @x@ are single precision, @y@ is double precision and the computation of @y@ is to be performed in at least double precision. .sp .PP For the triangular operations .sp .EQ I x ~<-~ T x, ~~ x ~<-~ T ~ sup T x, ~~ x ~<-~ T bar ~ sup T x, .EN .EQ I x ~<-~ T ~ sup -1 x, ~~ x ~<-~ T ~ sup -T x, ~~ x ~<-~ T bar ~ sup -T x .EN .sp @T@ is single precision, @x@ is double precision and the computation of @x@ is to be performed in at least double precision. .PP For the rank-one and rank-two updates .sp .EQ I A ~<-~ alpha x y ~ sup T ~+~ A,~~ A ~<-~ alpha x y bar ~ sup T ~+~ A, .EN .EQ I H ~<-~ alpha x x bar ~ sup T ~+~ H, ~~ H ~<-~ alpha x y bar ~ sup T ~+~ alpha bar y x bar ~ sup T ~+~ H, .EN .sp @ alpha@, @A@, and @H@ are single precision, @x@ and @y@ are double precision and the computation is to be performed in at least double precision. .PP The type and dimension of the corresponding Fortran variables (see section 6) is as follows. .PP For all routines whose first two letters are ES: .sp REAL ALPHA, BETA REAL A( LDA, * ) REAL AP( * ) DOUBLE PRECISION Y( * ) .sp .PP For routines ESGEMV, ESGBMV, ESSYMV, ESSBMV, ESSPMV .sp REAL X( * ) .sp but for all the other ES routines: .sp DOUBLE PRECISION X( * ) .PP For all routines whose first two letters are EC: .sp COMPLEX ALPHA COMPLEX BETA COMPLEX A( LDA, * ) COMPLEX AP( * ) COMPLEX*16 Y( * ) or DOUBLE COMPLEX Y( * ) .sp except that for ECHER and ECHPR the scalar @ alpha @ is real so that the first declaration is replaced by: .sp REAL ALPHA .sp .PP For routines ECGEMV, ECGBMV, ECHEMV, ECHBNV, ECHPMV: .sp COMPLEX X( * ) .sp but for all the other EC routines: .sp COMPLEX*16 X( * ) or DOUBLE COMPLEX X( * ) .sp .PP We intend to make a set of these routines available together with the routines of the main proposal, and the test program will include tests of these extended precision level 2 BLAS. .PP These extended precision routines require just the COMPLEX*16 extension to ANSI Fortran 77. An equivalent set of double precision routines would require further data types to be available and so we have not included these as part of this proposal. However many systems do have an extended precision real data type beyond double precision, such as REAL*16, and similarly some systems have an additional complex data type such as COMPLEX*32. Thus implementors may wish to provide real and complex double precision versions of these routines, in which case we recommend that the names of the routines are again obtained by preceding the characters representing the Fortran data type (D or Z) by the character E. We implore implementors of the double precision routines to clearly describe the arithmetic used for the internal computation. .PP We thank Velvel Kahan for insisting that we think about the extended precision issue. .bp .SH Appendix D .PP This appendix contains the calling sequences for all the proposed level 2 BLAS. .ps 8 .cs 1 24 .nf name options dim b-width scalar matrix x-vector scalar y-vector _GEMV( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _GBMV( TRANS, M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _HEMV( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _HBMV( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _HPMV( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) _SYMV( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _SBMV( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) _SPMV( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) _TRMV( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) _TBMV( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) _TPMV( UPLO, TRANS, DIAG, N, AP, X, INCX ) _TRSV( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) _TBSV( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) _TPSV( UPLO, TRANS, DIAG, N, AP, X, INCX ) name options dim scalar x-vector y-vector matrix _GER_( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) _HER ( UPLO, N, ALPHA, X, INCX, A, LDA ) _HPR ( UPLO, N, ALPHA, X, INCX, AP ) _HER2( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) _HPR2( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) _SYR ( UPLO, N, ALPHA, X, INCX, A, LDA ) _SPR ( UPLO, N, ALPHA, X, INCX, AP ) _SYR2( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) _SPR2( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) .fi .cs 1 .