Table of Contents
  1. Introduction
  2. Content
  3. Installation
  4. Code Generation with M4 macro processor
  5. Testing
  6. Feedback

1. Introduction

   This library of routines is part of a reference implementation for
   the Dense and Banded BLAS routines, along with their
   Extended and Mixed Precision versions, as documented in
   Chapters 2 and 4 of the new BLAS Standard, which is available from:

	http://www.netlib.org/blas/blast-forum/

   EXTENDED PRECISION is only used internally; the input and output
       arguments remain the same as in the existing BLAS.  At present, we
       only allow Single, Double, or Extra internal precision.  Extra
       precision is implemented as double-double precision (128-bit total,
       106-bit significand).  The routines for the double-double precision
       basic arithmetic operations +, -, *, / were developed by David
       Bailey.

       We have designed all our routines assuming that single precision
       arithmetic is actually done in IEEE single precision (32 bits)
       and that double precision arithmetic is actually done in
       IEEE double precision (64 bits). The routines also pass our tests 
       on an Intel machine with 80-bit floating point registers.

   MIXED PRECISION permits some input/output arguments to be of
       different types (mixing real and complex) or precisions (mixing
       single and double).

   The purpose of this implementation is to do a proof of
   concept implementation, showing that the considerable complexity
   of the specification is actually implementable and testable
   with a manageable amount of software. We have not attempted to
   optimize performance, but our code should be as good as 
   straightforward but careful code written by hand.


2. Content

   The BLAS Standard defines language bindings for Fortran 95, Fortran
   77, and C. Here, we have only implemented the C version and
   provided a method for binding to one Fortran 77 ABI.

   In this initial release, we provide the following 11 routines:
	
    Level 1
 	DOT (Inner product)
	SUM (Sum)
	AXPBY (Scaled vector accumulation)
	WAXPBY (Scaled vector addition)
    Level 2
	GEMV (General matrix vector product)
	GBMV (Banded matrix vector product)
	SYMV (Symmetric matrix vector product)
	SBMV (Symmetric banded matrix vector product)
	SPMV (Symmetric matrix vector product, packed format)
	HEMV (Hermitian matrix vector product)
	HBMV (Hermitian banded matrix vector product)
	HPMV (Hermitian matrix vector product, packed format)
	GE_SUM_MV (Summed matrix-vector product)
	TRSV (Triangular solve)
    Level 3
	GEMM (General matrix matrix product)
	SYMM (Symmetric matrix matrix product)
	HEMM (Hermitian matrix matrix product)
   
   All have passed our systematic testing of all
   possible combinations of mixed and extended precision.
   We will eventually include everything in the intersection of 
   Chapter 2 and Chapter 4 with systematic testing.

   2.1 Directory Structure

       This release contains the following directories:

       doc/          - technical report

       m4/           - Directory where m4 macro files and support
                       routines are stored
       m4/dot       \
       m4/sum        } Directories for each function
       m4/,..       /
       m4/test-dot  \
       m4/test-sum   } Directories for each test function
       m4/...       /

       src/          - Directory where C code is stored
       src/dot      \
       src/sum       } Target directories for C code
       src/...      /

       testing/      - Directory where C code for testing is stored
       testing/test-dot  - DOT test code and results
       testing/test-sum  - SUM test code and results
       testing/...

3. Installation

   The reference XBLAS are built similarly to the reference BLAS and
   LAPACK.  The current build system produces a static libxblas.a.
   
   You need to provide a make.inc file in the source directory that
   defines the compiler, optimization flags, and options for building
   a Fortran->C bridge.  Some examples are provided.
   
   Alternatively, you can use the configure script to attempt to produce 
   a make.inc that is appropriate for your system with your C (and optionally, 
   Fortran) compiler.  You need to issue ./configure in the top-most
   directory.  This will produce a make.inc file that you should check
   before proceeding.  To specify a specific compiler(s) to use, you will
   need to do something like:
     
     CC=my_c_compiler FC=my_fortran_compiler ./configure
   
   M4 is not necessary for the distributed archive; it is only
   necessary if you modify the generator sources under m4/.  See
   README.devel for more information.

   The Fortran->C bridge uses details of a specific toolchain's binary
   interface, in particular how the Fortran compiler mangles names.
   See src/f2c-bridge.h for the available options.  Most compilers can
   support different name mangling schemes; be sure to use the *SAME*
   naming options for all your Fortran code.

   The Fortran->C bridge is included in libxblas.a.  Each of the
   bridge's object files matches *-f2c.o, so you can extract them with
   ar if you need to share one libxblas.a between multiple Fortran
   compilers.  Example steps to strip the Fortran->C routines from
   libxblas.a and place them in a separate libxblas-myfortran.a are as
   follows:

     ar t libxblas.a |fgrep -- -f2c.o | xargs ar x libxblas.a
     ar ru libxblas-myfortran.a *-f2c.o
     ar x libxblas.a *-f2c.o
     rm *-f2c.o

4. Code Generation with M4 macro processor

   In the existing BLAS, there are usually 4 routines associated with
   each operation.  All input, output, and internal variables are
   single or double precision and real or complex.  But under the new
   extended and mixed precision rules (see Chapter 4 for details), the
   input, output and internal variables may have different precisions
   and types. Therefore, the combination of all these types results in
   many more routines associated with each operation. For
   example, DOT will have 32 routines altogether, 4 "standard" versions
   (from Chapter 2) and 28 mixed and extended precision versions
   (from Chapter 4).  In addition, the 16 versions with extended precision
   support up to three internal precisions that can be chosen at runtime.
   We have automated the code and test code generation as much as possible.  
   We use the M4 macro processor to facilitate this task.
   
   The idea is to define a macro for each fundamental operation. The
   macro's argument list contains the variables, accompanied by their
   types and precisions. For example, for the operation c <- a + b, we
   define the following macro:

   	ADD(c, c_type, a, a_type, b, b_type)

   where, x_type can be one of:

	real_single
	real_double
	real_extra
	complex_single
	complex_double
	complex_extra

   Inside the macro body, we use an "if-test" on c_type, a_type and
   b_type, to generate the appropriate code segment for "+".
   (This is similar to operator overloading in C++; but we do it
    manually.) All these if-tests are evaluated at macro-evaluation
   time, and do not appear in the executable code. Indeed, our goal
   was to produce efficient C code, which means minimizing branches
   in inner loops.

   Other macros include SUB, MUL, DIV, DECLARE (variable declaration),
   ASSIGN, etc.

   Since these macros are shared among all the BLAS routines, we put
   them in a common header file, named cblas.m4.h.  Each BLAS routine
   also has its own macro file, such as dot.m4, spmv.m4 and gbmv.m4,
   to generate the specific functions.  All the macro files are
   located in the m4/ subdirectory.

   For example, the inner loop of the M4 macro for the dot product is 
   simply as follows (the M4 parameters $2, $3, and $4 are types):

        for (i = 0; i < n; ++i) {
            GET_VECTOR_ELEMENT(x_ii, x_i, ix, $2) 
                  /* put ix-th element of vector x into x_ii */
            GET_VECTOR_ELEMENT(y_ii, y_i, iy, $3) 
                  /* put iy-th element of vector y into y_ii */
            MUL(prod, $4, x_ii, $2, y_ii, $3) /* prod = x[i]*y[i] */
            ADD(sum, $4, sum, $4, prod, $4) /* sum = sum+prod */
            ix += incx;
            iy += incy;
        } /* endfor */

   The motivation for this macro-based approach is simplifying
   software engineering. For example, the file {\tt dot.m4}
   of M4 macros for the dot product is 401 lines long (245 non-comment lines)
   but expands into 11145 lines in 32 C subroutines implementing
   different versions of DOT. Similarly the macros for TRSV expand from
   732 lines (454 non-comment lines) to 37099 lines in 32 C subroutines.
   (This does not count the shared M4 macros in the file cblas.m4.h.)


5. Testing
  
   The goal of the testing code is to validate the underlying implementation.
   The challenges are twofold: First, we must thoroughly test routines 
   claiming to use extra precision internally, where the test code is not
   allowed to declare any extra precision variables or use any other extra
   precision facilities not available to the code being tested. This requires
   great care in generating test data.  Second, we must use M4 to
   automatically generate the many versions of test code needed for the many
   versions of the code being tested.

   For each BLAS routine, we perform the following steps in the test code:
     1) Generate input scalars, vectors and arrays, according to the 
        routine's specification, so that the result exposes the internal
        precision actually used.
     2) Call the BLAS routine
     3) For each output, compute a "test ratio" of the computed
	error to the theoretical error bound, i.e.,
		| Computed_value - "True_value" | / Error_Bound

   By design, the test ratio should be at most 1. A larger ratio indicates
   that the computed result is either completely wrong, or not as
   accurate as claimed in the specification.

   The following section will discuss how we generate "good" inputs in
   order to reveal the internal precisions actually used. 
   For details, see the paper in file doc/report.ps.

   5.1 Testing DOT

       DOT performs the following function:

		r <- beta * r_in + alpha * (SUM_{i=1,n} x_i*y_i)

       Assume that the result r_computed is computed as follows
          precision eps_int internally,
          precision eps_out when the final result is rounded on output
          underflow threshold UN_int internally, 
          underflow threshold UN_out on output
       and that additionally we compute a very accurate approximation r_acc with
          precision eps_acc  = 2^(-106)   (double-double format)
          underflow threshold UN_acc  = 2^(-968)
       Then the error bound satisfies

(*)	|r_computed-r_acc| <= (n+2)(eps_int + eps_acc)*S + U + eps_out*|r_acc|
                            = Scale
       where

       	   S = |alpha| * (SUM_{i=1,n} |x_i|*|y_i|) + |beta|*|r_in|
           U = (|alpha|*n+2)*(UN_int + UN_acc) + UN_out

       Thus, we can confirm that r_computed has been computed as accurately
       as claimed (i.e. with internal precision defined by eps_int and UN_int)
       by testing whether the 

          test ratio = |r_computed - r_acc| / Scale 

       is at most 1.  Suppose that no underflow occurs, and that 

          eps_intX >> eps_int >= eps_acc.

       where eps_intX is the internal precision actually used in some 
       buggy version of DOT that we want to test.  Then we can expect 
       the test ratio to be as large as

                        (n+2)*(eps_intX + eps_acc)*S + eps_out*|r_acc|
          test ratio ~  ----------------------------------------------
                        (n+2)*(eps_int  + eps_acc)*S + eps_out*|r_acc|

                     
       If we can make r_acc very small, then this ratio will be roughly 
 
          test ratio =  eps_intX / eps_int >> 1

       which means the bug in DOT will be detected, and in fact the test ratio
       will actually tell us how much internal precision we effectively used.

       Thus our goal is to pick test data alpha, x(1:n), y(1:n), beta and r_in
       to make |r_acc| as small as possible, MUCH smaller than S, so that
       eps_int term dominates on the right of the inequality (*).
       Otherwise eps_int will be invisible in the error bound, then
       we cannot tell what precision is used internally. 

       In our test generator, we choose input data alpha, beta, r_in, x(1:n) 
       and y(1:n) judiciously in order to cause as much cancellation in r as 
       possible.

   5.2 Choosing input data and computing r_acc

       The general approach is to choose some of the input values of x(i) and y(i)
       so that the exact (partial) dot product of these values has a lot of
       nonzero fraction bits, preferably at least 106. Then the remaining values of
       x(i) and y(i) are chosen to cancel the previous bits as much as possible.
       This latter computation seems to require high precision.
       
       One possibility to use an arbitrary precision package, such as
       MPFUN, but our goal is to make this test code self contained,
       and use no extra precision facilities not available to the code
       being tested.  Since most of our routines can be reduced to a
       series of dot products, testing can be based on DOT.  We only
       need a TRUSTED dot routine. Currently, we are using our own dot
       routine with double-double internal precision to compute
       r_truth, which is accurate to 106 bits. 
       This means that any internal precision higher than double-double 
       cannot be detected, and may result in a tiny test ratio.
       A very tiny test ratio (such as zero) may also occur if the result
       happens to be computed exactly.

       This raises the possibility that there are "matching" bugs in 
       our trusted DOT and the DOT under test, so that bugs are missed.
       To avoid this possibility we also generate some test examples
       where the cancellation is done mathematically (and so exactly)
       rather than depending on a computation. The idea is simple:
       For example, choose x(1:3) and y(1:3) so that 
            x(1)*y(1) = -x(3)*y(3) >> x(2)*y(2)
       so that SUM_{i=1,3} x(i)*y(i) = x(2)*y(2) exactly.

   5.3 Testing SPMV and GBMV
    
       SPMV, GBMV, and many other BLAS2 routines perform the following function:
		y <- beta * y + alpha * A * x

       Testing it is no more difficult than testing DOT, because each
       component of the computed y vector is a dot product, and
       satisfies the error bound (*). So we simply use the same test
       generator as DOT, and compute a test ratio for each component
       of the y vector. The only tricky part is that some entries 
       in each dot product may be fixed. For example, the first row
       of A and the vector x can be chosen freely, but after that
       x is fixed and, if A is symmetric, the first entry of each
       subsequent row is fixed. Our dot-product test generator handles
       all these cases.

       This approach can be generalized to most other Level 2 and 3 BLAS.

6. Feedback

   Please send any comments or bug reports to extended_blas@cs.berkeley.edu.
   This code was developed by

   Xiaoye Li,
   Jim Demmel, 
   David Bailey,
   Yozo Hida,
   Jimmy Iskandar,
   Anil Kapur,
   Michael Martin,
   Brandon Thompson,
   Teresa Tung,
   Daniel Yoo
   
   with help from Ben Wanzo, Berkat Tung, Weihua Shen, and Jason Riedy.
