Technical data

Algorithm Optimization Examples
Example A–2 Core Loop of a BLAS 2 Routine Using Matrix-Vector Operations
xGEMV - computes Y(I) = Y(I) + X(J)*M(I,J)
where x = precision = F, D, G
MSYNC ;synchronize with scalar
ILOOP:
VLDx Y(I),std,VR0 ;Y(I) is loaded as VR0
JLOOP:
VLDx M(I,J),std,VR1 ;VR1 gets columns of M(I,J)
VSMULx X(J),VR1,VR2 ;VR2 gets the product of VR1
;and X(J) as a scalar
VVADDx VR0,VR2,VR0 ;VR0 gets VR0 summed with VR2
INC J
IF (J < SIZ) GOTO JLOOP ;Loop for all values of J
VSTx VR0,Y(I),std ;VR0 gets stored into Y(I)
INC I
IF (I < SIZ) GOTO ILOOP ;Loop for all values of I
MSYNC ;synchronize with scalar
There are no set rules to follow when solving the largest problem size,
a set of 1000 simultaneous equations. One potential tool for optimizing
this benchmark is the LAPACK library, developed by Argonne National
Laboratory in conjunction with the University of Illinois Center for
Supercomputing Research and Development (CSRD). The LAPACK library
features equation-solving algorithms that will block the data array into
sections that fit into a given cache size. The LAPACK library calls not
only the BLAS 1 and BLAS 2 routines but also a third level of BLAS,
called matrix-matrix BLAS or the BLAS level 3.
Example A–3 shows that a matrix-matrix multiply is at the heart of
one BLAS 3 routine. The matrix multiplication computation can be
blocked for modern architectures with cache memories. Highly efficient
vectorized matrix multiplication routines have been written for the VAX
vector architecture. For example, a double precision 64 by 64 matrix
multiplication achieves over 85 percent of the peak MFLOPS on the
Model 400 system.
Performance can be further improved with other methods that increase
the reuse of data while it is contained in the vector registers. For
example, loop unrolling can be done until all the vector registers have
been fully utilized. Partial results can be formed within the innermost
A–5