Technical data

Algorithm Optimization Examples

Example A–2 Core Loop of a BLAS 2 Routine Using Matrix-Vector Operations

xGEMV - computes Y(I) = Y(I) + X(J)*M(I,J)

where x = precision = F, D, G

MSYNC ;synchronize with scalar

ILOOP:

VLDx Y(I),std,VR0 ;Y(I) is loaded as VR0

JLOOP:

VLDx M(I,J),std,VR1 ;VR1 gets columns of M(I,J)

VSMULx X(J),VR1,VR2 ;VR2 gets the product of VR1

;and X(J) as a scalar

VVADDx VR0,VR2,VR0 ;VR0 gets VR0 summed with VR2

INC J

IF (J < SIZ) GOTO JLOOP ;Loop for all values of J

VSTx VR0,Y(I),std ;VR0 gets stored into Y(I)

INC I

IF (I < SIZ) GOTO ILOOP ;Loop for all values of I

MSYNC ;synchronize with scalar

There are no set rules to follow when solving the largest problem size,

a set of 1000 simultaneous equations. One potential tool for optimizing

this benchmark is the LAPACK library, developed by Argonne National

Laboratory in conjunction with the University of Illinois Center for

Supercomputing Research and Development (CSRD). The LAPACK library

features equation-solving algorithms that will block the data array into

sections that ﬁt into a given cache size. The LAPACK library calls not

only the BLAS 1 and BLAS 2 routines but also a third level of BLAS,

called matrix-matrix BLAS or the BLAS level 3.

Example A–3 shows that a matrix-matrix multiply is at the heart of

one BLAS 3 routine. The matrix multiplication computation can be

blocked for modern architectures with cache memories. Highly efﬁcient

vectorized matrix multiplication routines have been written for the VAX

vector architecture. For example, a double precision 64 by 64 matrix

multiplication achieves over 85 percent of the peak MFLOPS on the

Model 400 system.

Performance can be further improved with other methods that increase

the reuse of data while it is contained in the vector registers. For

example, loop unrolling can be done until all the vector registers have

been fully utilized. Partial results can be formed within the innermost

A–5