Technical data

Algorithm Optimization Examples

Example A–3 Core Loop of a BLAS 3 Routine Using Matrix-Matrix Operations

xGEMM - computes Y(I,J) = Y(I,J) + X(I,K)*M(K,J)

here x = precision = F, D, G

MSYNC ;synchronize with scalar

IJLOOP:

VLDx Y(I,J),std,VR0 ;Y(1:N,J) gets loaded into VR0

KLOOP:

VLDx M(K,J),std,VR1 ;K(1:N,K) get loaded into VR1

VSMULx X(I,K),VR1,VR1 ;VR1 gets VR1 summed with

;X(I,K) as a scalar

VVADDx VR0,VR2,VR0 ;VR0 gets VR0 summed with VR2

INC K ;increment K by vector length

IF (K < SIZ) GOTO KLOOP

RESET K ;reset I to SIZ

VSTx VR0,Y(I,J),std ;VR0 gets stored into Y(I,J)

INC I ;increment I by vector length

IF (I < SIZ) GOTO IJLOOP

INC J ;increment J by vector length

RESET I ;reset I to SIZ

IF (J < SIZ) GOTO IJLOOP

MSYNC ;synchronize with scalar

loop to minimize the loads and stores required. Because both rows and

columns are traversed, the algorithm can be blocked for cache size. The

VAX 6000 Model 400 exhibits vector speedups greater than 35 for the 64

by 64 matrix multiplication described above.

Although the overall performance of the 1000 by 1000 size benchmark

is less than a single 64 by 64 matrix multiplication, it does indicate the

potential performance when blocking is used. Improving the performance

of this benchmark is most challenging because the 1000 by 1000 matrix

requires about eight times the vector cache size of 1 Mbyte. Further

analysis is being conducted to determine the most efﬁcient block size to

use, that would maximize the use of BLAS 3 and remain within the size

of the cache for a given block of code.

The vectorized fraction increases to approximately 98 percent for the

1000 by 1000 benchmark. The proportion of vector arithmetics relative to

vector load and stores is much improved for the BLAS 3s. Although the

cache is exceeded, performance more than doubles when using a method

that can block data based on the BLAS 3 algorithms. Therefore, the

performance of the VAX 6000 Model 400 on the blocked Linpack 1000

A–6