Technical data
Algorithm Optimization Examples
Example A–3 Core Loop of a BLAS 3 Routine Using Matrix-Matrix Operations
xGEMM - computes Y(I,J) = Y(I,J) + X(I,K)*M(K,J)
here x = precision = F, D, G
MSYNC ;synchronize with scalar
IJLOOP:
VLDx Y(I,J),std,VR0 ;Y(1:N,J) gets loaded into VR0
KLOOP:
VLDx M(K,J),std,VR1 ;K(1:N,K) get loaded into VR1
VSMULx X(I,K),VR1,VR1 ;VR1 gets VR1 summed with
;X(I,K) as a scalar
VVADDx VR0,VR2,VR0 ;VR0 gets VR0 summed with VR2
INC K ;increment K by vector length
IF (K < SIZ) GOTO KLOOP
RESET K ;reset I to SIZ
VSTx VR0,Y(I,J),std ;VR0 gets stored into Y(I,J)
INC I ;increment I by vector length
IF (I < SIZ) GOTO IJLOOP
INC J ;increment J by vector length
RESET I ;reset I to SIZ
IF (J < SIZ) GOTO IJLOOP
MSYNC ;synchronize with scalar
loop to minimize the loads and stores required. Because both rows and
columns are traversed, the algorithm can be blocked for cache size. The
VAX 6000 Model 400 exhibits vector speedups greater than 35 for the 64
by 64 matrix multiplication described above.
Although the overall performance of the 1000 by 1000 size benchmark
is less than a single 64 by 64 matrix multiplication, it does indicate the
potential performance when blocking is used. Improving the performance
of this benchmark is most challenging because the 1000 by 1000 matrix
requires about eight times the vector cache size of 1 Mbyte. Further
analysis is being conducted to determine the most efficient block size to
use, that would maximize the use of BLAS 3 and remain within the size
of the cache for a given block of code.
The vectorized fraction increases to approximately 98 percent for the
1000 by 1000 benchmark. The proportion of vector arithmetics relative to
vector load and stores is much improved for the BLAS 3s. Although the
cache is exceeded, performance more than doubles when using a method
that can block data based on the BLAS 3 algorithms. Therefore, the
performance of the VAX 6000 Model 400 on the blocked Linpack 1000
A–6