CSC Cray T3E User’s Guide Juha Haataja and Ville Savolainen (eds.
All rights reserved. The PDF version of this book or parts of it can be used in Finnish universities as course material, provided that this copyright notice is included. However, this publication may not be sold or included as part of other publications without permission of the publisher. c Authors and CSC – Tieteellinen laskenta Oy 1998 2nd edition ISBN 952-9821-43-3 http://www.csc.
Cray T3E User’s Guide 3 Preface This is the second edition of a user’s guide to the Cray T3E massively parallel supercomputer installed at the Center for Scientific Computing (CSC), Finland. The first edition of this guide was written by Juha Haataja, Yrjö Leino, Jouni Malinen, Kaj Mustikkamäki, Jussi Rahola, and Sami Saarinen. The second edition was written by Juha Haataja, Jussi Heikonen, Yrjö Leino, Jouni Malinen, Kaj Mustikkamäki, and Ville Savolainen.
Cray T3E User’s Guide Contents Preface 1 2 3 4 3 Introduction 1.1 How to use this guide . . . . . . . 1.2 Usage policy . . . . . . . . . . . . . 1.3 Overview of the system . . . . . . 1.4 Programming environment . . . . 1.5 Programming tools and libraries 1.6 Notation used in this guide . . . 1.7 Sources for further information . Using 2.1 2.2 2.3 2.4 2.5 2.6 . . . . . . . . . . . . . . the Cray T3E at CSC Logging in . . . . . . . . . . . . . . . . Files . . . . . . . . . . . . . . . . . . . .
Contents 5 6 7 8 9 Fortran programming 5.1 The Fortran 90 compiler . . . . . . 5.2 Basic usage . . . . . . . . . . . . . . 5.3 Fixed and free format source code 5.4 Compiler options . . . . . . . . . . . 5.5 Optimization options . . . . . . . . 5.6 Optimizing for cache . . . . . . . . 5.7 Compiler directives . . . . . . . . . 5.8 Fortran 90 modules . . . . . . . . . 5.9 Source code preprocessing . . . . . 5.10 More information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cray T3E User’s Guide B Glossary 113 C Metacomputer Environment 116 Bibliography 118 Index 120
Chapter 1. Introduction 7 Chapter 1 Introduction This chapter gives a short introduction of the Cray T3E system. We also describe the policies imposed on using the computer: application forms, scalability testing, and user quotas. 1.1 How to use this guide This book is divided into ten independent chapters, and it can be used as a handbook. However, we recommend that you browse through at least the first four chapters which provide a general overview of the Cray T3E system.
Cray T3E User’s Guide 1.2 Usage policy As the Cray T3E is a high-performance computational resource, CSC enforces a usage policy in order to guarantee an efficient and fair usage of the computer. When applying for access to the Cray T3E you are supposed to have a user id also on some other computer at CSC. A T3E resource application form can be requested by contacting Ms. Paula Mäki-Välkkilä at CSC, tel. (09) 457 2718, e-mail Paula.Maki-Valkkila@csc.fi.
Chapter 1. Introduction 1.3 9 Overview of the system The Cray T3E system at CSC has currently 224 RISC processors for parallel applications. In addition, there are 16 processors for system services and for interactive use. The T3E has a good user and programming environment. The system feels like any Unix computer. You log in to the Internet address t3e.csc.fi and end up in an interactive processor. All processors share a common file system.
Cray T3E User’s Guide Besides the portable MPI and PVM message-passing systems, the highperformance SHMEM library is available. This is a Cray-specific library for parallelization using the “data-passing” or one-sided communication paradigm. See page 70 for further details. In addition to the message-passing and data-passing methods for parallelization, there is a possibility for data-parallel programming on the Cray T3E.
Chapter 1. Introduction 11 Here the prompt and response of the machine have been typeset with the teletype font, and the user commands are shown in boldface. The generic names given to the commands are indicated with a slanted font type: rm file The optional parts of a command are written inside brackets: more [options] [file] Some commonly used names have been written in the same way as Unix. To introduce new terms, an emphasized text type is used. 1.
Cray T3E User’s Guide The basics of parallel programming are discussed in the textbook Designing and Building Parallel Programs [Fos95]. Another good textbook is Introduction to Parallel Computing — Design and Analysis of Algorithms [KGGK94]. CSC maintains a WWW service called CSC Program Development, which contains examples of parallel codes, an English-Finnish parallel computing dictionary, and some other information. The WWW address is http://www.csc.
Chapter 2. Using the Cray T3E at CSC 13 Chapter 2 Using the Cray T3E at CSC This chapter helps you to start using the Cray T3E at CSC: how to log in, where to store files, how to to use the compiler and run your codes etc. The usage policy of the machine is discussed in Section 1.2 on page 8. 2.1 Logging in When logging into the Cray T3E, you will actually get connected to one of the command processors. These are the processing elements (PEs) that are responsible for Unix command processing.
Cray T3E User’s Guide because it uses a secure way to authenticate oneself to the host machine. If you are using an X terminal or an equivalent (a workstation or a microcomputer with software supporting the X Window System), you can establish an X Window System connection to the Cray T3E directly. An easy way to use X Window System connection to the T3E is an ssh connection running in local xterm.
Chapter 2. Using the Cray T3E at CSC all files before running a job from the home directory tree to the local T3E disk described below. The home directory is suitable only for small initialization files and frequently used small programs. It is not intended for extensive I/O operations or for large data sets. There are three file storage areas available for users. Usually you need not (and should not) refer to directories with full path names.
Cray T3E User’s Guide 2.3 Editing files You can use the Emacs or vi editors on the T3E. To start Emacs, give the command emacs [options] [filename]... Here is an example: emacs -nw main.f90 However, because your home directory is shared with other computers at CSC, you can do your editing on some other system. We recommend this approach, because it minimizes the interactive load on the T3E. You get a short introduction to Emacs in Finnish by giving the command help emacs 2.
Chapter 2. Using the Cray T3E at CSC 17 Interactive jobs can use at maximum 16 processors and 30 min parallel CPU time. 2.5 Executing in batch mode The batch jobs on all CSC’s computers are handled by the NQE system (Network Queuing Environment). A more detailed description of this system is given in Chapter 8 (page 81). You can run a batch job by submitting an NQE request, which is a shell script that contains NQE commands and options, shell commands, and input data.
Cray T3E User’s Guide environment. After this, the commands in the script are executed using /bin/sh. This can be overridden using the option -s shell_name. Here follows an example script, which is written into a file called t3e.job. The request name is set to simulation. The job file reserves at maximum six processors (option -l mpp_p=6), and the approximate maximum wall clock time is 600 seconds (-l mpp_t=600). Standard error is concatenated with the standard output (option -eo).
Chapter 2. Using the Cray T3E at CSC 19 for some information in English. Cray has published several manuals, which help in using the T3E. On-line versions of the manuals are found at the Web address http://www.csc.fi:8080 The most useful manuals are the following: • CF90 Commands and Directives Reference Manual [Craa] • Cray T3E Fortran Optimization Guide [Crac] • Cray C/C++ Reference Manual [Crab].
Cray T3E User’s Guide Chapter 3 The Cray T3E system This chapter reviews the Cray T3E hardware and operating system. 3.1 Hardware overview The Cray T3E system consists of the following hardware components: • • • • processing elements interconnect network I/O controllers external I/O nodes. This section briefly presents each of the system components and their interactions.
Chapter 3. The Cray T3E system 21 This configuration may change in the future. Use the command grmview to find out the current situation. 3.2 Distributed memory The T3E has a physically distributed and a logically shared memory architecture. Access to the local memory inside current processing element is faster than to the remote memory. Essentially, the T3E is a MIMD (Multiple Instruction, Multiple Data) computer although it supports SIMD (Single Instruction, Multiple Data) programming style.
Cray T3E User’s Guide +Y Microprocessor +X Support Circuitry Local Memory -Z Network Router -X +Z -Y Figure 3.1: The components of Cray T3E node. This is a problem only if you are using the SHMEM library for communication. The MPI library, for example, handles the streams mechanism properly. The streams mechanism can be disabled or enabled on the user level by using the environment variable $SCACHE_D_STREAMS.
Chapter 3. The Cray T3E system Attribute Value Processor type Physical address base Virtual address base Clock rate on the T3E Peak floating-point rate Peak instruction issue rate Size of the on-chip instruction cache Size of the on-chip level 1 data cache Size of the on-chip level 2 data cache DEC Alpha 21164 40 bits 43 bits 375 MHz 750 Mflop/s 4 (2 floating-point + 2 integer) 8 kB 8 kB 96 kB Table 3.1: Characteristics of the DEC Alpha 21164 processor. 3.
Cray T3E User’s Guide 3.5 Local memory hierarchy The local four-level memory hierarchy of the processing elements is shown in Figure 3.3. Nearest to the execution units are the registers. Caches for instructions and data (ICACHE and DCACHE) are each of size 8 kB. The second-level cache, SCACHE (96 kB in total), is on the Alpha chip. The fourth level of the memory hierarchy is the main (DRAM) memory (128 MB).
Chapter 3. The Cray T3E system 25 two words in each cp. An SCACHE line is 64 bytes. Therefore, data is moved in consecutive blocks of 64 bytes from the main memory. When you are optimizing your code, the most important thing is to optimize the usage of the DCACHE. Almost as important is to optimize the usage of the SCACHE. Because of the reasons mentioned above, try to avoid step sizes of 8 kB or 32 kB when you are referencing memory.
Cray T3E User’s Guide Source Node +Y -Z +X -X +Z 1 -Y 2 3 Destination Node Figure 3.4: A routing example through the 3D torus network of the T3E. Addressing of remote memory is managed by the External Register Set, or E-registers. Latency hiding and synchronization are integrated in 512 + 128 off-chip memory-mapped E-registers.
Chapter 3. The Cray T3E system 3.7 External I/O The T3E system has four processing elements per one I/O controller, while one out of every two I/O controllers is connected to a GigaRing controller. These controllers can be connected to external I/O clients through high-speed GigaRing channels. Figure 3.5 illustrates the I/O path from a PE to an external disk device.
Cray T3E User’s Guide 3.8 The UNICOS/mk operating system The Cray T3E has a distributed microkernel based operating system. This provides a single system image of the global system to the user. UNICOS/mk is a Unix-like operating system based on Cray’s UNICOS system, which runs on parallel vector processor (PVP) platforms such as the Cray C90. The microkernel is based on the CHORUS technology.
Chapter 3. The Cray T3E system The T3E file systems at CSC are located on striped FiberChannel disks residing in one GigaRing, which is attached to a Multi Purpose Node (MPN). The total disk capacity is over 300 GB. Most of the space is allocated for paging (swapping), $TMPDIR and $WRKDIR. 3.10 Resource monitoring The most useful commands for viewing the global configuration and status of the Cray T3E system are grmview and top.
+ + + + + + + + + + + + + + + + + APP APP APP OS CMD CMD CMD CMD CMD NUL NUL CMD CMD CMD CMD CMD CMD CMD OS Cray T3E User’s Guide 0xdd 0xde 0xdf 0xe0 0xe1 0xe2 0xe3 0xe4 0xe5 0xe6 0xe7 0xe8 0xe9 0xea 0xeb 0xec 0xed 0xee 0xef 2 2 2 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 192 192 192 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 unlim unlim unlim unlim unlim 0 0 unlim unlim unlim unlim unlim unlim unlim 0 2 2 2 0 unlim unlim unlim unlim unlim 0 0 unlim unlim unlim unlim unl
Chapter 3. The Cray T3E system You can also use the command ps -PeMf to see what parallel processes are running. Here is an extract from the output: F 1 1 1 1 1 S R R R R R UID user1 user1 user2 user3 user4 PID 1406 1688 8817 9864 9985 PPID 1386 1617 8809 9855 9922 ... STIME 11:17 11:36 13:03 13:22 13:24 TTY ? ? ? ? ? TIME 147:3 129:5 44:26 24:04 23:15 CMD ./prog1.x ./prog2.x ./prog3.x ./prog4.x ./prog5.x You can compare this with the output of the top command shown above. 3.
Cray T3E User’s Guide Chapter 4 Program development This chapter shows how to compile and run your programs on the Cray T3E at CSC. Fortran programming is discussed in more detail in Chapter 5 and C/C++ in Chapter 6. Parallel programming (message passing etc.) is discussed in Chapter 7. 4.1 General overview The Cray T3E environment for program development is automatically initialized upon logging in or startup of a batch job.
Chapter 4. Program development 33 The option -Xn or -X n is used to indicate how many processors you want for your application. If you do not provide this option, the program can be run on any number of processors using the mpprun command. This kind of executable is called malleable. Here is a typical example of generating and running a non-malleable executable, which has to be run on a fixed number of PEs: t3e% f90 -X 16 -o prog.x prog.f90 t3e% ./prog.
Cray T3E User’s Guide PBLAS, ScaLAPACK, BLACS, and FFT. The most straightforward way to obtain more information on these libraries is through the man command as follows: man intro_lapack There is a manual page for almost all subroutines in Libsci, the most notable exception being the routines under the PBLAS library. Unfortunately, there are also manual pages for some non-existent routines such as those solving sparse linear systems.
Chapter 4. Program development 4.3.3 35 BLACS Both BLAS and LAPACK are developed for single processor computations. In order to solve problems of linear algebra on parallel machines where matrices can be distributed over several processors, we need to communicate data between the processors. For this purpose there is a special library called BLACS (Basic Linear Algebra Communication Subroutines). The routines in BLACS can be divided into three classes.
Cray T3E User’s Guide Routines Explanation PSGETRF PSGETRS PSTRTRS PSGESV PSPOTRF PSPOTRS PSPOSV PSGEQRF PSGERQF PSGEQLF PSGELQF PSGEQPF PSGETRI PSTRTRI PSPOTRI PSSYTRD PCGETRF PCGETRS PCTRTRS PCGESV PCPOTRF PCPOTRS PCPOSV PCGEQRF PCGERQF PCGEQLF PCGELQF PCGEQPF PCGETRI PCTRTRI PCPOTRI PCHETRD PSGEBRD PSSYEVX PCGEBRD PCHEEVX PSSYGVX PCHEGVX INDXG2P NUMROC LU factorization and solution of linear general distributed systems of linear equations Cholesky factorization and solution of real symmet
Chapter 4. Program development 4.4 37 The NAG subroutine library The NAG library is a comprehensive mathematical subroutine library that has become a de facto standard in the field of numerical programming. NAG routines are not parallelized on the T3E. The implemented single PE version is Mark 17 (July 1998).
Cray T3E User’s Guide You can also use the NAG on-line documentation on Cypress by the command naghelp 4.5 The IMSL subroutine library The IMSL library is another general purpose mathematical subroutine library with two separate parts, MATH/LIBRARY and STAT/LIBRARY. The release installed on the T3E is the IMSL FORTRAN 90 MP Library version 3.0.
Chapter 4. Program development 39 The manual Introducing CrayLibs [Crad] contains a summary of Cray scientific library routines. You can use help to get some information about the IMSL and NAG libraries in the CSC environment with the commands help imsl help nag You can also use the NAG and IMSL help systems on other computers at CSC as described in Sections 4.4 and 4.5.
Cray T3E User’s Guide Chapter 5 Fortran programming The Cray T3E offers a Fortran 90 compiler which can be used to compile standard-conforming FORTRAN 77 programs as well. This chapter discusses the most essential compiler features. Parallel programming is described in Chapter 7. Programming tools are discussed in Chapter 9. 5.1 The Fortran 90 compiler The Cray T3E Fortran 90 compiler (CF90) supports a full implementation of the ANSI and ISO Fortran 90 standard.
Chapter 5. Fortran programming 5.2 41 Basic usage The CF90 compiler is invoked using the command f90 followed by optional compiler options and the filenames to be compiled: t3e% f90 [options] filenames If the -c option is not specified, the f90 command will automatically invoke the linker to create an executable program. You can compile and link in a single step: t3e% f90 -o prog.x prog.f90 sub.f90 Here the source code files prog.f90 and sub.f90 were compiled into the executable program prog.x.
Cray T3E User’s Guide File extension Type Notes .f .f90 .F .F90 .o .a .s Fixed source form (72 columns) Free source form (132 columns) Fixed source form (72 columns) Free source form (132 columns) Object file Object library file Assembly language file No preprocessing No preprocessing Preprocessing Preprocessing Passed to linker Passed to linker Passed to assembler Table 5.1: The interpretation of some filename extensions. 5.
Chapter 5. Fortran programming 43 Option Explanation -c -r2 -r6 -rm -i32 -s default32 -dp -On -O3 -Osplit2 -Ounroll2 -Oaggress -Obl -g Compile only, do not attempt to link Request for standard listing file (.
Cray T3E User’s Guide Option Explanation -dn, -dp, -er, -du, -dv, -dA, -dI, -dR, -dP, -dZ, Report nonstandard code Use double precision Round multiplication results Round division results upwards Static storage Use the Apprentice tool IMPLICIT NONE statement Recursive procedures Preprocessing, no compilation Preprocessing and compilation -en -ep -dr -eu -ev -eA -eI -eR -eP -eZ Table 5.3: Enabling or disabling some compiler features. The default option is listed first. 5.
Chapter 5. Fortran programming You can improve the performance by padding the arrays so that the corresponding elements do not map to the same cache lines: INTEGER, PARAMETER :: n = 4096, pad = 8 REAL, DIMENSION(n+pad) :: a, b REAL, DIMENSION(n) :: c COMMON /my_block/ a, b, c The rest of the code is identical.
Cray T3E User’s Guide !dir$ split DO i = 1, 1000 a(i) = b(i) * c(i) t = d(i) + a(i) e(i) = f(i) + t * g(i) h(i) = h(i) + e(i) END DO The directive is marked with the characters !dir$. If the source code is written using the fixed source form, these characters must be at the beginning of the line.
Chapter 5. Fortran programming 47 a(j,i) = b(j,i) + 1 a(j,i+1) = b(j,i+1) + 1 END DO END DO Here we used the inverse operation of loop splitting to decrease the overhead due to loop control. Bottom loading is an effective technique for overlapping loop control and loading of operands for the next iteration of the loop.
Cray T3E User’s Guide The symmetric directive is useful when using the SHMEM communications library. !dir$ symmetric [var [, var]...] This directive declares that a PE-private stack variable has the same local address on all PEs. For more information on the SHMEM library routines, issue the command man intro_shmem. See also Section 7.4 on page 70. The directives !dir$ free !dir$ fixed allow you to select the form of the source code within a file.
Chapter 5. Fortran programming 49 F90= f90 iterate: $(OBJS) $(F90) -o $@ $(OBJS) iterate.o: myprec.o cg.o matrix.o cg.o: myprec.o matrix.o matrix.o: myprec.o .SUFFIXES: .f90 .f90.o: $(F90) $(OPTS) $< clean: rm -f *.o iterate If the module files (.o or .a) are not in the current directory, one can use the -p path option of the f90 command to include additional search paths and/or module files. It is common to place the modules in an archive so that they can be used in several programs.
Cray T3E User’s Guide As an example, consider a simple code that computes and prints a root of a polynomial using IMSL routines ZREAL and WRRRN. The code is written so that it can be run on both T3E and Caper (DEC AlphaServer at CSC), on which the preprocessor replaces the single precision IMSL calls with the corresponding double precision versions by defining macros. Moreover, the variable info is printed if INFO is defined.
Chapter 5. Fortran programming 51 and four iterations were performed. 5.10 More information CSC has published a textbook on Fortran 90 [HRR96]. A general introduction to the Unix programming environment is given in the Metacomputer Guide [Lou97]. Both books are written in Finnish. Code optimization is discussed in the Cray manual Cray T3E Fortran Optimization Guide [Crac]. Compiler directives are explained in the manual CF90 Commands and Directives Reference Manual [Craa]. The WWW address http://www.
Cray T3E User’s Guide Chapter 6 C and C++ programming This chapter discusses C and C++ programming on the Cray T3E. Parallel programming is described in Chapter 7 and programming tools are discussed in Chapter 9. 6.1 The Cray C/C++ compilers The Cray C++ Programming Environment contains both the Cray Standard C and the Cray C++ compilers. The Cray Standard C compiler conforms with the ISO and ANSI standards. The Cray C++ compiler conforms with the ISO/ANSI Draft Proposed International Standard.
Chapter 6. C and C++ programming 53 The compilation process, if successful, creates an absolute object file, named a.out by default. This binary file, a.out, can then be executed. For example, the following sequence compiles the source file myprog.c and executes the resulting malleable program a.out with eight processors: t3e% cc myprog.c t3e% mpprun -n 8 ./a.
Cray T3E User’s Guide 6.3 Calling Fortran from C Sometimes you need to call Fortran routines from C programs. In the following, we calculate a matrix product using the routine SGEMM from the Libsci library: #include #include #define DGEMM SGEMM #define l 450 #define m 500 #define n 550 main() { double a[n][l], b[l][m], ct[m][n]; int ll, mm, nn, i, j, k; double alpha = 1.0; double beta = 0.
Chapter 6. C and C++ programming 55 The fact that Fortran stores arrays in reverse order compared to C needs to be taken into account. Therefore, the array ct contains the transpose of the result of the matrix multiplication. This program takes about one second to execute on a 375 MHz processor, which corresponds to the execution speed of about 240 Mflop/s. 6.4 C compiler options The most typical compiler options are given in the Table 6.1.
Cray T3E User’s Guide 6.5 C compiler directives (#pragma) The #pragma directives are used within the source program to request certain kinds of special processing. The #pragma directives are extensions to the C and C++ standards. They are classified according to the following types: • • • • • general template instantiation (Cray C++ only) scalar tasking inlining. You can control the compiler analysis of your source code by using #pragma directives.
Chapter 6. C and C++ programming 57 In the previous format, var_list represents a list of variable names separated by commas. In C, the cache_align directive can appear before or after the declaration of the named objects. In C++, it must appear after the declaration of all named objects. noreduction The noreduction compiler directive tells the compiler not to optimize the loop that immediately follows the directive as a reduction loop. If the loop is not a reduction loop, the directive is ignored.
Cray T3E User’s Guide The split directive merely asserts that the loop can profit by splitting. It will not cause incorrect code. The compiler splits the loop only if it is safe. Generally, a loop is safe to split under the same conditions that a loop is vectorizable. The compiler only splits inner loops, but it may not split loops with conditional code. The split directive also causes the original loop to be stripmined, and therefore the data is processed in blocks small enough to fit in the cache.
Chapter 6. C and C++ programming 59 symmetric The symmetric directive declares that an auto or register variable has the same local address on all processing elements (PEs). This is useful for global addressing using the SHMEM library functions. The format for this compiler directive is: #pragma _CRI symmetric var... The symmetric directive must appear in local scope.
Cray T3E User’s Guide The compiler can be directed to attempt to unroll all loops generated for the program with the command-line option -hunroll. The amount of unrolling specified on the unroll directive overrides those chosen by the compiler when the command-line option -hunroll is specified.
Chapter 6. C and C++ programming 6.6 61 The C++ compiler The Cray C++ compiler conforms with the ISO/ANSI Draft Proposed International Standard. A revised version of the standard has recently been accepted as the ISO/ANSI standard. The Cray C++ compiler is invoked by the command CC. The compiler consists of a preprocessor, a language parser, a prelinker, an optimizer and a code generator.
Cray T3E User’s Guide Chapter 7 Interprocess communication This chapter describes how to use the MPI or PVM message-passing libraries on the Cray T3E at CSC. In addition, the properties of the Cray data-passing library SHMEM are described in some detail. The dataparallel High Performance Fortran (HPF) programming model is introduced, too. 7.1 The communication overhead Parallelization on the T3E can be done by three different approaches: message"-passing, data-passing and data-parallel programming.
Chapter 7. Interprocess communication 63 and MPI. Latency and bandwidth are not equally transparent to the HPF user, but in general HPF programs are slower than SHMEM and MPI applications. The total bandwidth of the machine is very large due to six bi-directional communication links in each PE. It does not matter much where the computational nodes of your application are physically situated.
Cray T3E User’s Guide Correspondingly, for C/C++ programs the format is: #include void sub(...) { int return_code; ... return_code = MPI_Routine(parameter_list); } 7.2.2 Some MPI routines The MPI standard includes more than 120 routines. However, one needs only a few of them for efficient message passing and, at minimum, one can do with six MPI routines. The most important MPI routines are listed in Table 7.1 (the Fortran syntax is shown).
Chapter 7. Interprocess communication 65 Fortran syntax Meaning MPI_INIT(rc) Initialize the MPI session. This should be the very first call. Terminate the MPI session. This should be the very last call. Get the number of processes in comm.
Cray T3E User’s Guide CALL MPI_COMM_RANK(MPI_COMM_WORLD, id, rc) data = id CALL MPI_REDUCE(data, s, 1, MPI_INTEGER, & MPI_SUM, 0, MPI_COMM_WORLD, rc) CALL MPI_BCAST(s, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, rc) WRITE(*,*) ’data:’, data, ’sum:’, s CALL MPI_FINALIZE(rc) END PROGRAM example If this program is in the file collect.f90, it can be compiled and run interactively as follows: t3e% f90 -o collect.x collect.f90 t3e% mpprun -n 8 ./collect.
Chapter 7. Interprocess communication 67 t3e% cc -o collect.x collect.c The program may be executed as in the Fortran 90 case above. 7.2.4 Reducing communication overhead in MPI On the T3E, it is in some cases faster to use the synchronous send routine MPI_SSEND instead of the standard routine MPI_SEND. The synchronous routine avoids some overhead in buffering the messages, but may cause load inbalance due to synchronization.
Cray T3E User’s Guide Some examples of MPI programs are available in the WWW system, see the address http://www.csc.fi/programming/examples/mpi/ 7.3 Parallel Virtual Machine (PVM) PVM (Parallel Virtual Machine) is a message-passing library that is wellsuited for heterogeneous computing. It is somewhat older and clumsier to use than MPI. 7.3.1 Using PVM on the T3E You do not need to use any special linker options to use PVM calls in your program.
Chapter 7. Interprocess communication CALL PVMFpack(INTEGER8, j, msglen, stride, rc) to = j CALL PVMFsend(to, tag, rc) END DO END IF from = 0 CALL PVMFrecv(from, tag, rc) CALL PVMFunpack(INTEGER8, message, msglen, stride, rc) WRITE (*,*) ’PE#’,mype,’: message=’,message END PROGRAM main Compile, link and run the program as follows (on three processors): t3e% t3e% PE# PE# PE# PE# PE# PE# f90 -o pvmprog.x pvmprog.f90 mpprun -n 3 ./pvmprog.
Cray T3E User’s Guide PE#2: PE#0: PE#1: PE#0: PE#2: PE#1: 7.3.3 tid=393218 nproc=3 tid=393216 nproc=3 tid=393217 nproc=3 message=0 message=2 message=1 Further information about PVM CSC has published a textbook on PVM in Finnish [Saa95]. The Cray implementation of PVM is described in the publication Message Passing Toolkit: PVM Programmer’s Manual [Crah]. Some examples of PVM programs are available in the WWW system, see the address http://www.csc.fi/programming/examples/pvm/ 7.
Chapter 7. Interprocess communication 71 Routine Description num_pes shmem_add Returns the total number of PEs. Performs an atomic add operation on a remote data object. A barrier routine for synchronization purposes. Sends a local variable to all other PEs. Concatenates data from several PEs to each of them. An auxiliary routine for ordering calls to shmem_put. Reads from a remote (another PE) memory. An auxiliary routine for protecting a part of the memory from simultaneous update by multiple tasks.
Cray T3E User’s Guide 7.4.1 Using the SHMEM routines SHMEM routines can be divided into a few basic categories according to their respective tasks. The point-to-point communication routines transfer data between two PEs, whereas collective routines involve data transfer between several PEs. Reduction routines are used to find out certain properties of data stored in the memories of a group of PEs.
Chapter 7. Interprocess communication 7.4.3 73 Point-to-point communication Point-to-point communication is the most widely occuring form of data transfer in parallel computing in shared-memory computers. There are two basic routines for this purpose in the SHMEM library, shmem_get and shmem_put. Since SHMEM routines are one-sided, only one of these is needed to transfer data.
Cray T3E User’s Guide Other point-to-point routines The routines shmem_put and shmem_get operate correctly when the data being transferred consists of items with a size of 8 bytes. If the size of a single data item is only 4 bytes (32 bits), one must call instead the routines shmem_put4 or shmem_get4 in Fortran or shmem_put32, shmem_get32 in C and C++. Another crucial restriction for the basic versions of shmem_put and shmem_get is that they accept only consecutive data items.
Chapter 7. Interprocess communication 75 responding call would be CALL SHMEM_type_op_TO_ALL(target, source, nreduce, & pe_start, logpe_stride, pe_size, pwrk, psync) and here type is one of {INT8, INT4, REAL8, REAL4, COMP8, COMP4}, and op is one of the operations already mentioned. The call above applies reduction operation op on data of type type at address source in the memories of all PEs involved. The result is stored at address target.
Cray T3E User’s Guide 7.4.5 Other important routines There are two very important routines which a parallel program on T3E will almost certainly call, namely shmem_my_pe and shmem_n_pes. The former reveals the calling PE its identity, while the latter returns the total number of PEs in use.
Chapter 7. Interprocess communication 77 INCLUDE ’mpp/shmem.
Cray T3E User’s Guide 0 : c = 1 : c = 10 12 14 16 2 4 6 8 The output of the C program is similar. 7.5 High Performance Fortran (HPF) The Cray T3E system at CSC has a High Performance Fortran (HPF) compiler. HPF is a developing standard agreed by several computer and software vendors. The Portland Group HPF (PGHPF) version 2.3 on T3E also supports the CRAFT programming model used on Cray T3D systems.
Chapter 7. Interprocess communication 79 • The directive SHARED has been changed to DISTRIBUTE. • The distribution specification : for a degenerate distribution has been changed to *. • The distribution specification :BLOCK has been changed to BLOCK. • The intrinsic function called IN_DOSHARED is called (in HPF_CRAFT) IN_INDEPENDENT. The PGHPF 2.3 compiler conforms with HPF standard version 1.1.
Cray T3E User’s Guide t3e% module load pghpf The compiler is invoked with the command pghpf. It accepts files ending with .hpf, .f, .F,.for or .f90. Files ending with .F are processed using the C preprocessor. Suppose that the previous program is in the file dot.f90. The program can be compiled and run with the following commands: t3e% pghpf -Minfo -Mstats -Mautopar dot.
Chapter 8. Batch queuing system 81 Chapter 8 Batch queuing system The batch queuing system ensures an optimum load on the computer and a fair distribution of resources for the users. On the Cray T3E the queuing system is called Network Queuing Environment, NQE. 8.1 Network Queuing Environment (NQE) The NQE system makes it possible to submit batch jobs from different computers (client computers) to the target computer (execution server).
Cray T3E User’s Guide # QSUB -l p_mpp_t=7000 # QSUB cd $HOME/sn6309 mpprun -n $NPES ./mmloop 6000 6000 First, the given command shell (here /bin/ksh) is used to run the script. The default shell is /bin/sh. These two shells are recommended. The option -q tells the NQE system which queue the job should be sent to. Here we have requested the prime queue. At the moment this is the only queue a normal user on the T3E can submit jobs to.
Chapter 8. Batch queuing system 83 The batch job is submitted with the command qsub [options] jobfile The output from the command looks like this: nqs-181 qsub: INFO Request <2227.sn6309>: Submitted to queue by . The identifier 2227.sn6309 is the request-id of the job. This can be used to check the status of the job with the qstat command. The most often encountered error is the following: nqs-4517 qsub: CAUTION No such queue at local host.
Cray T3E User’s Guide Figure 8.1: An example of a cqstat session. --------------------------------NQS 3.3.0.4 BATCH REQUEST SUMMARY --------------------------------IDENTIFIER NAME USER LOCATION/QUEUE ------------- ------- -------- --------------------18996.sn6309 O-As32r vsammalk small@sn6309 19021.sn6309 CHARMM santa small@sn6309 18999.sn6309 p-IV-g2 jmozos small@sn6309 18979.sn6309 AlO2_pb honkala medium@sn6309 18929.sn6309 SIC128. torpo medium@sn6309 18930.sn6309 SIC128. torpo medium@sn6309 18993.
Chapter 8. Batch queuing system qstat option Meaning -a -b -r -m -u user -f Display summary information for all jobs. Display summary information for batch jobs. Display summary information for running jobs. Display information about MPP queue limits. Display information about user’s jobs. Display full information about queues or requests. Table 8.2: Some qstat options. ---------------------------------------NQS 3.3.0.4 BATCH REQUEST: espy.
Cray T3E User’s Guide 8.4 Deleting an NQE batch job Sometimes it is necessary to delete a job before it is finished. For example, the input may be erroneous and you do not want to waste any CPU time. A job is deleted with the command qdel. The most usual way to use the qdel command is qdel request-id You can ensure the deletion of a job by sending a SIGKILL signal to the running job. Use the option -k of the qdel command: qdel -k request-id 8.
Chapter 8. Batch queuing system sn6309 5 0 0 0 ----------------------- --- --- --- --- --- --- 8.6 0 0 0 --- --- --- ------------ More information More information on NQE is available in the CSC help system: help nqe See also the manual pages of the commands qsub, qstat and qdel. Another good reference to check out is CSC’s T3E Users’ Information Channel in the WWW address: http://www.csc.
Cray T3E User’s Guide Chapter 9 Programming tools 9.1 The make system The make utility executes commands in a makefile to update one or more targets, which typically are programs. The make system is mainly used to maintain programs consisting of several source files. When some of the source files are modified, the make system recompiles only the modified files (and those files that depend on the modified files). Here is a typical makefile: OBJECTS= func1.o func2.
Chapter 9. Programming tools 89 make compiles the source codes func1.c and func2.c, and links them with the NAG library, producing an executable file myprog. The line .c.: in the example specifies that .c files should be compiled into .o files using the command on the following tabulated line: $(CC) -c $(OPTS) $< The symbol $(CC) is already defined by the make system, but you could have redefined it to the appropriate compiler in the beginning of the makefile.
Cray T3E User’s Guide changed. The browser may act upon a routine, a file, or an entire program, which is composed of one or more distinct files, but treated by the browser as a single unit. Xbrowse also acts as a base for other Cray Research tools that reference source code. To display a list of available tools, use the left mouse button to click on the Tools menu button. Suppose you want to obtain information about all of your C code.
Chapter 9. Programming tools 91 • The source code pane, located in the middle of the Xbrowse window, is the largest area of the window. This pane displays the current source code. • The information pane is located at the bottom of the main Xbrowse window and provides information about the status of Xbrowse. You can also type equivalents of the Xbrowse commands for many menu options in this pane. (A list of these commands is available through the Help menu option.
Cray T3E User’s Guide level position the cursor on the box and press the left mouse button. To close the tree one level position the cursor on the node name and press the right mouse button. When you click on a node, that node becomes the current node and is displayed in the main Xbrowse window. The command Caller/Call Tree displays a static call tree of routines that call a specified subprogram and, in turn, displays any subprograms called by the specified subprogram.
Chapter 9. Programming tools 93 The user can move to the source of a subroutine by clicking on the name of the subroutine with the right mouse button. TotalView has two execution modes: “all” or “single-processor”. In the execution mode “all”, the breakpoints and execution commands are applied to all processors simultaneously. In the single-processor execution mode, the user can set breakpoints individually for each processor. The execution mode is selected from the PSet menu.
Cray T3E User’s Guide Figure 9.2: An example of a Cray TotalView session.
Chapter 9. Programming tools 95 error, and it can be used to determine the cause of the problem. The TotalView debugger can be used to examine core files. Start the debugger with the executable name (prog.x in the example) and the core file name: totalview prog.x core After this the debugger shows where each process has been stopped and you can examine the values of the variables. Some of the processes may have been stopped in an assembly routine.
Cray T3E User’s Guide when the program is executed. The files are passed to the Apprentice tool for graphical examination. MPP Apprentice is used by the following steps. Fortran 90 programs are compiled with the option -eA and object codes linked with the MPP Apprentice library using the option -lapp: t3e% f90 -c -eA prog.f t3e% f90 -o prog.x prog.o -lapp The compiler option -eA and the linker option -lapp work also with the PGHPF High Performance Fortran compiler.
Chapter 9. Programming tools 97 Figure 9.3: An example of an MPP Apprentice session. The upper pane shows timing statistics and the lower pane shows the number of instructions for the selected subroutine.
Cray T3E User’s Guide 9.4.2 The appview command In addition to MPP Apprentice, the appview command that a quick summary of the profiling data. Its output is similar to the output from the conventional Unix profiler prof. The appview command was developed at CSC and it relies on a few scripts that extract and sort information from the textual report produced by the command apprentice -r. The following example illustrates the usage. The command line is appview app.
Chapter 9. Programming tools performance information and instruction counts. PAT is able to analyze programs written in Fortran 90, C, C++ and HPF. The executable only needs to relinked, no recompiling is necessary. The linker option -l pat along with the PAT specific cld file pat.cld are required. As an example, suppose that a Fortran 90 program in the file prog.f90 is to be analyzed. The following commands can be used: t3e% f90 -c prog.f90 t3e% f90 prog.o -o prog -l pat pat.
Cray T3E User’s Guide program must be compiled with the option -g, which disables optimization and thus increases the run time. More information about PAT in general and on its advanced features can be found on the manual pages (man pat). 9.5 Tracing message passing: VAMPIR VAMPIR Visualization and Analysis of MPI Resources is a profiling tool for MPI applications.
Chapter 9. Programming tools 101 The source code includes the VAMPIRtrace API calls by the preprocessor macro USE_VT. It is recommended to include the definitions in the file $PAL_ROOT/include/VT.inc with the option -I. The program may be run now, e.g., by the command t3e% mpprun -n8 loadbal < input If the program is run as a batch job (see Chapter 8), the command use vampir has to be included in the jobfile. By default, the trace file is generated between the calls MPI_INIT and MPI_FINALIZE.
Cray T3E User’s Guide Figure 9.4: A sample of a VAMPIR session. On the lower right corner is the global timeline display showing the communication pattern. The communication statistics display on the upper left corner shows the number of messages sent between processors. The two pie charts show the activities of individual processors. Before the VAMPIR visualization tool is used for the first time, the user should create a directory .
Chapter 9. Programming tools 103 interval by selected processors. • Process view shows the portion of time spent in a given activity class. All the displays have several options which can be controlled through the pop-up menu from the right mouse button. These include • Communication statistics: the total amount of communication between all processor pairs is shown. This is opened from the Global timeline display by selecting Comm. Statistics.
Cray T3E User’s Guide Chapter 10 Miscellaneous notes This chapter discusses some additional topics, such as timing of programs and defining the scalability of parallel programs. 10.1 Obtaining timing information The most useful measure of processing time in a parallel environment is the wall clock time. This is due to the fact that traditional CPU times are processor-based, whereas the wall clock time gives a global view of aggregate parallel performance.
Chapter 10. Miscellaneous notes 105 /* Wall clock time in seconds */ return (double) _rtc() * cpcycle * 1.0e-12; } This routine can be called either in C/C++ or Fortran. In C/C++ the syntax is as follows: extern double SECS(void); double t1, t2, dt; t1 = SECS(); ... perform calculations ... t2 = SECS(); dt = t2 - t1; printf("Elapsed time: %f\n",dt); In Fortran 90: REAL :: t1, t2, dt REAL, EXTERNAL :: secs t1 = secs() ... perform calculations ...
Cray T3E User’s Guide ...computation... after = cpused(); utime = after - before; printf("CPU time in user space = %ld clock ticks\n", utime); 10.1.4 Example of timing Here is an example of a C program, which computes the matrix product using the SGEMM routine from Libsci: #include #include #include #include
Chapter 10. Miscellaneous notes after = cpused(); utime = after - before; printf("ct[10][10] is %.6f\n", ct[10][10]); printf("CPU time in user space = %ld clock ticks\n", utime); exit(0); } See Section 6.3 on page 54 for more details on calling Fortran routines (here SGEMM) from C. Here is an example of compiling and executing this program: t3e% cc matmul.c t3e% timex mpprun -n 2 ./a.out ct[10][10] is -345.015608 ct[10][10] is -345.
Cray T3E User’s Guide This equation can be normalized by setting W1 + Wp = 1. Here W1 = α (the sequential portion) and Wp = 1 − α (the parallel portion). Now you get 1 Sp = . α + (1 − α)/p For example, if you have a program which contains a 10 % sequential part the equation reads 1 . 0.1 + 0.9/p Sp = Setting p → ∞, you get the maximum speedup, which is 1/0.1 = 10. Therefore, the sequential part starts to dominate, when you add more processors.
Chapter 10. Miscellaneous notes You can derive the following connection between the parameters α and α0 in Amdahl’s and Gustafson’s laws: α0 , p − α0 (p − 1) αp . α0 = 1 + α(p − 1) α = Figure 10.1 shows how these scalability laws are connected. Figure 10.2 shows how the speed of the code scales (according to Amdahl’s law) when α = 0.02 and α = 0.002. 1 a 1–a a' p(1–a') W1 Wp W1 pWp a (1–a)/p a' (1–a')/p W1 Wp/p W1 Wp 1 Amdahl's law Gustafson's law Figure 10.
Cray T3E User’s Guide 10.3 Scalability criteria at CSC CSC imposes the following scalability criteria for Cray T3E applications: The speed of the application has to increase by 50%, when the number of processors is doubled. For example, when doubling the processors from 8 to 16, the speed of the code should be 1.5 times as much. You can use Gustafson’s law for nice formulation of this criteria: compute the same calculation using p and p/2 processors.
Appendix A. About CSC 111 Appendix A About CSC Center for Scientific Computing, or simply CSC, is a national service center that specializes in scientific computing and data communications. CSC provides modeling, computing and information services for universities, research institutes and industry. For example, Finland’s weather forecasts are computed with the Cray supercomputers operated by CSC.
Cray T3E User’s Guide hours you can leave a message. The Help Desk registers the call, writes down the problem and tries to solve the problem immediately. If this is not possible, the problem is forwarded to the right experts to take care of it. See the WWW pages at the address http://www.csc.fi for more information about CSC services.
Appendix B. Glossary 113 Appendix B Glossary ANSI American National Standards Institute, organization deciding on the U.S. computer science standards. Bandwidth The amount of data that can be sent through a given communications circuit per second. BLACS Basic Linear Algebra Communication Subroutines, a subroutine package for interprocess communication in PBLAS and ScaLAPACK. BLAS Basic Linear Algebra Subroutines, a subroutine package for fundamental linear algebra operations.
Cray T3E User’s Guide HPF High Performance Fortran, a data-parallel language extension to Fortran 90. HTML Hypertext Markup Language, a language for writing hypertext documents in the Web. IEEE Institute of Electrical and Electronics Engineers, the world’s largest technical professional society, based in the USA. IMSL Fortran subroutine library for numerical and statistical computation.
Appendix B. Glossary 115 Non-malleable Non-malleable programs are fixed at compile time to run on a specific number of processors. NQE Network Queueing Environment, the batch queuing system on the T3E. PBLAS Parallel BLAS, parallelized version of BLAS. PDF Portable Document Format, a format defining the final layout of a document. The native file format for the Adobe Acrobat software package. PE Processing Element, consisting of a microprocessor, local memory and support circuitry.
Cray T3E User’s Guide Appendix C Metacomputer Environment Help commands • help topic (CSC help system) • usage program (quick help for programs) • man program (manual pages) • msgs (system messages) • lynx, mosaic, netscape (hypertext information system) Unix commands • ls (list directory) • less (print a file to the screen) • cp (copy a file) • rm (delete a file) • mv (move or rename a file) • cd (change the current directory) • pwd (print name of the current directory) • mkdir (create a directory)
Appendix C. Metacomputer Environment E-mail • pine (start the e-mail program) • Reading: choose a message with arrow keys and press return • i (index of received messages) • c (send mail) • r (reply to mail) • f (forward mail) • q (quit) • ? or Ctrl-g (help); notation Ctrl-g means “hold down the control key and press g” • Ctrl-c (interrupt the current operation) Sending mail: • pine First.Surname@csc.
Cray T3E User’s Guide Bibliography [Craa] Cray Research, Inc. CF90 Commands and Directives Reference Manual. SR-3901. 2.6, 5.10 [Crab] Cray Research, Inc. Cray C/C++ Reference Manual. SR-2179 3.0.2. 2.6, 6.7 [Crac] Cray Research, Inc. Cray T3E Fortran Optimization Guide. SG-2518. 2.6, 5.5, 5.10 [Crad] Cray Research, Inc. Introducing CrayLibs. IN-2167 3.0. 4.6 [Crae] Cray Research, Inc. Introducing the Cray TotalView Debugger. IN2502 3.0. 9.3.2 [Craf] Cray Research, Inc.
Bibliography 119 [KR97] Tiina Kupila-Rantala, editor. CSC User’s Guide. CSC – Tieteellinen laskenta Oy, 1997. URL http://www.csc.fi/oppaat/cscuser/. 1.7 [Lou97] Kirsti Lounamaa, editor. Metakoneen käyttöopas. CSC – Tieteellinen laskenta Oy, 2nd edition, 1997. 1.7, 2.6, 5.10 [Pac97] Peter S. Pachero. Parallel Programming with MPI. Morgan Kaufmann Publishers, Inc., 1997. 1.7, 7.2.6 [Saa95] Sami Saarinen. Rinnakkaislaskennan perusteet PVM-ympäristössä. CSC – Tieteellinen laskenta Oy, 1995. 1.7, 7.3.
Cray T3E User’s Guide Index Symbols .F, 41, 49 .F90, 41, 49 .f, 41 .
Index communicator, 64 compiler C language, 52 C++ language, 52 directives, 45 features, 44 Fortran 90 language, 40 options, 32, 42, 43, 55 compiler information file, 95 compiling, 16, 32, 40 core file, 93 cpp, 52 CPU DEC Alpha, 21 quota, 8 cqstat, 83, 84 CRAFT, 78 Cray C90, 28 Cray scientific library, 33 CSC, 111, 113 cypress.csc.
interprocess communication, 62 interprocessor communication, 25 L LAPACK, 9, 10, 33, 34, 114 latency, 25, 62, 114 level 1 cache, 23 level 2 cache, 23 library, 49 Libsci, 10, 33, 54, 106, 114 linear algebra, 34 Linear Algebra PACKage, 34 linear scalability, 108 linking, 32 local disk space, 20 local memory, 20–22, 24 logging in, 13 loop optimization, 43 loop splitting, 43 loop unrolling, 46 M macros, 50 defining, 50 mailing list, 11 main memory, 24 make, 48, 88 Makefile, 88 makefile, 48, 88 malleable,
Index P Parallel BLAS, 35 parallel performance, 107 parallel programs, 16 Parallel Virtual Machine, 9, 68 PAT, 10, 98 PBLAS, 34, 35, 115 PDF, 115 PE, 13, 20, 21, 115 peak performance, 20 performance, 20, 95, 107 Performance Analysis Tool, 10, 98 PGHPF, 78 pipe queue, 81 PostScript, 115 pragma, 56 preprocessing, 49 prime, 18, 86 processing elements, 13, 20, 21 processor, 23 architecture, 23 RISC, 9 prof, 98 profiling, 10, 43 Program Browser, 89 program development, 32 programming C language, 52 C++ lan
shmem_max, 74 shmem_min, 74 shmem_my_pe, 76 shmem_n_pes, 76 shmem_or, 74 shmem_prod, 74 shmem_put, 73, 76, 77 shmem_put32, 74 shmem_put4, 74 shmem_reduce_sync_size, 75 shmem_sum, 74 shmem_wait, 73 shmem_xor, 74 SIMD, 21, 115 Single Instruction, Multiple Data, 21 Single Purpose Node, 27 single-processor performance, 9 source code format, 41 speedup, 8 split, 45, 46, 57 SPN, 27 ssh, 13, 14, 115 status of the NQE job, 83 stream buffers, 21 streams, 21, 115 submitting jobs, 81 support circuitry, 22 symmetr