Early Experiences with the Cray T3E
at the SDSC and the NAVOCEANO MSRC

Joel E. Tohline and John Cazes

Department of Physics & Astronomy
Louisiana State University


Part V: Straightforward F90 Compiler Optimizations

In Part III of our first report, we summarized results from several different test runs in which we measured the performance of our CFD code on a variety of relevant single-processor platforms utilizing several different F90 compilers. At the time of that report we had not had access to a performance monitoring tool that would permit us to directly measure floating-point performance on the T3E processors, so in order to evaluate the code's execution efficiency we relied upon comparisons with code timings on the Cray C90. An analysis of those test runs led us to conclude that "the relatively low efficiency with which our code currently executes on the T3E ... can be ascribed primarily to the immaturity of the F90 compiler that currently is available on the T3E." We also stated that, "In the near future, we should expect to see improvements in the F90 compiler that lead to at least a factor of 3 improvement in execution speeds without having to make any major fortran code modifications."

Since that first report we have gained a better appreciation of the way the current T3E F90 compiler handles the loading and storing of data arrays and the streaming of data into floating-point registers. By making adjustments in the stride of our principal data arrays and by taking advantage of some compiler options, but without making any changes in the implemented HPF algorithm, we are now able to generate executable code with a substantially higher execution efficiency than originally reported. The following discussions detail this improvement.

Note that, by utilizing "pat" -- a performance analysis tool that accesses the hardware performance monitors on the T3E -- at both the SDSC and the NAVO MSRC, we now are able to document the performance of our code through a direct measurement of its MFlop rating rather than simply through timing comparisons with the C90.


Impact of Power-of-2 Arrays:

In the past, we have selected lattice (grid) structures for our three-dimensional CFD simulations that have had 2Ni (where Ni is an integer) zones in each of the three coordinate directions (i = 1,3). Hence, each of our principal physical variables usually has been defined by arrays with power-of-2 dimensions. On certain vector supercomputers, and certainly on the SIMD-architecture MasPar, arrays with power-of-2 dimensions produce optimal executable code. As has been pointed out in a technical report by Anderson, Brooks and Hewitt (Benchmarking Group, Cray Research) entitled, "The Benchmarker's Guide to Single-processor Optimization for CRAY T3E Systems," however, on each processor of the T3E large arrays that have power-of-2 dimensions cannot be loaded intothe cache then accessed by the floating-point registers in an efficient manner. Generally speaking, when looping through an entire array in order to calculate, say, a vector-vector multiply, many "cache misses" will occur and a low execution efficiency will result if the data is layed out in memory with a power-of-2 stride.

In order to examine the impact of power-of-2 arrays on the execution speed of our code, we reran some of our earlier single-processor tests, changing only the declared size of our principal variable arrays. Specifically, instead of using our normal lattice sizes of 64×32×32 and 16×16×128, we compiled and executed the CFD code using lattice sizes of 67×34×32 and 18×18×128, respectively. Table 6 displays the results of these tests performed on two separate computing platforms (the T3E_600 at the SDSC, and the T3E_900 at the NAVOCEANO MSRC), using two separate compilers (F90 and PGHPF).

Table 6: Impact of Power-of-2 Arrays

(a) SDSC: T3E_600
Test Compiler Streams Execution Speed (MFlops)
Grid size
64 × 32 × 32
Grid size
16 × 16 × 128
Grid size
67 × 34 × 32
Grid size
18 × 18 × 128
B1 F90 OFF 7.20 7.30 21.58 20.00
C1 PGHPF OFF 12.79 10.72 13.06 11.29

(a) NAVOCEANO MSRC: T3E_900
Test Compiler Streams Execution Speed (MFlops)
Grid size
64 × 32 × 32
Grid size
16 × 16 × 128
Grid size
67 × 34 × 32
Grid size
18 × 18 × 128
E1 F90 OFF 7.74 7.75 24.88 23.05
F1 PGHPF OFF 15.07 12.57 15.41 13.29

(Tests B1 and C1 essentially reproduce the results reported for tests B and C in Table 4 of our earlier report. Here, however, our code's single-processor execution speed has been tabulated directly in terms of its MFlop rating.) Four things are clear from Table 6a:

  1. When our code is compiled with the Cray F90 compiler on the T3E, it suffers terribly from cache misses when power-of-2 arrays are declared. Stated another way, by simply changing our array dimensions so that they are not power-of-2, the code speeds up by a factor of 3!

  2. For a given size problem, once the decision is made to utilize either power-of-2 or non-power-of-2 arrays, the Cray F90 timing results appear to be insensitive to the relative sizes of the specified array dimensions. (For example, the results reported for a lattice size of 64×32×32 are essentially indistinguishable from the results reported for a lattice size of 16×16×128.)

  3. When our code is compiled with the PGHPF compiler on a single processor of the T3E, it does not seem to care whether or not the array dimensions are a power-of-2. This seems odd because, as we understand it, the PGHPF compiler ultimately uses the Cray F90 compiler to create an executable code.

  4. Even under the best conditions reported in Table 6a (F90 compiler without power-of-2 arrays), the measured MFlop rating is only 3.6% of the processor's 600 MFlop peak-performance rating.

Most of these points were made briefly at the end of Part III of our earlier report. By comparing Table 6b with Table 6a, one additional point can be made:

  1. Despite the fact that the peak-performance rating of the T3E_900 at the NAVO MSRC is 50% higher than the peak-performance rating of the T3E_600 at the SDSC, in moving to the NAVO MSRC machine the execution speed of our code has increased at most by 18%. This almost certainly reflects the fact that, overall, the code's measured MFlop rating is below 5% of peak and, hence, we are not effectively utilizing the machine's floating point capabilities.


The Benefit of STREAMS:

When Cray reconfigured the DEC Alpha chip to serve as the floating-point node of the T3E, as we understand it Cray replaced what is normally a level-3 cache on the chip with a memory access system called "STREAMS." On the T3E, a user may either activate or deactivate this additional memory access sytem by turning a STREAMS environment variable "on" or "off," respectively. Both the Cray F90 compiler and the PGHPF compiler recognize the setting of this environment variable at the time of compilation and will attempt to utilize the hardware capabilities of STREAMS if directed to do so. At the SDSC, STREAMS has not yet been activated, so turning this environment variable "on" is in practice identical to leaving it "off." At the NAVO MSRC, however, STREAMS has been activated. We recently have rerun all the tests reported in Table 6b with the STREAMS environment variable turned "on." Table 7 displays the results of these tests for two of the four selected lattice sizes.

Table 7: The Benefit of STREAMS

NAVOCEANO MSRC: T3E_900
Test Compiler Streams Execution Speed (MFlops)
Grid size
64 × 32 × 32
Grid size
67 × 34 × 32
E1 F90 OFF 7.74 24.88
E2 ON 8.85 38.19
F1 PGHPF OFF 15.07 15.41
F2 ON 23.08 23.31

Table 7 shows clearly that, with STREAMS turned on, our code speeds up by 50% independent of which compiler is being used. (Note that if the Cray F90 compiler is being used, this additional speedup will not be fully realized if the arrays have power-of-2 dimensions; but, again, the PGHPF compiler results seem to be insensitive to the array dimensions.)


Summary:

The earliest model simulations that we carried out on the T3E at the SDSC and the multi-processor test results that were quoted in our first report (see, for example, Table 2 of that report) were all conducted with a version of our CFD code that utilized power-of-2 arrays and the "STREAMS" option on the PGHPF compiler was turned off. As the test results reported in Table 7 show, in precisely that same mode of operation our CFD code achieves a rating of 15.07 MFlops on a single node of the T3E_900 at the NAVO MSRC. From the test results tabulated in Table 7, we are able to sketch out the following path by which significant improvements in the execution speed of our code may be achieved without making any alterations in our basic HPF algorithm:

Original mode
of operation.
--> Set
STREAMS=on.
--> Utilize Cray F90 Handling of
non-power-of-2 Arrays.
MFlops 15.07 23.08 38.19
Speedup Factor 1.00 1.5 2.5

This is completely in line with the expectations that we expressed in the summary comments of our first report, viz, "in the near future we should expect to see improvements in [our understanding of] the F90 compiler on the T3E that lead to at least a factor of 3 improvement in execution speeds without having to make any major fortran code modifications."

There are two immediate catches, however.

  1. At the SDSC, "STREAMS" has not been activated, so we are unable to realize the first factor of 1.5 speedup at that facility.

  2. The speedup that can be realized using the Cray F90 compiler on a single processor by adopting non-power-of-2 array dimensions cannot presently be generalized to a parallel HPF application. This is because, apparently, the PGHPF preprocessor does not pass the benefits of this array respecification transparently through to the F90 compilation step.

At this point, we believe that we would benefit from close interactions with the Portland Group compiler developers. We would like to determine how the full factor of 2.5 speedup outlined here can be achieved on a multi-processor HPF application.


Title Page
Part I
Part II
Part III
Part IV
(Top of) Part V