Early Experiences with the Cray T3E at the SDSC

Joel E. Tohline and John Cazes

Department of Physics & Astronomy
Louisiana State University


Part III: Performance of DEC Alpha processors

The Bad News:

In Part II of this report, we produced a Table entitled, "Timings on the T3E" which illustrated clearly that we have been able to achieve excellent scalability of our CFD code on the T3E by utilizing the Portland Group's HPF compiler (PGHPF).

As footnote "c" of that table points out, however, we had to enlist 32 T3E processors before we were able to achieve execution speeds that were close to the execution speeds we have realized running on a single node of the Cray C90. This is a bit troubling because the reported peak theoretical performance capabilities of a single T3E node at the SDSC is 600 MFlops and the reported peak theoretical performance capabilities of a single node of the C90 is 1000 MFlops, so in theory we should have realized execution speeds that rivaled the C90 by enlisting only 2 nodes of the T3E. Apparently the efficiency with which our code executes on the T3E is a factor of 16 poorer than (that is, only 6% of) the efficiency with which it executes on the C90!

This point is made even more explicitly in Table 3 where one column of timings as reported earlier in Table 2 (for grid size 128 × 642) has been converted to equivalent single-processor execution efficiencies measured relative to the Cray C90. All of our multi-processor runs on the Cray T3E achieve an efficiency (relative to the reported theoretical peak performance) that is only 6 - 7% of the efficiency that the same code achieves on the Cray C90.

Table 3

T3E
nodes
Timings on the T3Ea
Using PGHPF
Grid size:
128 × 64 × 64
Seconds µsec
per
zone
Ratio
to
C90
Efficiency
(relative
to C90)
4 22.96 175 0.044 7.3%
8 12.49 191 0.040 6.7%
16 7.18 219 0.035 5.8%
32 3.75 229 0.033 5.6%

Does this mean that although the PGHPF compiler is able to produce an executable code that scales well on the Cray T3E, the compiled code is overloaded with sluggish communications instructions? Or does it mean that, irrespective of network communication demands, the compiled code is performing computational instructions with extraordinarily low efficiency on each T3E processor?


Single-Processor Test Results:

In order to understand these timing results more completely and, in particular, answer this question regarding the usefulness of the PGHPF compiler on the Cray T3E, we have tested the performance of our CFD code on a variety of relevant single-processor platforms utilizing several different F90 compilers. Specifically, we have measured the code's execution time on:

Table 5 (at the bottom of this document) details some of the properties of the computer hardware that was used in each of these test cases. Exactly the same CFD code was used for each test; the code was simply recompiled using the selected compiler and compiler options as detailed in Table 5. The version of the code that was used in these tests also was virtually identical to the code that was used to obtain the various timings that have been reported in Parts I and II of this report.

Table 4 presents our execution timing results from these various test runs. They are presented in a format similar to Table 3. Notice that for each hardware/compiler pair, a test run was performed using two slightly different computational lattices -- one with and one without power of 2 arrays -- in order to test for cache misses.

Table 4: Single-Processor Test Results

Test Hardware Compiler Execution Speed
Grid size
64 × 32 × 32
Grid size
67 × 34 × 32
µsec
per
zone
Ratio
to
C90
Efficiency
(relative
to C90)
µsec
per
zone
Ratio
to
C90
Efficiency
(relative
to C90)
A Cray C90 Cray F90 9.6 1.000 100% 9.2 1.000 100%
B T3E F90 405.4 0.024 4% 120.3 0.076 13%
C T3E PGHPF 168.5 0.057 9% 165.2 0.056 9%
D DEC server DEC F90 127.7 0.075 27% 108.1 0.085 30%


Analysis:

The numbers reported in Table 4 illustrate several things:

  1. The execution speed (9.2 - 9.6 µsec/zone) measured on the Cray C90 for this relatively small size lattice problem is roughly the same as the execution speed (7.7 µsec/zone) measured earlier on a larger problem (see footnote c of Table 2), so the C90 measurements should provide a reasonably good control reference here.

  2. The efficiency of the PGHPF code when executing on a single-processor node of the T3E (9%) is not significantly better than the efficiency of the PGHPF code measured on multi-processor runs (6 - 7%; see Table 3). This suggests that the low measured efficiencies reported in Table 3 cannot be blamed primarily on sluggish, inter-processor communication.

  3. Both of the Cray T3E tests (B & C) produced execution efficiencies that are an order of magnitude worse than the execution efficiencies achieved on the Cray C90. This suggests that either the reported peak performance capabilities of each T3E processor (600 Mflops) is an order of magnitude too optimistic, or both the F90 compiler and the PGHPF compiler (which itself utilizes the F90 compiler) produce executable code that performs computational instructions on each T3E processor with extraordinarily low efficiency.

  4. Digital's F90 compiler produces an executable code that exhibits substantially better performance (a 27 - 30% relative efficiency) on our DEC server's 275 MHz 21040 Alpha processor than the executable code that is produced for the T3E processors by either the F90 or PGHPF compiler.

  5. When our code is compiled with the T3E F90 compiler, it suffers terribly from cache misses when power of 2 arrays are declared. (Interestingly, the PGHPF compiler which utilizes the same F90 compiler does not suffer this problem.)

All things considered, we suspect that the relatively low efficiency with which our code currently executes on the T3E (in the single-processor mode as well as in the multi-processor mode) is not due to any major failings of the PGHPF compiler but, instead, can be ascribed primarily to the immaturity of the F90 compiler that currently is available on the T3E. The relatively good performance of Digital's F90 compiler on the 275 MHz Alpha processor gives us reasonable hope that the F90 compiler on the T3E can, with modest effort, be taught to produce executable code that achieves a much higher percentage of the 600 MHz Alpha processor's peak performance capability. In the near future, we should expect to see improvements in the F90 compiler that lead to at least a factor of 3 improvement in execution speeds without having to make any major fortran code modifications.

Another item of concern that has surfaced during our various single-processor and multi-processor test executions is the following:

  1. As written, our code primarily relies upon nearest-neighbor communications that involve shifting various 3D arrays only one place, in one dimension. However, the reports that we have received from running the code with the "-stats -alls" options show that the PGHPF compiler has established a large number of communications with average sizes of 8 - 10 bytes for each processor. This appears to us to be a very inefficient way to implement the array shifts being requested by our code.


FOOTNOTES:

aIn Column 2 of Table 3, execution times (in seconds per integration timestep) have been copied directly from Column 3 of Table 2. In Column 3, these execution times have been converted to an equivalent measure in microseconds per zone as if each execution had been performed on a single processor-node of the T3E, that is,

Column 3 = (Column 2) × (T3E nodes) × 106 / (128 × 642).

The numbers reported in Column 4 of Table 3 have been derived by dividing the numbers in Column 3 by 7.65 µsec/zone which, as can be deduced from footnote c of Table 2, is the execution speed we obtained for the same size problem running on a single node of the Cray C90. Finally, the "efficiency measures" reported in the last column of Table 3 have been derived by multiplying the time ratios reported in Column 4 by the ratio of the peak theoretical performance capability of a single node of the C90 to that of a single node of the T3E, i.e., 1000 Mflops/600 Mflops.


Table 5: Single-Processor Test Configurations

Test Hardware Compiler
Machine Processor Theoretical
Peak
Name Options Used
Mflops Ratio
to
C90
A Cray C90 C90 1000 1.00 Cray F90 f90 -N 132 -O3 -c $*.f
B T3E Alpha
300 MHz 21164
600 0.60 F90 f90 -N 132 -O3 -c $*.f
C T3E Alpha
300 MHz 21164
600 0.60 PGHPF pghpf -Mextend -Mreplicate=dims:3 -Moverlap=size:1 -c -O2
D DEC server Alpha
275 MHz 21040
275 0.28 DEC F90 f90 -extend_source -O3 -c $*.f


Title Page
Part I
Part II
(Top of) Part III
Part IV