Early Experiences with the Cray T3E at the SDSC |
---|
The Bad News:
In Part II of this report, we produced a Table entitled, "Timings on the T3E" which illustrated clearly that we have been able to achieve excellent scalability of our CFD code on the T3E by utilizing the Portland Group's HPF compiler (PGHPF).
As footnote "c" of that table points out, however, we had to enlist 32 T3E processors before we were able to achieve execution speeds that were close to the execution speeds we have realized running on a single node of the Cray C90. This is a bit troubling because the reported peak theoretical performance capabilities of a single T3E node at the SDSC is 600 MFlops and the reported peak theoretical performance capabilities of a single node of the C90 is 1000 MFlops, so in theory we should have realized execution speeds that rivaled the C90 by enlisting only 2 nodes of the T3E. Apparently the efficiency with which our code executes on the T3E is a factor of 16 poorer than (that is, only 6% of) the efficiency with which it executes on the C90!
This point is made even more explicitly in Table 3 where one column of timings as reported earlier in Table 2 (for grid size 128 × 642) has been converted to equivalent single-processor execution efficiencies measured relative to the Cray C90. All of our multi-processor runs on the Cray T3E achieve an efficiency (relative to the reported theoretical peak performance) that is only 6 - 7% of the efficiency that the same code achieves on the Cray C90.
T3E nodes |
Timings on the T3Ea
Using PGHPF Grid size: 128 × 64 × 64 |
|||
---|---|---|---|---|
Seconds |
µsec per zone |
Ratio to C90 |
Efficiency
(relative to C90) |
|
4 | 22.96 | 175 | 0.044 | 7.3% |
8 | 12.49 | 191 | 0.040 | 6.7% |
16 | 7.18 | 219 | 0.035 | 5.8% |
32 | 3.75 | 229 | 0.033 | 5.6% |
Does this mean that although the PGHPF compiler is able to produce an executable code that scales well on the Cray T3E, the compiled code is overloaded with sluggish communications instructions? Or does it mean that, irrespective of network communication demands, the compiled code is performing computational instructions with extraordinarily low efficiency on each T3E processor?
Single-Processor Test Results:
In order to understand these timing results more completely and, in particular, answer this question regarding the usefulness of the PGHPF compiler on the Cray T3E, we have tested the performance of our CFD code on a variety of relevant single-processor platforms utilizing several different F90 compilers. Specifically, we have measured the code's execution time on:
Table 5 (at the bottom of this document) details some of the properties of the computer hardware that was used in each of these test cases. Exactly the same CFD code was used for each test; the code was simply recompiled using the selected compiler and compiler options as detailed in Table 5. The version of the code that was used in these tests also was virtually identical to the code that was used to obtain the various timings that have been reported in Parts I and II of this report.
Table 4 presents our execution timing results from these various test runs. They are presented in a format similar to Table 3. Notice that for each hardware/compiler pair, a test run was performed using two slightly different computational lattices -- one with and one without power of 2 arrays -- in order to test for cache misses.
Test | Hardware | Compiler | Execution Speed | ||||||
---|---|---|---|---|---|---|---|---|---|
Grid size 64 × 32 × 32 |
Grid size 67 × 34 × 32 |
||||||||
µsec per zone |
Ratio to C90 |
Efficiency (relative to C90) |
µsec per zone |
Ratio to C90 |
Efficiency (relative to C90) |
||||
A | Cray C90 | Cray F90 | 9.6 | 1.000 | 100% | 9.2 | 1.000 | 100% | |
B | T3E | F90 | 405.4 | 0.024 | 4% | 120.3 | 0.076 | 13% | |
C | T3E | PGHPF | 168.5 | 0.057 | 9% | 165.2 | 0.056 | 9% | |
D | DEC server | DEC F90 | 127.7 | 0.075 | 27% | 108.1 | 0.085 | 30% |
Analysis:
The numbers reported in Table 4 illustrate several things:
All things considered, we suspect that the relatively low efficiency with which our code currently executes on the T3E (in the single-processor mode as well as in the multi-processor mode) is not due to any major failings of the PGHPF compiler but, instead, can be ascribed primarily to the immaturity of the F90 compiler that currently is available on the T3E. The relatively good performance of Digital's F90 compiler on the 275 MHz Alpha processor gives us reasonable hope that the F90 compiler on the T3E can, with modest effort, be taught to produce executable code that achieves a much higher percentage of the 600 MHz Alpha processor's peak performance capability. In the near future, we should expect to see improvements in the F90 compiler that lead to at least a factor of 3 improvement in execution speeds without having to make any major fortran code modifications.
Another item of concern that has surfaced during our various single-processor and multi-processor test executions is the following:
Test | Hardware | Compiler | ||||
---|---|---|---|---|---|---|
Machine | Processor |
Theoretical Peak |
Name | Options Used | ||
Mflops |
Ratio to C90 |
|||||
A | Cray C90 | C90 | 1000 | 1.00 | Cray F90 | f90 -N 132 -O3 -c $*.f |
B | T3E |
Alpha 300 MHz 21164 |
600 | 0.60 | F90 | f90 -N 132 -O3 -c $*.f |
C | T3E |
Alpha 300 MHz 21164 |
600 | 0.60 | PGHPF | pghpf -Mextend -Mreplicate=dims:3 -Moverlap=size:1 -c -O2 |
D | DEC server |
Alpha 275 MHz 21040 |
275 | 0.28 | DEC F90 | f90 -extend_source -O3 -c $*.f |