Attachment C3

Recent Experiences with the Cray T3E_600
at the SDSC

Joel E. Tohline, John Cazes, and Patrick Motl

Department of Physics & Astronomy
Louisiana State University


In late 1998 and early 1999, we successfully rewrote our entire gravitational CFD algorithm, incorporating explicit message-passing instructions via mpi. Exhaustive tests have convinced us that this new version of our code is producing physical results identical to the ones generated with our well-tested HPF algorithm. As illustrated by the numbers shown in the various tables, below:


Scalability of the CFD Code

Performance Measures using the Portland Group Compiler (PGHPF):

Table 1 details our execution times on various configurations of the T3E and on a variety of different problem sizes. As the table illustrates, we realize almost perfect linear speedup as we move from 2 nodes to 128 nodes on the T3E as long as the size of our problem doubles each time the number of accessed nodes is doubled. This represents significantly better scaling than we previously have been able to achieve on the SP-2.

Table 1

CFD Code Timings on the SDSC T3E_600a
Using PGHPF
(seconds per integration timestep)
643 128×642 1282×64 1283 256×1282 2562×128 2563
2 26.60 -- -- -- -- -- --
4 12.38 22.96 -- -- -- -- --
8 6.98 b12.49 23.75 -- -- -- --
16 3.84 7.18 13.21 23.89 -- -- --
32 2.07 c3.75 6.97 12.49 24.09 -- --
64 -- -- 3.86 7.07 12.69 -- --
128 -- -- -- 3.95 -- 13.54 24.64



Performance Measures using mpi:

Table 2 details our execution times on various configurations of the T3E and on a variety of different problem sizes. As the table illustrates, we realize almost perfect linear speedup as we move from 2 nodes to 128 nodes on the T3E as long as the size of our problem doubles each time the number of accessed nodes is doubled.

Table 2

CFD Code Timings on the SDSC T3E_600a
Using mpi
(seconds per integration timestep)
662×64 662×128 130×66×128 1302×128 1302×256 258×130×256 2582×256 2582×512
4 2.456 4.945 9.212 -- -- -- -- --
8 1.468 b2.978 5.078 10.07 -- -- -- --
16 0.775 1.630 2.711 5.211 11.37 -- -- --
32 0.471 c0.968 1.584 3.027 6.573 11.32 -- --
64 -- -- 0.878 1.617 3.493 5.983 11.40 --
128 -- -- -- 0.968 2.057 3.453 6.548 15.19



Speedup of mpi over PGHPF:

Table 3 provides a brief comparison between the numbers in Table 1 and the numbers in Table 2 in order to show at a glance how much the execution time of our CFD code has been improved by changed from HPF to mpi. In order to derive the numbers shown in Table 3, we have divided the red numbers along the diagonal column in Table 1 by their respective numbers along the diagonal in Table 2, and have adjusted the ratio to take into account the fact that the grid sizes are not identical in the two tables.

Table 3

CFD Code Timings on the SDSC T3E_600a
Ratio of PGHPF timings to mpi timings
643 128×642 1282×64 1283 256×1282 2562×128 2563
4 5.36 -- -- -- -- -- --
8 -- 4.46 -- -- -- -- --
16 -- -- 5.10 -- -- -- --
32 -- -- -- 4.26 -- -- --
64 -- -- -- -- 3.75 -- --
128 -- -- -- -- -- 4.01 --



Single-Processor Test Results (mpi vs. HPF)

As Table 4 documents, most of the improvement that we made by moving from a HPF-based code to an mpi-based code can be understood by looking at single-processor execution speeds. Using mpi (which, in turn, permits us to use f90 directly without passing through PGHPF), we gain a factor of approximately 2 by simply shifting to array sizes that do not have power-of-two dimensions and another factor of approximately 1.5 by turning streams on. The final improvement, which gives us a factor of 4 speedup overall, comes from the parallel implementation. And, as the accompanying "pat" report indicates, this final speedup comes not so much from an overall improvement in communications efficiency but from the fact that the mpi-based code requires significantly fewer floating point operations! (This last point came as a bit of a surprise to us.)

Table 4

SDSC: T3E_600
Test Compiler + Options Streams Execution Speed (MFlops)
Grid size
64×32×32
Grid size
67×32×32
A PGHPF -O3 OFF 12.79 13.06
B f90 (mpi) -O3,aggress -lmfastv OFF 15.61 24.83
C f90 (mpi) -O3,aggress -lmfastv ON 15.61 36.09
D f90 (mpi) -O3,aggress -sdefault32 ON 21.73 43.06
The numbers in this table have been obtained using "pat," a performance monitoring tool that runs on the T3E. The report from which these numbers have been drawn accompanies this proposal as Attachment C1.



Scalability of the Gravitational CFD Code

Performance Measures using mpi:

Table 5 details our execution times on various configurations of the T3E and on a variety of different problem sizes precisely as reported in Table 2, but here the timings include a solution of the global Poisson equation along with the CFD code.

Table 5

Gravitational CFD Code Timings on the SDSC T3E_600a
Using mpi
(seconds per integration timestep)
662×64 662×128 130×66×128 1302×128 1302×256 258×130×256 2582×256
4 3.552 7.016 -- -- -- -- --
8 2.050 4.122 -- -- -- -- --
16 1.118 2.237 4.008 -- -- -- --
32 0.6562 1.311 2.269 4.394 -- -- --
64 -- -- 1.280 2.381 4.850 8.638 --
128 -- -- -- 1.398 2.832 4.848 --



FOOTNOTES:

aTo obtain the execution times reported in this table, the hydrocode was run for 200 integration timesteps utilizing the grid resolution specified at the top of each column of the table.

b For comparison (see Table 1 for details), running the same size problem (128 × 642) on a single node of a Cray Y/MP requires 13.30 cpu seconds per integration timestep.

c For comparison (see Table 1 for details), running the same size problem (128 × 642) on a single node of a Cray C90 requires 4.01 cpu seconds per integration timestep, and on an 8,192-node MasPar MP-1 requires 4.74 cpu seconds per integration timestep.


Body of Proposal