SDSC Timings with mpi

Attachment C3

Recent Experiences with the Cray T3E_600
at the SDSC

Joel E. Tohline, John Cazes, and Patrick Motl

Department of Physics & Astronomy
Louisiana State University

In late 1998 and early 1999, we successfully rewrote our entire gravitational CFD algorithm, incorporating explicit message-passing instructions via mpi. Exhaustive tests have convinced us that this new version of our code is producing physical results identical to the ones generated with our well-tested HPF algorithm. As illustrated by the numbers shown in the various tables, below:

Versions of the new mpi-based code both with and without the Poisson solver scale very well as the number of processors is increased from 4 to 128;
When run on multiple T3E processors, the new mpi-based code executes four to five times faster than the old HPF version of the code.

Scalability of the CFD Code

Performance Measures using the Portland Group Compiler (PGHPF):

Table 1 details our execution times on various configurations of the T3E and on a variety of different problem sizes. As the table illustrates, we realize almost perfect linear speedup as we move from 2 nodes to 128 nodes on the T3E as long as the size of our problem doubles each time the number of accessed nodes is doubled. This represents significantly better scaling than we previously have been able to achieve on the SP-2.

Table 1

CFD Code Timings on the SDSC T3E_600^a
Using PGHPF
(seconds per integration timestep)

64³ 128×64² 128²×64 128³ 256×128² 256²×128 256³

2 26.60 -- -- -- -- -- --

4 12.38 22.96 -- -- -- -- --

8 6.98 ^b12.49 23.75 -- -- -- --

16 3.84 7.18 13.21 23.89 -- -- --

32 2.07 ^c3.75 6.97 12.49 24.09 -- --

64 -- -- 3.86 7.07 12.69 -- --

128 -- -- -- 3.95 -- 13.54 24.64

	CFD Code Timings on the SDSC T3E_600^a Using PGHPF (seconds per integration timestep)
64³	128×64²	128²×64	128³	256×128²	256²×128	256³
2	26.60	--	--	--	--	--	--
4	12.38	22.96	--	--	--	--	--
8	6.98	^b12.49	23.75	--	--	--	--
16	3.84	7.18	13.21	23.89	--	--	--
32	2.07	^c3.75	6.97	12.49	24.09	--	--
64	--	--	3.86	7.07	12.69	--	--
128	--	--	--	3.95	--	13.54	24.64

Performance Measures using mpi:

Table 2 details our execution times on various configurations of the T3E and on a variety of different problem sizes. As the table illustrates, we realize almost perfect linear speedup as we move from 2 nodes to 128 nodes on the T3E as long as the size of our problem doubles each time the number of accessed nodes is doubled.

Table 2

CFD Code Timings on the SDSC T3E_600^a
Using mpi
(seconds per integration timestep)

66²×64 66²×128 130×66×128 130²×128 130²×256 258×130×256 258²×256 258²×512

4 2.456 4.945 9.212 -- -- -- -- --

8 1.468 ^b2.978 5.078 10.07 -- -- -- --

16 0.775 1.630 2.711 5.211 11.37 -- -- --

32 0.471 ^c0.968 1.584 3.027 6.573 11.32 -- --

64 -- -- 0.878 1.617 3.493 5.983 11.40 --

128 -- -- -- 0.968 2.057 3.453 6.548 15.19

	CFD Code Timings on the SDSC T3E_600^a Using mpi (seconds per integration timestep)
66²×64	66²×128	130×66×128	130²×128	130²×256	258×130×256	258²×256	258²×512
4	2.456	4.945	9.212	--	--	--	--	--
8	1.468	^b2.978	5.078	10.07	--	--	--	--
16	0.775	1.630	2.711	5.211	11.37	--	--	--
32	0.471	^c0.968	1.584	3.027	6.573	11.32	--	--
64	--	--	0.878	1.617	3.493	5.983	11.40	--
128	--	--	--	0.968	2.057	3.453	6.548	15.19

Speedup of mpi over PGHPF:

Table 3 provides a brief comparison between the numbers in Table 1 and the numbers in Table 2 in order to show at a glance how much the execution time of our CFD code has been improved by changed from HPF to mpi. In order to derive the numbers shown in Table 3, we have divided the red numbers along the diagonal column in Table 1 by their respective numbers along the diagonal in Table 2, and have adjusted the ratio to take into account the fact that the grid sizes are not identical in the two tables.

Table 3

CFD Code Timings on the SDSC T3E_600^a
Ratio of PGHPF timings to mpi timings

64³ 128×64² 128²×64 128³ 256×128² 256²×128 256³

4 5.36 -- -- -- -- -- --

8 -- 4.46 -- -- -- -- --

16 -- -- 5.10 -- -- -- --

32 -- -- -- 4.26 -- -- --

64 -- -- -- -- 3.75 -- --

128 -- -- -- -- -- 4.01 --

	CFD Code Timings on the SDSC T3E_600^a Ratio of PGHPF timings to mpi timings
64³	128×64²	128²×64	128³	256×128²	256²×128	256³
4	5.36	--	--	--	--	--	--
8	--	4.46	--	--	--	--	--
16	--	--	5.10	--	--	--	--
32	--	--	--	4.26	--	--	--
64	--	--	--	--	3.75	--	--
128	--	--	--	--	--	4.01	--

Single-Processor Test Results (mpi vs. HPF)

As Table 4 documents, most of the improvement that we made by moving from a HPF-based code to an mpi-based code can be understood by looking at single-processor execution speeds. Using mpi (which, in turn, permits us to use f90 directly without passing through PGHPF), we gain a factor of approximately 2 by simply shifting to array sizes that do not have power-of-two dimensions and another factor of approximately 1.5 by turning streams on. The final improvement, which gives us a factor of 4 speedup overall, comes from the parallel implementation. And, as the accompanying "pat" report indicates, this final speedup comes not so much from an overall improvement in communications efficiency but from the fact that the mpi-based code requires significantly fewer floating point operations! (This last point came as a bit of a surprise to us.)

Table 4

SDSC: T3E_600

Test Compiler + Options Streams Execution Speed (MFlops)

Grid size
64×32×32 Grid size
67×32×32

A PGHPF -O3 OFF 12.79 13.06

B f90 (mpi) -O3,aggress -lmfastv OFF 15.61 24.83

C f90 (mpi) -O3,aggress -lmfastv ON 15.61 36.09

D f90 (mpi) -O3,aggress -sdefault32 ON 21.73 43.06

The numbers in this table have been obtained using "pat," a performance monitoring tool that runs on the T3E. The report from which these numbers have been drawn accompanies this proposal as Attachment C1.

SDSC: T3E_600
Test	Compiler + Options	Streams	Execution Speed (MFlops)
Grid size 64×32×32		Grid size 67×32×32
A	PGHPF	-O3	OFF	12.79	13.06
B	f90 (mpi)	-O3,aggress -lmfastv	OFF	15.61	24.83
C	f90 (mpi)	-O3,aggress -lmfastv	ON	15.61	36.09
D	f90 (mpi)	-O3,aggress -sdefault32	ON	21.73	43.06
The numbers in this table have been obtained using "pat," a performance monitoring tool that runs on the T3E. The report from which these numbers have been drawn accompanies this proposal as Attachment C1.

Scalability of the Gravitational CFD Code

Performance Measures using mpi:

Table 5 details our execution times on various configurations of the T3E and on a variety of different problem sizes precisely as reported in Table 2, but here the timings include a solution of the global Poisson equation along with the CFD code.

Table 5

Gravitational CFD Code Timings on the SDSC T3E_600^a
Using mpi
(seconds per integration timestep)

66²×64 66²×128 130×66×128 130²×128 130²×256 258×130×256 258²×256

4 3.552 7.016 -- -- -- -- --

8 2.050 4.122 -- -- -- -- --

16 1.118 2.237 4.008 -- -- -- --

32 0.6562 1.311 2.269 4.394 -- -- --

64 -- -- 1.280 2.381 4.850 8.638 --

128 -- -- -- 1.398 2.832 4.848 --

	Gravitational CFD Code Timings on the SDSC T3E_600^a Using mpi (seconds per integration timestep)
66²×64	66²×128	130×66×128	130²×128	130²×256	258×130×256	258²×256
4	3.552	7.016	--	--	--	--	--
8	2.050	4.122	--	--	--	--	--
16	1.118	2.237	4.008	--	--	--	--
32	0.6562	1.311	2.269	4.394	--	--	--
64	--	--	1.280	2.381	4.850	8.638	--
128	--	--	--	1.398	2.832	4.848	--

FOOTNOTES:

^aTo obtain the execution times reported in this table, the hydrocode was run for 200 integration timesteps utilizing the grid resolution specified at the top of each column of the table.

^b For comparison (see Table 1 for details), running the same size problem (128 × 64²) on a single node of a Cray Y/MP requires 13.30 cpu seconds per integration timestep.

^c For comparison (see Table 1 for details), running the same size problem (128 × 64²) on a single node of a Cray C90 requires 4.01 cpu seconds per integration timestep, and on an 8,192-node MasPar MP-1 requires 4.74 cpu seconds per integration timestep.

Body of Proposal