Early Experiences with the Cray T3E at the SDSC

Joel E. Tohline and John Cazes

Department of Physics & Astronomy
Louisiana State University

Part II: Scalability Using the Portland Group HPF Compiler

Parallel Programming Method:

We have been utilizing HPF compilers (and variations thereof) almost exclusively to handle data distribution and message passing tasks. On the T3E specifically, we are using the Portland Group's High Performance Fortran (PGHPF) compiler.

Porting to the T3E:

Part 1 -- Necessary code modifications

Our code already included "ALIGN" and "DISTRIBUTE" statements and CMFortran's version of compiler directives. The following list itemizes just the things that we have implemented to get our code to run on the T3E using the Portland Group compiler. Many more things remain to be done to further optimize this code.
  1. All of the compiler directives were changed from "CMF$" to "!HPF$".
  2. A processor layout was set up with the "PROCESSORS" directive.
  3. All of the "LAYOUT" directives in CMFortran were changed to "DISTRIBUTE" directives. The keyword ":NEWS" was changed to "BLOCK" and ":SERIAL" was changed to "*". The syntax of the CMFortran "ALIGN" statements matched that of the HPF compiler, so they were not changed.
  4. FFT and some other vendor-supplied subroutines were cut out until suitable replacements can be made on the T3E.

Part 2 -- Problems we have faced

  1. The PGHPF compiler would not allow dummy arrays in subroutines to be aligned with arrays outside the subroutine, so these dummy arrays were distributed directly in the subroutines.
  2. The Cray profiler, Apprentice, does not work well with the Portland Group compiler because it apparently does not have access to the statistics from the PGHPF subroutines.
  3. The Portland Group's own profiler, PGProf, is not available on the T3E (whereas it is available on the SP-2). This tool would be useful to HPF programmers.
  4. As remote internet users of the SDSC facilities, we sorely miss the ASCII interface to mppview (which was available on the T3D). The xmppview tool, while being beautiful, is much too slow to be a useful tool to remote users.
  5. It would be extremely useful to know exactly what physical processor layout one can expect for a given number of processors and how that relates to one's virtual processor layout. (The ASCII interface to mppview gave us this information quickly.)

Performance Measures:

Table 2 details our execution times on various configurations of the T3E and on a variety of different problem sizes. As the table illustrates, we realize almost perfect linear speedup as we move from 2 nodes to 128 nodes on the T3E as long as the size of our problem doubles each time the number of accessed nodes is doubled. This represents significantly better scaling than we previously have been able to achieve on the SP-2.

Table 2

Timings on the T3Ea
(seconds per integration timestep)
643 128×642 1282×64 1283 256×1282 2562×128 2563
2 26.60 -- -- -- -- -- --
4 12.38 22.96 -- -- -- -- --
8 6.98 b12.49 23.75 -- -- -- --
16 3.84 7.18 13.21 23.89 -- -- --
32 2.07 c3.75 6.97 12.49 24.09 -- --
64 -- -- 3.86 7.07 12.69 -- --
128 -- -- -- 3.95 -- 13.54 24.64


aTo obtain the execution times reported in this table, the hydrocode was run for 200 integration timesteps utilizing the grid resolution specified at the top of each column of the table. It should be noted that the timing comparisons were obtained with a purely hydrodynamic version of the code, that is, a solution to the Poisson equation and, hence, the self-gravity of the fluid was not included.

b For comparison (see Table 1 for details), running the same size problem (128 × 642) on a single node of a Cray Y/MP requires 13.30 cpu seconds per integration timestep.

c For comparison (see Table 1 for details), running the same size problem (128 × 642) on a single node of a Cray C90 requires 4.01 cpu seconds per integration timestep, and on an 8,192-node MasPar MP-1 requires 4.74 cpu seconds per integration timestep.

Title Page
Part I
(Top of) Part II
Part III
Part IV