Let's outline a simulation scenario that would permit us to view and
analyze results in real time while sitting in a CAVE visualization
environment, for example. We'll base our estimates on the current
throughput of our hydrocode (FLOWER) running on LSU's 'SuperMike' or
NCSA's 'tungsten' linux clusters.
Binary Mass-Transfer Simulation
Preface: We are accustomed to generating animation sequences that contain 120 image frames per binary orbit. Because each animation sequence plays at 30 frames/second this means that, while watching each movie, the binary system completes each orbit in 4 seconds. This seems to be a good movie pace when the simulation is viewed from a frame that is rotating with the frequency of the orbit because, when the system is viewed from this rotating frame of reference, not much action happens on timescales shorter than a few orbits. However, at present, it does not appear to be feasible to produce a movie of this type (4 seconds per orbit) in real time.
Instead, let's consider viewing the simulation from an inertial frame of reference, in which case the system as a whole undergoes a great deal of change during each orbit (each star wanders all the way around its orbit) so the "movie" should be interesting even if we stretch it out so that a single orbit requires one minute (60 seconds) to complete. Therefore, the first question is, "Can the hydrodynamic simulation be carried out at a pace where a single orbit is completed in one minute of wall-clock time?"
According to Mario D'Souza, a simulation (of a q = 0.4 binary system) conducted with a grid resolution of 2563 on 256 SuperMike processors requires one second of wall-clock time to complete one integration time step, and 105 time steps are required to push the system through one binary orbit. This means that each orbit requires 28 wall-clock hours (!) i.e., approximately 1700 minutes. This is way too slow for real-time analysis! In order to get down to approximately 1 minute of wall-clock time per orbit, we need to speed up the simulation by a factor of approximately 1700. Let's do this as follows:
Based on these estimates, a simulation that is run with a grid resolution of 643 on 600 processors of the LONI machines should speed up by a factor of approximately (5 × 2.3 × 256) = 2944. This is more than what we need (or it provides a nice margin of error for our estimates). Such a simulation should require only about 35 wall-clock seconds to complete one binary orbit (and each orbit will require about 105/4 = 25,000 time steps).
Things to consider: Each LONI-machine processor will contain approximately 643/600 = 440 fluid grid cells; will this fit entirely into cache? Should we reconsider how data domain decomposition is done? Will we be killed by the "transpose" step inside the Poisson solver?
Passing Data to the CAVE
If the "image" that is produced for the CAVE is updated 30 times each second, each binary orbit will then require the production of approximately (35 × 30) = 1000 images, that is, an image must be constructed by FLOWER approximately every 25 time steps. Since the "image" that will be sent to the CAVE is actually VRML data, this means that Wes Even's marching cubes algorithm will be called approximately once every 25 integration time steps.
How many vertices will be generated by each processor? Well, typically each isodensity surface requires 104 - 105 vertices, which means that, on average, each of the (600) processors will need to generate 17 - 170 vertices. (Actually, the load balance will not likely be good because of surface-to-volume issues.)
After discussions with Richard Muffoletto, its seems like the most efficient way to get these vertices to the CAVE is to let each processor send its information directly across the LONI network asynchronously, rather than waiting for all of the processors to finish and gathering the data together into one location before transferring the data to the CAVE. An asynchronous transmission will be beneficial because (a) it will keep the network active a larger fraction of the time, and (b) it will take advantage of the fact that each processor will finish the vertex-generation step at different times because processors will have varying numbers (from 0 to several hundred) of vertices to create.
We might also consider the following: The CAVE will need a new surface every 1/30th of a second. This means that it has 1/30th of a second over which to gather the new set of vertices together, during which time FLOWER will take approximately 25 time steps. So instead of calling Wes Even's program only once every 25 time steps, why not have 1/25th of the processors (on each LONI machine) call Wes Even's program every time step (of course, a different subset of processors will be activated each time)? This will cut down on network contentions within the LONI machines (only 4 processors on each machine will be sending data down the network each time step!).
And, as Richard Muffoletto points out, it is not absolutely essential that a completely new iso-surface be available for the CAVE to display every 1/30th of a second. The surface will not change its shape very rapidly, so partial (incomplete) updates probably won't interfere with the viewer's perception that things are changing gradually. The CAVE client will simply need to keep track of which one of the 600 LONI processors has just sent a new set of vertices and it can replace that subset. (Overall synchronization of the image will occur naturally because the fluid flow must already be synchronized within FLOWER every integration time step!)
| forecast made in July, 2005|