An HPF performance of a CFD code on the SX-5 SMP node


Mitsuo YOKOKAWA(*1), Yoshinori TSUDA(*1), and Kenji SUEHIRO(*2)

(*1) Japan Atomic Energy Research Institute
1-16-18, Hamamatsu-cho, Minato-ku, Tokyo 105-0013, Japan
{yokokawa,tsuda}@gaia.jaeri.go.jp

(*2) NEC Corporation
suehiro@ccm.cl.nec.co.jp

Hybrid parallel programming models are quite essential to obtain higher parallel performance on a cluster of shared memory symmetric multiprocessor (SMP) nodes interconnected by a network in these days.

The Earth Simulator is a kind of SMP clusters which is being developed by National Space Development Agency of Japan (NASDA), Japan Atomic Energy Research Institute (JAERI), and Japan Marine Science and Technology Center (JAMSTEC). The Earth Simulator has 640 processor nodes (PNs) connected by a internode full crossbar switch. Each PN is a SMP system which consists of eight vector processors and a shared memory system of 16GB. The total peak performance and the main memory capacity are 40TFLOPS and 10TB, respectively.

Programming models for the Earth Simulator is a great concern in making a parallel program. Because the Earth Simulator has three levels of parallelism; parallelism by vector processing, parallelism within SMP node, and parallelism among SMP nodes. In order to obtain high parallel performance, we should adopt vector processing in a processor, microtasking (loop-based parallelism on a shared memory system) in the node, and HPF or MPI implementations among nodes to the Earth Simulator. The architecture of PN of the Earth Simulator is almost similar to the one of SX-5 node and the compilers for SX-5 will be used. Therefore, SX-5 can be used in evaluating the programming models.

In this study, we have evaluated parallel efficiency of microtasking and HPF implementation only in a node. A computational fluid dynamics code or Trans6 which simulates a homogeneous isotropic turbulent flow by pseudospectral method is taken to evaluate the parallel performance.

In case of microtasking, the original program is compiled by FORTRAN90/SX compilers, which has a capability of automatic loop parallelization in a node. In case of HPF implementation, some HPF directives are inserted to the programs and compiled by an HPF/SX compiler. We compared the CPU times of these compiled objects by changing the number of processors of 1, 2, 4, and 8 on NEC SX-5/16A. The number of modes in the pseudospectral method is 128 x 128 x 128.

The CPU time of microtasking is almost the same as that of HPF implementation in a processor. But the CPU time of HPF implementation spends 1.58 times larger than that of microtasking with 8 processors. As the number of processors is increased, speedup of HPF implementation degarades compared with that of microtasking. Parallel efficiency of 69.87\% and 44.35\% are obtained for microtasking and HPF implementation with 8 processors, respectively. As a result, the microtasking should be taken for the parallelization in a node.

We have also measured the CPU time of some number of microtasking within an HPF process in a node, which we let the third case. The result might be used to predict a performance of multi-node implementations. Speedup of 5.03 is obtained with 8 processors. The CPU time of the third case is 13.4\% larger than that of by the first case, because HPF implementaion needs some preprocessing time. But the third case is faster than the the second case. By the result, a homogeneous implementation on the SMP clusters by HPF should be prohibited.

Results on the multinode system of SX-5 and a comparison with MPI implementations will be presented in the full paper.