Dept. of Computer Science & Engineering, Korea University,
Sungbuk-ku, Seoul 136-701, Korea
{saeri, hyun, kms, myongsp}@iLab.korea.ac.kr
Abstract
Many
data parallel programming languages, such as High Performance Fortran(HPF),
Fortran D, Vienna Fortran, and High Performance C(HPC), provide compiler
directives for programmers to specify array distribution. The ¡°generalized¡±
block distribution, GEN_BLOCK, which is supported in High Performance Fortran
version 2, allows contiguous segments of an array, of possibly unequal
sizes, to be mapped onto processors and is therefore useful for solving
load balancing problems.
The
array redistribution problem has recently received considerable attention.
This interest is motivated largely by the HPF programming style, in which
science applications are decomposed into phases. At each phase, there is
an optimal distribution of the arrays onto the processor grid. Because
the optimal distribution changes from phase to phase, the array redistribution
is a critical operation. Since current GEN_BLOCK distribution may not be
effective for dynamic load balancing at the next time, the redistribution
between different GEN_BLOCKs is necessary.
The
use of data redistribution represents a performance tradeoff between the
expected higher efficiency of a new distribution for subsequent computation
and the communication cost of redistributing the data among processor memories.
Consequently, minimizing the execution time of data redistribution has
obvious merit. Reducing the amount of data exchanged among processor memories
is one possible way of reducing overall redistribution execution time.
This
paper proposes logical processor reordering for minimizing the communication
cost of redistributing the data among processor memories in the GEN_BLOCK
redistribution. The aim of this study is to reduce overall redistribution
execution time by causing an increasing amount of data to remain on the
same processor after redistribution. The data remaining on the same processor
following redistribution is called a Data Hit. This eliminates the exchange
of data elements among processors. Another advantage is that Data Hits
reduce the number of destination processors. Each of these factors reduces
interprocessor communication overhead. This paper focuses on increasing
the number and size of Data Hits. We propose three algorithms for logical
processor reordering. The first one reorders the logical processors to
increase Data Hits regardless of the message sizes exchanged. The second
one enlarges the message size of Data Hits. The third one increases the
number and size of Data Hits by combining the two algorithms.
Various
experiments on CRAY T3E were performed to evaluate the proposed algorithms
and compare typical GEN_BLOCK redistribution, which does not reorder logical
processor numbers. MPI primitives were used for interprocessor communication.
According to the experiments, Number_Oriented algorithm shows better performance
than Size_Oriented algorithm, and Numsize_Oriented algorithm shows the
best performance of the three algorithms. The logical processor reordering
algorithms were found to perform better than typical GEN_BLOCK redistribution
in all cases.