An HPF Implementation of the Multigrid Method and Effectiveness of Compiler Optimizations

Hiroshi Ohta, Systems Development Laboratory, Hitachi, Ltd., ohta@sdl.hitachi.co.jp Yasunori Nishitani, Kiyoshi Negishi, Eiji Nunohiro, Software Development Division, Hitachi, Ltd.

Multigrid method is a widely-used Poisson solver algorithm and is included in the NAS Parallel Benchmarks (NPB) as the benchmark MG [1]. The method uses several grids of different coarseness and requires projection and interpolation between these grids. These characteristics introduce complex array subscripts. For instance, code fragment taken from NPB/MG interpolation routine looks like:

  u(2*i1-1,2*i2-1,2*i3) =  z(i1,i2,i3+1)+z(i1,i2,i3)+...

Since many HPF compilers cannot generate efficient code for such complex subscripts, previous HPF implementations of MG resulted in poor performance [2]. In this study, we present an efficient HPF implementation of MG and compiler optimizations needed to achieve good performance.

We used the following techniques for the HPF implementation.

We distribute among processors only the four finest grids. The remaining coarser grids are packed into a 1D work array and are replicated onto all processors. This technique allows portability in changing the problem size. Although the computation of the coarser grids is replicated, the overall performance is not very much affected because the amount of computation is much smaller than that of the finer grids.
We declare a template to which we align the grids of different coarseness with different strides, then distribute the template in 2D. The HPF directives in the main routine are shown below:

!HPF$ template maxgrid(nm3*8,nm3*8,nm3*8)
!HPF$ align (i,j,k) with maxgrid(i,j,k) :: u0
!HPF$ align (i,j,k) with maxgrid(2*i-1,2*j-1,2*k-1) :: u1
!HPF$ align (i,j,k) with maxgrid(4*i-3,4*j-3,4*k-3) :: u2
!HPF$ align (i,j,k) with maxgrid(8*i-7,8*j-7,8*k-7) :: u3
!HPF$ distribute maxgrid(*,block,block)

We declare shadow edges for the distributed grids to allow shadow communication.

Given such code, the compiler must generate efficient communication. Our compiler generates communication based on a remapping concept [3]. It utilizes remapping libraries for general cases, but uses more efficient communication libraries for specific patterns that frequently appear in practical applications. It performs the following optimizations for specific cases:

(1) Replaces a remapping with a shadow communication for a nearest-neighbor reference.
(2) Merges multiple shadow communications.
(3) Replaces a remapping with a one-to-one communication for a boundary value exchange.

We compiled our HPF implementation of MG with our prototype compiler and executed it in Hitachi SR2201, a distributed-memory parallel machine. Figure 1 shows the execution time with various optimization options shown in Table 1. We also show, for comparison, the timings for the hand-parallelized MPI code provided by NASA (NPB2.3-beta).

We could not run 'remap' in 16 processors due to large memory requirement. We could run 'remap' in 32 and 64 processors but the execution time is significantly long. As we perform more optimizations, the execution time gets shorter and closer to the 'MPI' time. With all optimizations, the execution time is only 18% longer compared to MPI, in the case of 16 processors, class B. We conclude that sufficiently good performance is achieved with our HPF implementation and compiler optimizations.

label	optimizations
remap shadow merge 1to1 MPI	no optimization (1) (1)+(2) (1)+(2)+(3) hand-parallelized MPI (NPB2.3-beta)

Table 1. Optimization options.

Figure 1. Execution time of MG.

References

NASA, The NAS Parallel Benchmarks, http://www.nas.nasa.gov/Software/NPB/
M. Frumkin, H. Jin and J. Yan, "Implementation of NAS Parallel Benchmarks in High Performance Fortran," IPPS'99
T. Kamachi, K. Kusano, K. Suehiro, Y. Seo, M. Tamura, and S. Sakon, "Generating Realignment-Based Communication for HPF Programs," IPPS '96, pp.364-371.