A Data Layout Assistant Tool for Data-Parallel Programs
Atsushi Kubota and Takao Tsuda, Hiroshima City Univ., Japan
To achieve high performance for data-parallel programs on distributed
memory multiprocessors, it is crucial to find the optimal data
layout. We have been developing a data layout assistant tool to help
users to parallelize sequential Fortran programs.
Many researchers have focused on this problem. Kennedy and Kremer[1]
developed a data layout selection tool based on 0-1 integer
programming. This tool enables data remapping of arrays between
program parts. However, much time is required to find the optimal data
layout. Matsuura et al.[2] developed a tool that enables
inter-procedure remapping. This system does not require a long time
to find the optimal data layout.
Our system
- finds intra-procedural static data layout and then
- finds inter-procedural dynamic data layout (remapping).
The intra-procedural static data layout adopts the data layout
selection scheme proposed by Gupta et el.[3]. CAG (Component Affinity
Graph) which represents the relationship between the dimensions of
different arrays is constructed and then the candidate dimension to be
distributed is selected.
In the static data layout analysis, performance prediction is
required. Our system estimates the execution time of each loop based
on the trip counts of the loop, the number of operations and
assignments in the loop and data dependence in the loop.
Inter-procedural dynamic data layout is determined using a greedy search
algorithm.
- A procedure P1 which has the longest estimated execution time
(score) is selected.
- A procedure P2 which is a caller of P1 is selected.
- The layouts of P1 and P2 are matched such that data
remapping does not occur. If this matching is not possible, data
remapping is forced when the call of P2 from P1 and the return from P2
occur.
- If the layouts of P1 and P2 can be matched in step 3.,
P1 and P2 are merged and treated as one procedure.
- Step 3. and 4. are performed for the layout of P1 and P3,
which is a callee of P1.
- Return to step 1. and the analysis is continued until
all procedures are examined.
We have implemented a prototype assistant tool of data layout based on
the proposed scheme and obtained the data layouts for NAS Parallel
Benchmarks 2.3 serial (FT and SP) as shown in Table 1 and 2.
Table 1. Obtained Data Layout for NAS FT
Subroutine | Array(Distribution) | description |
FFT | X1(BLOCK, *, *) | |
FFT | X2(BLOCK, *, *) | |
CFFTS1 | X(*, BLOCK, *) | X1 in FFT |
CFFTS1 | XOUT(*, BLOCK, *) | X1 in FFT |
CFFTS2 | X(BLOCK, *, *) | X1 in FFT |
CFFTS2 | XOUT(BLOCK, *, *) | X1 in FFT |
CFFTS3 | X(BLOCK, *, *) | X1 in FFT |
CFFTS3 | XOUT(BLOCK, *, *) | X2 in FFT |
Table 2. Obtained Data Layout for NAS SP
Subroutine | Array(Distribution) | description |
adi | lhs(*,BLOCK,*,*) | |
adi | rhs(*,BLOCK,*,*) | |
x_solve | lhs(*,BLOCK,*,*) | |
lhsx | lhs(*,BLOCK,*,*) | |
x_solve | rhs(*,BLOCK,*,*) | |
ninvr | rhs(*,BLOCK,*,*) | |
y_solve | lhs(*,*,BLOCK,*) | |
lhsy | lhs(*,*,BLOCK,*) | |
y_solve | rhs(*,*,BLOCK,*) | |
pinvr | rhs(*,*,BLOCK,*) | |
z_solve | lhs(*,BLOCK,*,*) | |
lhsz | lhs(*,BLOCK,*,*) | |
z_solve | rhs(*,BLOCK,*,*) | |
tzetar | rhs(*,BLOCK,*,*) | |
References
- Ken Kennedy and Ulrich Kremer: Automatic
Data Layout for Distributed-Memory Machines, in ACM Transactions on
Programming Languages and Systems, Vol. 20, No.4, July 1998, pages
869-916.
- Ken-ichiro Matsuura, Hitoshi Murai, Kenji Suehiro and
Yoshiki Seo: Fast Automatic Data Layout for Data Parallel Programs, in
IPSJ Journal, Vol. 41, No. 5, May 2000, pages 1420-1429 (in Japanese).
- Manish Gupta and Prithviraj Banerjee: Demonstration of Automatic Data
Partitioning Techniques for Parallelizing Compilers on Multicomputers,
IEEE Transations on Parallel and Distributed Systems, Vol. 3, No. 2,
1992, pages 179-193.