A Data Layout Assistant Tool for Data-Parallel Programs

Atsushi Kubota and Takao Tsuda, Hiroshima City Univ., Japan

To achieve high performance for data-parallel programs on distributed memory multiprocessors, it is crucial to find the optimal data layout. We have been developing a data layout assistant tool to help users to parallelize sequential Fortran programs. Many researchers have focused on this problem. Kennedy and Kremer[1] developed a data layout selection tool based on 0-1 integer programming. This tool enables data remapping of arrays between program parts. However, much time is required to find the optimal data layout. Matsuura et al.[2] developed a tool that enables inter-procedure remapping. This system does not require a long time to find the optimal data layout. Our system

finds intra-procedural static data layout and then
finds inter-procedural dynamic data layout (remapping).

The intra-procedural static data layout adopts the data layout selection scheme proposed by Gupta et el.[3]. CAG (Component Affinity Graph) which represents the relationship between the dimensions of different arrays is constructed and then the candidate dimension to be distributed is selected. In the static data layout analysis, performance prediction is required. Our system estimates the execution time of each loop based on the trip counts of the loop, the number of operations and assignments in the loop and data dependence in the loop. Inter-procedural dynamic data layout is determined using a greedy search algorithm.

A procedure P1 which has the longest estimated execution time (score) is selected.
A procedure P2 which is a caller of P1 is selected.
The layouts of P1 and P2 are matched such that data remapping does not occur. If this matching is not possible, data remapping is forced when the call of P2 from P1 and the return from P2 occur.
If the layouts of P1 and P2 can be matched in step 3., P1 and P2 are merged and treated as one procedure.
Step 3. and 4. are performed for the layout of P1 and P3, which is a callee of P1.
Return to step 1. and the analysis is continued until all procedures are examined.

We have implemented a prototype assistant tool of data layout based on the proposed scheme and obtained the data layouts for NAS Parallel Benchmarks 2.3 serial (FT and SP) as shown in Table 1 and 2.

Table 1. Obtained Data Layout for NAS FT

Subroutine	Array(Distribution)	description
FFT	X1(BLOCK, , )
FFT	X2(BLOCK, , )
CFFTS1	X(, BLOCK, )	X1 in FFT
CFFTS1	XOUT(, BLOCK, )	X1 in FFT
CFFTS2	X(BLOCK, , )	X1 in FFT
CFFTS2	XOUT(BLOCK, , )	X1 in FFT
CFFTS3	X(BLOCK, , )	X1 in FFT
CFFTS3	XOUT(BLOCK, , )	X2 in FFT

Table 2. Obtained Data Layout for NAS SP

Subroutine	Array(Distribution)	description
adi	lhs(,BLOCK,,*)
adi	rhs(,BLOCK,,*)
x_solve	lhs(,BLOCK,,*)
lhsx	lhs(,BLOCK,,*)
x_solve	rhs(,BLOCK,,*)
ninvr	rhs(,BLOCK,,*)
y_solve	lhs(,,BLOCK,*)
lhsy	lhs(,,BLOCK,*)
y_solve	rhs(,,BLOCK,*)
pinvr	rhs(,,BLOCK,*)
z_solve	lhs(,BLOCK,,*)
lhsz	lhs(,BLOCK,,*)
z_solve	rhs(,BLOCK,,*)
tzetar	rhs(,BLOCK,,*)

References

Ken Kennedy and Ulrich Kremer: Automatic Data Layout for Distributed-Memory Machines, in ACM Transactions on Programming Languages and Systems, Vol. 20, No.4, July 1998, pages 869-916.
Ken-ichiro Matsuura, Hitoshi Murai, Kenji Suehiro and Yoshiki Seo: Fast Automatic Data Layout for Data Parallel Programs, in IPSJ Journal, Vol. 41, No. 5, May 2000, pages 1420-1429 (in Japanese).
Manish Gupta and Prithviraj Banerjee: Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers, IEEE Transations on Parallel and Distributed Systems, Vol. 3, No. 2, 1992, pages 179-193.