Overpartitioning with the Rice dHPF Compiler

Overpartitioning with the Rice dHPF Compiler

Bradley Broom,.Daniel Chavarria-Miranda, Rob Fowler

Ken Kennedy, John Mellor-Crummey, and Qing Yi

Strategies for partitioning an application's data play a fundamental role in determining the range of possible parallelizations that can be performed and ultimately their potential efficiency. This paper surveys HPF extensions in the Rice dHPF compiler that implement a general mechanism for overpartitioning data. In this approach, each physical node is allocated multiple tiles for which communication and computation are carefully scheduled to address multiple performance concerns while obeying scheduling and communication constraints. Overpartioning is the key enabling technology for a diverse set of advanced optimizations.

Hand-coded versions of the NAS parallel benchmark codes attain their performance by using a technique known as multipartitioning, which guarantees that no processor is ever starved for useful work. We describe the design and implementation of overpartitioning in dHPF and how it is used to derive multipartioning parallizations of codes like the NAS benchmarks. We present preliminary results that demonstrate how this can help close the substantial gap between the efficiency and scalability of compiler-parallelized codes for multi-directional line sweep computations and their hand-coded counterparts.

Overpartitioning is also applicable to the issue of compiling very large HPF programs for Out of Core (OOC) execution. We describe how this is done within the dHPF framework. In our discussion of the initial results we describe the impact of our strategy on both computation and I/O costs.

Recursive blocking strategies have been found to be an effective approach to improving the performance of memory hierarchies. We briefly describe our work on automatically deriving and scheduling recursive tiling strategies. These strategies are able to provide machine independent tilings for a variety of difficult problems from linear algebra, including LU decomposition with partial pivoting.

Each of the different applications of overpartitioning addresses a separate performance problem: scheduling to improve parallelism, memory hierarchy performance, and I/O performance. All of these are challenges that must be met in order to construct latency-tolerant programs to solve very large problems on computational grids. We conclude with a discussion of how the dHPF strategy might be applied to this problem.