Implementation of a Fast Integer Sorting Library on VPP Parallel Vector Supercomputers

Eiji YOKOYAMA (Graduate School of Informatics, Kyoto University),
Koichi YASUOKA (Institute for Research in Humanities, Kyoto University),
Yasuo OKABE (Graduate School of Informatics, Kyoto University),
Masanori KANAZAWA (Data Processing Center, Kyoto University)

We propose a fast integer sorting algorithm for distributed memory parallel computers whose processors have vector units each. We have implemented a library program for efficient integer sorting based on the algorithm on Fujitsu VPP parallel vector supercomputers in HPF and evaluate it on VPP800 with 32CPUs.

We utilize the same problem specification as the NAS Parallel Benchmark Integer Sorting (IS) scheme; the input records (say the number N) whose key values are non-negative integers less than a constant (say MAXKEY) are distributed balanced on the processors, and the output is the global rank of each record that is given on the same processor where the record is (say the owner processor of it).

Our algorithm is based on bucket sort, which is known as the fastest algorithm for vector supercomputers. We have utilized the RETRY algorithm [Murai, Suehiro, Seo: Trans. IPSJ 39-6, 1595-1602]. for vectorizing local bucket sort on each processor.

In our algorithm, bucket sort is parallelized as follows. First, each processor computes the histogram on the processor (number of records for each key value on it; say local histogram). This phase can be done completely in parallel without any data transfer. Then we redistribute the local histogram with a new distribution where the space of possible values from zero to MAXKEY (say key space for short) are divided by processors. Here parallel data transfers among all pairs of the processors are needed.

Next each processor computes the sums of local histograms of all owner processors, not only the total sum of all local histograms, but also the partial sums for all processors. This can be computed locally. Then each processor sends the part of the total-sum histogram which has to all processors. Here simultaneous broadcasting from all processors to all processors (COPY-ON-ALL in MPI) is needed. And then we redistribute the part of each partial-sum histogram to its owner-processor. Here parallel data transfers among all pairs of the processors are needed. Now each processor is ready to know the rank in the total order of each record on it with local computation.

In each of the three data transfers done in the algorithm, data whose amount is proportional to MAXKEY x logN is sent. This is not necessarily small, but our target machines VPPs are equipped with high-speeed crossbar connection among all processors (1.6GB/sec for each processor, each direction) and hence the transfer can be done fast and efficiently. We have also made the following improvement to compress the mount of the transferred data. For example, if the number of records N is 2^27, then each entry of the histogram must have size of at least 4 bytes. But actually almost all entry have values less than 256; more precisely the number of entries that have values more than 256 is at most 2^27/256=2^19. So we have modified the algorithm so that entries of the histogram are 1-byte integers and, if there are entries whose real values exceed 255, they are sent later as exceptional ones. This compress the amount of transferred data roughly 1/4.

We have implemented a integer sorting library based on the algorithms on Fujitsu VPP800. The program is written is HPF, and translated into VPP Fortran (Fujitsu's native but HPF-like data-parallel Fortran) via a translator which we have originally developed.

We have measured the performance of our implementation using instances of NAS Parallel Benchmark. The time for the Class C instance on VPP800, 32CPUs is 0.1066 sec. This is much faster than the former benchmark record, 1.05 sec on NEC SX-4, 8CPUs, by Murai et.al. The speed of each CPU of VPP is about 4 times of that of SX-4, but not that SX-4 is shared-memory machine and VPP is distributed-memory one. We have also measured how the compression works. In the optimal case, the NPB instance is just the case, more than 1.6-time speed up compared to a implementation without compression is achieved, while in the worst case the loss of the speed is at most 6%.

On single CPU the speed of our code is 1.095 sec. Thus we have achieved more than 10-time speed up using 32CPUs. As a comparison, we have also measured the NAS Parallel Benchmarks 2.3 reference codes for integer sorting, written in C and MPI, on VPP800. The speed of the parallel version on 32CPU is 6.33 sec, while that of serial version on 1CPU is 119.22 sec Although the speed-up ratio of the NPB reference code is more than ours, the absolute speed is terribly slow. This seems because the code is vectorized very little.