Modular Sorting on a Fine-Grained Many-Core Processor Array

Aaron Stillmaker
Bevan M. Baas
VLSI Computation Laboratory
Department of Electrical and Computer Engineering
University of California, Davis

Abstract:

The highest throughput design achieves up to 9x higher throughput per chip area, and the most energy efficient proposed sort yields a 200x reduction in energy dissipated per sorted block---both compared to a quicksort implementation on a general purpose CPU.

Data centers require significant and growing amounts of power to operate, and with increasing numbers of data centers worldwide, power consumption for enterprise workloads is a significant concern. Sorting is a key computational kernel in large database systems, and the development of energy efficient sorting capabilities would therefore significantly reduce data center power usage.

We propose highly parallel sorting algorithms and mappings using a modular design for a fine-grained many-core system that greatly decreases the amount of energy consumed to perform sorts of arbitrarily large data sets. The memory, computational, and nearest-neighbor inter-processor communication requirements of the many-core processor array are very modest and require relatively small die area. We present the design and implementations of several sorting variants that perform the first phase of an external sort. They are built using program kernels operating on independent processors in a many-core array with 256 bytes of data memory and fewer than 128 assembly instructions per processor. The vast majority of processors contain identical programs. Sorts of large records are accomplished by insertion sort kernels to create sorted lists inside each processor, a merge sort to combine these sorted lists, and a distribution kernel which evenly distributes the data stream throughout the processor array.

The highest throughput design achieves up to 9x higher throughput per chip area, and the most energy efficient proposed sort yields a 200x reduction in energy dissipated per sorted block---both compared to a quicksort implementation on a general purpose CPU.

Poster


VCL Lab | ECE Dept. | UC Davis

Last update: June 7, 2012