Energy-Efficient Computing with Fine-Grained Many-Core Systems

Bin Liu
Ph.D. Dissertation
VLSI Computation Laboratory
Department of Electrical and Computer Engineering
University of California, Davis
Technical Report ECE-VCL-2016-1, VLSI Computation Laboratory, University of California, Davis, 2016.

Abstract:

For the past half century, Moore's Law has been the fundamental driver of high-performance computing. The continued CMOS technology scaling doubles the transistor density of VLSI systems and has provided a predictable 40% performance improvement of single-core processors for every 18 to 24 months. However, as Dennard Scaling ends, the era of scaling frequency and performance without increasing power density is over. Since 2005, the semiconductor industry shifted to multi-core and many-core processors in order to sustain the proportional scaling of performance along with transistor count increases. One of the critical challenges for many-core system design is to reduce the power dissipation and improve the energy efficiency of the chip. Researchers are eager to seek innovative low power architectures and techniques to relieve the "dark silicon" problem and effectively convert transistors to performance.

To demonstrate that many-core processors with network-on-chip interconnects are a promising architecture for high-performance energy-efficient computing, 16 Advanced Encryption Standard (AES) engines are proposed on a fine-grained many-core system by exploring different granularities of data-level and task-level parallelism. The smallest design utilizes only six cores for offline key expansion and eight cores for online key expansion, while the largest requires 107 cores and 137 cores, respectively. In comparison with published AES cipher implementations on general purpose processors, the designs have 3.5–15.6 times higher throughput per unit of chip area and 8.2–18.1 times higher energy efficiency. Moreover, the design shows 2.0 times higher throughput than the TI DSP C6201, and 3.3 times higher throughput per unit of chip area and 2.9 times higher energy efficiency than the GeForce 8800 GTX.

Next, a scalable joint local and global dynamic voltage and frequency scaling (DVFS) scheme is proposed to further improve the energy efficiency for many-core systems by monitoring on-line workload variations. The local algorithms selects the voltage and frequency pair for each individual core based on its FIFO occupancy and stall information, while the global algorithm tunes the global voltage supplies based on the workload of all active processors. To demonstrate the effectiveness of the proposed solution, a suite of benchmarks are tested on a many-core globally asynchronous locally synchronous (GALS) platform. The experiment results show that the proposed approach can achieve near-optimal power saving under performance constraints. Different local algorithms are compared in terms of power saving, voltage switching frequency and response delay to workload variation. The impact of the number of voltage supplies and global voltage tuning resolution on the global algorithm is also investigated.

To further improve the energy efficiency beyond traditional DVFS, core scaling is proposed by introducing an extra dimension beyond supply voltage and clock frequency scaling. This dissertation addresses the problem of minimizing the power dissipation of many-core systems under performance constraints by choosing an appropriate number of active cores and per-core voltage/frequency levels. A genetic algorithm based solution is proposed to solve the problem. Experiments with real applications show that (1) dynamically scaling the number of active cores can improve the energy efficiency by 5% to 42% compared with per-core DVFS for different performance requirements; (2) core scaling favors systems with more global voltage supplies and high-performance leaky process when the performance requirement is loose, while it favors systems with fewer global voltage supplies and low-power less-leaky process when the performance requirement is tight; (3) increasing the number of global voltage supplies or leakage ratio can reduce the optimal core count by 22% and 50%, respectively.

Dissertation

Reference

Bin Liu, "Energy-Efficient Computing with Fine-Grained Many-Core Systems," Ph.D. Dissertation, Technical Report ECE-VCL-2016-1, VLSI Computation Laboratory, ECE Department, University of California, Davis, 2016.

BibTeX entry

@phdthesis{bliu:vcl:phdthesis,
   author      = {Bin Liu},
   title       = {Energy-Efficient Computing with Fine-Grained Many-Core 
                  Systems},
   school      = {University of California, Davis},
   year        = 2016,
   address     = {Davis, CA, USA},
   month       = sep,
   note        = {\url{http://vcl.ece.ucdavis.edu/pubs/theses/2016-1/}}
   }

Support Acknowledgment

This work is supported by UC Davis ECE Department, ST Microelectronics, C2S2 Grant 2047.002.014, NSF Grant 0430090 and CAREER Award 0546907, SRC GRC Grant 1598, 1971, and 2321 and CSR Grant 1659, and Intel.


VCL Lab | ECE Dept. | UC Davis