Design of Energy-Efficient Many-Core MIMD GALS Processor Arrays in the 1000-Processor Era

Aaron Stillmaker
Ph.D. Dissertation
VLSI Computation Laboratory
Department of Electrical and Computer Engineering
University of California, Davis
Technical Report ECE-VCL-2015-1, VLSI Computation Laboratory, University of California, Davis, 2015.

Abstract:

As transistor sizes continue to scale, more transistors are able to be used in a fixed die size. The recent trend for general purpose processing units is to use the increased number of transistors from process technology scaling to add more processing cores on a single die. At a certain point, it becomes untenable to continue to add more cores with traditional architectures and communication systems, which necessitates a fundamental change in architectures to facilitate these cores. This paradigm shift requires new energy- efficient, high-performance algorithms and hardware designs tailored for many-core processor arrays, as they provide different challenges than a single or multi-core chip. With such large arrays of processors, communication, both between processors and to memories, becomes a limiting factor, requiring algorithms to work with these limitations as well as on-chip interconnect networks to make communication possible.

This dissertation offers three different novel methods to perform a high throughput energy-efficient database sort data records using a fine-grained many-core processor array. When measured against sorts created to fairly compare results, the most energy efficient first-phase many-core sort requires over 83× lower energy than a quick sort performed on an Intel laptop-class processor and over 105× lower energy than a radix sort running on an Nvidia GPU. In addition, the highest first-phase throughput many-core sort is over 10× faster than the quick sort and over 14× faster than the radix sort. Both phases of an entire 10 GB external sort require 6.9× lower energy×time (energy delay product, EDP) than the quick sort and over 13× lower energy×time than the radix sort. The proposed sorts are easily programmed and scalable to any sized 2D mesh processor array while giving a large energy savings without penalizing performance.

The dissertation presents the developed physical design flow and design methodology for creating a digital chip in the 1000-processor era. A number of design considerations are discussed, including module design, power grid design, power gate system design, physical DVFS requirements, communication, and chip level layout.

The design for both KiloCore and KiloCore2 are covered, as well as preliminary mea- sured results from KiloCore, the first fabricated chip containing 1000 MIMD, programmable, independent processing cores on a single die. Early results show that KiloCore can perform at 5.8 pJ/Op at 115 Billion Ops/sec at 0.56 V, and up to 1.78 Trillion Ops/sec at 1.1 V. KiloCore2 contains 697 programable processors, two of which are optimized for high speed, one fast Fourier transform accelerator, and two Viterbi decoder accelerators. Both chips were fabricated in 32 nm partially depleted silicon-on-insulator (PD-SOI) technology. KiloCore2 contains multiple power rails, which allows individual cores to select a voltage based on its workload to save energy, with minimal voltage droop and minimal area.

Dissertation Copy

Reference

Aaron Stillmaker, "Design of Energy-Efficient Many-Core MIMD GALS Processor Arrays in the 1000-Processor Era," Ph.D. Dissertation, Technical Report ECE-VCL-2015-1, VLSI Computation Laboratory, ECE Department, University of California, Davis, 2015.

BibTeX entry

@phdthesis{astill:vcl:phdthesis,
   author      = {Aaron Stillmaker},
   title       = {Design of Energy-Efficient Many-Core {MIMD} {GALS} Processor 
                  Arrays in the 1000-Processor Era},
   school      = {University of California, Davis},
   year        = 2015,
   address     = {Davis, CA, USA},
   month       = dec,
   note        = {\url{http://vcl.ece.ucdavis.edu/pubs/theses/2015-1/}}
   }

Support Acknowledgment

This work is supported by UC Davis ECE Department, UC Davis GSA, ST Microelectronics, C2S2 Grant 2047.002.014, NSF Grant 0430090 and CAREER Award 0546907, SRC GRC Grant 1598, 1971, and 2321 and CSR Grant 1659, Intel, UC Micro, Intellasys, SEM, and a UCD Faculty Research Grant.


VCL Lab | ECE Dept. | UC Davis