Design and Programming of the KiloCore Processor Arrays

Brent Bohnenstiehl
Ph.D. Dissertation
VLSI Computation Laboratory
Department of Electrical and Computer Engineering
University of California, Davis
Technical Report ECE-VCL-2020-1, VLSI Computation Laboratory, University of California, Davis, 2020.

Abstract:

Modern semiconductor fabrication technologies now enable the construction of integrated circuits which contain over 1000 processors on a single chip [1]. However, for such systems to effectively compute workloads, new architectures are needed for the processors, the inter-processor interconnect, circuits that interact with larger memories, and the applications they execute [2–4]. This work explores the characteristics of many-core arrays, and utilizes gained insights to advance the state of the art through a new architecture designed to efficiently scale to thousands of processors per chip, along with software development tools to aid in effectively programming such arrays.

A detailed exploration is made of prior many-core architecture work to identify areas that might be significantly improved. Possible benefits are found in the instruction set selection, pipeline design, network communication between processors, voltage control logic, and other areas. To name a few findings: inclusion of unsigned support speeds some operations up by as much as 15×, profile-guided static branch prediction raises the correct prediction rate from 27% to 96% in sampled applications, fast oscillator halting reduces active-clock stall cycles by 33%, and voltage dithering improves DVFS energy savings by 16%.

A 1,000-processor array named KiloCore is presented. Fabricated in 32 nm PD-SOI CMOS technology and occupying 64 mm2, this newly designed architecture implements lessons learned from prior work along with other innovations. KiloCore processors may operate up to 1.78 GHz at 1.1 V, and down to 115 MHz at 560 mV where an operation dissipates only 5.8 pJ. Several applications are implemented on KiloCore and their characteristics and performance are discussed. Across these applications and when scaled to the same fabrication technology, KiloCore at 0.9 V has geometric mean improvements of 3.1× higher throughput per area and 16.7× higher energy efficiency compared to published results on standard CPU and GPU architectures. Comparing to just-CPUs and just-GPUs, Kilocore achieves 68.9× and 72.0× higher throughput per area per Watt respectively.

A followup 697-processor array named KiloCore2 is presented. Fabricated with the same technology and chip area as KiloCore, KiloCore2 is designed to achieve a 63% higher throughput per processor than its predecessor, supports three voltage domains for optimizing per-processor -iienergy efficiency, implements a new low-area packet routing network that is specifically designed for the needs of a many-core system, and adds FFT and Viterbi accelerators along with a selection of high speed processors for speeding up serial code. Pending final measurements, KiloCore2 is designed to reach 2.0 tera-operations per second at 1.1 V, with standard processors reaching 2.9 GHz and fast processors reaching over 5.0 GHz operation.

Programming and analysis tools for many-core arrays are presented. High speed simulators, written in C++, are over 50,000 times faster than Verilog RTL simulation, and contain a suite of features to aid in application development and debugging. The KiloCore compiler generates optimized assembly from user supplied kernels written in C++, Python, or potentially other languages. Leveraging the LLVM infrastructure to act as a front end, this compiler focuses on the back end operations needed to lower LLVM IR code into the format needed for stackless, 16-bit, direct-memory-access processors with strict memory limitations, as well as optimize the code to be comparable to optimized hand-written assembly. Supplementing these tools is a Project Manager, which allows users to write simple Python scripts to define a collection of tasks, replicate and map them to a target many-core array, perform array-wide optimizations, and to conveniently compile and run their applications.

Dissertation

Reference

Brent Bohnenstiehl, "Design and Programming of the KiloCore Processor Arrays," Ph.D. Dissertation, Technical Report ECE-VCL-2020-1, VLSI Computation Laboratory, ECE Department, University of California, Davis, March 2020.

BibTeX entry

@phdthesis{brent:vcl:phdthesis,
   author      = {Brent Bohnenstiehl},
   title       = {Design and Programming of the {KiloCore} Processor Arrays},
   school      = {University of California, Davis},
   year        = 2020,
   address     = {Davis, CA, USA},
   month       = mar,
   note        = {\url{http://vcl.ece.ucdavis.edu/pubs/theses/2020-1.bbohnenstiehl/}}
   }

VCL Lab | ECE Dept. | UC Davis