Despite floating-point (FP) being the most commonly used method for real number representation, certain architectures are still limited to fixed-point arithmetic due to the large area and power requirements of FP hardware. A software library, which emulates FP functions, is typically implemented when FP calculations need to be performed on a platform with a fixed-point datapath. However, software implementations of FP operations, despite not requiring any additional area, suffer from a low throughput. Conversely, hardware FP implementations provide high throughput, but require a large amount of additional area and consequently increase leakage. Therefore, it is desirable to increase the FP throughput provided by a software implementation without incurring the area overhead of a full hardware floating-point unit (FPU). Furthermore, the widths of data words in digital processors have a direct impact on area in application-specific ICs (ASICs) and field-programmable gate arrays (FPGAs). Circuit area impacts energy dissipation per workload and chip cost. Graphics and image processing workloads are very FP intensive, however, little exploration has been done into modifying FP word width and observing its effect on image quality and chip area.
This dissertation first presents hybrid FP implementations, which improve software FP performance without incurring the area overhead of full hardware FPUs. The proposed implementations are synthesized in 65 nm complementary metal oxide semiconductor (CMOS) technology and integrated into small fixed-point processors which use a reduced instruction set computing (RISC)-like architecture. Unsigned, shift-carry, and leading zero detection (USL) support is added to the processors to augment the existing instruction set architecture (ISA) and increase FP throughput with little area overhead. Two variations of hybrid implementa- tions are created. USL support is additional general purpose hardware that is not specific to FP workloads (e.g., unsigned operation support), custom FP-specific (CFP) hardware is specifically for FP workload acceleration (e.g., exponent calculation logic). The first, hybrid implementations with USL support, increase software FP throughput per core by 2.18x for addition/subtraction, 1.29x for multiplication, 3.07-4.05x for division, and 3.11-3.81x for square root, and use 90.7-94.6% less area than dedicated fused multiply-add (FMA) hardware. The second type of hybrid implementations, those with CFP hardware, increase throughput per core over a fixed-point software kernel by 3.69-7.28x for addition/subtraction, 1.22-2.03x for multiplication, 14.4x for division, and 31.9x for square root, and use 77.3-97.0% less area than dedicated fused multiply-add hardware. The circuit area and throughput are found for 38 multiply-add, 8 addition/subtraction, 6 multiplication, 45 division, and 45 square root designs. 33 multiply-add implementations are presented which improve throughput per core versus a fixed-point software implementation by 1.11-15.9x and use 38.2-95.3% less area than dedicated FMA hardware.
In addition to proposing hybrid FP implementations, this dissertation investigates the effects of modifying FP word width. For the second portion of this dissertation, FP exponent and mantissa widths are independently varied for the seven major computational blocks of an airborne synthetic aperture radar (SAR) image formation engine. This image formation engine uses the backprojection algorithm. SAR imaging uses pulses of microwave energy to provide day, night, and all-weather imaging and can be used for reconnaissance, navigation, and environment monitoring. The backprojection algorithm is a frequently used tomographic reconstruction method similar to that used in computed tomography (CT) imaging. Additionally, trigonometric function evaluation, interpolation, and Fourier transforms are common to SAR backprojection and other biomedical image formation algorithms. The circuit area in 65 nm CMOS and the peak signal-to-noise ratio (PSNR) and structural similarity index metric (SSIM) are found for 572 design points. With word width reductions of 46.9-79.7%, images with a 0.99 SSIM are created with imperceptible image quality degradation and a 1.9-11.4x area reduction.
The third portion of this dissertation covers the physical design of two many-core chips in 32 nm PD-SOI, KiloCore and KiloCore2. In the first portion of this section, the design of KiloCore is covered, while the second portion details the adjustments made to the flow for the tape-out of KiloCore2. KiloCore features 1000 cores capable of independent program execution. The maximum clock frequency for the cores on KiloCore range from 1.70 GHz to 1.87 GHz at 1.10 V. KiloCore compares favorably against many other many-core and multi-core chips, as well as low power processors. At a supply voltage of 0.56 V, processors require 5.8 pJ per operation at a clock frequency of 115 MHz. KiloCore2 has 700 cores, 697 of which are programmable processor tiles, and three which are hardware accelerators (a fast Fourier transform (FFT) accelerator, and two Viterbi decoders). The assembled printed circuit boards (PCBs) with packaged KiloCore2 chips are expected to be ready in July.
The fourth portion of this dissertation explores implementing a scientific kernel on a many-core array, namely sparse matrix-vector multiplication. Twenty-three functionally equivalent sparse matrix times dense vector multiplication implementations are created for a fine-grained many-core platform with FP capabilities. These implementations are considered against two central processing unit (CPU) chips and two graphics processing unit (GPU) chips. The designs for the many-core array, CPUs, and GPUs are evaluated using the metrics of throughput per area and throughput per watt when operating on a set of five unstructured sparse matrices of varying dimensions, sourced from a wide range of domains including directed weighted graphs, computational fluid dynamics, circuit simulation, thermal problems (e.g., heat exchanger design), and eigenvalue/model reduction problems. Results using unscheduled and unoptimized code demonstrate that the implementations on the many-core platform increase power efficiency by up to 14.0x versus the CPU implementations, and by up to 27.9x versus the GPU implementations. Additionally, the implementations on the many-core platform increase area efficiency by as much as 17.8x versus the CPU implementations, and up to 36.6x versus the GPU implementations.
Jon Pimentel, "Methods for Reducing Floating-Point Computation Overhead," Ph.D. Dissertation, Technical Report ECE-VCL-2017-2, VLSI Computation Laboratory, ECE Department, University of California, Davis, 2017.
@phdthesis{jpimentel:vcl:phdthesis, author = {Jon Pimentel}, title = {Methods for Reducing Floating-Point Computation Overhead}, school = {University of California, Davis}, year = 2017, address = {Davis, CA, USA}, month = aug, note = {\url{http://vcl.ece.ucdavis.edu/pubs/theses/2017-2.pimentel/}} }