A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

Dean Truong
Wayne Cheng
Tinoosh Mohsenin
Zhiyi Yu
Toney Jacobson
Gouri Landge
Michael Meeuwsen
Christine Watnik
Paul Mejia
Anh Tran
Jeremy Webb
Eric Work
Zhibin Xiao
Bevan Baas
VLSI Computation Laboratory
Department of Electrical and Computer Engineering
University of California, Davis

Abstract:

Applications that require the computation of complex DSP workloads are becoming increasingly commonplace. These applications are often composed of multiple DSP tasks and are found in applications such as: wired and wireless communications, multimedia, sensor signal processing, and medical/biological processing. Many are embedded and strongly energy-constrained. In addition, many of these workloads require very high throughputs, often dissipate a significant portion of the system power budget, and are therefore of considerable interest.

In contrast to general-purpose workloads, DSP workloads typically comprise a collection of DSP kernels that are numerically intensive, easily parallelizable, and do not require large data working sets or large programs.

One-time fabrication costs for state of the art CMOS designs are several million dollars and total design costs of modern chips can easily total tens of millions of dollars. These costs are expected to continue rising in the future. In this context, programmable and/or reconfigurable processors that are not tailored to a single application or a small class of applications become increasingly attractive.

The presented processing array computes the aforementioned complex DSP application workloads with high performance and high energy efficiency, and is well suited for implementation in future fabrication technologies.

The 164 programmable processors are able to dynamically and independently switch their supply voltage between one of two power grids and are also able to dynamically and independently tune their clock frequency. Changes can be made by a local configurable hardware controller, local software, or configuration.

All 167 processors and 3 shared memories contain independent individual local oscillators that are able to halt, restart, or change frequency arbitrarily in a Globally Asynchronous Locally Synchronous (GALS) fashion. There are no PLLs, DLLs, clock crystals, or global clock signals. Individual oscillators fully halt (leakage power only) when there is no work to do, and restart in less than one cycle after work is available.

All processors are interconnected by a "double-link" reconfigurable mesh network. The novel network design allows links to be configured to pass data across the chip in dedicated channels without disturbing intermediate processors and without regard to their clock or voltage domains, with small circuit area. Configurable pipeline registers enable full-rate communication over long distances.

The 65 nm chip comprises a 2-D array of processors containing:

164 simple programmable Dynamic Voltage, Dynamic clock Frequency (DVFS) processors with 128x35-bit instruction memories; 128x16-bit data memories; two 64x16-bit FIFOs; and an in-order, single-issue, six-stage pipeline.
A configurable block floating point Fast Fourier Transform (FFT) processor capable of dynamically switching between 16 to 4096-point complex FFT/IFFT transforms with a continuous throughput of one complex radix-4 or radix-2 butterfly per cycle which results in continuous complex 1024-point transforms every 1.28 us at 990 MHz,
A configurable Viterbi processor containing 8 Add-Compare-Select (ACS) units that can decode codes up to constraint length 10, and can deliver 78 Mb/s at 790 MHz with k=7 codes,
A programmable video Motion Estimation processor supporting several fixed and programmable search patterns, all H.264-specified block sizes, and is able to compute over 14 billion SADs/sec at 880 MHz,
Three 16 KB multi-ported shared memories providing high speed storage to the array and supporting port priorities, port request arbitration, and multiple addressing modes including programmable address generators.

Presentation Slides

PDF (3.1 MB)
PPT (3.1 MB)
(c) Copyright, 2008

Reference

Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine Watnik, Paul Mejia, Anh Tran, Jeremy Webb, Eric Work, Zhibin Xiao, Bevan Baas. "A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing." In Proceedings of the IEEE HotChips Symposium on High-Performance Chips (HotChips 2008), August 2008.

BibTeX Entry

@inproceedings{Truong:HotChips08,
   author    = {Dean Truong and Wayne Cheng and Tinoosh Mohsenin and  Zhiyi Yu 
               Toney Jacobson and Gouri Landge and Michael Meeuwsen 
	           and Christine Watnik and Paul Mejia and Anh Tran and Jeremy Webb
	           and Eric Work and Zhibin Xiao and Bevan Baas}, 
   title     = {A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing},
   booktitle = {IEEE HotChips Symposium on High-Performance Chips
               (HotChips 2008)},
   month     = {Aug.},
   year      = {2008}
}

Symposium Program

PDF (95 KB)

VCL Lab | ECE Dept. | UC Davis

Last update: Sep. 07, 2010