Applications that require the computation of complex DSP workloads are becoming increasingly commonplace. These applications are often composed of multiple DSP tasks and are found in applications such as: wired and wireless communications, multimedia, sensor signal processing, and medical/biological processing. Many are embedded and strongly energy-constrained. In addition, many of these workloads require very high throughputs, often dissipate a significant portion of the system power budget, and are therefore of considerable interest.In contrast to general-purpose workloads, DSP workloads typically comprise a collection of DSP kernels that are numerically intensive, easily parallelizable, and do not require large data working sets or large programs.
One-time fabrication costs for state of the art CMOS designs are several million dollars and total design costs of modern chips can easily total tens of millions of dollars. These costs are expected to continue rising in the future. In this context, programmable and/or reconfigurable processors that are not tailored to a single application or a small class of applications become increasingly attractive.
The presented processing array computes the aforementioned complex DSP application workloads with high performance and high energy efficiency, and is well suited for implementation in future fabrication technologies.
The 164 programmable processors are able to dynamically and independently switch their supply voltage between one of two power grids and are also able to dynamically and independently tune their clock frequency. Changes can be made by a local configurable hardware controller, local software, or configuration.
All 167 processors and 3 shared memories contain independent individual local oscillators that are able to halt, restart, or change frequency arbitrarily in a Globally Asynchronous Locally Synchronous (GALS) fashion. There are no PLLs, DLLs, clock crystals, or global clock signals. Individual oscillators fully halt (leakage power only) when there is no work to do, and restart in less than one cycle after work is available.
All processors are interconnected by a "double-link" reconfigurable mesh network. The novel network design allows links to be configured to pass data across the chip in dedicated channels without disturbing intermediate processors and without regard to their clock or voltage domains, with small circuit area. Configurable pipeline registers enable full-rate communication over long distances.
The 65 nm chip comprises a 2-D array of processors containing:
Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine Watnik, Paul Mejia, Anh Tran, Jeremy Webb, Eric Work, Zhibin Xiao, Bevan Baas. "A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing." In Proceedings of the IEEE HotChips Symposium on High-Performance Chips (HotChips 2008), August 2008.
@inproceedings{Truong:HotChips08, author = {Dean Truong and Wayne Cheng and Tinoosh Mohsenin and Zhiyi Yu Toney Jacobson and Gouri Landge and Michael Meeuwsen and Christine Watnik and Paul Mejia and Anh Tran and Jeremy Webb and Eric Work and Zhibin Xiao and Bevan Baas}, title = {A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing}, booktitle = {IEEE HotChips Symposium on High-Performance Chips (HotChips 2008)}, month = {Aug.}, year = {2008} }