Energy-efficient Fine-grained Many-core Architecture for Video and DSP Applications

Zhibin Xiao
PhD Dissertation
VLSI Computation Laboratory
Department of Electrical and Computer Engineering
University of California, Davis
Technical Report ECE-VCL-2012-4, VLSI Computation Laboratory, University of California, Davis, 2012.

Abstract:

Many-core processor architecture has become the most promising computer architecture for next-generation computing. However, how to utilize the extra system performance for real applications such as video encoding is still challenging. This dissertation investigates architecture design, physical implementation and performance evaluation of a fine-grained many-core processor for advanced video coding with a focus on interconnection, topology, memory system and related parallel programming methodology.

A baseline residual encoder for H.264/AVC on a current generation fine-grained many-core system is proposed that utilizes no application-specific hardware. The 25-processor encoder encodes video sequences with variable frame sizes and can encode 1080p HDTV at 30 frames per second with 293~mW average power consumption by adjusting each processor to workload-based optimal clock frequencies and dual supply voltages—a 38.4% power reduction compared to operation with only one clock frequency and supply voltage. In comparison to published implementations on the TI C642 DSP platform, the design has approximately 2.9–3.7 times higher scaled throughput, 11.2–15.0 times higher throughput per chip area, and 4.5–5.8 times lower energy per pixel. Compared to a heterogeneous SIMD architecture customized for H.264, the presented design has 2.8–3.6 times greater throughput, 4.5–5.9 times higher area efficiency, and similar energy efficiency.

Next, this dissertation proposes novel processor shapes and inter-connection topologies for many-core processor arrays which result in an overall application processor that requires fewer cores and has a lower total communication length. The proposed topologies compared to the commonly-used 2D mesh and include two 8-neighbor topologies, two 5-nearest-neighbor and three 6-nearest-neighbor topologies—three of which utilize 5-sided or hexagonal processor tiles. A 1080p H.264/AVC residual video encoder and a complete 54 Mbps 802.11a/11g wireless LAN baseband receiver are mapped onto all topologies and compared. The methodology to implement an array of hexagonal-shaped processor tiles with industry-standard CAD tools and automatic place and route flow is described. A 16-bit DSP processor tile is tailored for all proposed topologies and implemented at 65 nm CMOS technology without full-custom layout. Results show that the 6-neighbor hexagonal tile and the 6-neighbor rectangular tile incur a 2.9% area increase per tile compared to the 4-neighbor 2D mesh, but their much more effective inter-processor interconnect yields an average total application area reduction of 21% and a total application inter-processor communication distance reduction of 19%.

Motivated by the fact that video encoding tasks normally read and write a block of data at one time in one transaction, the third part of this dissertation proposes a novel source synchronous bufferless shared memory to enable safe memory sharing among multiple processors with different clock domains. Compared with the previous FIFO buffered memory design, the bufferless memory module achieves lower latency, higher throughput, lower area overhead and lower power consumption. The bufferless memory module also supports direct communication with far-away processors through the existing processor-processor circuit switch interconnection network. The implementation results show that a 16~KB bufferless memory module reduces 58% single memory access latency and has higher burst-mode throughput (1%) compared to the 16~KB buffered memory module. The bufferless memory module also reduces the area overhead from 63% to 17% compared with buffered memory module, which yields a power reduction by 43%.

Dissertation Copy

Reference

Zhibin Xiao, "Energy-efficient Fine-grained Many-core Architecture for Video and DSP Applications," Ph.D Dissertation, Technical Report ECE-VCL-2012-4, VLSI Computation Laboratory, ECE Department, University of California, Davis, 2012.

BibTeX entry

@phdthesis{zxiao:vcl:phdthesis,
   author      = {Zhibin Xiao},
   title       = {Energy-efficient Fine-grained Many-core Architecture for
                  Video and DSP Applications},
   school      = {University of California},
   month       = dec,
   year        = 2012,
   address     = {Davis, CA, USA},
   note        = {\url{http://www.ece.ucdavis.edu/vcl/pubs/theses/2012-4/}}
   }

Support Acknowledgment

This work is supported by ST Microelectronics, Intel Corporation, UC Micro, NSF Grant 0430090 and CAREER Award 0546907, SRC GRC Grant 1598, CSR Grant 1659, Intellasys Corporation, S~Machines and the support of the C2S2 Focus Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation entity.


VCL Lab | ECE Dept. | UC Davis