## A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

Zhibin Xiao and Bevan M. Baas

VLSI Computation Lab, ECE Department University of California, Davis

# Outline

### Introduction to H.264 CAVLC Encoder

- Features of Target Fine-Grained Many-Core System
- The Proposed Parallel CAVLC Encoder
- Results and Performance Analysis
- Summary

### Advanced Video Processing



Video applications are everywhere: High definition video, realtime video conference, portable handset



## **Introduction to H.264/AVC Standard**

- Drafted on May 2003 from JVT formed by ITU and ISO MPEG organization
- Target from high-definition TV to low-resolution mobile video
- Huge computation complexity with more data dependency and irregular processings



## Introduction of H.264 CAVLC Encoder

- Context-adaptive variablelength coding (CAVLC)
  - Adopted in H.264 baseline profile
  - Reverse zigzag scanned runlength coding and adaptive coding table selection
  - Up to 27 4x4 or 2x2 blocks within a macroblock in order
- Less processing regularity
  - Serial in pixel level
  - SIMD approach is not feasible in this case
  - Task-level parallelism is available

16x16 Macroblock CAVLC Processing Order



## Introduction of H.264 CAVLC Encoder

- CAVLC Encoding
  - Five parameters of each 4x4 block are coded separately
    - coeff\_token, Sign\_trail, Levels, Total\_zeros, Run\_before
- CAVLC data-flow graph
  - Serial scanning phase
  - Parameter coding phase





- Introduction to H.264 CAVLC Encoder
- Features of Target Fine-Grained Many-Core System
- The Proposed Parallel CAVLC Encoder
- Results and Performance Analysis
- Summary

## **Target Many-core System Architecture**

- Key features
  - 164 Enhanced prog. procs.
  - 3 Dedicated-purpose procs.
  - 3 Shared memories
  - Long-distance circuit-switched communication network
  - Dynamic Voltage and Frequency Scaling (DVFS)



### **Project motivation and mapping methdology**

- Fine-grained many-core system for DSP applications
  - energy efficient
  - scalable performance
  - highly flexibile
- Mapping methology
  - Sequential C code
  - Parallel C code
  - Fine-grained assembly-level code





- Introduction to H.264 CAVLC Encoder
- Features of Target Fine-Grained Many-Core System
- The Proposed Parallel CAVLC Encoder
- Results and Performance Analysis
- Summary

## **Parallel CAVLC : Memory Optimization**

#### Coeff\_Token table selection

- Encode number of non-zero coefficients (nnz) in current 4x4 block
- The table index depends on top and left 4x4 blocks
- A row of nnz values of previous blocks has to be stored in the shared big memory



720p HDTV: 324 word memory for nnz

- Table elimination and compression
  - Levels encoded at runtime
  - Reduce more than 75% table memory for coeff\_token, total\_zeros and run\_before
    - Width compression
    - Zero-value reduction

## **CAVLC Partition and Dataflow mapping**

- A 20-processor mapping
  - No long-distance link
  - 8 routing processors
  - Every encoding blocks can fit into a fine-grained core (128 word instruction and data memory)



data\_out

VLC

## **Mapping and Throughput Optimization**

- 15-processor mapping
  - 4 long-distance link
  - Reduce 5 routing processors
- Throughput optimization
  - Readjust workload
  - Code optimization







- Introduction to H.264 CAVLC Encoder
- Features of Target Fine-Grained Many-Core System
- The Proposed Parallel CAVLC Encoder
- Results and Performance Analysis
- Summary

## **Parallel CAVLC Encoder Performance**

#### Throughput

- Five QCIF video test sequences with varying Quantization Parmeter (10-40)
- Scaled performance can achieve 30fps 720p HDTV (1280x720) processing



### **Performance Comparison with General CPU**

#### Performance comparison

- Intel Core 2 Duo, Intel Pentium 4 and Pentium 4 HT
- Throughput
  - 4.86-6.83 times better
- Scaled area
  - 20.2 times smaller



### **Performance Comparison: traditional DSPs**

- Performance estimation on DSPs
  - CAVLC takes 18.2% computation time for H.264 baseline encoder
- 1.0-6.15 higher throughput and 6.2 times smaller area compared to TI C642 DSP
  - Scaled to 65nm
  - More demanding test for our design

| Platform             | Target<br>App.        | Processor<br>Type      | Tech.         | Area<br>(mm²) | Freq.<br>(MHz) | Scaled<br>Area<br>to 65nm<br>(mm <sup>2</sup> ) | Scaled<br>Freq. to<br>65nm<br>(MHz) | Test<br>Sequence                 | CAVLC<br>Performance<br>(fps 720p) |
|----------------------|-----------------------|------------------------|---------------|---------------|----------------|-------------------------------------------------|-------------------------------------|----------------------------------|------------------------------------|
| TI C642              | CIF<br>24fps          | 8-way<br>VLIW          | 130nm<br>CMOS | 72            | 600            | 18                                              | 1200                                | 50 frames<br>IPPPP<br>QP=25      | 28                                 |
| ADSP<br>BF561        | CIF<br>30fps          | Dual-<br>core DSP      | 130nm<br>CMOS | N/A           | 600            | N/A                                             | 1200                                | N/A                              | 36                                 |
| TI C641              | QCIF<br>24.5fps       | 8-way<br>VLIW          | 130nm<br>CMOS | 72            | 600            | 18                                              | 1200                                | 100<br>frames<br>IPPP P<br>QP=28 | 7.4                                |
| This<br>work<br>AsAP | 720p<br>HDTV<br>30fps | Array<br>(15<br>cores) | 65nm<br>CMOS  | 2.89*         | 1070           | 2.89*                                           | 1070                                | 2 frames<br>IP<br>QP=20          | 36.0-41.3                          |

## **Processor Activity Analysis & Power**

- Processor activity type
  - Execution
  - Stalls on input or output
- Analysis
  - Data receiving stall on output
  - 7%-65% active time for most processors
  - Bottleneck: zigzag reorder and CAVLC scanning, over 94% active time
- Power estimation
  - One processor
    - 59mW@1.07GHz, 1.3V, 65nm 100% active
    - Nearly zero leakage when processor is idle
  - 323mW@1.07GHz, 1.3V, 15-processor + memory





- Introduction to H.264 CAVLC Encoder
- Features of Target Fine-Grained Many-Core System
- The Proposed Parallel CAVLC Encoder
- Results and Performance Analysis
- Summary

## Summary

- Fine-grained many-core system
  - Energy efficient, scalable and flexible
  - Exploiting task-level parallelism
- The proposed parallel CAVLC encoder
  - 15-processor plus 324 word memory, 720p HDTV at 30 fps
  - 4.86-6.83 times higher scaled throughput than latest generalpurpose processor
  - 1.0-6.15 higher scaled throughput and 6.2 times smaller area compared with traditional DSPs
- Future work
  - Further power reduction using DVFS
  - A complete parallel H.264 baseline encoder

### **Acknowledgments**

- Intellasys Inc.
- SRC GRC Grant 1598 and CSR Grant 1659
- ST Microelectronics
- NSF Grant 0430090 and CAREER Award 0546907
- Intel and S Machines Corporation
- UC Micro
- UCD Faculty Research Grant



## **Thank You!**