# A Fine-Grained Parallel Implementation of an H.264/AVC Encoder on a 167-Processor Computational Platform

ACSSC 2011 – Pacific Grove, CA

Zhibin Xiao<sup>1</sup>, Stephen Le<sup>2</sup> and Bevan M. Baas<sup>1</sup> <sup>1</sup>University of California, Davis <sup>2</sup>Intel Corporation, Folsom, CA



# Introduction to H.264/AVC Video Encoding

- Features of Target Many-core System
- The Proposed Parallel H.264 Encoder
- Performance Results
- Summary

# **Advanced Video Processing and Standards**

- Application-driven standard development
  - **Standards:** MPEG-1/2/4, H.26-1/2/3, H.264/AVC, HEVC
  - **Trend:** Lower bit-rate, higher resolution, scalable, multi-view
  - Challenges: higher computation complexity and power requirement
  - Approaches: DSP/CPU (single-core or many-core), FPGA, ASIC and Hybrid Architecture









Camera

Video conference

Mobile

Online video streaming

# Introduction to H.264/AVC Standard

- Drafted on May 2003 from ITU and ISO MPEG
- New extensions such as Scalable and Multi-View Coding (3D)
- Target applications from HDTV to low-resolution mobile video
- Huge computation complexity with higher data dependencies and irregular processing







- Introduction to H.264/AVC Video Encoding
- Features of Target Many-core System
- The Proposed Parallel H.264 Encoder
- Results and Performance Analysis
- Conclusion

# **Target Many-core System Architecture**

# Key features

- 164 Enhanced prog. procs.
- 3 Dedicated-purpose procs.
- 3 Shared memories
- Long-distance circuit-switched communication network
- Dynamic Voltage and Frequency Scaling (DVFS)





# **Parallel Programming Methodology**

# 3-step mapping

- Sequential C code
- Parallel C code
- Fine-grained assembly-level code



# Challenges of Mapping H.264/AVC on AsAP

- Limited size of data memories (128-word)
  - Solution 1: on-chip 16-KB shared memories
  - Solution 2: small processors can be used as memory
  - Solution 3: off-chip memories for large frame buffer
- Limited size of instruction memory (128-word)
  - Solution: program partition and more parallelism can be exposed with communication overhead
- Limited number of inputs (Two 64-word input buffers per processor core)
  - Solution: routing processors by combining data from multiple source processors



- Introduction to H.264/AVC Video Encoding
- Features of Target Many-core System
- The Proposed Parallel H.264 Encoder
- Results and Performance Analysis
- Conclusion

# **Initial Partition of the Baseline Encoder**

# Key components

- Intra-predictor
- Inter-predictor
- Residual encoding (integer transform, quantization, CAVLC)
- Data-flow control





#### **General Problems of H.264 Encoder Parallelization**

#### Large memory requirement

- Current/reference frame: off-chip memory
- Motion vectors: on-chip shared memory
- Non-zero coefficient in CAVLC encoder: on-chip shared memory

# Data-flow control

- Raster-scan encoding order in the format of 16x16 or 4x4 blocks
- Minimal control information is broadcasted; mostly are computed at run-time.



Raster-scan encoding order



# **Detailed Parallelization (1): Intra-prediction**

#### Supporting modes

- 5 luma modes
- 3 chroma modes

## Level of parallelization

- Luma and chroma are processed in parallel
- All modes are processed in parallel



# **Detailed Parallelization (2): Inter-prediction**

- Dedicated motion estimator (ME\_ACC)
  - Asynchronous I/O interface (FIFO)
  - Fully pipelined SAD units
  - Supports 4 programmable search patterns and block sizes
  - 14 billion SADs/sec @880 MHz, 1.3 V; supports 1080p HDTV @ 30fps



# **Detailed Parallelization (3): Residual Encoder**

- 25-processor + 1 shared memory (968 bytes for 1080p HDTV)
  - 8 procs for trans. and quant and 17 procs for CAVLC encoding.
  - 8 long-distance links (distance = 1 proc).
  - Variable frame up to 1080p HDTV@30fps, 424mW average power



## Partitioning of the H.264 Encoder on AsAP

#### Five major modules plus control module

- Each module is implemented and verified separately in both parallel C and assembly level
- Bit-level verification of the full encoder in both parallel C and assembly level





- Introduction to H.264/AVC Video Encoding
- Features of Target Many-core System
- The Proposed Parallel H.264 Encoder
- Results and Performance Analysis
- Conclusion

# **Resource Utilization**

- Total processors (115 processors)
  - 68 computational processors
  - 28 memory processors
  - 19 routing processors
- Custom mapping vs. Mapping tool
  - 22% less number of processors

|                            | Custom<br>Mapping | Mapping<br>Tool |
|----------------------------|-------------------|-----------------|
| Number of<br>Processors    | 115               | 147             |
| Number of<br>Memory Proc.  | 28                | 28              |
| Number of<br>Routing Proc. | 19                | 51              |
| Computational<br>Proc.     | 68                | 68              |
| Long-distance<br>Links     | 48                | 52              |



#### **Processor** memory usages

- Instruction memories
  - 36% usage on average
  - 79% usage for computational processors
- Data memories
  - 68 computational processors (32%)
  - 28 memory processors (100%)
  - 19 routing processors (3%)



## **Performance Results**

- Throughput (IPIP test sequences)
  - VGA (640x480) 21.0 fps
  - CIF (352x288) 63.6 fps
- Power consumption
  - 931 mW @ 1.2 V at maximum 651 MHz
- Video Resolution
  - Less than 1db loss

| Voltage<br>(V) | Max Freq.<br>(MHz) | Intra<br>fps | Inter<br>fps | Power<br>Intra (mW) | Power<br>Inter (mW) |
|----------------|--------------------|--------------|--------------|---------------------|---------------------|
| 0.8            | 172                | 19           | 95           | 108.8               | 365.1               |
| 0.9            | 295                | 33           | 160          | 213.6               | 452.6               |
| 1.0            | 410                | 49           | 233          | 419.0               | 662.3               |
| 1.1            | 539                | 66           | 324          | 696.3               | 908.4               |
| 1.2            | 651                | 82           | 427          | 802.7               | 1059                |
| 1.3            | 798                | 96           | 478          | 947.5               | 1189                |

Measured encoder performance (QCIF) on AsAP chip

# Power break-down analysis

## Intra-prediction only encoder

- 58% for intra prediction
- Inter-prediction only encoder
  - 63% for inter prediction including ME accelerator



# **Summary and future work**

- Fine-grained many-core platform
  - Scalable, flexible and energy-efficient
- Fine-grained parallel programming is not trivial
  - 3 step mapping is crucial for successful parallel programming
- The proposed parallel H.264 baseline encoder
  - 115-processor with two 16 KB shared memories and hardware motion estimator
  - 1080p HDTV residual encoding at 30 fps with 424mW power
  - The full encoder supports VGA (640x480) at 21.0 fps with 925 mW average power consumption
- Future work
  - Parallel implementation of next-generation video standard (HEVC)
  - Distributed reconfigurable memory for next-generation architecture

## **Acknowledgements**

- Support
  - ST Microelectronics
  - SRC GRC Grant 1598 and CSR Grant 1659
  - NSF Grant 430090 and CAREER award 546907
  - Intel
  - Intellasys
  - UC Micro
  - SEM



# **THANK YOU!**