A GALS Many-Core Heterogeneous DSP Platform with Source-Synchronous On-Chip Interconnection Network

> Anh Tran, Dean Truong and Bevan Baas University of California, Davis

> > NOCS '09 May 13, 2009

- Motivation
- Design of a GALS many-core DSP platform
- A GALS-compatible source-synchronous interconnect network
- Test chip implementation
- Mapping application case study: 802.11a/g baseband receiver
- Conclusion

#### Motivation

- Design of Our GALS many-core DSP platform
- The GALS compatible source-synchronous interconnect network
- Test chip implementation
- Mapping application case study: 802.11a/g baseband receiver
- Conclusion

#### Emergence of DSP multi-core platforms

- Low design cost and short time-to-market favor programmable and reconfigurable DSP platforms
- Continually shrinking transistor sizes enable multi/many-core designs
- Pollack's Rule: many small cores outperform a few large cores for the same silicon area
- Amdahl's Law: performance speedup depends strongly on available parallelism



[S. Borkar, DAC, 2007]

# High parallelism and deterministic connections in DSP Applications



- A high degree of task-level parallelism is available directly from task graphs for many DSP, multimedia, and embedded applications
- Often possible to map each task to one/few small processors
- A statically-configured interconnection network may be sufficient

# Energy advantages of GALS, many-core and heterogeneous architectures

- Independent local clock oscillators
  - Eliminate difficult to design, power-hungry global clock trees
  - Allow use of different frequencies (and supply voltages) for processors depending on their workloads
     → reduce dynamic power
  - □ Allow complete turn off of unused processors → reduce idle power
- Support compute-intensive tasks by specific accelerators



- Our approach for interconnection network of many-core heterogeneous GALS DSP platforms:
  - Static reconfigurable circuit-switched interconnects
  - Source-synchronous communication across multiple clock domains

#### Motivation

#### Design of Our GALS many-core DSP platform

- The GALS compatible source-synchronous interconnect network
- Test chip implementation
- Mapping application case study: 802.11a/g baseband receiver
- Conclusion



- All programmable processors have identical design and physical layout
- The design of the oscillator and inter-processor communication circuitry are the same for all processing elements (PE)
  - They are designed as a generic wrapper that is reused for all PEs



- 164 small fine-grained processors
- Three reconfigurable accelerators: FFT, Viterbi and Motion Estimation
- Three shared memory modules

## Voltage and Frequency Controller

- Multiple power grids → low design cost, fast voltage switching
- Programmable ring oscillator runs on its own supply voltage for increased stability
- Supply voltage and clock frequency are set depending on the workload
  - Statically
  - Dynamically by software
  - Dynamically by hardware
- Inter-processor communication circuits run at a fixed voltage to avoid using many level shifters



- Motivation
- Design of Our GALS many-core DSP platform
- The GALS compatible source-synchronous interconnect network
- Test chip implementation
- Mapping application case study: 802.11a/g baseband receiver
- Conclusion

#### 2-D mesh static circuit-switched network



- Each switch has five ports and uses only 4-input MUXs
  - Switch contains no input/output queue buffer, routing control and arbitration circuitry → very small area and power
- Switches are configured before run-time to connect any two processors; thus links are fixed and not shared
   → high throughput, low latency
- Small switches allow to have multiple parallel networks for increasing interconnection capacity. This platform contains two in parallel.



- signals from the source processor to the destination processor
- Links have a capacity of one data word per source-clock cycle
- No intermediate registering is needed, providing small area and low latency



[R. Apperson et al., TVLSI, 2007]



write clock in the stable data timing window

#### Low power communication strategy



- Always active clock dissipates unnecessary power
- Solution: send clock only when valid data is available
  - 45% power reduction
  - Requires at least one additional cycle due to the reconfigured delay



[Z. Yu and B. Baas, ICCD, 2006]

- Motivation
- Design of Our GALS many-core DSP platform
- The GALS compatible source-synchronous interconnect network
- Test chip implementation
- Mapping application case study: 802.11a/g baseband receiver
- Conclusion



- Fabricated in ST 65nm low-leakage CMOS
- Each processor occupies 0.17 mm<sup>2</sup> with only 7% area for comm. circuits
- Fully functional from 1.2 GHz at 1.3V down to 5 MHz at 0.6 V

- Motivation
- Design of Our GALS many-core DSP platform
- The GALS compatible source-synchronous interconnect network
- Test chip implementation
- Mapping application case study: 802.11a/g baseband receiver
- Conclusion

## Mapping of a 802.11a/g baseband receiver



- Manually partition tasks onto one/many processors
- Program processors using a simple version of C language, combined with assembly language for interconnection configuration and code optimization
- Simulate whole system at the cycle-accurate RTL level using NC Verilog
- Compare results with a Matlab model to verify functionality
- Use activity percentages reported by the simulator for power estimation

## Throughput evaluation

- OFDM data symbols are processed by an interconnected sequence of processors
- The Viterbi processor is the slowest one and thus determines throughput of the receiver
- Faster processors stall on either input or output while waiting to receive or send data
- Each processor processes one 4 µs OFDM data symbol in 2376 cycles
  - → 54 Mbps throughput at 594 MHz and 0.95 V



#### Power estimation at 594 MHz and 0.95V

Power is estimated based on the number of cycles that each processor spends for execution, stalling with active clock, standby with halted clock, and the number of data items sent on each link and the distance of each link

|                                                                                                            | Processor                | Execution<br>Time<br>(cycles) |  | Stall with<br>Active Clock<br>(cycles) | Standby with<br>Halted Clock<br>(cycles) |  | Output<br>Time<br>(cycles) |  | Comm.<br>Distance<br>(# switches) |  |
|------------------------------------------------------------------------------------------------------------|--------------------------|-------------------------------|--|----------------------------------------|------------------------------------------|--|----------------------------|--|-----------------------------------|--|
|                                                                                                            | Data Distribution        | 320                           |  | 960                                    | 1096                                     |  | 80 x 2                     |  | 6                                 |  |
|                                                                                                            | Post-Timing Sync.        | 240                           |  | 960                                    | 1176                                     |  | 80 x 2                     |  | 5                                 |  |
|                                                                                                            | Acc. Offset Vector Comp. | 2320                          |  | 56                                     | 0                                        |  | 80 x 2                     |  | 2                                 |  |
|                                                                                                            | CFO Compensation         | 2160                          |  | 216                                    | 0                                        |  | 80 x 2                     |  | 2                                 |  |
|                                                                                                            | Guard Removal            | 176                           |  | 768                                    | 1432                                     |  | 64 x 2                     |  | 6                                 |  |
|                                                                                                            | 64-point FFT             | 205                           |  | 768                                    | 1403                                     |  | 64 x2                      |  | 3                                 |  |
|                                                                                                            | Subcarrier Reorder       | 1018                          |  | 576                                    | 782                                      |  | 48 x 2                     |  | 4                                 |  |
|                                                                                                            | Channel Equalization     | 1488                          |  | 576                                    | 312                                      |  | 48 x 2                     |  | 2                                 |  |
|                                                                                                            | De-modulation            | 2352                          |  | 24                                     | 0                                        |  | 288                        |  | 2                                 |  |
|                                                                                                            | De-interleaving 1        | 864                           |  | 1512                                   | 0                                        |  | 288                        |  | 2                                 |  |
|                                                                                                            | De-interleaving 2        | 1130                          |  | 1246                                   | 0                                        |  | 288                        |  | 2                                 |  |
|                                                                                                            | De-pucturing             | 576                           |  | 1800                                   | 0                                        |  | 432                        |  | 2                                 |  |
|                                                                                                            | Viterbi Decoding         | 2376                          |  | 0                                      | 0                                        |  | 216                        |  | 3                                 |  |
|                                                                                                            | De-scrambling            | 2160                          |  | 216                                    | 0                                        |  | 216                        |  | 2                                 |  |
|                                                                                                            | Pad Removal.             | 648                           |  | 1296                                   | 432                                      |  | 216                        |  | 2                                 |  |
| $P_{Total} = \sum P_{Exe.i} + \sum P_{Stall.i} + \sum P_{Standby.i} + \sum P_{Comm.i} = 174.76 \text{ mW}$ |                          |                               |  |                                        |                                          |  |                            |  |                                   |  |
| →12.18 mW (or 7%)                                                                                          |                          |                               |  |                                        |                                          |  |                            |  |                                   |  |

## Power reduction by freq. and volt. scaling

|                                       | Frequency scalir     |                   | Frequer            | ncy & Voltage<br>caling |                                              |
|---------------------------------------|----------------------|-------------------|--------------------|-------------------------|----------------------------------------------|
| Processor                             | Optimal<br>Frequency | Power<br>Consumed | Optimal<br>Voltage | Power<br>Consumed       |                                              |
|                                       | (MHz)                | (mW)              | (V)                | (mW)                    |                                              |
| Data Distribution                     | 80                   | 3.52              | 0.75               | 0.63                    | 140 124 22 mW                                |
| Post-Timing Sync.                     | 60                   | 2.78              | 0.75               | 2.11                    | 134.32 11100                                 |
| Acc. Off. Vector Comp.                | 580                  | 17.72             | 0.95               | 17.72                   |                                              |
| CFO Compensation                      | 540                  | 16.53             | 0.95               | 16.53 g                 | 130 123.18 mW                                |
| Guard Removal                         | 44                   | 2.23              | 0.75               | 1.73 t                  | <u> </u>                                     |
| 64-point FFT                          | 51                   | 1.64              | 0.75               | 1.23                    |                                              |
| Subcarrier Reorder                    | 257                  | 8.12              | 0.75               | <b>5.22</b> है          | <u> </u>                                     |
| Channel Equalization                  | 372                  | 11.34             | 0.95               | 11.34 a                 | 5 115 - · · · · · · · · · · · · · · · · · ·  |
| De-modulation                         | 588                  | 18.38             | 0.95               | 18.38                   |                                              |
| De-interleaving 1                     | 216                  | 7.36              | 0.75               | <b>4.95</b>             | - 110                                        |
| De-interleaving 2                     | 283                  | 9.34              | 0.95               | <b>9.34</b>             | 105                                          |
| De-pucturing                          | 144                  | 5.70              | 0.75               | 4.10                    |                                              |
| Viterbi Decoding                      | 594                  | 7.13              | 0.95               | 7.13                    | 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95          |
| De-scrambling                         | 540                  | 16.72             | 0.95               | 16.72                   | V <sub>ddLow</sub> (V)                       |
| Pad Removal                           | 162                  | 5.52              | 0.75               | 3.71                    |                                              |
| Ten non-critical Procs.               |                      |                   | 0.95               | 0.31                    |                                              |
| Total (mW)                            |                      | 134.32            |                    | 123.18                  |                                              |
| $f_{O} = \frac{N_{Exe.i}}{N_{Exe.i}}$ | (cycle               | $\frac{s}{MHz}$   |                    | $\sum P$                | $C_{\text{orm}} = 12.18 \text{ mW}$ (or 10%) |
| JOpt.i - 4                            | $(\mu s)$            |                   |                    |                         | Comm.i = 12110 1110 (011070)                 |

#### Estimation and measurement

| Configuration Mode            | Estimated Power<br>(mW) | Measured Power<br>(mW) | Difference |
|-------------------------------|-------------------------|------------------------|------------|
| At 594 MHz and 0.95 V         | 174.8                   | 177.9                  | 1.8%       |
| At optimal frequencies only   | 134.3                   | 139.6                  | 3.9%       |
| At both optimal freq. & volt. | 123.2                   | 129.8                  | 5.1%       |

- The receiver operates correctly on the test chip
- Total time for designing, simulating, and testing this receiver is about 3 months
- The difference between estimated and measured power is within 2-5%

- Motivation
- Design of Our GALS many-core DSP platform
- The GALS compatible source-synchronous interconnect network
- Test chip implementation
- Mapping application case study: 802.11a/g baseband receiver
- Conclusion

## Conclusion

- Many-core designs are a promising solution for programmable DSP platforms
- When coupled with GALS and heterogeneous architectures, it allows to achieve high performance at high energy efficiencies
- A test chip was fabricated in 65 nm CMOS and is fully functional
  - Uses static circuit-switched interconnection networks with simple switches that are highly suitable for many DSP applications
  - The networks utilize a simple yet effective source-synchronous communication technique across multiple clock domains
- An 802.11a/g Wi-Fi baseband receiver mapped onto this platform obtains 54 Mbps throughput while consuming only 130 mW, with 10% dissipated in its interconnection links

# Acknowledgments

- NSF Grant 430090 and CAREER award 546907
- SRC GRC Grant 1598 and CSR Grant 1659
- Intellasys
- UC Micro
- Intel
- ST Microelectronics
- A VEF Fellowship
- SEM
- J.-P. Schoellkopf, P. Cogez, Y.-P. Cheng,
   A. Gatherer, R. Krishnamurthy, K. Bowman, and M. Anders

## **THANK YOU!**

## Backup/Extra Slides

Source-synchronous interconnects:

- Switch structure
- Dual-clock FIFO
- Programming so that the receiver operates obeying a FSM model:
  - Save power
  - Obtain high throughput
- Power estimation equations:
  - Based on activity percentages of execution, stall, standby, output times of each processor and its interconnection distance

#### Source-synchronous communication (1)



- On each interconnect link, clock is sent with bundled valid + data items from its source to destination
  - Each data item is sent per cycle
  - No intermediate register is needed; thus, low latency



[R. Apperson et al., TVLSI, 2007]











| Power estimation                                                                                          |                                                                                                                                                                                                                                                                                                       |                                                                                                                       |                                                                                                       |                                                                                                                                                                                                                                            |                                                                                                                                                                                                                    |                                                                                                                           |  |  |  |
|-----------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|--|--|--|
| $P_{Exe.i}$                                                                                               | $= \alpha_i \cdot P_{ExeAvg} \xrightarrow{P_{Comm.i}} = \gamma_i \cdot \left[ (P_{SwitchActive} + 2P_{SwitchStall}) \cdot L_i - (P_{FIFOWriteActive} + 2P_{FIFOWriteStall}) \cdot L_i \right]$                                                                                                        |                                                                                                                       |                                                                                                       |                                                                                                                                                                                                                                            |                                                                                                                                                                                                                    |                                                                                                                           |  |  |  |
|                                                                                                           | Processor                                                                                                                                                                                                                                                                                             | Execution<br>Time<br>(cycles)                                                                                         | Stall with<br>Active Clock<br>(cycles)                                                                | Standby with<br>Halted Clock<br>(cycles)                                                                                                                                                                                                   | Output<br>Time<br>(cycles)                                                                                                                                                                                         | Comm.<br>Distance<br>(# switches)                                                                                         |  |  |  |
|                                                                                                           | Data Distribution<br>Post-Timing Sync.<br>Acc. Offset Vector Comp.<br>CFO Compensation<br>Guard Removal<br>64-point FFT<br>Subcarrier Reorder<br>Channel Equalization<br>De-modulation<br>De-interleaving 1<br>De-interleaving 2<br>De-pucturing<br>Viterbi Decoding<br>De-scrambling<br>Pad Removal. | 320<br>240<br>2320<br>2160<br>176<br>205<br>1018<br>1488<br>2352<br>864<br>1130<br>576<br>2376<br>2376<br>2160<br>648 | 960<br>960<br>56<br>216<br>768<br>768<br>576<br>576<br>24<br>1512<br>1246<br>1800<br>0<br>216<br>1296 | $     \begin{array}{r}       1096 \\       1176 \\       0 \\       0 \\       1432 \\       1403 \\       782 \\       312 \\       0 \\       0 \\       0 \\       0 \\       0 \\       0 \\       0 \\       432 \\     \end{array} $ | $\begin{array}{c} 80 \times 2 \\ 80 \times 2 \\ 80 \times 2 \\ 80 \times 2 \\ 64 \times 2 \\ 64 \times 2 \\ 48 \times 2 \\ 48 \times 2 \\ 288 \\ 288 \\ 288 \\ 288 \\ 432 \\ 216 \\ 216 \\ 216 \\ 216 \end{array}$ | 6<br>5<br>2<br>2<br>6<br>3<br>4<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2<br>2 |  |  |  |
| $P_{Stall.i} = \beta_i \cdot P_{StallAvg}  P_{Standby.i} = (1 - \alpha_i - \beta_i) \cdot P_{StandbyAvg}$ |                                                                                                                                                                                                                                                                                                       |                                                                                                                       |                                                                                                       |                                                                                                                                                                                                                                            |                                                                                                                                                                                                                    |                                                                                                                           |  |  |  |
| $P_{Total} = \sum' P_{Exe.i} + \sum' P_{Stall.i} + \sum' P_{Standby.i} + \sum' P_{Comm.i}$                |                                                                                                                                                                                                                                                                                                       |                                                                                                                       |                                                                                                       |                                                                                                                                                                                                                                            |                                                                                                                                                                                                                    |                                                                                                                           |  |  |  |