# Trends and Challenges in LDPC Hardware Decoders

Tinoosh Mohsenin and Bevan Baas

ECE Department, University of California, Davis

Abstract— Over the last decade low density parity check (LDPC) codes have received significant attention due to their superior error correction performance, and have been adopted by recent communication standards such as 10 Gigabit Ethernet (10GBASE-T), digital video broadcasting (DVB-S2), WiMAX (802.16e), Wi-Fi (802.11n) and 60 GHz WPAN (802.15.3c). While there has been much research on LDPC decoders and their VLSI implementations, many difficulties to achieve requirements remain such as lower error floors, reduced interconnect complexities, smaller die areas, lower power dissipation, and design reconfigurability (run-time) to support multiple code lengths and code rates.

This paper provides an overview of current research in LDPC decoder algorithms and architectures that are well suited for hardware implementations. Near and long-term trends of next generation LDPC requirements are made and an analysis of how current architectures will fare with the increasing demands on throughput, BER performance, power dissipation, and chip area (among others) that will be necessary for the widespread adoption of LDPC codecs in near-future applications.

#### I. INTRODUCTION

Error correction plays a major role in communication and storage systems to increase the transmission reliability and achieve a better error correction performance with less signal power. LDPC codes have recently received a lot of attention because of their superior error correction performance, and have been adopted by many communication standards (wired, wireless and broadcasting) and applications.

LDPC codes were used in the second generation of digital video broadcasting satellite (DVB-S2) [1] for the first time in 2003 and recently were adopted by other digital TV standards such as DVB-T2 (terrestrial), DVB-C2 (cable). These standards require very high code lengths (for very good error performance), with multi-code rates (for different modes of operations) and with medium throughput requirement. For example, DVB-S2 specifies LDPC codes of length 64800 bits and 16200 bits, with 11 different codes rates, and a 135 Mbps decoding throughput.

In wireless applications, both WiMAX (802.16e) [2] and WiFi (802.11n) [3] adopted LDPC codes as an optional coding scheme. Similar to broadcasting standards, they specify LDPC codes of different lengths and rates (802.16e has 19 different code lengths) and medium throughput in the range of 40 Mbps-340 Mbps.

In contrast, the 10 Gbit Ethernet copper (10GBASE-T) standard [4] specifies a high code rate LDPC code with a fixed code length of 2048 bits, with a very high decoding throughput of 6.4 Gbps, and a very low error floor (BER =  $10^{-14}$ ).

Although there is no standard for magnetic recording hard disks, they demand high code rate, low error floor, and high decoding throughput (e.g., rate-8/9 LDPC decoder with 2.1 Gbps throughput [5]).

There are six primary criteria that must be considered in an LDPC decoder design based on the application requirement and these are its silicon area, speed, energy dissipation per bit, latency, error floor and error performance gap from the Shannon limit. This paper provides an overview of current research on LDPC decoders and looks at the near- and long-term throughput and power dissipation requirements of future applications and the challenges for LDPC decoders that must be addressed in order to meet them.



Fig. 1. Parity check matrix (upper) and Tanner graph (lower) representation of a 12 column (N), 6 row (M), column weight 2  $(W_c)$ , row weight 4  $(W_r)$ , 24 edge, LDPC code with information length 7 (K). The first check node and variable node processing steps are highlighted in the parity check matrix and Tanner graph.

#### II. LDPC CODES AND MESSAGE PASSING ALGORITHM

A  $(W_c, W_r)(N, K)$  regular LDPC code is characterized by an  $M \times N$  binary matrix which is called the *parity check matrix* or H matrix, a column weight  $W_c$ , and a row weight  $W_r$ . LDPC codes are also defined by a bipartite (Tanner) graph consisting of two sets of nodes: M check nodes and N variable nodes. Total number of edges (connections) between variable node and check nodes is  $N \times W_c$ , or  $M \times W_r$ .

The iterative message passing algorithm, called Sum Product (SPA) [6] is the most widely used LDPC decoding method. Min-Sum algorithm (MS) [7] is a reduced complexity decoding method which reduces the complexity of the check node processing unit.

The conventional message passing algorithm consists of two phases: check node processing which generates  $\alpha$  and variable node processing which generates  $\beta$ . The check node processing output in MinSum is computed as follows:

$$\alpha_{ij} = S_{MS} \times \prod_{j' \in V(i) \setminus j} \operatorname{sign}(\beta_{ij'}) \times \min_{j' \in V(i) \setminus j} (|\beta_{ij'}|) \tag{1}$$

where each  $\alpha_{ij}$  message is generated using the  $\beta$  messages from all variable nodes V(i) (excluding  $V_j$ ) connected to check node  $C_i$  as defined by the Tanner graph. The scaling factor  $S_{MS}$  is included to improve error performance in MinSum, and this decoding method is called MinSum Normalized [8], [9].

In variable node update, which is identical to the SPA version, each  $\beta_{ij}$  message is generated using the noisy channel information (of a single bit),  $\lambda_j$ , and the  $\alpha$  messages from all check nodes C(j)connected to variable node  $V_j$  (excluding  $C_i$ ), as defined by the tanner graph and is computed as follows:

$$\beta_{ij} = \lambda_j + \sum_{i' \in C(j) \setminus i} \alpha_{i'j} \tag{2}$$

Decoding can repeat iteratively until a preset maximum number of iterations is reached. However, for large signal to noise ratio values the majority of blocks can be corrected after a small number of iterations. Thus the circuit power is lowered by stopping the decoding



Fig. 2. Throughput of reported full-parallel and partial-parallel LDPC decoder ASIC implementations versus year

process as soon as a valid codeword is detected— this is called early termination check scheme. A standard method is to make a decision (zero or one) for each variable node at the end of each iteration and then check if all parity constraints are satisfied.

# **III. LDPC DECODER ARCHITECTURES**

The check node and variable node processing steps do not require very sophisticated operations. However, the major challenge is mapping these processing nodes into hardware in order to minimize the communication between these nodes under stringent cost constraints for silicon area, throughput and energy dissipation.

In full-parallel decoders [10], [11], [12], [13], each row and each column of the parity check matrix is directly mapped to a different processing unit and all these processing units operate in parallel. Although full-parallel decoders provide the highest throughput and require no large memory element such as SRAMs to store intermediate messages, they suffer from large circuit area and routing congestion, which are caused by the large number of processing units and very long global wires between them [10].

Partial-parallel designs [14], [15], [16], [17], [18], [19], [20] partition the parity check matrix into rowwise and columnwise groupings such that a set of check node and variable node updates can be done per cycle. This partitioning can potentially limit practical partialparallel designs to regular structured LDPC codes. The irregularity of random codes causes partitioning to be difficult due to the memory addressing problems inherent with the irregular placement of "ones" in the parity check matrix [21].

Figure 2 shows the throughput of reported ASIC designs (measured or post-layout implementations) versus year for full-parallel and partial-parallel decoders. The tick marks along the right side of the plot indicate the maximum throughput requirement for five popular standards. Because they are not efficient, there are not many reported full-parallel decoder implementations. All decoders for DVB-S2, 802.16e and 802.11n standards which require reconfigurable hardware to support different code lengths and code rates are partialparallel.

Figure 3 (a) and (b) show the decoding throughput and the energy dissipation per bit of the decoders versus CMOS technology, respectively. In order to fairly compare throughput and energy dissipation, implementations with an early termination scheme are excluded. A curve is shown connecting data points that have the maximum throughput and minimum energy per given technology in Fig. 3 (a) and (b), respectively. As shown in the figures, in general, most partial-parallel decoders have lower decoding throughput and higher energy



Fig. 4. Parity check matrix of a quasi-cyclic code consisting of  $b \times b$  columns and  $m \times b$  rows, with  $n \times b$  permuted identity submatrices.

dissipation than full-parallel decoders in each technology. However, as shown in Fig. 3 (c) (where the curve connects the smallest die area per given technology) full-parallel decoders have larger circuit area than partial-parallel decoders. Also note that, in general, the number of edges in LDPC codes, which is an indication of code complexity, has increased as technology advances (Fig. 3 (d)).

# IV. CURRENT RESEARCH ON LDPC DECODERS

Current research has focused on the decoding algorithm, code design and VLSI implementation to meet the demands for current applications which were discussed in Section I. These requirements are: very low error floor, hardware reconfigurability, very high throughput and high energy efficiency.

#### A. Error Floor Reduction

Although message passing decoding for LDPC codes have shown a very good error performance, most LDPC codes have a major drawback known as an error floor and this is when the error performance curve's slope of descent suddenly becomes shallow [23]. Usually this happens when a small number of check sums are not satisfied because of a very small number of errors. A trapping set is defined as a set of variable nodes that are connected to a small number of odd degree check nodes [24], [23]. If errors happen on variable nodes in the trapping set, the messages from such a small number of check nodes are most probably not sufficient to correct these errors, which can result in an error floor. Current studies to lower the error floor has focused on better code construction techniques, code concatenation with conventional codes such as Reed-Solomon or BCH and decoding-based strategies. The latter consists of twostage decoding. The first stage is usually the regular message passing scheme, the second stage is performed only if the iterative decoding fails to correct the errors after some iterations. The recent proposed post-processing methods perform a message biasing scheme [25] on check nodes or bit flipping on selective variable nodes [24]. Both of these schemes are followed by another iteration of a regular message passing scheme.

# B. Efficient Reconfigurable Decoder Design

LDPC codes by nature have a very random structure that makes them very inefficient for hardware implementations. A new class of hardware efficient codes are called Quasi-Cyclic (QC) [26] or block-structured LDPC codes [27] and have shown comparable error performance as randomly structured codes. The parity check matrix of these codes consists of square sub-matrices, where each sub-matrix is either a zero matrix or a permuted identity. An example is shown in Fig. 4, which defines a matrix with  $n \times b$  columns which is the code length and  $m \times b$  rows with  $b \times b$  submatrices. This structure makes the memory address generation for partial-parallel decoders very efficient and many communication standards such as DVB-S2, 802.11n and 802.16e use this structure.



Fig. 3. Throughput, energy dissipation per bit, silicon area and number of edges (check node and variable node connections in Tanner graph) of reported LDPC decoder ASIC implementations versus CMOS technology. For throughput and energy plots the implementations with early termination scheme are excluded. Also for the area plot, full-parallel implementations with reduced routing schemes such as Split-Row [12], Split-Row Threshold [11] and bit-serial [22] methods are excluded for a fair comparison. The idealized contour in the throughput plot is obtained through linear scaling with technology (S); in the energy plot it is obtained through linear scaling with technology and quadratic scaling with voltage (V); and in the area plot it is obtained through quadratic scaling with technology.

A generic architecture for a reconfigurable decoder maps each submatrix or multiple submatricies to a memory block or register file and connects them to variable and check node processors through a reconfigurable routing scheme. A controller generates addresses for memory access and defines the interconnections for different modes. Overlapped check node and variable node processing [28], also known as Turbo decoding message passing (TDMP) [29] or Layer decoding [30], is used for Quasi-Cyclic codes to enhance the throughput [14], [31], [17], [18]. Depending on the code structure it may require reordering row and columns of the parity check matrix for efficient address generation [18]. To reduce the area and power consumption, block-serial scheduling is used [31] and register files are proposed [18]. For further power reduction, shared processors and memory blocks that are not used are deactivated [32].

## C. Routing Congestion Reduction

As shown in Section II, full-parallel decoders can potentially have the highest throughput and energy efficiency but because of high routing congestion caused by long global wires between processors they are not efficient to build. The reduced complexity decoding "Split-Row" methods have shown significant reductions in routing congestion through message passing reduction in check node operation [33], [12]. These methods achieve this by partitioning the parity check matrix columnwise into Spn partitions where the check node operation in each partition is simultaneously and almost independently computed. The parity check matrix example in Fig. 5 highlights the first row processing using Split-Row with two



Fig. 5. Parity check matrix example, highlighting the first row (check node) processing using Split-Row, with nearly independent and simultaneous check node processing in  $C1_{Sp0}$  and  $C1_{Sp1}$ .

partitions. As shown in the figure, each check node connects to only two variable nodes instead of four, which results in less processor and interconnect complexity.

Split-Row modifies the check node update (Eq. 1) of MinSum Normalized in the following:

$$\alpha_{ij:Spn} = S_{SR} \times \prod_{\substack{j' \in V(i) \setminus j \\ \text{Sign Calculation}}} sign(\beta_{ij'}) \times \underbrace{\min_{\substack{j' \in V_{Spn}(i) \setminus j \\ \text{Magnitude Calculation}}}}_{\text{Magnitude Calculation}}$$
(3)

where  $V_{Spn}(i)$  represents the V(i) variable nodes only contained in partition Spn on row *i*. In original Split-Row, the only communication between check node partitions per row is via a single sign bit wire. Therefore, as shown in Eq. 1 the sign computation remains the same as MinSum. In addition, a new scale factor  $S_{SR}$  is required to normalize  $\alpha$ . Due to the loss of global information among the partitions, Split-Row suffers from a 0.3 to 0.7 dB reduction in performance depending on the level of partitioning.

The recent "Split-Row Threshold" algorithm adds a threshold (T) signal that can partially recover the lost min() information by adding a 1-bit global signal with very few additional logic blocks. It is shown that significant error-performance recovery is achieved with only 0.07 dB loss from MinSum Normalized [34], [35]. In addition, greater levels of partitioning are accessible at lesser errorperformance loss and will enable designs of fully-parallel decoder architectures that have increased throughput and energy efficiency, and reduced area and power [11]. A 10GBASE-T full-parallel decoder with Spn = 16 partitioning, implemented in 65 nm CMOS operates at 195 MHz at 1.3 V with an average throughput of 92.8 Gbps with early-termination enabled. Low power operation at 0.7 V gives a worst case throughput of 6.5 Gbps-just above the 10GBASE-T requirement—and an estimated average power of 62 mW, resulting in 9.5 pJ/bit [36]. Decoder area is 4.84 mm<sup>2</sup> with a final post-layout area utilization of 97%. Compared to a MinSum Normalized fullparallel implementation in the same technology and with the same design flow, this decoder has 2.6 times higher logic utilization, is  $4.1 \times$  smaller,  $3.3 \times$  faster,  $4.8 \times$  more energy efficient and has only 0.22 dB performance loss at BER =  $10^{-9}$ .

# V. FUTURE APPLICATIONS AND LDPC DECODER CHALLENGES

Recently, there has been an increased demand for wireless-capable devices and it is expected to see continuous growth of wireless technology adoption not only in cellular but also in broadcasting and connectivity applications [37]. Due to the increased demand for high throughput and reliability in future wireless standards there is a need for more sophisticated signal processing and coding schemes. For example, a rate-compatible LDPC code has been proposed for IEEE 802.16m (under development), which is the next generation of WiMAX for 4G, targeting 100 Mbps mobile and 1 Gbps fixed throughputs [38]. LDPC codes have also been proposed for wireless high definition video transmission (WirelessHD) in the 60 GHz frequency band, which achieves a raw airlink data rate of 2.2 Gbps and decoding throughput of 1.6 Gbps [39].

In addition, future devices will not only have to allow reconfigurability for different environments but also must support different communication standards [40], [41]. This will lead towards dramatic increases in computations and the silicon areas required even with the technology scaling. Thus, future decoder hardware for these standards not only must support different code sizes/rates, but also enable shared hardware between standards for silicon area savings.

Also, as shown in Fig. 3, there is not much improvement in throughput and energy dissipation as technology advances. One reason is that code size and the decoding algorithm complexity have, on average, increased for these decoders (Fig. 3 (d)). Thus, technology improvement can mostly negate the effects of increased algorithm/code complexity but cannot drive improvements in the throughput and power. Therefore, to meet the required throughput and power dissipation of future reconfigurable applications, new decoding algorithms and architectures are required.

The advent of digital television (DTV) broadcasting has enabled high definition television services for both stationery and mobile devices. Unfortunately, current DTV standards are not well-suited for mobile devices. Although mobile devices do not require highresolution images due to their small screens, they require complicated signal processing and error correction algorithms with low power dissipation [42]. Current digital TV standards use multi-level coding (also called code concatenation) for very low error floor operation. For the next generation of broadcasting standards, there will be an attempt to remove the code concatenation step with better LDPC code construction. For example, the LDPC code in the new generation satellite broadcasting in China, ABS-S, achieves a frame error rate lower than FER =  $10^{-7}$  (the same as DVB-S2 with BCH code concatenation) without concatenation [43].

On the other hand, wireless medical technology has created opportunities in new methods of preventive care using biomedical implanted devices [44]. Long battery life is very critical in these devices due to the high cost of replacements, therefore the device must be designed for minimum energy consumption (10  $\mu$ W to 10 mW [45]). Using error correction improves the error performance and thus helps lower the transmit power to achieve a certain SNR. However, the power dissipation of encoder circuits must be ultra low to meet the implant transmit power requirements. It has been shown that the most energy-efficient system choice depends on the distance between implant device (in vivo) and receiver (ex vivo) [46]. Usually uncoded systems work well within a short range (< 0.5 m) but for longer distances, especially between 4 m to 10 m, a sophisticated coding scheme is more energy efficient. For example, a proposed rate 1/2 LDPC encoder has 7.2 dB coding gain over an uncoded system and dissipates 1331  $\mu$ W in 90 nm CMOS and 0.9 V [46].

LDPC codes have also received a lot of attention in hard disks with magnetic recording channels. Recent advances in magnetic recording are aimed at densities up to 2 Terabits per square [47]. To achieve such a high density with high system reliability, powerful coding schemes with efficient hardware implementations are required. The current iterative decoding system in most hard drives use a multilevel coding scheme which consists of an inner code, such as LDPC, and an outer code, such as Reed-Solomon (RS), to correct the remaining errors. For the next generation of hard disks, it is desired to combine the multi-level coding in the iterative decoder to reduce the latency [48]. It is still up to debate whether LDPC codes can totally replace the RS codes or be used as inner codes. In order to use LDPC codes, two requirements must be met. First, they should have a superior error performance with a very low error floor down to  $BER = 10^{-15}$ . Second, the decoder hardware should have a high decoding throughput of over 5 Gbps with small circuit area [47], [49].

In addition to these challenges, designing circuits in the deepsubmicron (e.g. below the 65 nm node) will require following stricter design rules to increase yield and decrease variations, and this will likely limit the circuit designers' and CAD tools' current level of freedom [50]. While the effects of new design rules may not be tangible, the problem of global wires is very real. Unlike transistors and local wires, global wires have not reaped the benefits of scaling. Compared to device scaling of approximately 0.7, global wire capacitance (per length) scales at a factor of 0.9, its resistivity is scaling at just over 1.1, and its RC delay is increasing at a rate of close to 2.4 [51]. Placing repeaters (buffers) between wire partitions has only slowed down-but not solved the wire delay increase. However, repeaters also have the unfortunate drawbacks of added power consumption, and added vias between metal layers to and from the buffer and wire segments, which makes the routing problem even harder [52].

Therefore, designs with very high interconnect complexity such

as LDPC decoders will be more challenging in future implementations. Thus, it is critical to have a technique which reduces design dependencies on low-level optimizations in order to achieve the high throughput and high energy efficiency requirements of future applications.

# VI. CONCLUSION

LDPC codes are appearing in an increasing number of applications, which have even stricter power and throughput constraints than the current state-of-the-art, requiring very good error performance. On the other hand, the benefits of straightforward CMOS scaling has been slowed down as the supply voltage, capacitance and wire delay will hardly decrease in future deep-submicron technology. Therefore, innovative algorithms and architectures, i.e. better code construction methods and efficient decoding algorithms for low error floor performance, reconfigurable and multi-standard decoder architectures, and new decoding algorithms and architectures for ultra low power and very high throughput applications, are required to keep the speed and power requirements within future tight budgets.

# VII. ACKNOWLEDGMENTS

The authors gratefully acknowledge support from ST Microelectronics, NSF Grant 0430090, CAREER Award 0546907, SRC GRC Grant 1598, CSR Grant 1659, Intel, UC Micro, SEM, and a UCD Faculty Research Grant.

#### REFERENCES

- "T.T.S.I. digital video broadcasting (DVB) second generation framing structure for broadband satellite apps.," http://www.dvb.org.
- [2] "IEEE 802.16e. air interface for fixed and mobile broadband wireless access systems. ieee p802.16e/d12 draft, oct 2005..".
- [3] "IEEE 802.11n. wireless lan medium access control and physical layer specifications: P802.11n/d3.07, march 2008.,".
- [4] "IEEE P802.3an, 10GBASE-T task force," http://www.ieee802. org/3/an.
- [5] H. Zhong et al., "Area-efficient min-sum decoder design for highrate quasi-cyclic low-density parity-check codes in magnetic recording," *IEEE Transactions on Magnetics*, vol. 43, pp. 4117–4122, Dec. 2007.
- [6] D. J. MacKay, "Good error correcting codes based on very sparse matrices," *TIT*, vol. 45, pp. 399–431, Mar. 1999.
- [7] M. Fossorier, M. Mihaljevic, and H. Imai, "Reduced complexity iterative decoding of low-density parity check codes based on belief propagation," *IEEE Transactions on Communications*, vol. 47, pp. 673–680, May 1999.
- [8] J. Chen and M. Fossorier, "Near optimum universal belief propagation based decoding of low-density parity check codes," *IEEE Transactions* on Communications, vol. 50, pp. 406–414, Mar. 2002.
- [9] J. Chen, A. Dholakia, E. Eleftheriou, and M. Fossorier, "Reduced-complexity decoding of LDPC codes," *IEEE Transactions on Communications*, vol. 53, pp. 1288–1299, Aug. 2005.
  [10] A. Blanksby et al., "A 690-mW 1-Gb/s 1024-b, rate 1/2 low-density
- [10] A. Blanksby et al., "A 690-mW 1-Gb/s 1024-b, rate 1/2 low-density parity-check code decoder," JSSC, vol. 37, pp. 404–412, Mar. 2002.
- [11] T. Mohsenin et al., "Multi-Split-Row Threshold decoding implementations for LDPC codes," in *ISCAS*, May 2009, pp. 2449–2452.
- [12] T. Mohsenin and B. Baas, "High-throughput LDPC decoders using a multiple Split-Row method," in *ICASSP*, 2007, vol. 2, pp. 13–16.
- [13] A. Darabiha et al., "Power reduction techniques for LDPC decoders," JSSC, vol. 43, pp. 1835–1845, Aug. 2008.
- [14] M. Mansour and N.R. Shanbhag, "A 640-Mb/s 2048-bit programmable LDPC decoder chip," JSSC, vol. 41, pp. 684–698, Mar. 2006.
- [15] L. Liu et al., "Sliced message passing: High throughput overlapped decoding of high-rate low density parity-check codes," *TCASI*, vol. 55, pp. 3697 – 3710, Dec. 2008.
- [16] P. Urard, L. Paumier, et al., "A 360mW 105Mb/s DVB-S2 compliant codec based on 64800b LDPC and BCH codes enabling satellitetransmission portable devices," in *ISSCC*, 2008, pp. 310–311.
- [17] C. Liu et al., "An LDPC decoder chip based on self-routing network for IEEE 802.16e applications," *JSSC*, vol. 43, pp. 684–694, Mar. 2008.
  [18] X. Shih, C. Zhan, et al., "An 8.29 mm<sup>2</sup> 52 mW multi-mode LDPC
- [18] X. Shih, C. Zhan, et al., "An 8.29 mm<sup>2</sup> 52 mW multi-mode LDPC decoder design for mobile WiMAX system in 0.13 CMOS process," *JSSC*, vol. 43, pp. 672–683, Mar. 2008.

- [19] Z. Zhang et al., "A 47 Gb/s LDPC decoder with improved low error rate performance," in VLSI Symposium, June 2009, pp. 22–23.
- [20] M. Chao, J. Wen, et al., "A triple-mode LDPC decoder design for IEEE 802.11n system," in ISCAS, May 2009, pp. 2445–2448.
- [21] R. Lynch, E.M. Kurtas, et al., "The search for a practical iterative detector for magnetic recording," vol. 40, no. 1, pp. 213–218, Jan. 2004.
- [22] A. Darabiha et al., "A 3.3-Gbps bit-serial block-interlaced Min-Sum LDPC decoder in 0.13-um CMOS," in CICC, 2007, pp. 459–462.
- [23] T. Richardson, "Error floors of LDPC codes," in Allerton, Oct. 2003.
- [24] J. Kang et al., "A two-stage iterative decoding of LDPC codes for lowering error floors," in *Globecom*, Dec. 2008, pp. 1–4.
- [25] Z. Zhang, L. Dolecek, et al., "Lowering LDPC error floors by postprocessing," in *Globecom*, Dec. 2008, pp. 1–6.
- [26] M. Fossorier, "Quasi-cyclic low-density parity-check codes from circulant permutation matrices," *TIT*, vol. 50, pp. 1788–1793, Aug. 2004.
- [27] R. M. Tanner et al., "LDPC block and convolutional codes based on circulant matrices," *TIT*, vol. 50, pp. 2966–2984, Dec. 2004.
- [28] Y. Chen and K. Parhi, "Overlapped message passing for quasi-cyclic low-density parity check codes," *IEEE Transactions on Circuits and Systems I*, vol. 51, pp. 1106–1113, June 2004.
- [29] M. Mansour and N. Shanbhag, "Turbo decoder architectures for lowdensity parity-check codes," in *Globecom*, Nov. 2002, pp. 1383–1388.
- [30] D. Hocevar, "A reduced complexity decoder architecture via layered decoding of LDPC codes," in SiPS, Oct. 2004, pp. 107–112.
- [31] K. K. Gunnam et al., "VLSI architectures for layered decoding for irregular LDPC codes of WiMAX," in ICC, June 2007, pp. 4542–4547.
- [32] Y.Sun and J.R.Cavallaro, "A low-power 1-Gbps reconfigurable LDPC decoder design for multiple 4G wireless standards," in SOC Conference, Sept. 2008, pp. 367–370.
- [33] T. Mohsenin et al., "Split-Row: A reduced complexity, high throughput LDPC decoder architecture," in *ICCD*, Oct. 2006, pp. 13–16.
- [34] T. Mohsenin, P. Urard, and B. Baas, "A thresholding algorithm for improved Split-Row decoding of LDPC codes," in ACSSC, Oct. 2008, pp. 448–451.
- [35] T. Mohsenin et al., "An improved Split-Row Threshold decoding algorithm for LDPC codes," in *ICC*, June 2009, pp. 1–5.
- [36] T. Mohsenin et al., "A low-complexity message-passing algorithm for reduced routing congestionin LDPC decoders," *TCAS-I, in reveiw.*
- [37] A. Tasic et al., Circuits and Systems for Future Generations of Wireless Communications, Springer, first edition, 2009.
- [38] "Amendment text proposal on rate compatible LDPC-convolutional codes," http://www.ieee802.org/16/tgm/IEEEC802. 16m-09/0339.
- [39] F. Mlinarsky, "Wireless HD video: Raising the throughput bar," Wireless Net Designline, Feb 2008.
- [40] F. Naessens et al., "A unified instruction set programmable architecture for multi-standard advanced forward error correction," in *SiPS*, Oct. 2008, pp. 31–36.
- [41] Y. Sun and J.R Cavallaro, "Unified decoder architecture for LDPC/turbo codes," in SiPS, Oct. 2008, pp. 13–18.
- [42] F. Mlinarsky, "China's DTV standard revolutionises broadcast," Global Sources Embedded Design India, Feb 2009.
- [43] S. Yuhai, L. Chunjiang, and Y. Ming, "A fast encoding method of QC-LDPC code used in ABS-S system," in *Pacific-Asia Conference on Circuits, Communications and Systems*, May 2009, pp. 107–110.
- [44] B. Zhen et al., "IEEE body area networks and medical implant communications," in *ICST*, Mar. 2008.
- [45] A. Chandrakasan et al., "Ultralow-power electronics for biomedical applications," *Annual Review of Biomedical Engineering*, pp. 247–274, Apr. 2008.
- [46] Krzysztof Iniewski, VLSI Circuits for Biomedical Applications, Artech-House, 685 Canton Street, Norwood, MA, USA, first edition, 2008.
- [47] K.K.Gunnam et al., "Next generation iterative LDPC solutions for magnetic recording storage," in ACSSC, Oct. 2008, pp. 1148–1152.
- [48] R. Galbraith and T. Oenning, "Iterative detection read channel technology in hard disk drives," *Hitachi white paper*, Nov 2008.
- [49] W. Tan, "Design of inner ldpc codes for magnetic recording channels," *IEEE Transactions on Magnetics*, vol. 44, pp. 217–222, Jan. 2008.
- [50] R. Radojcic et al., "Design for manufacturability for fabless manufacturers," in *IEEE Solid-State Circuits Magazine*, Summer 2009, pp. 24–33.
- [51] ITRS, "International technology roadmap for semiconductors, 2007 update, interconnect section," Online, http://www.itrs.net/reports.html.
- [52] S. S. Sapatnekar P. Saxena, R. S. Shelar, *Routing Congestion in VLSI Circuits*, Springer Science, NYC, NY, USA, first edition, 2007.