Peformance of Modified Bit Parallel Multiplier Based Radix-4-Pipe-Lined Single-Path Delay Commutator (R4SDC) FFT

D. Jaya kumar* and E. Logashanmugam**

Abstract: This paper focuses on the development of Fast Fourier Transform (FFT), based on Decimation in frequency domain by using Radix-4 algorithm. Fast Fourier Transform becomes an integral part of any type of digital communication system. A VLSI based high speed Radix-4 FFT algorithm is developed by using pipelined architectures. Radix-4 Single-path Delay Commutator (R4SDC) has been developed. Radix-4 SDC is the significant pipelined FFT architectures, because of adders and multipliers are used efficiently. Radix-4 FFT algorithm has a less number of computational stages than the Radix-2 FFT algorithm. Moreover, Bit Parallel Multiplier (BPM) design has been redesigned for improving the performance of twiddle factor multiplication. Modified architecture of BPM structure has been used only little number of hardware’s to perform the twiddle factor multiplication. Finally, the modified BPM structure has been integrated into R4SDC FFT for improving the speed of the processors. Modified Bit Parallel Multiplier structure based Radix-4 pipelined Single-path Delay Commutator (R4SDC) FFT has been proposed in this paper. Twiddle factor multiplication, which can reduce the number of multiplication operation as well as the number of memory access. The high speed design of R4SDC FFT processor, which is efficient in terms of latency, speed, area utilization, LUTs and power consumption. Generally, the radix-4 FFT is to reduce the hardware utilization than the Radix-2 FFT. The proposed structure has been simulated by using Modelsim 6.3C and synthesized by using Xilinx 10.1 ISE design tool.

Keywords: Fast Fourier Transform (FFT), Single-Path Delay Commutator (SDC) FFT, Bit Parallel Multiplier (BPM), Look Up Tables (LUTs).

1. INTRODUCTION

The FFT processor is an important block in those fields based on orthogonal frequency division multiplexing (OFDM) technology, Wireless LAN and LTE-Advanced. The pipeline FFT which can estimate the FFT in a sequential manner; it achieves real-time application with non-stop processing when the data is continually fed through the processor. This algorithm needs less computation due to its recursive operator named butterfly. This operator performs the computation of complex terms, which includes multiplication of input data by appropriate coefficients. Some different architecture has been proposed, based on different decomposition methods, such as R2MDC, R2SDF, R4SDC and R4SDF. Multi-path architecture can process N number of data inputs concurrently, and they have some restrictions on the number of parallel data-paths, radix and FFT points.

FFT architectures can be divided into two types: Memory-based and pipelined architectures. Memory-based architectures consist of a butterfly unit and few number of memory blocks for providing low cost designs. However, it is very difficult for them to achieve real-time processing at low clock frequency.

* Research Scholar, St.Peters University, Department of ECE, Chennai, Tamilnadu, India, E-mail: jayakumarvlsi16@gmail.com.
** Professor & head, Department of ECE, Sathyabama University, Sholinganallur, Chennai, Tamilnadu, India.
Alternatively, a pipelined architecture consists of multiple stages to give higher throughput at the cost of more hardware. The pipelined FFT architectures can attain a high throughput and low delay which are suitable for real time applications.

A pipelined architecture for a Radix-4 FFT is to increase the throughput without increasing the area. Radix-4 FFT is well suited for various kinds of applications as the hardware utilization of radix-4 is lower than that of the radix-2 FFT architecture. In this paper, Modified BPM based R4SDC FFT has been designed. Radix-4 FFT has more efficiency than the Radix-2 FFT structure in terms of less computational path. By modifying the BPM structure to improve the multiplication of twiddle factor. Further, Modified architecture of BPM structures are incorporated into R4SDC FFT for improving the performance of processor.

2. RELATED WORK

The design of optimized radix-2 and radix-4 butterflies from FFT with decimation in time [1]. The committed and varied structures for 16 point radix-2 and radix-4 DIT butterflies are implemented, where the main aim is to minimize the number of arithmetic operators in order to produce power efficient structures. To improve the radix-2 butterfly elements by reducing one adder and one subtractor in the structure. The optimized radix-2 butterfly is used to reduce the number of real multipliers in the radix-4 butterfly. In [2], the radix-4 based architecture has four butterfly elements, and the pipelined structure is optimized to manage the processing speed and area. The pipeline performance was improved by the repartition of the multiplier and butterfly operation. A novel multiplier-less pipelined FFT processor [3] suitable for shorter FFT’s. The minimum number of shifter and adder operations is used in multiplier less architecture for analyzing the hardware complexity.

In [4] presented a twiddle factor based FFT algorithm with redundant memory access removed. To reduce the frequency of memory access, in FFT algorithm. All the butterfly elements sharing the same twiddle factor will be grouped and computed together. The complexity of twiddle factor memories for pipelined FFTs assuming different architectures in [5]. By using octave symmetry and an address generator, a look up table is advantageous for low resolution memories while for larger resolution memories.

For high speed and low power application, the architecture of FFT has been proposed [6]. R2SDF and R4SDC through folding transformation and register minimization technique is proposed in [7] for reduction in hardware complexity and to enhance the computational performance of FFT. In [8] used the multipath delay Commutator to increase the throughput rate of radix-4 and rdaix-2 FFT calculation by the factor of 2 to 4. By the factor of 2 to 3, delay can be reduced.

3. RADIX-4 FFT STRUCTURE

Implementation of butterfly block in DSP processor requires selection of radix first, several FFT algorithms have been proposed such as radix-2, radix-4, radix-8 and several other higher order radices FFT. A Radix-4 FFT algorithm has \( \log_4 N \) stages. Thus, the 16 point Radix-4 FFT consists of two stages. Each stage has several butterfly operations. Each butterfly consist four inputs and four outputs. The Radix-4 decimation in frequency (DIF) algorithm divides an N-point DFT into four \( N/4 \)-point DFTs. Each of those is divided into \( N/16 \)-point DFTs giving sixteen \( N/16 \) DFTs and so on. The final stage produces a 4-point DFT which is simply a butterfly calculation for a radix-4 FFT. The data flow structure of 16 point Radix-4 FFT is shown in figure 1. The equation of radix-4 FFT is represented as,

\[
Y[k] = \sum_{n=0}^{N-1} y[n] W_N^{nk}, \ 0 \leq k \leq N-1
\]

\[
Y[K] = \sum_{n=0}^{N/4} y(n)W_N^{nk} + \sum_{n=N/4}^{N/2} y(n)W_N^{nk} + \sum_{n=N/2}^{3N/4} y(n)W_N^{nk} + \sum_{n=3N/4}^{N-1} y(n)W_N^{nk}
\]
\[ Y[K] = Y[K] = \sum_{n=0}^{N-1} \left( y(n) + W_{N}^{nk} y(n + N/4) + W_{N}^{nk} y(n + 3N/4) \right) W_{N}^{nk} \]

Compared to Radix-2 FFT, computational stages can be reduced in Radix-4 FFT. Therefore, Radix-4 FFT has been used to boost the performance of OFDM system.

![Data Flow Structure of 16 point Radix-4 FFT](image1)

**4. RADIX-4 SINGLE PATH DELAY COMMUTATOR (R4SDC) FFT**

R4SDC FFT is a radix-4 model of R2SDC. Radix-4 single path delay commutator consists of butterfly units, commutator and complex multiplier with shift registers to give delays. Radix^2 algorithm has the multiplicative complexity as radix-4 and keeping the butterfly architecture of radix-2 FFT. The structure of Single-path delay commutator (SDC) FFT is illustrated in Figure 2.

![Structure of Single-path Delay Commutator](image2)
R4SDC architecture has higher computational efficiency in addition. Bi and Jones introduced the R4SDC FFT, uses an applied architecture to calculate the radix-4 FFT. By using different radices, the key to the algorithm is dividing the FFT into different stages. The R4SDC FFT, which creates a long critical path. In the R4SDC architecture, the multipliers are used upto 75% using the programmable butterfly elements. Using merged delay-Commutator, the memory requirement can be reduced to 2N-2.

5. PROPOSED MODIFIED BPM BASED RADIX-4 PIPELINED SINGLE-PATH DELAY COMMUTATOR (R4SDC) FFT

In this paper, the architecture of proposed modified BPM based 16-point R4SDC FFT is used to reduce the computational path, hardware utilization and power consumption. R4SDC enhances the utilization of the butterfly units by varying the processing elements. However, the memory requirement is increased. R4SDC architecture gives high speed, high throughput rate and low hardware complexity. In the proposed method, R4SDC FFT architecture is proposed for enhancing the architectural performances in terms of VLSI parameters.

Radix-4 Single-path Delay Commutator consists of processing element, commutator, delay elements, twiddle factor and multiplexer. The operation of adder and subtractor has been done in processing element. In this architecture, carry select adder circuit is used for addition operation to improve the speed of the processor. To convert the signal from one form to another form is called commutator. Multiplexer is used to control the signals. Twiddle factor, which is used to reduce the shifter and adder values. The twiddle factor values are stored in the form of LUTs.

The complex multiplier circuit, Read Only Memory has been increased. In order to rectify this problem, modified structure of bit parallel multiplier has been designed. MBPM does not need Read Only Memory to store the values of twiddle factor, instead of using accumulation and shifter based logical functions to perform the twiddle factor multiplication. The architecture of 16 point Radix-4 SDC FFT is shown in figure 2.

Figure 3: Architecture of 16 point Radix-4 Single-path Delay Commutator (R4SDC) FFT

In figure 3, illustrates the traditional method of Bit Parallel Multiplication (BPM) structure of 0.707. The 0.707 fractional values with some input value is denoted as follows,
output = \text{in} \times \frac{\sqrt{2}}{2} = \text{in} \times (2^{-1} + 2^{-3} + 2^{-4} + 2^{-6} + 2^{-8} + 2^{-14})

Further it can be solved as

output = \text{in} \times \frac{\sqrt{2}}{2} = \text{in} \times \left[1 + (1 + 2^{-2})(2^{-6} - 2^{-2})\right]

The twiddle factor multiplication value of 0.707 requires four numbers of shifters and three numbers of adders to perform multiplication. For example, decimal value of 12 is represented in binary value as 1100. While shifting right position to one bit, 0110 (6), i.e., division of two will be obtained.

In Figure 4, the proposed modified Bit parallel multiplier consists of 3 shifters and 2 adders to perform the twiddle factor multiplication. In this research work, the modified BPM based (R4SDC) FFT is to reduce the hardware components. By reducing the twiddle factor values, the hardware complexity can be reduced. The aim of this research work is to reduce the area, delay and power consumption of the FFT processor.

6. RESULT AND DISCUSSION

By using the Verilog HDL, the design of proposed modified BPM based Radix-4 pipelined Single Path Delay Commutator (R4SDC) FFT has been developed. The results of simulation waveform have been done by using ModelSim 6.3C and performances of synthesis results are calculated by Xilinx ISE10.1i design tool. The simulated output of Existing R4SDC FFT and proposed modified BPM based Radix-4 SDC FFT is shown in Figure 5 and Figure 6. The performance of Existing R4SDC FFT and Proposed modified BPM based R4SDC FFT are analyzed and compared in Table 1. The performance Evaluation of existing R4SDC FFT and Proposed modified BPM based R4SDC FFT is graphically illustrated in Figure 7.
Figure 6: Simulation Result of Existing R4SDC FFT

Figure 7: Simulation Result of Proposed Modified BPM based R4SDC FFT
### Table 1

<table>
<thead>
<tr>
<th>Types of parameters</th>
<th>No of occupied Slices</th>
<th>No of LUTs</th>
<th>Delay (ns)</th>
<th>Power (W)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Existing R4SDC FFT</strong></td>
<td>1078</td>
<td>1820</td>
<td>26.816</td>
<td>0.807</td>
</tr>
<tr>
<td><strong>Proposed Modified BPM based 16-point R4SDC FFT</strong></td>
<td>1037</td>
<td>1756</td>
<td>26.184</td>
<td>0.775</td>
</tr>
<tr>
<td><strong>Percentage reduction %</strong></td>
<td>3.8%</td>
<td>3.51%</td>
<td>63.2%</td>
<td>3.96%</td>
</tr>
</tbody>
</table>

![Graph showing performance comparison](image)

**Figure 8: Performance Evaluation of Existing R4SDC and Proposed Modified BPM based R4SDC FFT**

From Table 1, it is clear that the proposed modified BPM based Radix-4 SDC FFT architecture gives 3.8% reduction in number of slice utilization, 3.51% reduction in number of LUTs, 63.2% reduction in delay and 3.96% reduction in power consumption than the existing R4SDC FFT.

### 7. CONCLUSION

In this paper, the architecture of modified BPM structure based 16-point Radix-4 SDC FFT has been proposed through the Very Large Scale Integration (VLSI) design environment. A maximum frequency is 38.191MHZ in Modified BPM based 16-point FFT of R4SDC and used 1037 slices on the Spartan-3. The Existing R4SDC ran at 37.291MHz and used 1078 slices on the Spartan-3. The importance of this research work is to reduce the area, delay and power consumption of proposed modified BPM based Radix-4 SDC FFT. Proposed modified bit parallel multiplier based Radix-4 SDC FFT provides 3.8% reduction in number of slice utilization, 3.51% reduction in number of LUTs, 63.2% reduction in delay and 3.96% reduction in power consumption than the existing Radix-4 SDC FFT. Thus, the proposed design is particularly useful for low power applications such as WLAN etc. For future work, plan to optimize the complex multipliers for more power reduction and FFT will be expanded to big number of FFT computations for OFDM and SDR applications.

### 8. REFERENCES


