# FAST RNS FPL-BASED COMMUNICATIONS RECEIVER DESIGN AND IMPLEMENTATION

J.Ramírez<sup>(1)</sup>, A. García<sup>(1)</sup>, U. Meyer-Baese<sup>(2)</sup> and A. Lloris<sup>(1)</sup>

 <sup>(1)</sup> Dept. of Electronics and Computer Technology, Campus Universitario Fuentenueva, 18071 Granada, Spain
 <sup>(2)</sup> Dept. of Electrical and Computer Engineering, Florida State University, Tallahassee, USA
 {jramirez, agarcia, lloris}@ditec.ugr.es

**Abstract.** Currently, several design barriers inhibit the implementation of highprecision digital signal processing (DSP) systems with field programmable logic (FPL) devices. A new demonstration of the synergy between the residue number system (RNS) and FPL technology is presented in this paper. The quantifiable benefits of this approach are studied in the context of a high-end communications digital receiver. A new RNS-based direct digital synthesizer (DDS) that does not need a scaler circuit is introduced. The programmable decimation FIR filter is based on the arithmetic benefits associated with Galois fields and supports tuning the IF frequency as well as its bandwidth. Results show the proposed methodology requires fewer resources than classical designs, while throughput advantage is about 65%.

# 1. Introduction

Digital receivers have revolutionized communication systems offering remarkable benefits when compared to their analog counterparts. During the last decade, Direct Digital Synthesizer (DDS) techniques have become increasingly popular methods in digital receiver designs and many ASIC vendors, such as Graychip [1], Intersil [2] or Pentek [3], are providing semiconductor solutions for digital communication systems. These systems yield significant benefits in performance, density and cost as well as provide high frequency resolution, fast and phase-continuous frequency switching, exceptional linearity and excellent temperature and aging stability.

With the advent of the new Field-Programmable Logic (FPL) device families, such as the Altera APEX 20K [4] or the Xilinx Virtex [5], and their increasing speed and density, many new benefits are becoming available to radio frequencies for the design of digital communication systems using these devices. Digital receiver chips perform down conversion, lowpass filtering and decimation of the sampled RF signal. The resulting bandwidth and sample rate reduction makes it possible to perform real-time processing of narrow and wide band radio signals.

Traditional numbering systems are commonly used to build DSP systems with commercially available FPL technology. Although the two's complement (2C) system has been adopted for a wide range of real-time applications including digital communications, image, video and speech processing, multimedia systems, networking, etc, a review of the FPL vendor supplied application notes [6, 7] shows that these devices suffer from weak arithmetic performance when compared to carefully designed standard-cell based ASICs. While FPL vendors champion their technology as a provider of *system-on-a-chip* (SOC) DSP solutions, engineers have historically viewed FPL as a prototyping technology. In order for FPL to begin to compete in areas currently controlled by low-end standard-cell ICs, a means must be found to more efficiently implement DSP objects.

An arithmetic system capable of surmounting these barriers is the residue number system, or RNS [8]. This paper develops a mechanism of achieving synergy within an FPL-defined environment to implement arithmetic intensive DSP solutions. FPL devices are organized in channels (typically 8-bits wide). Within these channels are found short delay propagation paths and dedicated memory blocks with programmable address and data spaces, which are commonly used to synthesize small RAM and ROM functions. Performance rapidly suffers when carry bits and/or data have to propagate across channel boundaries. We call this the channel barrier problem [9]. Existing 2C designs encounter the channel barrier problem whenever precision exceeds the channel width. An alternative design paradigm is advocated in this paper. The advantage is gained by reducing arithmetic to a set of concurrent operations that reside in small wordlength non-communicating channels. The quantifiable benefits of this approach are studied in the context of a design example, an RNS-based digital receiver design. This work will build upon previous works [10, 11, 12] and previous RNS-FPL design experience [13].

## 2. Background

There is emerging evidence that an arithmetic technology, called the RNS [8], can avoid the throughput degradation with the increase in precision and become a custom IC enabling technology. Computer arithmeticians have long held that the RNS offers the best MAC speed-area advantage [14].

In the RNS, numbers are represented in terms of a relatively prime basis set (*moduli* set)  $P = \{m_1, m_2, \dots, m_L\}$ . Any number  $X \in \mathbb{Z}_M = \{0, 1, \dots, M-1\}$ , where  $M = m_1 \cdot m_2 \cdot \dots \cdot m_L$ , has a unique RNS representation  $X \leftrightarrow \{X_1, X_2, \dots, X_L\}$ , where  $X \models X \mod m_L$  ( $l = 1, 2, \dots, L$ ). Mapping from the RNS back to the integer domain is defined by the Chinese Remainder Theorem (CRT) [8].

RNS arithmetic is defined by pair-wise modular operations:

$$Z = X \pm Y \leftrightarrow \left[ \left| X_{m_1} \pm Y_{m_1} \right|_{m_1}, \left| X_{m_2} \pm Y_{m_2} \right|_{m_2}, \dots, \left| X_{m_L} \pm Y_{m_L} \right|_{m_L} \right]$$

$$Z = X \times Y \leftrightarrow \left[ \left| X_{m_1} \times Y_{m_1} \right|_{m_1}, \left| X_{m_2} \times Y_{m_2} \right|_{m_2}, \dots, \left| X_{m_L} \times Y_{m_L} \right|_{m_L} \right]$$
(1)

where  $|Q|_{m_l}$  denotes  $Q \mod m_l$ . The individual modular arithmetic operations are typically performed as LUT calls to small memories, the usual core block of today FPL devices.

Index arithmetic [15, 16] constitutes an efficient means for designing high performance, reduced complexity DSP systems. It is based on the mathematical properties associated with Galois fields, denoted GF(*p*), with *p* being a prime. All the non-zero elements in a Galois field can be generated exponentiating a primitive element, denoted  $g_l$ . This property can be exploited for multiplication in GF( $m_l$ ) through the use of the well known isomorphism between the multiplicative group  $Q = \{1, 2, ..., m_l-1\}$ , with multiplication modulo  $m_l$ , and the additive group  $I = \{0, 1, ..., m_l-2\}$ , with addition modulo  $m_l-1$ . The mapping is given by:

$$q = \Phi_l^{-1}(i) = g_l^i \mod m_l \qquad l = 1, 2, ..., L$$
(2)

 $q \in Q$ ,  $i \in I$  and multiplication is based on:

$$|q_1q_2|_{m_l} = g^{|l_1+l_2|_{m_l-1}} \qquad l=1,2,...,L$$
 (3)

Thus, multiplication of two operands, say  $q_1$  and  $q_2$ , can be performed by adding exponents in a modular sense. The exponents, or indexes,  $i_1$  and  $i_2$ , can be precomputed and stored in a look-up table. Adding the indexes can be performed with a modulo  $m_l$ -1 adder, and the inverse index transformation can be performed again using a LUT.

# 3. Digital receiver RNS Design

In its simplest form, a superheterodyne receiver filters the radio frequency (RF) signal and converts it to a lower intermediate frequency (IF) by mixing with an offset localoscillator as shown in Figure 1, with many vendors offering digital receiver chips. On the other hand, FPL chips can take advantage of the built-in device resources as well as their low-cost and low development time to meet the continuous evolving market requirements. However, FPL technology suffers from weak arithmetic capabilities when compared to a well designed ASIC. A RNS-based digital receiver FPL design able to surmount inherent technology barriers is presented below. The design consist of an RNS-based DDS and a high throughput programmable decimation FIR filter.

#### 3.1. Direct Digital Frequency Synthesizer

DDS, or Numerically Controlled Oscillators (NCO), are important components in many digital communication systems. Their applications are numerous in down and up converters, demodulators and various types of modulation schemes, including PSK (Phase Shift Keying), FSK (Frequency Shift Keying) and MSK (Minimum Shift Keying) [17]. A common method for building such a system makes use of an integrator and look-up tables (LUTs) that store uniformly spaced samples of cosine and sine waves. Several properties of the DDS design determine its performance. Traditionally, for most practical applications, a quantizer reduces the precision of the phase angle presented to the LUT address space port, thus reducing the memory requirements of the system [17]. In addition, for an area-efficient design, quarter wave



Fig. 1. Digital receiver architecture.

symmetry is exploited, so the two most significant bits of the quantized phase angle are used to perform quadrant mapping. The signal phase and amplitude resolution are affected by the length and width of the LUT, respectively. In this section, an RNS-based DDS for a digital receiver is presented.

A design of a RNS DDS consisting of an RNS Phase Accumulator and a number of LUTs storing the residue digits of sine and cosine waves was presented in [10]. The Phase Accumulator consists of a modulo M adder that increments in  $\Delta\theta$  (phase resolution) units the count value each clock cycle. This operation is performed in parallel over the L residue digits, with  $\Delta\theta_i = \Delta\theta \mod m_i$  (l = 1, 2, ..., L). A clear advantage of this design when compared to traditional structures [17] is that the LUT address space is reduced to the modulus width and the phase accumulation is performed in parallel over low-delay, modular RNS channels. However, most of the RNS-based DDSs often require a complex RNS scaler circuit [10, 11]. In this paper a new RNS-based DDS not requiring scaling is proposed. Moreover, RNS scaler circuits often eliminate two or more RNS channels of the system, thus limiting the digital receiver dynamic range since RNS base extension methods introduce excessive hardware complexity.

Figure 2 shows the design of a low-complexity RNS DDS designed for a high performance FPL programmable digital receiver. Notice that an output RNS engine processes the first quadrant LUT outcomes:

$$c_{l}(n) = \left(K \cos\left(\frac{2\pi}{N}\theta(n)\right)\right) \mod m_{l} \qquad s_{l}(n) = \left(K \sin\left(\frac{2\pi}{N}\theta(n)\right)\right) \mod m_{l} \qquad (4)$$

to produce the cosine and sine waves given by:

$$uv = 00 \qquad \cos_{l}(n) = c_{l}(n) \qquad \sin_{l}(n) = \Phi(s_{l}(n)) \qquad 0 \le \theta < \pi/2$$
  

$$uv = 01 \qquad \cos_{l}(n) = -s_{l}(n) \qquad \sin_{l}(n) = \Phi(c_{l}(n)) \qquad \pi/2 \le \theta < \pi$$
  

$$uv = 10 \qquad \cos_{l}(n) = -c_{l}(n) \qquad \sin_{l}(n) = \Phi(-s_{l}(n)) \qquad \pi \le \theta < 3\pi/2$$
  

$$uv = 11 \qquad \cos_{l}(n) = s_{l}(n) \qquad \sin_{l}(n) = \Phi(-c_{l}(n)) \qquad 3\pi/2 \le \theta < 2\pi$$
(5)



Fig. 2. RNS-based DDS.

where *K* and *N* are the precision and depth of the LUTs and  $\Phi$  is a function that provides the RNS index representation. The proposed design avoids the use of complex RNS scaling hardware and benefits from the sine and cosine symmetry to reduce the LUT address space depth *N* from 10 to 8 bits, thus fitting the built-in target technology memory resources [4]. For the phase accumulator a carry-chain conventional accumulator was selected. The multiplexer, located at the DDS output, selects the  $c_l(n)$ ,  $s_l(n)$ ,  $-c_l(n)$  or  $-s_l(n)$  depending on the value of the 2 MSBs of the phase  $\theta$ . To compute  $-c_l(n)$  and  $-s_l(n)$ , a two-stage modular subtractor was used according to:

$$-c_{l}(n) = (-c_{l}(n)) \mod m_{l} = (m_{l} - c_{l}(n)) \mod m_{l}$$

$$-s_{l}(n) = (-s_{l}(n)) \mod m_{l} = (m_{l} - s_{l}(n)) \mod m_{l}$$

$$l = 1, 2, ..., L$$
(6)

The DDS generates sine and cosine waves with a frequency given by:

$$f_{\rm out} = \frac{\Delta \theta}{N} f_{\rm CLK} \tag{7}$$

where  $f_{\text{CLK}}$  is the phase accumulator frequency. The frequency resolution  $\Delta f$  of the synthesizer is a function of the clock frequency  $f_{\text{CLK}}$  and the phase accumulator precision  $L_{\text{ACC}}$ , and can be determined using the following equation:

$$\Delta f = \frac{f_{CLK}}{2^{L_{ACC}}} \tag{8}$$

Finally, the sine and cosine waves  $sin_l(n)$  and  $cos_l(n)$  are the index representations of the waveforms as shown in Equation 5 and Figure 2.

#### 3.2. Programmable Decimation FIR filter

The digital receiver is intended to work externally in a 2C format. The programmable decimation FIR filter consists of *L* index-based parallel RNS channels and a final RNS-to-2C output converter. The main parameters (number of taps, input and output precisions) of the FIR filter are programmable. The filter has a COEFF/DATA input



Fig. 3. RNS-based programmable decimation filter design.

that is used to store the filterbank coefficients serially.

To convert the received RF signal to the index domain, a special design of a 2C-toindex converter was used. A *B*-bit 2C input RF signal x(n) is assumed. The input converter is used to map the *B*-bit input signal x(n) (or the coefficients sequentially loaded) to an index RNS representation. Its design assumes decomposing the input word into *p b*-bit blocks:  $x_0, x_1, ..., x_{p-1}$ , and uses *p*-1 look-up tables (LUTs) storing the functions  $\Phi_k(x_k) = (2^{n_l-1+(k-1)b} x_k) \mod m_l$  (k=1, 2, ..., p-2) where  $n_l = \lceil \log_2(m_l) \rceil$ , and  $\Phi_{p-1}(x_{p-1}) = (2^{B-q} x_p) \mod m_l$ , with *q* being the size of the signed  $x_{p-1}$  block. Finally, a modular adder tree followed of a LUT provides the index representation modulo  $m_l$  (l=1, 2, ..., L), denoted by  $x_l(n)$ . Notice that a LUT is saved since  $x_0$  can be selected as a  $(n_l-1)$ -bit block and is directly routed to the modular adder tree.

Once the received RF signal is converted to the index domain, the receiver mixes it with the DDS synthesizer signal using an index domain multiplier and routes the mixed signal to the programmable decimation filter.

Figure 3 shows the internal modular processing engine of the proposed programmable decimation FIR filter. A multiplexer enables loading the coefficient sequentially or distributing the input signal to the index-based multipliers across the register chain. Products are computed by means of a modulo  $m_l$ -1 adder modified to correctly compute multiplication by zero [18]. Thus, the modified adder produces a "11…11" value when one input is 0, and a zero value is stored in the  $2^{n_l} - 1$  LUT address. An inverse index LUT stores the final residue digit given by:  $\Phi_l^{-1}(i) = g_l^i \mod m_l$ . The filter product summation is efficiently implemented by means of an enhanced modulo  $m_l$  adder tree consisting of conventional adders (with



Fig. 4. Complete RNS-based digital receiver design.

precision extension) followed by a modulo  $m_l$  reduction stage implemented by block decomposition [18]. This method leads to an important reduction in resources (especially for longer filters) and an increase in the system performance. A saving in resources is achieved with a binary MOMA structure when the filter products summation is computed by means of a conventional adder stage (with precision extension) along with a final reduction stage. Notice that the binary adder tree is optimal for FPL devices, in opposition to the use of carry save adders [19], better suited for an ASIC design.

The practical implementation of RNS-based systems encounters a difficulty in the conversion stage. Different solutions has been used to overcome the conversion barrier [18] in FPL-centric designs. In this paper, two solutions were explored for the final index-to-2C output converter. The CRT converter makes use of *L* LUTs storing  $\Theta_l(i) = \hat{m}_l(i/\hat{m}_l) \mod m_l$ , with  $\hat{m}_l = M/m_l$ , conventional adders and a modulo  $M = m_1 \cdot m_2 \cdot \ldots \cdot m_L$  reduction stage. Notice that a subtractor and a multiplexer is needed to map the modulo *M* output to 2C. A more efficient design is based on the well-known  $\varepsilon$ -CRT algorithm [20] that maps the *L* residue digits directly to a scaled *W*-bit 2C representation. This converter uses smaller LUTs storing the functions  $\hat{\Theta}_l(i) = \lfloor 2^{W+b}/m_l \cdot (i/\hat{m}_l) \mod m_l \rfloor$  and replaces modulo adders by conventional adders. Notice that the  $b = \lceil \log_2(L) \rceil$  extra bits are introduced to correct the scaling error,  $0 \le \varepsilon < L$ , by truncating the final output to *W* bits.

# 4. Complexity and performance comparisons

Implementations of the proposed communication digital receiver using Altera APEX20K [4] devices were carried out and are presented in the following. The entire RNS-based communication receiver design is shown in Figure 4. Hardware complexity of individual RNS channels was assessed before building the complete systems. The modulus set was selected to ensure the dynamic range of the system, reduce the complexity and maximize the throughput of the system. Prime moduli of up to 5-bit wide as well as a power-of-two modulus of up to 6-bits were found to be the best choice. Thus, the programmable decimation FIR filter was synthesized using only LEs, while ESBs were only required for the DDS sine and cosine LUTs.

The DDS consists of a conventional phase accumulator, 2 LUTs storing the sine

and cosine functions and additional logic to exploit quarter wave symmetry. LUTs are mapped directly to 2K-bit ESBs. These blocks can be configured either as  $2^7 \times 16$ ,  $2^8 \times 8$ ,  $2^9 \times 4$ ,  $2^{10} \times 2$  or  $2^{11} \times 1$ . A compromise between hardware requirements and DDS performance is necessary. Thus, for good signal phase and amplitude resolutions larger tables are necessary. However, the number of ESBs increases since a single ESBs can not allocate the whole word of the output residue digits. In this paper we have selected a hardware efficient implementation of the DDS with a 10 bit phase resolution and  $2^8 \times n_l$  sine and cosine LUTs, with each LUT being mapped on a single ESB. The additional logic to deal with the quarter wave symmetry is built with two modular subtracters and two multiplexers.

The programmable decimation FIR filters are based on the design shown in figure 3 and replace slow and complex 2C multipliers with highly efficient FPL-optimized index multipliers [18], thus yielding a sustained increase in system performance.

The proposed design was evaluated using 2C classical receiver benchmarks. The design parameters were: *i*) 8-bit RF signal, *ii*) 8-bit wide sine and cosine LUTs, *iii*) the programmable decimation FIR filters accept 16-bit mixed signals and the filter coefficients are assumed to be 16-bit wide. 2C- and RNS-based digital receivers were modeled using VHDL and synthesized using speed grade -1 Altera APEX 20K devices. The models are parametric and suitable of being modified easily if the system requirements are modified. On the other hand, both systems allow the filter coefficients to be run-time programmable for maximum flexibility.

|       |     | RNS-based DDS |                             |           |           |  |
|-------|-----|---------------|-----------------------------|-----------|-----------|--|
|       | DDS | Phase acc.    | LUT (quarter wave symmetry) |           |           |  |
|       |     |               | $n_l = 3$                   | $n_l = 4$ | $n_l = 5$ |  |
| #LEs  | 94  | 30            | 36                          | 46        | 58        |  |
| #ESBs | 2   | 30            | 1                           | 1         | 2         |  |

| Ν  | W  | 2C FIR |            | Proposed RNS-based programmable decimation FIR filter<br>CRT/ε-CRT |                    |             |            |                               |
|----|----|--------|------------|--------------------------------------------------------------------|--------------------|-------------|------------|-------------------------------|
|    |    | LEs    | F<br>(MHz) | LEs                                                                | Resource reduction | F<br>(MHz)  | Speed-up   | Modulus set                   |
| 8  | 34 | 3892   | 91         | 5338/<br>4089                                                      | -37%<br>-5%        | 130/<br>149 | 43%<br>64% | 11,13,17,19,<br>23,29,31,32   |
| 16 | 35 | 7786   | 84         | 9087/<br>7365                                                      | -16%<br>5%         | 128/<br>137 | 52%<br>63% | 11,13,17,19,<br>23,29,31,64   |
| 32 | 36 | 15591  | 79         | 16828/<br>14331                                                    | -7%<br>8%          | 119/<br>131 | 51%<br>66% | 7,11,13,17,<br>19,23,29,31,32 |
| 64 | 37 | 31235  | 75         | 32026/<br>28173                                                    | -2%<br>10%         | 116/<br>127 | 55%<br>69% | 7,11,13,17,<br>19,23,29,31,64 |

**Table 1.** DDS LE and ESB requirement comparisson.

Table 2. Resource reduction and speed-up provided by programmable FIR decimation filters.

Table 1 and Table 2 show the hardware requirements and the maximum operating frequency obtained for the DDS and the programmable decimation FIR filter, respectively. Table 1 quantifies the resources required for a 2C-based DDS and the proposed 2C-RNS merged DDS. The table includes implementation data for different wordwidth RNS moduli since for an entire RNS-based digital receiver design, different wordwidth RNS channels are needed. Table 2 compares 2C FIR filters ranging from 8 to 64 taps with the filters proposed in this paper. Input signal and coefficients are 16-bit wide and, for the 2C design, the  $16 \times 16$ -bit multipliers are designed using five pipeline stages. The table shows the number of taps (N), the system dynamic range (W), the number of LEs, the maximum frequency and the modulus set. Results reveal the proposed filter to have a complexity comparable or even lower than a 2C design, while performance increase is about 65%.

The hardware penalty introduced by the conversion stages was carefully assessed. A CRT-based converter only required between 28% and 7% of the the system total resources for filters ranging from 8 to 64 taps. However, if an  $\epsilon$ -CRT converter is used, output conversion only requires from 12% to 2% the of the LEs in the entire system.

# 5. Conclusions

RNS has been shown as an enabling tool for fast FPL implementation of communication digital receivers. The paper explored building a complete system including the DDS and the programmable bandwidth decimation filter. A new DDS that avoids the problems associated with previous proposals has been presented. Its design exploits the quarter wave symmetry, thus reducing the DDS memory requirements. On the other hand, the programmable decimation filter has shown a reduction in hardware complexity and a throughput improvement over a classical 2C design. Thus, the proposed system introduces clear advantages over previous proposals and meets the performance requirements of modern DSP technology.

#### Acknowledgements

The authors were supported by the Comisión Interministerial de Ciencia y Tecnología (CICYT, Spain) under project PB98-1354. CAD tools and supporting material were provided by Altera Corp., San Jose CA, under the Altera University Program.

### References

- Graychip, Inc., "GC1012A Digital Tuner Data Sheet", http://wwws.ti.com/sc/psheets/slws128/slws128.pdf, Feb. 1998.
- [2] Intersil, Corp., "HSP50306 Digital QPSK Demodulator",

http://www.intersil.com/data/FN/FN4/FN4162/FN4162.pdf, 1998.

- Pentek, Inc., "Model 4272 Multiband Digital Receiver", http://www.pentek.com/ products/GetDS.cfm/4272.PDF?Filename=ACF405.pdf.
- [4] Altera, Corp., "APEX 20K Programmable Logic Device Family Data Sheet", http://www.altera.com/literature/ds/apex.pdf, Dec. 2001, v. 4.2.
- [5] Xilinx, Inc., "Virtex 2.5V Field Programmable Gate Arrays Data Sheet" http://www.xilinx.com/partinfo/ds003-2.pdf, Jul. 2001, v. 2.6.
- [6] Altera, Corp., "Implementing FIR Filters in FLEX Devices", http://www.altera.com/literature/an/an073.pdf, Feb. 1998, v.1.01.
- [7] Xilinx Inc., "Transposed Form FIR Filters", http://www.xilinx.com/xapp/xapp219.pdf, Oct. 2001, v. 1.2.
- [8] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and Its Applications to Computer Technology, McGraw-Hill, NY, 1967.
- [9] J. Ramírez, A. García, P. G. Fernández, L. Parrilla and A. Lloris, "RNS-FPL Merged Architectures for Orthogonal DWT", *Electronics Letters*, vol. 36, no. 14, pp. 1198-1199, Jul. 2000.
- [10] W. A. Chren, "RNS-Based Enhancements for Direct Digital Frequency Synthesis", *IEEE Transactions on Circuits and Systems II*, vol. 42. no. 8, pp. 516-524, Aug. 1995.
- [11] P. V. A. Mohan, "On RNS-based enhancements for direct digital frequency synthesis", *IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing*, vol. 48, no. 10, pp. 988-990, Oct. 2001.
- [12] W. Namgoong, T. H. Meng, "Direct-Conversion RF Receiver Design", IEEE Transactions on Communications, vol. 49, no. 3, pp. 518-529, Mar. 2001.
- [13] J. Ramírez, A. García, P. G. Fernández, L. Parrilla, A. Lloris, "Analysis of RNS-FPL Synergy for High Throughput DSP Applications: Discrete Wavelet Transform", in *Lecture Notes in Computers Science. Field Programmable Logic: The Roadmap to Reconfigurable Computing*, Springer Verlag, págs. 342-351. 2000.
- [14] M. A. Soderstrand, W. K. Jenkins, G. A. Jullien and F. J. Taylor, *Residue Number System Arithmetic: Modern Applications in Digital Signal Processing*, IEEE Press, 1986.
- [15] G. A. Jullien, "Implementation of Multiplication, Modulo a Prime Number, with Applications to Number Theoretic Transforms", *IEEE Transactions on Computers*, vol. C-29, no. 10, pp. 899-905, Oct. 1980.
- [16] D. Radhakrishnan, Y. Yuan, "Fast and Highly Compact RNS Multipliers", International Journal of Electronics, 70, pp. 281-293, 1991.
- [17] Xilinx, Inc., "Direct Digital Synthesizer (DDS) V2.0", http://www.xilinx.com/ipcenter/catalog/logicore/docs/dds.pdf, Nov. 2000.
- [18] J. Ramírez, U. Meyer-Bäse, "Benchmarks for Programmable FIR Filters Built in RNS-FPL Technology", accepted in 2002 SPIE's 16th Annual International Symposium on Aerospace/Defense Sensing, Simulation, and Controls.
- [19] S. Piestrak, "Design of Residue Generators and Multi-Operand Modular Adders using Carry-Save Adders", Proc. of the 10th IEEE Symposium on Computer Arithmetic, 1991.
- [20] M. Griffin, M. Sousa, F. Taylor, "Efficient Scaling in the Residue Number System", Proc. of the International Conference on Acoustics, Speech and Signal Processing, pp. 1075-1078, 1989.