# A 2D-FFT ALGORITHM ON MESH CONNECTED MULTIPROCESSOR SYSTEMS

#### Hiroaki Kunieda and Kazuhito Itoh

Department of Electrical and Electronic Engineering Tokyo Institute of Technology Tokyo 152, Japan

Abstract: A direct computation algorithm of two dimensional fast Fourier transform (2D-FFT) is considered here for implementation in mesh connected multiprocessor array of both a 2D-toroidal and a rectangular type. Results are derived for a hardware algorithm including data allocation and interprocessor communications.

A Performance comparison is carried out between the proposed direct 2D-FFT computation and the conventional one to show that a new algorithm gives higher speedup under a reasonable assumption on the speeds of operations.

# 1. Introduction

2-D FFT is used in most image processing devices, in applications such as pattern recognition, image reconstruction and correcting of image distortions. One of methods for computing 2D-FFT is to carry out twice one-dimensional FFT with respect to rows and columns of two dimensional data (the indirect 2D-FFT method). The well-known Cooley-Tukey algorithm [1] for 1-D FFT can be applied. Recently several researches have been reported for 2D-FFT to be calculated on multiprocessor systems [2][3]. Shorter computational time can be achieved on parallel processing by using multiprocessors in each stage of 2D-FFT computations. Although this type of computer is not yet commercially available, much research in the area indicates their potential advantage in various type of applications.

There is an alternative way to compute 2D-FFT [4][5]. It is the direct method which computes 2D-FFT directly by using 2D-butterfly operations. By this direct method, the number of the multiplications of direct 2-D FFT algorithm can be reduced to 3/4 of the multiplications of indirect 2-D FFT algorithm.

In this paper, we investigate a new 2D-FFT hardware algorithms based on the direct 2D-FFT method. We specifically consider here two kinds of multiprocessor systems. Both systems consist of a mesh connected array of identical Processor Elements (PE's) with minimum storage requirements. The result shows that the shorter processing time can be achieved than the time based on the indirect 2D-FFT on the same multiprocessor systems.

# 2. 2D-FET Algorithm

We will consider here an NxN-N<sup>2</sup> point two-dimensional discrete Fourier transform implementation where N=2<sup>n</sup> or n=log\_N. Let's A<sub>0</sub>(i:k) and An(u:v) be the original and Fourier transformed two dimensional data respectively where 0 $\le$ 1,k,u,v $\le$ N-1. The two dimensional discrete Fourier Transformation is defined by the equation for all u and v:

$$\begin{split} \mathbf{X}(\mathbf{u} : \mathbf{v}) &= \sum_{\mathbf{k}} (\sum_{\mathbf{i}} \mathbf{A}_{\mathbf{0}}(\mathbf{i} : \mathbf{k}) \mathbf{W}_{\mathbf{N}}^{\mathbf{i} \mathbf{u}}) * \mathbf{W}_{\mathbf{N}}^{\mathbf{k} \mathbf{v}} \quad \text{for } 0 \leq \mathbf{u}, \mathbf{v} \leq \mathsf{N} - 1 \\ \mathbf{W}_{\mathbf{N}} &= \exp(-\mathbf{j} \mathbf{2} \quad / \mathsf{N}) \end{aligned}$$

In this equation, each sum is independent of the other. Therefore they can be computed on after the other using one dimensional FFT techniques.

Direct 2D FFT algorithm was reported by G.E.Rivard [4][5]. The algorithm is the expansion of the algorithm in 1D-FFT case. It consists of stage operations with N two-dimensional data inputs and outputs and in each stage, N-/4 two-dimensional butterfly operations will be carried out.

We express all indices in the form as

$$i = i_{n-1} 2^{n-1} + ---- + i_1^2 + i_0$$

and i, are equal to 0 or 1 and are the contents of the respective bit positions in the binary representation of i. All arrays will now be written as functions of the bits of their indices. The k-th stage 2D-butterfly operations works as follows where  $A_{k-1}(i\,;\,j)$  and  $A_k(i\,;\,j)$  are the input and output data of the k-th stage respectively.

$$\begin{array}{l} & \\ = A_{K}(i_{n-1}, \dots, i_{n-k}, \dots, i_{0}; j_{n-1}, \dots, j_{n-k}, \dots, j_{0}) \\ = A_{k-1}(i_{n-1}, \dots, 0, \dots, i_{0}; j_{n-1}, \dots, 0, \dots, j_{0}) \\ + A_{k-1}(i_{n-1}, \dots, 0, \dots, i_{0}; j_{n-1}, \dots, 1, \dots, j_{0}) * B_{k} * (-1)^{i_{1}} n - k \\ + A_{k-1}(i_{n-1}, \dots, 1, \dots, i_{0}; j_{n-1}, \dots, 0, \dots, j_{0}) * C_{k} * (-1)^{j_{1}} n - k \\ + A_{k-1}(i_{n-1}, \dots, 1, \dots, i_{0}; j_{n-1}, \dots, 1, \dots, j_{0}) * B_{k} * C_{k} \\ * (-1)^{i_{1}} n - k * (-1)^{j_{1}} n - k \\ * (-1)^{i_{1}} n - k * (-1)^{j_{1}} n - k \\ \text{where } B_{k} = w_{N}^{i_{1}} n - k + 1 \\ & C_{k} = w_{N}^{j_{1}} n - k + 1 \\ & C_{k} = w_{N}^{j_{1}} n - k + 1 \\ & \text{for } 0 \le i, j \le N - 1, i.e. i_{k}, j_{k} = 0 \text{ or } 1 \end{array}$$

The butterfly operations in the 1st stage are carried out among four data A<sub>0</sub>(i:j) whose (n-1)th bits of i and j indices are different. In the second stage, they are done among data A<sub>1</sub>(i:j) with different (n-2)th bits of i and j indices. In the last stage, the operations are performed among data An(i:j) with different 0th bits of i and indices

performed among data An(1:j) with different 0th bits of i and j indices.

Each butterfly operation in each stage consists of additions or substractions of the four terms in equations (2). Since all four terms in four equations (2) for A<sub>k</sub>(i:j) of only different in and j<sub>n</sub>-k are the same in value, it will be efficient to perform these four equations as the sets of operations. If we can calculate the three products once for each set of four equations, their execution time for each

stage would be proportional to  $3*(N^2/4)$  complex multiplication time and  $8*(N^2/4)$  addition time.

$$=An(i_0, ---, i_{n-1} : j_0, ---, j_{n-1})$$
 (3)

#### 3. Mesh connected multiprocessor array

We consider here a mesh connected multiprocessor array, because we think mesh connected array may be suitable for 2D Signal Processing. Fig.1 and 2 show examples of 16 PE's of two different type of arrays which we have chosen among various structures of multi-processor systems. Fig.1 is called as a 2D toroidal array in which each PE has a equal position. While, Fig.2 is called as a simple rectangular array. A rectangular array has an advantage that it has only data transfer paths between adjacent PE's which will be suitable to be implemented in future one chip VLSI.

The mesh connected arrays with P PE's and  $P=2^m$  are arranged in a  $\sqrt{P} \times \sqrt{P}$  square matrix. Each PE is connected to four nearest neighbors in a two dimensional grid. If each PE's number is represented as PE(i,j) according to its geometrical position in a two dimensional grid where i,j=0,--,rP-1, PE (i,j) is connected to four PE's PE(i,j-1),(i,j+1),(i-1,j),(i+1,j) mod rP in a 2D toroidal array and four PE's PE(i,j-1),(i,j+1),(i-1,j),(i+1,j) in a rectangular-array.

We assume for both types of arrays that each PE has four bidirectional I/O ports which can transfer data to and from the neighbor PE's. Furthermore, we assume that each PE can input one data from one of the four neighbor PE's and at the same time can output one data to the four neighbor PE's.

If the number of two dimensional data is  $N^2$ , the same amount of data storage capacity will be required, we assume each PE has its local memory with capacity of  $N^2/P$ . We additionally assume that there is a central controller that supervises the operation for interprocessor communications.



Fig. 1 2D toridal Array (P=16)



Fig. 2 Rectangular Array (P=16)

### 4. 2D-FFT Hardware Algorithm

The input data can allocated in many ways. We consider here the data allocation in which the input N x N matrix data A\_(i:j),  $0 \le i, j \le N-1$  will be partitioned into square  $/P \times /P$  submatrices, A\_kl,  $0 \le N, l \le /P-1$  and each submatrix A\_kl will be allocated to the PE(k,l). Local memory of PE(k,l) contains N^/P input data. This type of the data allocation is suitable also for the other image processing applications. An example of this data storage is shown in Fig.3 in the case of N-8 and P=4.

Fig. 4 shows the other type of the data allocation for N=8 and P=4. The input N  $\times$  N matrix data  $A_{\rm p}$  (i:j) will be partitioned into (N/P)  $\times$  N submatriceies and each one will be allocated to each PE. This type of data allocation has been used for the indirect 2D-FFT algorithm.

From this initial data allocation, the first several stages of FFT computations require the data transfers. According to our investigation, it is much more efficient to first exchange all the data required to proceed the maximum number of stages of FFT computations without interprocessor communications and then to start FFT computations. If FFT computations will not be able to continue without data transfers, the same procedure will be repeated, that is, to change data allocation and continue FFT computations. The following 2D-FFT hardware algorithm are derived for computing 2D direct FFT. This algorithm is valid for  $n \geq 2^m$  or  $N \geq 1$ .

[2D-FFT hardware algorithm]

(1) Bit reverse data transfers

(2) 1st to (n-m)th stage of FFT computations

(3) Bit reverse data transfer

(4) (n-m+1)th to nth stage of FFT computations

|   | 00 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|---|----|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|   | 10 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   | 20 | 5 | 1 | 2 | 2 | 5 | 3 |   | 2 | 4 | 2 | 5 | 2 | 6 | 2 | 7 |
|   | 30 | 3 | 1 | 3 | 2 | 3 | 3 |   | 3 | 4 | 3 | 5 | 3 | 6 | 3 | 7 |
|   | 40 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   | 50 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
|   | 60 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
| ľ | 70 | 7 | 1 | 7 | 2 | 7 | 3 | - | 7 | 4 | 7 | 5 | 7 | 6 | 7 | 7 |

Fig.3 Square type data allocation (N=8,P=4)

| 00 | 0 | 1 | 0 | 2 | 0 | 3 | 0 | 4 | 0 | 5 | 0 | 6 | Ō  | 7 |
|----|---|---|---|---|---|---|---|---|---|---|---|---|----|---|
| 10 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 4 | 1 | 5 | 1 | 6 | 1. | 7 |
| 20 |   |   |   |   |   |   |   |   |   |   |   |   |    |   |
| 30 | 3 | 1 | 3 | 5 | 3 | 3 | 3 | 4 | 3 | 5 | 3 | 6 | 3  | 7 |
| 40 |   |   |   |   |   |   |   |   |   |   |   |   |    |   |
| 50 | 5 | 1 | 5 | 2 | 5 | 3 | 5 | 4 | 5 | 5 | 5 | 6 | 5  | 7 |
| 60 |   |   |   |   |   |   |   |   |   |   |   |   |    |   |
| 70 | 7 | 1 | 7 | 2 | 7 | 3 | 7 | 4 | 7 | 5 | 7 | 6 | 7  | 7 |

Fig. 4 Row-wise data allocation (N=8,P=4)

(1) Bit reverse data transfers 

from  $PE(i_{n-1}, \dots, i_{n-m} : j_{n-1}, \dots, j_{n-m})$  to  $PE(i_0, i_1, ---, i_{m-1} : j_0, j_1, ---, j_{m-1})$ 

for  $0\le i$ ,  $j\le N-1$ . As the result of these operations, the data  $A_0$  ( $i_{n-1}$ ,----, $i_0$ ) of the same  $i_{n-1}$  $1, i_{n-2}, ---, i_m$  will be stored in the same PE.

tions. The output data  $A_{n-m}(i_{n-1}, ---, i_0 : j_{n-1}, ----, j_0)$ 

of (n-m)th stage are stored in  $PE(i_0, i_1, ---, i_{m-1} : j_0, j_1, ---, j_{m-1})$ .

(3) Bit reverse data transfers
The result of the bit reverse data transfer

will be that the data  $A_{n-m}(i_{n-1},---,i_0:j_{n-1}---,j_0)$ 

are stored in

the same PE.

stages are complete.

 $PE(i_{n-1}, i_{n-2}, --, i_{n-m})$ :  $j_{n-1}, j_{n-2}, --, j_{n-m}$ . The data  $A_{n-m}(i_{n-1}, ----, i_0 : j_{n-1}, ----, j_0)$  of

the same in 1, in 2, ---, i which are calculated together in further stages, are stored in

(4) (n=m+1)th to nth stage of FFT computations FFT computation can be continued without interprocessor communications until the final n

# 5.Data Transfer

The data transfer from PE(i,j) to PE(k,1) consist of several transfers between two neighbors, the number of which are proportional to the distance between the source and destination PE's. Since a 2D-toroidal array has additional connections which have there distances connections which have these distances to be shorter, the less number of data transfers for the bit reverse data transfers in a 2D-toroidal array will be required than in a rectangular

array. Data X(i:j) of different  $i_{n-m-1}, \dots, i_m$ , and  $j_{n-m-1}, \dots, j_m$  are stored in  $PE(i_{n-1}, \dots, i_{n-m}; j_{n-m}; j$  $_{1}$ ,---, $_{n-m}$ ) and are transferred to PE( $_{0}$ ,---, $_{m-m}$ i; j, --, j, ]. Therefore,  $2^{n-2m}x^{2^{n-2m}}=(N/P)^2$  data have the same source and the same destination PE. In addition to that, gach PE distributes equal number of data (=N<sup>2</sup>/P<sup>2</sup>) to all PE's including itself. In this same including itself. In this sense, the bit reverse data transfers are equivalent to the transfers transpose of matrices stored row-wise on mesh-connected multiprocessor array.

5.1 2D toroidal array

Optimum algorithm for this data transfer problem on 2D toroidal array has already been derived [2]. If  $\tau$  represents a unit transfer time between any two neighbors, the bit reverse data transfers on 2D toroidal array require the transfer time

$$T_{\text{lowestl}} = \frac{N^2}{2\sqrt{P}} * t$$
this independent on algorithms.

which is independent on algorithms.

The derived algorithm is optimum because transfer time achieves th lowest bound of its transfer time Tlowest1.

5.2 Rectangular array
In case of one data transfer from PE(i:j) PE(k:1) along a shortest path in a rectangular array, the required number of data transfers lar array, the required number of data transfers between two neighbors are determined by the distance between PE's, that is,  $\{i-k\}+[j-l]$ . Therefore, the total number of data transfers between two neighbor PE's is obtained by  $\sqrt{P-1} \cdot \sqrt{P-1} \cdot \sqrt{P-1}$ 

$$S_{a} = \sum_{i=0}^{\sqrt{P-1}} \sum_{j=0}^{\sqrt{P-1}} \sum_{k=0}^{\sqrt{P-1}} \sum_{l=0}^{\sqrt{N^2/P^2}} (N^2/P^2) * (|i-k|+|j-1|)$$

$$= \frac{2N^2}{3\sqrt{P}} * (P~1) \tag{5}.$$

The lowest bound of the transfer time  $T_{lowest2}$ is given as

$$T_{\text{lowest2}} = \frac{2N^2}{3P\sqrt{P}} * (P-1)*\tau \qquad (6).$$

[Theorem 1]

There exists no bit reverse data transfer algorithm on a rectangular array which takes the lowest bound of the transfer time  $T_{lowest2}$ .

In this paper, we propose an transfer algorithm on a rectangular array which is optimum among ones which perform row directional transfers and column directional transfers one after the other.

If we denote by X(is,js)(id,jd) the  $(N/P)^2$  data which will be transferred from the source processor PE(is,js) to the destination processor PE(id,jd). id,jd). X(is,js)(id,jd) will be transferred row directional transfers from PE(is,js) to PE(is,jd) and then move to PE(id,jd) by column directional transfers. Therefore, in row directional transfers, each PE must transfer (N/P) \*\*√P data to each other PE in the same row. After that, each PE distributes the same amount of data to each other PE in the same column. The same algorithms can be applied to these two directional transfers. Therefore, we only show here an algorithm for row directional transfers. In convenience, we define the clock CLK as the transfer time in which (N/P) \*/P data will be transferred from a PE to its neighbor PE, that is, 1 CLK =  $(N/P)^2 * \sqrt{P} * \tau$ .

We will derive the lower bound T of transfer time for row directional transfers the transfer time for row directional transfers which is independent on algorithms. PE(i,j) must send X(is,js)(id,jd) for  $0 \le id \le \sqrt{P}-1$  and  $0 \le js \le j \le j \le d \le N$  for  $0 \le id \le \sqrt{P}-1$  and also send X(is,js)(id,jd) for  $0 \le id \le \sqrt{P}-1$  and  $\sqrt{P}-1 \ge js \ge j \ge j \le d \le N$  for  $\sqrt{P}-1 \ge N$  and  $\sqrt{P}-1 \ge N$  for  $\sqrt{P}-1 \ge N$  data to  $\sqrt{P}-1 \ge N$  and  $\sqrt{P}-1 \ge N$  and  $\sqrt{P}-1 \ge N$  and  $\sqrt{P}-1 \ge N$  data to  $\sqrt{P}-1 \ge N$  and

must transfer  $(\sqrt{P}-j)*j$  data to PE(i,j-1). The total number of data transfers through PE(i,j) to either PE(i,j-1) or PE(i,j+1) are  $((j+1)*(\sqrt{P}-1-j)+(\sqrt{P}-j)*j)*((N/P)^2*/P)$  (8)

The \_maximum value is obtained by PE(i,j)

for  $j=\sqrt{P}/2-1$  and  $\sqrt{P}/2+1$  which is  $(P/2-1)*\{(N/P)^2*\sqrt{P}\}$  or (P/2-1) clock periods. This gives the theoretical lower bound  $T_1$  of row and column directional bit reverse data transfers in a rectangular array as  $N^2*(P-2)$ 

$$T_{\text{lowest3}} = \frac{1}{\sqrt{P+P}}$$
 (7)

[Algorithm for row directional transfers] (1) Set CLK=0.

(2) When CLK is even, PE(i,j) and PE(i,j+1) for  $0\le i\le \sqrt{P}-1$ , j=2k and  $0\le k\le \sqrt{P}/2-1$  exchange  $(N/P)^2*\sqrt{P}$  data. Set HCLK=CLK/2 and increment CLK by 1 and goto (3).

When CLK is a odd number, PE(i,j) and PE(i,j+1) for  $0 \le i \le \sqrt{P}-1$ , j=2k+1 and  $0 \le k \le \sqrt{P}/2-2$  exchange (N/P) \* $\sqrt{P}$  data. Set HCLK = (CLK-1)/2 and increment CLK by 1 and goto (3).

The data to be transferred to the right direction from PE(i,j) to PE(i,j+1) are

for jsr<=j<jdr<.  $\sqrt{P}$  where u and v are positive integers of  $0 \le v \le j$  which satisfy are HCLK=(j+1)\*u+v. If indices don't satisfy jsr{j<jdr<√P, data transfers will not be performed. The data to be transferred to the direction from PE(i,j+1) to PE(i,j) are

(9) where r and s are positive integers which satisfy HCLK= $(\sqrt{P}-j-1)*r+s$  and  $0\le s\le \sqrt{P}-j-1$ 0≤s≤√P-j-If indices don't satisfy  $jdl \le j+1 \le j \le l \le P$ , such data transfer will not be performed.

(3) If CLK=P/2, stop, otherwise go to (2).

PE(1,j) for even j will perform row rightdirectional transfers during (\( \subseteq P-j-1 \)(j+1) clock periods only when CLK's are even numbers and row left-directional transfer during (√P-j)j clock periods only when CLK's are odd numbers. While PE(i,j) for odd j will perform row left-directional transfers during  $(\sqrt{P}-j)j$  clock periods only when CLK's are even numbers of and row right-directional transfers during  $(\sqrt{P}-1-1)(j+1)$  clock periods only when CLK's are odd numbers. For either odd or even j, PE(i,j) needs  $(\sqrt{P}-j-1)(j+1)+(\sqrt{P}-j)j$  clock periods. The row directional data transfers are illustrated as in Fig.5 in the case of  $\sqrt{P}=4$ .

|   | CLK    | 1 | PE(i | .0)  | PE(i | ,1)  | PE(i        | 2)   | PECI | 3) |
|---|--------|---|------|------|------|------|-------------|------|------|----|
| _ | 0      | ! | (0   | 3)=> |      |      | (2          |      | <=(3 | 0) |
|   | 2      | İ | (0   | 2)=> | <=(2 | 0)   | <=(2<br>(1  | 3)=> | <=(3 | 1) |
|   | 3<br>4 | 1 | (0   | 1)=> |      |      | <=(3<br>(0) |      | <=(3 | 2) |
|   | 5<br>6 | 1 |      |      | (1   | 2)=> | <=(2        | 1)   |      |    |
|   | 7      | i |      |      | (0   | 2)=> | <=(3        | 1)   |      |    |

Note: (js jd) represents data X(i,js)(id,jd).

[Theorem 2] The above algorithm of row right directional transfers of the bit reverse data transfer is optimum among the row-column type transfer algorithms.

#### 6. Comparison

The performance comparison is carried out een the conventional indirect 2D-FFT between hardware algorithm and the proposed direct 2D-FFT hardware algorithm on both a 2D-toroidal

array and a rectangular array.

The time taken for proposed direct 2D-FFT hardware algorithms are as follows.

(Direct Method 2D-toroidal array) 
$$\frac{3N}{3N} = \frac{2N}{100} (\log_2 N - 2) \text{Tm} + \frac{2N}{p} (\log_2 N) \text{Ta} + \frac{N^2}{\sqrt{p}} \tau$$
(Direct Method Rectangular array) (10)

(Direct Method Rectangular array)
$$T_{d2} = \frac{3N^2}{4P} (\log_2 N - 2.7m + \frac{2N^2}{P} (\log_2 N) Ta + \frac{2N^2}{\sqrt{PP}} (P-2)\tau$$

where Tm and Ta are the time taken for a complex multiplication and a complex addition on each PE respectively, and t are the time taken for transferring a datum from a PE to a PE.

To compare the computation time with one by indirect 2D-FFT, we assume to use the same proposed algorithm for data transfers in case of 2D-FFT on a rectangular array.

2D-FFT on a rectangular array.

(Indirect method 2D-toroidal array)
$$T_{i1} = \frac{N^2}{P} (\log_2 N - 2) \text{Tm} + \frac{N}{P} (\log_2 N) \text{Ta} + \frac{N^2}{2\sqrt{P}} t$$
(12).

(Indirect method Rectangular array)
$$T_{12} = \frac{N^2}{P} (\log_2 N - 2) Tm + \frac{2N^2}{P} (\log_2 N) Ta + \frac{N^2}{\sqrt{PP}} (P-2) t$$
(13).

(1) Speedup by direct 2D-FFT Without the loss by the communication time, P PE's systems can run P times as fast as a single PE. Therefore, speedup efficiency a, defined as

(time taken in P PE's) x P gives an efficiency for a multiprocessor system with respect to speed. Fig.6 shows the speedup efficiencies against the addition to transfer operation time ratios for N=256 data, 2D-toroidal array of P=16 PE's and several multiplication to addition operation time ratios. For the practical addition to transfer operation time ratios greater than 5, the multiplication and addition operations are shown to be dominant in 2D-FFT hardware algorithm.



Arrows shows the direction of trasfer. Fig. 6 Speedup efficiency in 2D toroidal array Fig. 5 Row directional transfers ( $\sqrt{P}=4$ ) for direct 2D-FFT (N=256, P=16) for direct 2D-FFT (N=256, P=16)

(2) Comparison to the conventional 2D-FFT Fig.7 shows the computation time ratios between the proposed direct method and the conventional indirect method on 2D-toroidal array for N=256 and P=256. In the practical case of several Ta/Tm and Tm/t, the computation time ratios less than 0.9 indicate the higher speed of the proposed one than the conventional one.



Fig.7 Computation time ratios between direct and indirect 2D-FFT on 2D-toroidal array methods (N=256, P=256)

(3) Comparison between 2D-toroidal and rectangular array

Bit reverse data transfer in a rectangular in 2D-toroidal array. Since the required time for multiplications and additions in both types, the computation time ratios between two systems as shown in Fig.8 are proportional to interprocessor communication overhead which is determined by the addition to transfer operation time ratio and the number of PE's.



Fig.8 Computation time ratios between 2D toroidal and rectangular array for direct 2D-FFT (N=256, Tm/Ta=5)

#### 7. Concluding Remarks

algorithm on multiprocessor systems is considered for fast discrete Fourier transformation of two dimensional array. The results summarized as follows.

(1) Hardware algorithms to implement direct FFT method on both 2D-toroidal and rectangular array were derived. Although interprocessor communication time increase, the number of multiplications reduced to 3/4 of ones in conventional indirect 2D-FFT case.

(2) The bit reverse data transfer algorithm on 2D-toroidal array is optimum with respect to the transfer time.

(3) The bit reverse data transfer algorithm on a rectangular array was investigated. An derived algorithm is shown to be optimum among algorithms which consist of row and column directional transfers. However, the further investigation on general optimum transfer algorithm will be needed.

# References

Cooley, J.W. and Tukey, J.W.: "An Algorithm for the Machine Calculation of Complex [1] Cooley, J.W. Math. Compt., Fourier series",

pp.297-301. April, 1965. Bhuyan,L.N. and Agrawal,D.P.: "Performance Analysis of FFT Algorithms on Multiprocessor Systems", IEEE Trans. Softw. Eng., Vol.SE-9, [2] Bhuyan, L.N.

No. 4. pp.512-521, 1983.
[3] Nakano, H. and Tsuda, T.: "Optimizing Processor Data Transfers in Transposions of Matrices Stored Row-Wise on Mesh-Connected Parallel Computers", Information Processing Society of Japan, Vol. 27, No. 3, March, 1986.

[4] Rivard G.: "Direct Fast Fourier Transform of Bivariate Functions", IEEE Trans. ASSP, Vol. ASSP, 250, 252, 1977

Vol.ASSP-25, pp.250-252, 1977. Blahut R.: Fast Algorithms for Digi Signal Processing", Addison-Wesley, 1985. Digital [51 Blahut

# Appendix

[Proof of Theorem 1]

count the number of data First, we will transfers as possible as many through PE(0,0), one of the corner PE's. PE(0.0) may transfer (1) (N/P)\*(P-1) data for data transfers

PE(0:0) to the other (P-1) PE's.
(N/P) \*(P-1) data for data transfers

PE(0,j) to t √P -1. the other PE(1.0)

1 $\le$  1.4 $\le$   $\sqrt{P}$  -1. (N/P) \*(P-1) data for data transfers (3)PE(i,0) the other PE(0,j) to 1≤i.j≤√P-1.

PE(0,0) need not transfer all the data of and (3), because there are alternative shortest path which don't pass through PE(0,0). Therefore, the total number S of the data (1)-(3),

$$S = \frac{N^2}{P^2} * (2\sqrt{P} - 1) * (\sqrt{P} - 1)$$
 (A1)

will be the maximum possible number of data transfers through  $\ensuremath{\mathsf{PE}}(0.0)$  .

The total number of data transfers Sa is given in eq.(5). It is easy to show for P>4, Sa/P > S.

This means that even our over-estimation of the data transfers through PE(0,0) is less than Sa/P. However, the bit reverse data transfers Sa/P. However, the bit reverse data transfers need Sa transfer as a whole. Therefore there must be at least a PE through which the number of data greater than Sa/P will be transferred. The maximum number of data transfers S' of S'> Sa/P will determine the transfer time T. The transfer time by any algorithm is proved to be longer than the lower bound as

T=S't>Sa/P\*t=Tlowest2'

(Q.E.D)

The algorithm stops when CLK=P/2 which gives the transfer time  $(P/2)*(N^2/\sqrt{P})\tau$ . This value differs from the lowest bound  $T^2$ the bit reverse transfer algorithm. During the clock period of P/2-1, there are no data transfer in this algorithm. If we skip this clock period, we can perform the row right-directional transfers during (P/2-1) clock periods. Therefore, the algorithm is proved to be the fastest row-column transfer type algorithm.

Next, we will show the derived row-directional transfer algorithm guarantees the arrival of data before the same data should be sent. Let's Tc(j) be a CLK value when A(is:js)(id:jd) for js<jd will be in PE(is:j) during the trans-fers where js<j<jd. Tc(j) are derived from equations as

tions as  $Tc(j)=2(\sqrt{P}-1-jd-js)+2(\sqrt{P}-jd)j \quad \text{for even j} \\ Tc(j)=2(\sqrt{P}-1-jd-js)+2(\sqrt{P}-jd)j+1 \quad \text{for odd } J. \\ Since \sqrt{P}-jd>0, \quad Tc(j) \quad \text{are monotone increasing function of j. Furthermore it is easily proved that } Tc(j)(Tc(j+1)(Tc(j+2)) \quad \text{for any j. That is to say, data will arrive at } PE(1s,j) \quad \text{before } Te(1s,j) \quad \text{before } Te(1s,j) \quad \text{the proved that } Tc(j)(Tc(j+1)(Tc(j+2)) \quad \text{for any j. } Tc(j+1)(Tc(j+2)) \quad \text{for any j. } Tc(j+1)(Tc(j+2)) \quad \text{for any j. } Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc(j+1)(Tc$ PE(is,j) need the same data to transfer to its neighbor PE.

(Q.E.D.)