# Cooley-Tukey FFT algorithm (C)

This program is under development.
is complete, remove the {{develop}} tag.

The Cooley-Tukey FFT algorithm is a popular fast Fourier transform algorithm for rapidly computing the discrete fourier transform of a sampled digital signal. It applies best to signal vectors whose lengths are highly composite, usually a power of 2.

Here we describe a C implementation of Cooley-Tukey. We begin with a straightforward implementation and then describe various performance improvements. Like most practical implementations, we will concentrate on the case where the vector length is a power of 2.

## Complex arithmetic

Although the newest version of C, C99, has library support for complex number datatypes and arithmetic, it is not widely supported. We write simple implementations of a few basic operations we need here, with little explanation; for a more detailed discussion, see Category:Complex numbers.

```<<complex number datatype>>=
typedef struct complex_t {
double re;
double im;
} complex;
<<complex number prototypes>>=
double  complex_magnitude(complex c);
complex complex_sub(complex left, complex right);
complex complex_mult(complex left, complex right);
<<complex number function definitions>>=
complex complex_from_polar(double r, double theta_radians) {
complex result;
return result;
}
double complex_magnitude(complex c) {
return sqrt(c.re*c.re + c.im*c.im);
}
complex complex_add(complex left, complex right) {
complex result;
result.re = left.re + right.re;
result.im = left.im + right.im;
return result;
}
complex complex_sub(complex left, complex right) {
complex result;
result.re = left.re - right.re;
result.im = left.im - right.im;
return result;
}
complex complex_mult(complex left, complex right) {
complex result;
result.re = left.re*right.re - left.im*right.im;
result.im = left.re*right.im + left.im*right.re;
return result;
}
```

The use of `sin()`, `cos()`, and `sqrt()` requires the header math.h:

```<<complex number headers>>=
#include <math.h>
```

## Naïve evaluation of definition

First we discuss the simplest possible DFT solution in order to address some of the issues in a simpler setting. Suppose that our input vector is x and our output vector is X. Then the discrete Fourier transform is defined by the formula:

For each ,

The quantity is most easily described in terms of its polar complex coordinates: it is on the unit circle (distance r=1 from the origin) and forms an angle θ of radians with the positive real axis. This enables us to evaluate the definition as follows:

```<<FFT prototypes>>=
complex* DFT_naive_1(complex* x, int N);
<<Naive DFT>>=
complex* DFT_naive_1(complex* x, int N) {
complex* X = (complex*) malloc(sizeof(struct complex_t) * N);
int k, n;
for(k=0; k<N; k++) {
X[k].re = X[k].im = 0.0;
for(n=0; n<N; n++) {
X[k] = complex_add(X[k], complex_mult(x[n], complex_from_polar(1, -2*PI*n*k/N)));
}
}
return X;
}
```

We require headers for memory management and need to define `PI`:

```<<FFT headers>>=
#include <stdlib.h>
<<FFT constants>>=
#define PI  3.1415926535897932
```

This works well for very small inputs and does not place any requirements on N. At the expense of additional space, we can avoid repeatedly recomputing the factors . Instead, we compute only the N values for and use the fact that eik = 1 to reduce to one of these:

where and .

In code:

```<<FFT prototypes>>=
complex* DFT_naive_2(complex* x, int N);
<<Naive DFT>>=
complex* DFT_naive_2(complex* x, int N) {
complex* X = (complex*) malloc(sizeof(struct complex_t) * N);
complex* Nth_root = (complex*) malloc(sizeof(struct complex_t) * N);
int k, n;
for(k=0; k<N; k++) {
Nth_root[k] = complex_from_polar(1, -2*PI*k/N);
}
for(k=0; k<N; k++) {
X[k].re = X[k].im = 0.0;
for(n=0; n<N; n++) {
X[k] = complex_add(X[k], complex_mult(x[n], Nth_root[(n*k) % N]));
}
}
free(Nth_root);
return X;
}
```

This considerably speeds the algorithm in practice. Unfortunately, even with this change, directly evaluating this sum for each k requires O(N2) time overall, which is prohibitive for large N.

The radix-2 decimation in time algorithm uses a divide-and-conquer approach to improve efficiency. It begins by separating the input vector x into two smaller vectors d and e, with d containing the odd-numbered elements of the original vector and e containing the even-numbered elements. For example:

x = (1, 2, 3, 4, 5, 6)
d = (2, 4, 6)
e = (1, 3, 5)

(Keep in mind that the indexing starts at zero). We then recursively compute the FFTs of d and e, call them D and E. We use them to construct X according to the following formula, which can be shown directly from the definition:

Thus, all we need to do to compute X is to compute D, E, and the N/2 complex roots of unity specified by for (called twiddle factors), then do some basic arithmetic. The additional work after computing D and E is linear, so the analysis is just like that of mergesort, requiring O(NlogN) worst-case time. Following is a straightforward implementation based on recursion using the above formulas:

```<<FFT prototypes>>=
complex* FFT_simple(complex* x, int N /* must be a power of 2 */);
<<Simple FFT>>=
complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
complex* X = (complex*) malloc(sizeof(struct complex_t) * N);
complex * d, * e, * D, * E;
int k;
if (N == 1) {
X[0] = x[0];
return X;
}
e = (complex*) malloc(sizeof(struct complex_t) * N/2);
d = (complex*) malloc(sizeof(struct complex_t) * N/2);
for(k = 0; k < N/2; k++) {
e[k] = x[2*k];
d[k] = x[2*k + 1];
}
E = FFT_simple(e, N/2);
D = FFT_simple(d, N/2);
free(e);
free(d);
for(k = 0; k < N/2; k++) {
/* Multiply entries of D by the twiddle factors e^(-2*pi*i/N * k) */
D[k] = complex_mult(complex_from_polar(1, -2.0*PI*k/N), D[k]);
}
for(k = 0; k < N/2; k++) {
X[k + N/2] = complex_sub(E[k], D[k]);
}
free(D);
free(E);
return X;
}
```

Although this works, is simple, and has good asymptotic performance, in practice the large number of allocations above degrade speed and triple memory use, which is particularly harmful on large inputs. Additionally, many of the `complex_from_polar()` calls are redundant, since only Nth roots of unity are used anywhere in the recursion tree. Finally, this implementation has considerable function call overhead and would ideally be replaced by a simpler algorithm on small subvectors.

## Optimization: removing memory allocations 1

One obvious optimization is to remove the allocations for e and d, seeing as these are simple re-arrangements of some elements of x. The algorithms above assume that the input is a contiguous array, but if we relax this assumption so that it will accept inputs of the form `x[0], x[skip], x[skip*2], x[skip*3]...` then e and d are simply subarrays of x with different values of skip:

```complex* FFT_skip(complex* x, int N /* must be a power of 2 */, int skip) {
complex* X = (complex*) malloc(sizeof(struct complex_t) * N);
complex * D, * E;
int k;
if (N == 1) {
X[0] = x[0];
return X;
}
E = FFT_skip(x, N/2, skip*2);
D = FFT_skip(x + skip, N/2, skip*2);
for(k = 0; k < N/2; k++) {
/* Multiply entries of D by the twiddle factors e^(-2*pi*i/N * k) */
D[k] = complex_mult(complex_from_polar(1, -2.0*PI*k/N), D[k]);
}
for(k = 0; k < N/2; k++) {
X[k + N/2] = complex_sub(E[k], D[k]);
}
free(D);
free(E);
return X;
}
complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
return FFT_skip(x, N, 1);
}
```

## Optimization: removing memory allocations 2

A second optimization is to remove the allocation of X at the beginning of each iteration. It can be seen that X is not required until e and d (now subarrays of x) are finished with. So we could use a scratch buffer to hold E and D before calculating X, using X as the scratch buffer when calculating E and D:

```void FFT_buffer(complex* x, int N /* must be a power of 2 */, int skip,
complex* X, complex* E) {
complex * D = E + N/2;
int k;
if (N == 1) {
X[0] = x[0];
return;
}
/* for now we can use X as a scratch buffer */
FFT_buffer(x, N/2, skip*2, E, X);
FFT_buffer(x + skip, N/2, skip*2, D, X);
for(k = 0; k < N/2; k++) {
/* Multiply entries of D by the twiddle factors e^(-2*pi*i/N * k) */
D[k] = complex_mult(complex_from_polar(1, -2.0*PI*k/N), D[k]);
}
for(k = 0; k < N/2; k++) {
X[k + N/2] = complex_sub(E[k], D[k]);
}
}
complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
complex* out = (complex*) malloc(sizeof(struct complex_t) * N);
complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N);
FFT_buffer(x, N, 1, out, scratch);
free(scratch);
return out;
}
```

## Optimization: pre-calculating twiddle factors

It can be seen that the same twiddle factors are being re-calculated constantly. Firstly, the calculation of E and the calculation of D require exactly the same twiddle factors. Secondly, the calculation of E and D require the even-numbered twiddle factors that the calculation of X requires. Therefore we can use a constant array of twiddle factors with a skip, just like we did for x. Luckily, the skip for this array of twiddle factors is the same as the skip for x, so we do not need to pass around an extra value for it.

Note that although the code below allocates memory and calculates the twiddle factors once per call, in practice these would likely be re-usable over many calls.

```void FFT_calculate(complex* x, int N /* must be a power of 2 */, int skip,
complex* X, complex* D, complex* twiddles) {
complex * E = D + N/2;
int k;
if (N == 1) {
X[0] = x[0];
return;
}
/* for now we can use X as a scratch buffer */
FFT_calculate(x, N/2, skip*2, E, X, twiddles);
FFT_calculate(x + skip, N/2, skip*2, D, X, twiddles);
for(k = 0; k < N/2; k++) {
/* Multiply entries of D by the twiddle factors e^(-2*pi*i/N * k) */
D[k] = complex_mult(twiddles[k*skip], D[k]);
}
for(k = 0; k < N/2; k++) {
X[k + N/2] = complex_sub(E[k], D[k]);
}
}
complex* FFT_get_twiddle_factors(int N) {
complex* twiddles = (complex*) malloc(sizeof(struct complex_t) * N/2);
int k;
for (k = 0; k != N/2; ++k) {
twiddles[k] = complex_from_polar(1.0, -2.0*PI*k/N);
}
return twiddles;
}
complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
complex* out = (complex*) malloc(sizeof(struct complex_t) * N);
complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N);
complex* twiddles = FFT_get_twiddle_factors(N);
FFT_calculate(x, N, 1, out, scratch, twiddles);
free(twiddles);
free(scratch);
return out;
}
```

## Optimization: removing recursion

This is a decidedly nontrivial optimization, for not a lot of speed increase. However, it does lend itself well to futher optimization.

To remove recursion we must be able to perform all the parts of each iteration in one pass. Here is a table of how we will arrange the results for each iteration for a DFT of length 8:

 initial X[0][0] X[1][0] X[2][0] X[3][0] X[4][0] X[5][0] X[6][0] X[7][0] re-interpretation E[0][0] E[1][0] E[2][0] E[3][0] D[0][0] D[1][0] D[2][0] D[3][0] calculation of next Xs X[0][0] X[1][0] X[2][0] X[3][0] X[0][1] X[1][1] X[2][1] X[3][1] re-interpretation E[0][0] E[1][0] D[0][0] D[1][0] E[0][1] E[1][1] D[0][1] D[1][1] calculation of next Xs X[0][0] X[0][1] X[1][0] X[1][1] X[0][2] X[0][3] X[1][2] X[1][3] re-interpretation E[0][0] D[0][0] E[0][1] D[0][1] E[0][2] D[0][2] E[0][3] D[0][3] final calculation of Xs X[0][0] X[0][1] X[0][2] X[0][3] X[0][4] X[0][5] X[0][6] X[0][7]

where `X[n]` represents what would have been the output from one call of `FFT_simple` and `E[n]` and `D[n]` represent the intermediate values used by that same call.

 TODO explain this better!
```void FFT_calculate(complex* x, long N /* must be a power of 2 */,
complex* X, complex* scratch, complex* twiddles) {
int k, m, n;
int skip;
bool evenIteration = N & 0x55555555;
complex* E, * D;
complex* Xp, * Xstart;
if (N == 1) {
/* N == 1 is now a special case, not the end of the recursion! */
X[0] = x[0];
return;
}
E = x;
for (n = 1; n < N; n *= 2) {
Xstart = evenIteration? scratch : X;
skip = N/(2 * n);
/* each of D and E is of length n, and each element of each D and E is
separated by 2*skip. The Es begin at E[0] to E[skip - 1] and the Ds
begin at E[skip] to E[2*skip - 1] */
Xp = Xstart;
for(k = 0; k != n; k++) {
for (m = 0; m != skip; ++m) {
D = E + skip;
*D = complex_mult(twiddles[k * skip], *D);
Xp[N/2] = complex_sub(*E, *D);
++Xp;
++E;
}
E += skip;
}
E = Xstart;
evenIteration = !evenIteration;
}
}
complex* FFT_get_twiddle_factors(int N) {
complex* twiddles = (complex*) malloc(sizeof(struct complex_t) * N);
int k;
for (k = 0; k != N; ++k) {
twiddles[k] = complex_from_polar(1.0, -2.0*PI*k/N);
}
return twiddles;
}
complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
complex* out = (complex*) malloc(sizeof(struct complex_t) * N);
complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N);
complex* twiddles = FFT_get_twiddle_factors(N);
FFT_calculate(x, N, out, scratch, twiddles);
free(twiddles);
free(scratch);
return out;
}
```

Note that in the arrangement above, the same twiddle factor is being invoked repeatedly in the inner loop. This suggests further optimizations, which, when combined with inlining of the complex arithmetic, produces a worthwhile speed improvement:

```void FFT_calculate(complex* x, long N /* must be a power of 2 */,
complex* X, complex* scratch, complex* twiddles) {
int k, m, n;
int skip;
bool evenIteration = N & 0x55555555;
complex* E;
complex* Xp, * Xp2, * Xstart;
if (N == 1) {
X[0] = x[0];
return;
}
E = x;
for (n = 1; n < N; n *= 2) {
Xstart = evenIteration? scratch : X;
skip = N/(2 * n);
/* each of D and E is of length n, and each element of each D and E is
separated by 2*skip. The Es begin at E[0] to E[skip - 1] and the Ds
begin at E[skip] to E[2*skip - 1] */
Xp = Xstart;
Xp2 = Xstart + N/2;
for(k = 0; k != n; k++) {
double tim = twiddles[k * skip].im;
double tre = twiddles[k * skip].re;
for (m = 0; m != skip; ++m) {
complex* D = E + skip;
/* twiddle *D to get dre and dim */
double dre = D->re * tre - D->im * tim;
double dim = D->re * tim + D->im * tre;
Xp->re = E->re + dre;
Xp->im = E->im + dim;
Xp2->re = E->re - dre;
Xp2->im = E->im - dim;
++Xp;
++Xp2;
++E;
}
E += skip;
}
E = Xstart;
evenIteration = !evenIteration;
}
}
complex* FFT_get_twiddle_factors(int N) {
complex* twiddles = (complex*) malloc(sizeof(struct complex_t) * N);
int k;
for (k = 0; k != N; ++k) {
twiddles[k] = complex_from_polar(1.0, -2.0*PI*k/N);
}
return twiddles;
}
complex* FFT_simple(complex* x, int N /* must be a power of 2 */) {
complex* out = (complex*) malloc(sizeof(struct complex_t) * N);
complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N);
complex* twiddles = FFT_get_twiddle_factors(N);
FFT_calculate(x, N, out, scratch, twiddles);
free(twiddles);
free(scratch);
return out;
}
```

## Files

We create a few separate modules: one for our basic complex number support, one for the FFT implementations, and one for the test driver and `main()`, described in the next section. We include the complex number header in the FFT header for the `complex` type.

```<<complex_simple.h>>=
#ifndef _COMPLEX_SIMPLE_H_
#define _COMPLEX_SIMPLE_H_
complex number datatype
complex number prototypes
#endif /* #ifndef _COMPLEX_SIMPLE_H_ */
<<complex_simple.c>>=
#include "complex_simple.h"
complex number function definitions
```
```<<fft.h>>=
#ifndef _FFT_H_
#define _FFT_H_
#include "complex_simple.h"
FFT prototypes
#endif /* #ifndef _FFT_H_ */
<<fft.c>>=
#include "fft.h"