CooleyTukey FFT algorithm (C)
From LiteratePrograms
This program is under development.
Please help to debug it. When debugging
is complete, remove the {{develop}} tag.
The CooleyTukey FFT algorithm is a popular fast Fourier transform algorithm for rapidly computing the discrete fourier transform of a sampled digital signal. It applies best to signal vectors whose lengths are highly composite, usually a power of 2.
Here we describe a C implementation of CooleyTukey. We begin with a straightforward implementation and then describe various performance improvements. Like most practical implementations, we will concentrate on the case where the vector length is a power of 2.
Complex arithmetic
Although the newest version of C, C99, has library support for complex number datatypes and arithmetic, it is not widely supported. We write simple implementations of a few basic operations we need here, with little explanation; for a more detailed discussion, see Category:Complex numbers.
<<complex number datatype>>= typedef struct complex_t { double re; double im; } complex; <<complex number prototypes>>= complex complex_from_polar(double r, double theta_radians); double complex_magnitude(complex c); complex complex_add(complex left, complex right); complex complex_sub(complex left, complex right); complex complex_mult(complex left, complex right); <<complex number function definitions>>= complex complex_from_polar(double r, double theta_radians) { complex result; result.re = r * cos(theta_radians); result.im = r * sin(theta_radians); return result; } double complex_magnitude(complex c) { return sqrt(c.re*c.re + c.im*c.im); } complex complex_add(complex left, complex right) { complex result; result.re = left.re + right.re; result.im = left.im + right.im; return result; } complex complex_sub(complex left, complex right) { complex result; result.re = left.re  right.re; result.im = left.im  right.im; return result; } complex complex_mult(complex left, complex right) { complex result; result.re = left.re*right.re  left.im*right.im; result.im = left.re*right.im + left.im*right.re; return result; }
The use of sin()
, cos()
, and sqrt()
requires the header math.h:
<<complex number headers>>= #include <math.h>
Naïve evaluation of definition
First we discuss the simplest possible DFT solution in order to address some of the issues in a simpler setting. Suppose that our input vector is x and our output vector is X. Then the discrete Fourier transform is defined by the formula:
 For each ,
The quantity is most easily described in terms of its polar complex coordinates: it is on the unit circle (distance r=1 from the origin) and forms an angle θ of radians with the positive real axis. This enables us to evaluate the definition as follows:
<<FFT prototypes>>= complex* DFT_naive_1(complex* x, int N); <<Naive DFT>>= complex* DFT_naive_1(complex* x, int N) { complex* X = (complex*) malloc(sizeof(struct complex_t) * N); int k, n; for(k=0; k<N; k++) { X[k].re = X[k].im = 0.0; for(n=0; n<N; n++) { X[k] = complex_add(X[k], complex_mult(x[n], complex_from_polar(1, 2*PI*n*k/N))); } } return X; }
We require headers for memory management and need to define PI
:
<<FFT headers>>= #include <stdlib.h> <<FFT constants>>= #define PI 3.1415926535897932
This works well for very small inputs and does not place any requirements on N. At the expense of additional space, we can avoid repeatedly recomputing the factors . Instead, we compute only the N values for and use the fact that e^{2πik} = 1 to reduce to one of these:
 where and .
In code:
<<FFT prototypes>>= complex* DFT_naive_2(complex* x, int N); <<Naive DFT>>= complex* DFT_naive_2(complex* x, int N) { complex* X = (complex*) malloc(sizeof(struct complex_t) * N); complex* Nth_root = (complex*) malloc(sizeof(struct complex_t) * N); int k, n; for(k=0; k<N; k++) { Nth_root[k] = complex_from_polar(1, 2*PI*k/N); } for(k=0; k<N; k++) { X[k].re = X[k].im = 0.0; for(n=0; n<N; n++) { X[k] = complex_add(X[k], complex_mult(x[n], Nth_root[(n*k) % N])); } } free(Nth_root); return X; }
This considerably speeds the algorithm in practice. Unfortunately, even with this change, directly evaluating this sum for each k requires O(N^{2}) time overall, which is prohibitive for large N.
Radix2 decimation in time
The radix2 decimation in time algorithm uses a divideandconquer approach to improve efficiency. It begins by separating the input vector x into two smaller vectors d and e, with d containing the oddnumbered elements of the original vector and e containing the evennumbered elements. For example:
 x = (1, 2, 3, 4, 5, 6)
 d = (2, 4, 6)
 e = (1, 3, 5)
(Keep in mind that the indexing starts at zero). We then recursively compute the FFTs of d and e, call them D and E. We use them to construct X according to the following formula, which can be shown directly from the definition:
Thus, all we need to do to compute X is to compute D, E, and the N/2 complex roots of unity specified by for (called twiddle factors), then do some basic arithmetic. The additional work after computing D and E is linear, so the analysis is just like that of mergesort, requiring O(NlogN) worstcase time. Following is a straightforward implementation based on recursion using the above formulas:
<<FFT prototypes>>= complex* FFT_simple(complex* x, int N /* must be a power of 2 */); <<Simple FFT>>= complex* FFT_simple(complex* x, int N /* must be a power of 2 */) { complex* X = (complex*) malloc(sizeof(struct complex_t) * N); complex * d, * e, * D, * E; int k; if (N == 1) { X[0] = x[0]; return X; } e = (complex*) malloc(sizeof(struct complex_t) * N/2); d = (complex*) malloc(sizeof(struct complex_t) * N/2); for(k = 0; k < N/2; k++) { e[k] = x[2*k]; d[k] = x[2*k + 1]; } E = FFT_simple(e, N/2); D = FFT_simple(d, N/2); free(e); free(d); for(k = 0; k < N/2; k++) { /* Multiply entries of D by the twiddle factors e^(2*pi*i/N * k) */ D[k] = complex_mult(complex_from_polar(1, 2.0*PI*k/N), D[k]); } for(k = 0; k < N/2; k++) { X[k] = complex_add(E[k], D[k]); X[k + N/2] = complex_sub(E[k], D[k]); } free(D); free(E); return X; }
Although this works, is simple, and has good asymptotic performance, in practice the large number of allocations above degrade speed and triple memory use, which is particularly harmful on large inputs. Additionally, many of the complex_from_polar()
calls are redundant, since only Nth roots of unity are used anywhere in the recursion tree. Finally, this implementation has considerable function call overhead and would ideally be replaced by a simpler algorithm on small subvectors.
Optimization: removing memory allocations 1
One obvious optimization is to remove the allocations for e and d, seeing as these are simple rearrangements of some elements of x. The algorithms above assume that the input is a contiguous array, but if we relax this assumption so that it will accept inputs of the form x[0], x[skip], x[skip*2], x[skip*3]...
then e and d are simply subarrays of x with different values of skip:
complex* FFT_skip(complex* x, int N /* must be a power of 2 */, int skip) { complex* X = (complex*) malloc(sizeof(struct complex_t) * N); complex * D, * E; int k; if (N == 1) { X[0] = x[0]; return X; } E = FFT_skip(x, N/2, skip*2); D = FFT_skip(x + skip, N/2, skip*2); for(k = 0; k < N/2; k++) { /* Multiply entries of D by the twiddle factors e^(2*pi*i/N * k) */ D[k] = complex_mult(complex_from_polar(1, 2.0*PI*k/N), D[k]); } for(k = 0; k < N/2; k++) { X[k] = complex_add(E[k], D[k]); X[k + N/2] = complex_sub(E[k], D[k]); } free(D); free(E); return X; } complex* FFT_simple(complex* x, int N /* must be a power of 2 */) { return FFT_skip(x, N, 1); }
Optimization: removing memory allocations 2
A second optimization is to remove the allocation of X at the beginning of each iteration. It can be seen that X is not required until e and d (now subarrays of x) are finished with. So we could use a scratch buffer to hold E and D before calculating X, using X as the scratch buffer when calculating E and D:
void FFT_buffer(complex* x, int N /* must be a power of 2 */, int skip, complex* X, complex* E) { complex * D = E + N/2; int k; if (N == 1) { X[0] = x[0]; return; } /* for now we can use X as a scratch buffer */ FFT_buffer(x, N/2, skip*2, E, X); FFT_buffer(x + skip, N/2, skip*2, D, X); for(k = 0; k < N/2; k++) { /* Multiply entries of D by the twiddle factors e^(2*pi*i/N * k) */ D[k] = complex_mult(complex_from_polar(1, 2.0*PI*k/N), D[k]); } for(k = 0; k < N/2; k++) { X[k] = complex_add(E[k], D[k]); X[k + N/2] = complex_sub(E[k], D[k]); } } complex* FFT_simple(complex* x, int N /* must be a power of 2 */) { complex* out = (complex*) malloc(sizeof(struct complex_t) * N); complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N); FFT_buffer(x, N, 1, out, scratch); free(scratch); return out; }
Optimization: precalculating twiddle factors
It can be seen that the same twiddle factors are being recalculated constantly. Firstly, the calculation of E and the calculation of D require exactly the same twiddle factors. Secondly, the calculation of E and D require the evennumbered twiddle factors that the calculation of X requires. Therefore we can use a constant array of twiddle factors with a skip, just like we did for x. Luckily, the skip for this array of twiddle factors is the same as the skip for x, so we do not need to pass around an extra value for it.
Note that although the code below allocates memory and calculates the twiddle factors once per call, in practice these would likely be reusable over many calls.
void FFT_calculate(complex* x, int N /* must be a power of 2 */, int skip, complex* X, complex* D, complex* twiddles) { complex * E = D + N/2; int k; if (N == 1) { X[0] = x[0]; return; } /* for now we can use X as a scratch buffer */ FFT_calculate(x, N/2, skip*2, E, X, twiddles); FFT_calculate(x + skip, N/2, skip*2, D, X, twiddles); for(k = 0; k < N/2; k++) { /* Multiply entries of D by the twiddle factors e^(2*pi*i/N * k) */ D[k] = complex_mult(twiddles[k*skip], D[k]); } for(k = 0; k < N/2; k++) { X[k] = complex_add(E[k], D[k]); X[k + N/2] = complex_sub(E[k], D[k]); } } complex* FFT_get_twiddle_factors(int N) { complex* twiddles = (complex*) malloc(sizeof(struct complex_t) * N/2); int k; for (k = 0; k != N/2; ++k) { twiddles[k] = complex_from_polar(1.0, 2.0*PI*k/N); } return twiddles; } complex* FFT_simple(complex* x, int N /* must be a power of 2 */) { complex* out = (complex*) malloc(sizeof(struct complex_t) * N); complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N); complex* twiddles = FFT_get_twiddle_factors(N); FFT_calculate(x, N, 1, out, scratch, twiddles); free(twiddles); free(scratch); return out; }
Optimization: removing recursion
This is a decidedly nontrivial optimization, for not a lot of speed increase. However, it does lend itself well to futher optimization.
To remove recursion we must be able to perform all the parts of each iteration in one pass. Here is a table of how we will arrange the results for each iteration for a DFT of length 8:
initial  X[0][0]  X[1][0]  X[2][0]  X[3][0]  X[4][0]  X[5][0]  X[6][0]  X[7][0] 
reinterpretation  E[0][0]  E[1][0]  E[2][0]  E[3][0]  D[0][0]  D[1][0]  D[2][0]  D[3][0] 
calculation of next Xs  X[0][0]  X[1][0]  X[2][0]  X[3][0]  X[0][1]  X[1][1]  X[2][1]  X[3][1] 
reinterpretation  E[0][0]  E[1][0]  D[0][0]  D[1][0]  E[0][1]  E[1][1]  D[0][1]  D[1][1] 
calculation of next Xs  X[0][0]  X[0][1]  X[1][0]  X[1][1]  X[0][2]  X[0][3]  X[1][2]  X[1][3] 
reinterpretation  E[0][0]  D[0][0]  E[0][1]  D[0][1]  E[0][2]  D[0][2]  E[0][3]  D[0][3] 
final calculation of Xs  X[0][0]  X[0][1]  X[0][2]  X[0][3]  X[0][4]  X[0][5]  X[0][6]  X[0][7] 
where X[n]
represents what would have been the output from one call of FFT_simple
and E[n]
and D[n]
represent the intermediate values used by that same call.

void FFT_calculate(complex* x, long N /* must be a power of 2 */, complex* X, complex* scratch, complex* twiddles) { int k, m, n; int skip; bool evenIteration = N & 0x55555555; complex* E, * D; complex* Xp, * Xstart; if (N == 1) { /* N == 1 is now a special case, not the end of the recursion! */ X[0] = x[0]; return; } E = x; for (n = 1; n < N; n *= 2) { Xstart = evenIteration? scratch : X; skip = N/(2 * n); /* each of D and E is of length n, and each element of each D and E is separated by 2*skip. The Es begin at E[0] to E[skip  1] and the Ds begin at E[skip] to E[2*skip  1] */ Xp = Xstart; for(k = 0; k != n; k++) { for (m = 0; m != skip; ++m) { D = E + skip; *D = complex_mult(twiddles[k * skip], *D); *Xp = complex_add(*E, *D); Xp[N/2] = complex_sub(*E, *D); ++Xp; ++E; } E += skip; } E = Xstart; evenIteration = !evenIteration; } } complex* FFT_get_twiddle_factors(int N) { complex* twiddles = (complex*) malloc(sizeof(struct complex_t) * N); int k; for (k = 0; k != N; ++k) { twiddles[k] = complex_from_polar(1.0, 2.0*PI*k/N); } return twiddles; } complex* FFT_simple(complex* x, int N /* must be a power of 2 */) { complex* out = (complex*) malloc(sizeof(struct complex_t) * N); complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N); complex* twiddles = FFT_get_twiddle_factors(N); FFT_calculate(x, N, out, scratch, twiddles); free(twiddles); free(scratch); return out; }
Optimization: reducing functioncall overhead
Note that in the arrangement above, the same twiddle factor is being invoked repeatedly in the inner loop. This suggests further optimizations, which, when combined with inlining of the complex arithmetic, produces a worthwhile speed improvement:
void FFT_calculate(complex* x, long N /* must be a power of 2 */, complex* X, complex* scratch, complex* twiddles) { int k, m, n; int skip; bool evenIteration = N & 0x55555555; complex* E; complex* Xp, * Xp2, * Xstart; if (N == 1) { X[0] = x[0]; return; } E = x; for (n = 1; n < N; n *= 2) { Xstart = evenIteration? scratch : X; skip = N/(2 * n); /* each of D and E is of length n, and each element of each D and E is separated by 2*skip. The Es begin at E[0] to E[skip  1] and the Ds begin at E[skip] to E[2*skip  1] */ Xp = Xstart; Xp2 = Xstart + N/2; for(k = 0; k != n; k++) { double tim = twiddles[k * skip].im; double tre = twiddles[k * skip].re; for (m = 0; m != skip; ++m) { complex* D = E + skip; /* twiddle *D to get dre and dim */ double dre = D>re * tre  D>im * tim; double dim = D>re * tim + D>im * tre; Xp>re = E>re + dre; Xp>im = E>im + dim; Xp2>re = E>re  dre; Xp2>im = E>im  dim; ++Xp; ++Xp2; ++E; } E += skip; } E = Xstart; evenIteration = !evenIteration; } } complex* FFT_get_twiddle_factors(int N) { complex* twiddles = (complex*) malloc(sizeof(struct complex_t) * N); int k; for (k = 0; k != N; ++k) { twiddles[k] = complex_from_polar(1.0, 2.0*PI*k/N); } return twiddles; } complex* FFT_simple(complex* x, int N /* must be a power of 2 */) { complex* out = (complex*) malloc(sizeof(struct complex_t) * N); complex* scratch = (complex*) malloc(sizeof(struct complex_t) * N); complex* twiddles = FFT_get_twiddle_factors(N); FFT_calculate(x, N, out, scratch, twiddles); free(twiddles); free(scratch); return out; }
Files
We create a few separate modules: one for our basic complex number support, one for the FFT implementations, and one for the test driver and main()
, described in the next section. We include the complex number header in the FFT header for the complex
type.
<<complex_simple.h>>= #ifndef _COMPLEX_SIMPLE_H_ #define _COMPLEX_SIMPLE_H_ complex number datatype complex number prototypes #endif /* #ifndef _COMPLEX_SIMPLE_H_ */ <<complex_simple.c>>= #include "complex_simple.h" complex number headers complex number function definitions
<<fft.h>>= #ifndef _FFT_H_ #define _FFT_H_ #include "complex_simple.h" FFT prototypes #endif /* #ifndef _FFT_H_ */ <<fft.c>>= #include "fft.h" FFT headers FFT constants Naive DFT Simple FFT
Test driver

Download code 