# The Programmer’s Guide To FFT – Part 2: FFT

This will be a 2 part series on fast fourier transform (FFT). My aim for these posts is to provide a more hands-on and layman friendly approach to this algorithm, contrast to a lot of the theoretically heavy material available on the internet. In short: less math, no proofs, examples provided, and working source code (C++11).

In my previous post, I wrote about DFT as the basis to understand FFT. In this final part I will write about the FFT algorithm as outlined in Cooley and Tukey’s seminal work published in 1964. I assume you understood the material in the prior post before coming here, since everything here builds on top of it.

## Key Ideas

Before we begin, let us assume that the length of the input (N) is always in powers of 2 (2, 4, 8, 16, 32…). I will explain why is this important later on.

There are 2 key ideas that enable the FFT algorithm to work. First is to understand that DFT can be separated as a sum of odd and even parts:

$\begin{array}{lcl}F_k & = & \sum\limits_{n=0}^{N-1}x_n\cdot e^{-\frac{i2\pi k n}{N}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (2m)}{N}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (2m+1)}{N}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m+1/2)}{N/2}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m)}{N/2} - \frac{i\pi k}{N/2}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + e^{-\frac{i2\pi k}{N}} \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m)}{N/2}}\end{array}$

Let us define a function $\omega$ (read as omega):

$\omega(p, q) = e^{\frac{i2\pi q}{p}}$

Now we simplify the DFT formulation to:

$F_k = \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot \omega(km, \frac{N}{2}) + \omega(N, k) \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot \omega(km, \frac{N}{2})$

Let’s generalize further to:

$F_k = F_k^{\text{even}} + \omega(N, k) \cdot F_k^{\text{odd}}$

The second key idea is to take advantage of the periodic nature of DFT:

$\begin{array}{lcl}F_k & = & F_k^{\text{even}} + \omega(N, k) \cdot F_k^{\text{odd}} \\ F_{k+\frac{N}{2}} & = & F_k^{\text{even}} - \omega(N, k) \cdot F_k^{\text{odd}}\end{array}$

What this means is that in the process of calculating the resulting sequence $F$ you only need to compute $\omega(N, k) \cdot F_k^{\text{odd}}$ a total of $\frac{N}{2}$ times; we can essentially half the number of computations using this technique. But why stop there? We can also take either $F_k^{\text{even}}$ or $F_k^{\text{odd}}$ and split them to odd and even parts, and repeat the same procedure. If we compute this recursively, the base case for this is when $N = 1$. In this manner we compute $\omega(N, k) \cdot F_k^{\text{odd}}$ for as many times as we can divide it by 2, or $\log N$. Therefore, for sequence of size N, FFT computes the DFT in $N \log N$ time.

## Example

Ok. That was probably hard to grasp, so let us break it down. Take an example where N = 2: a sequence in the coefficient representation s is (1, 9), and we want to convert it to point-value representation. The even and odd sequence is simply 1 and 9 respectively. We can use an auxiliary variable h to store $\omega(N, k) \cdot F_k^{\text{odd}}$:

$h = \omega(2, 0) \cdot 9 = 9$

$F_0 = 1 + h = 1 + 9 = 10$

$F_1 = 1 - h = 1 - 9 = -8$

Notice how we only need to compute $\omega(N, k) \cdot F_k^{\text{odd}}$ once and reused it for $F_{0 + \frac{N}{2}}$.

Now we go to a more complex example, where N = 8: s = (1, 6, 3, 8, 9, 5, 4, 2). Here we can show that it is possible that by using the fact that DFT can be expressed as sum of even and odd parts, that we can recursively divide s to smaller subproblems, up until N = 1:

I arranged such that the subsequence to the left contains even parts, and the sequence to the right contains odd parts. Now that we separate it nicely we can systemically work on the smaller parts and work our way up until the final answer. I’ve made a nice diagram that illustrates the computational flow of the FFT algorithm:

As before, the sequence to the left are the even parts and the sequence to the right are the odd parts. The cells show the type: yellow for real numbers, and green for complex numbers. Blue arrows branch out from even sequences, and red arrows branch out from odd sequences. Red arrows also denote that the cell it came from will be multiplied by $\omega(N, k)$, though not visually depicted. The whole computation flow shows a sort of “butterfly pattern”, as how most engineers like to describe it.

IFFT works roughly the same way as FFT in that it uses the same technique to save computation, so if you understand FFT you should get IFFT as well.

## Implementation

The implementation here includes FFT and IFFT. As with the DFT and IDFT implementation in the previous post, it takes a sequence in the coefficient representation and spits out a sequence of the same size in the point-value representation using FFT, and takes that sequence puts it through IFFT to get back the original sequence. As with before, my emphasis is on readability not optimization.

#include <iostream>
#include <complex>
#include <cmath>
#include <iomanip>
#include <vector>
#include <algorithm>

using namespace std;

double PI = acos(0) * 2;
typedef complex<double> xd;
typedef vector<double> dvec;
typedef vector<xd> xvec;
const xd J(0, 1); // sqrt(-1)

inline xd omega(const double &p, const double &q)
{
return exp((2. * PI * J * q) / p);
}

xvec _fft(xvec &f)
{
double N = f.size();

if (N == 1) return f;

xvec fe, fo;
fe.reserve(N / 2);
fo.reserve(N / 2);

for (int i = 0; i < N; i += 2) {
fe.push_back(f[i]);     // even
fo.push_back(f[i + 1]); // odd
}

fe = _fft(fe);
fo = _fft(fo);

for (int m = 0; m < N / 2; ++m) {
xd omfo = omega(N, -m) * fo[m];
f[m]         = fe[m] + omfo;
f[m + N / 2] = fe[m] - omfo;
}

return f;
}

xvec fft(const dvec &x)
{
xvec f(x.size());

for (size_t i = 0; i < x.size(); ++i) {
f[i] = xd(x[i], 0);
}

return _fft(f);
}

xvec _ifft(xvec &x)
{
double N = x.size();

if (N == 1) return x;

xvec xe, xo;
xe.reserve(N / 2);
xo.reserve(N / 2);

for (int i = 0; i < N; i += 2) {
xe.push_back(x[i]);     // even
xo.push_back(x[i + 1]); // odd
}

xe = _ifft(xe);
xo = _ifft(xo);

for (int m = 0; m < N / 2; ++m) {
xd iomxo = omega(N, m) * xo[m];
x[m]         = xe[m] + iomxo;
x[m + N / 2] = xe[m] - iomxo;
}

return x;
}

dvec ifft(xvec f)
{
double N = f.size();

xvec xcomplex = _ifft(f);
dvec x(N);

for (int i = 0; i < N; ++i) {
x[i] = xcomplex[i].real() / N;
}

return x;
}

int main()
{
cout << fixed << setprecision(2);

dvec input = { 1,6,3,8,9,5,4,2 };

// convert from time to frequency domain
xvec freqdom = fft(input);

for (const auto &f : freqdom) {
cout << f << endl;
}
cout << endl;

// convert from frequency to time domain
auto timedom = ifft(freqdom);

for (const auto &t : timedom) {
cout << t << ' ';
}
cout << endl;
}


It is a good idea to set breakpoints to see how the recursive implementation of FFT systematically solves DFT from the smaller subproblems.

Because of how similar FFT and IFFT is, it is not hard to merge them into a function and pass a boolean parameter to determine whether it will be FFT and IFFT (most implementations online will call this multi-purpose function “transform”), but for the sake of clarity I refrain from doing so.

Convolution is practically the same as before – it’s just that we replace DFT with FFT and IDFT with IFFT:

// vector convolution
dvec convolve(const dvec &a, const dvec &b)
{
// calculate degree of resulting polynomial
size_t N = 2 * a.size() - 1;

// extend size to match result
dvec acof(N), bcof(N);
copy(a.begin(), a.end(), acof.begin());
copy(b.begin(), b.end(), bcof.begin());

xvec apv, bpv, cpv(N);

// evaluation
apv = fft(acof);
bpv = fft(bcof);

// point-wise multiplcation
for (size_t i = 0; i < N; ++i) {
cpv[i] = apv[i] * bpv[i];
}

for (const auto &t : cpv)  cout << t << ' ';
cout << endl;

// interpolation
return ifft(cpv);
}


Now we estimate the time complexity of vector convolution using FFT: evaluation ($N \log N$), pointwise multiplication (N) and interpolation ($N \log N$) now costs a total of $2 \times N \log N + N \approx N \log N$.

## Must N Be In Powers of 2?

You could play around with different input sizes and compare the answer with DFT. What you will see is that FFT will return some funny values if N is not a power of 2. Why is this so? Well, you can see from the visual depictions of how FFT works: at the simplest subproblem (N = 2), you need to have one even value and one odd value. FFT must be able to divide in such a way that at some point splitting the input, all subsequences are of size 2, and the only way that is possible is if N is a power of 2.

If it doesn’t make sense, you could just play around with the code to convince yourself I guess.

Is this a bad thing? Well, if all you use FFT for is convolution then no. You could first calculate the resulting polynomial degree of the convolution, then pad the input with 0 until N is a power of 2, evaluate it, do pointwise multiplication, interpolate, and resize it to match the resulting degree.

If you use bit shifts to multiply by 2, you can compute N very quickly (see full implementation here):

// degree of resulting polynomial = size of resulting array
size_t deg = a.size() + b.size() - 1;

// transform array size must be in power of 2 for FFT
size_t N = 1;
while (N < deg) N <<= 1;

// set size for arrays in point-wise representation -
// extended space is padded with 0:
xvec a(N), b(N), c(N);

// ** do convolution... **

// Resize to actual size
c.resize(deg);


But, wouldn’t resizing the input array be slow? Well, we can prove with simple example. Say N = 129: In naive DFT, computing the DFT will take $129^2 = 16641$. In FFT we resize N to the closest power of 2, which is 256. $256 \log 256 \approx 617$. That is 2697% less computations! Of course, it shouldn’t be hard to convince yourself that this is still true as N gets larger.

There are variations of FFT where N does not need to be in powers of 2, but it won’t be the focus of this post.

## Optimizations

If you print the N and m used by the $\omega$ (omega) function, you will notice that there are some repetitions. Here is N and m where N = 8:

2 0
2 0
4 0
4 1
2 0
2 0
4 0
4 1
8 0
8 1
8 2
8 3

So this means it is possible to precompute results from $\omega$ and reuse them. In the case where N = 8, we only need to calculate $\omega$ 7 times as oppose to 12. I have an implementaion of FFT that does this simple optimization.

What else can you do? In-place calculation – here’s my implementation of in-place fft. Aside knowing how to meddle with indices, one key idea is to understand is this: the indices after dividing the sequence is the bit reverse, where the length of bits is $\log N$. For example if N = 16 ($\log N = 4$), the index 7 (0111) with be swapped with 14 (1110).

## Conclusion

We have went through DFT, and now FFT. I hope this helped you understand FFT a little better. It’s ok if you’re still a little foggy with the details; play around the source code long enough and it will be clearer to you. I intend to work on hands-on applications of FFT in future posts, so keep a look out for more!

# The Programmer’s Guide To FFT – Part 1: DFT

This will be a 2 part series on fast fourier transform (FFT). My aim for these posts is to provide a more hands-on and layman friendly approach to this algorithm, contrast to a lot of the theoretically heavy material available on the internet. In short: less math, no proofs, examples provided, and working source code (C++11).

In this first part I will write about discrete fourier transform (DFT) as the basis to understand FFT.

## Sorting as a Metaphor

DFT and FFT are similar as insertion sort is to merge sort; they both take the same type of inputs and spits out the same output, it’s just that FFT runs much faster than DFT by utilizing a technique called divide and conquer (mergesort also uses this). The computational difference is also the same as merge sort is to insertion sort: $O(N \log N)$ vs $O(N^2)$. How much difference is this? Well, if N is 10 billion and each computation takes 1 nanosecond, it is the difference between finishing in 30 seconds as opposed to 30 years.

## What Does DFT Do?

DFT takes a discrete series of input in the time domain and converts it to the frequency domain. Computer scientists describe it as converting from the coefficient representation to the point-value representation (more on that later). At either case, the semantics of the input is retained; it is like translating from one language to another.

Also, the conversation from one domain to another needs to be reversible. To convert from the frequency domain back to the time domain, you use the inverse of the DFT, also called IDFT. In computer science, the process to convert from coefficient representation to the point-value representation is called evaluation, and the reverse process is interpolation.

## Applications and Motivations

The applications of DFT is diverse: how Shazam matches songs, mp3 compression, remove noise from audio, jpeg compression, edge detection, radar, sonar, optics, seismic activity, satellite transmissions, genome matching (I will put up a post on genome matching in the future as an application of DFT)… Put it this way: as long as it involves a sequence (could be in multiple dimensions) that changes – whether it be over time or space, it can benefit from DFT.

However, this post focuses on using DFT for vector convolution (I profess that as of this writing this is the only thing I know how to use DFT for).

## Vector Convolution

We can understand what convolution does as a polynomial multiplication:

Example taken from CLRS

Put formally, we are given a sequence of real numbers $a$ and $b$ (in this example is (6, 7, -10, 9) and -2, 0, 4, -5) respectively) of size n – each sequence maps to the coefficients of the input (thus the term coefficient representation), and we return a sequence of real numbers $c$ (-12, -14, 44, -20, -75, 86, -45) of size 2n – 1. The degree of each polynomial can be mapped from the indices (start from 0). We also describe convolution in mathemical notation as $a \otimes b = c$.

Each element $c_j$ ($0 \leq j < 2n-1$) can be calculated like so (formula also from CLRS):

$c_j = \sum\limits_{k=0}^{j}a_kb_{j-k}$

I have an implementation of this formulation below (click here to see a sample using this function):

typedef vector<int> ivec;

ivec polynomialMultiply(const ivec &a, const ivec &b)
{
int N = 2 * a.size() - 1;
ivec c(N, 0);

for (int j = 0; j < N; ++j) {
int p = j >= a.size() ? j - a.size() + 1 : 0;
for (int k = p; k <= j - p; ++k) {
c[j] += a[k] * b[j - k];
}
}

return c;
}


Perhaps you might be wondering why is there an auxiliary variable p which is not in the formulation: I don’t know; it doesn’t work without it. BUT, that’s not the point. The point is that this technique takes $O(N^2)$ to run (we can infer this from the nested for loop). The same result, however, can be computed much faster in point-value representation:

In this illustration, yellow indicates the coefficient representation (real numbers) and green indicates the point-value representation (complex numbers). In the point-value form, convolution is simply a pointwise multiplication which can be done in $O(N)$. However, the actual complexity of this method also needs to also factor the computation needed for evaluation and interpolation.

## The Math Behind DFT

We now describe the transformation process with more formal mathematical notation. Let $x$ be the sequence of N real values in the time domain, where the n-th element of $x$ is denoted as $x_n$. DFT then takes $x$ and transforms it to a sequence $F$ of N complex values in the frequency domain, where the k-th element of $F$ is denoted as $F_k$. We can also write DFT as a function $dft(x) = F$, and IDFT as $idft(F) = x$.

The DFT is then given by:

$F_k=\sum\limits_{n=0}^{N-1}x_n\cdot e^{-\frac{i2\pi k n}{N}}$

And IDFT given by:

$x_n=\frac{1}{N}\sum\limits_{k=0}^{N-1}F_k\cdot e^{\frac{i2\pi k n}{N}}$

Where $i$ is the imaginary number $\sqrt{-1}$.

Since each element takes N computations for all N elements, it is not hard to see that both DFT and IDFT takes $O(N^2)$.

## Calculate An Example by Hand

It always helps to compute the DFT by hand first to get a feel for it. To do this we will need Euler’s formula to break down the exponential to its sine cosine equivalents:

$e^{iy}=\cos y + i \sin y$

In this context, $y$ is $\frac{2\pi n k}{N}$.

Example: Let $x$ be a sequence (1, 2, 3, 4) and N = 4, and we wish to convert it to the frequency domain.

$\begin{array}{lcl}F_0 & = & 1\cdot e^{-\frac{i2\pi 0 \cdot 0}{4}} + 2\cdot e^{-\frac{i2\pi 0 \cdot 1}{4}} + 3\cdot e^{-\frac{i2\pi 0 \cdot 2}{4}} + 4\cdot e^{-\frac{i2\pi 0 \cdot 3}{4}} \\ & = & 1\cdot e^0 + 2\cdot e^0 + 3\cdot e^0 + 4\cdot e^0 \\ & = & 10\end{array}$

$\begin{array}{lcl}F_1 & = & 1\cdot e^{-\frac{i2\pi 1 \cdot 0}{4}} + 2\cdot e^{-\frac{i2\pi 1 \cdot 1}{4}} + 3\cdot e^{-\frac{i2\pi 1 \cdot 2}{4}} + 4\cdot e^{-\frac{i2\pi 1 \cdot 3}{4}} \\ & = & 1\cdot e^0 + 2\cdot e^{-\frac{i\pi}{2}} + 3\cdot e^{-i\pi} + 4\cdot e^{-\frac{i3\pi}{2}} \\ & = & 1 + 2\left(\cos \frac{-\pi}{2} + i\sin \frac{-\pi}{2} \right) + 3 \left( \cos(-\pi) + i \sin(-\pi) \right) + 4 \left( \cos\frac{-3\pi}{2} + i\sin\frac{-3\pi}{2}\right) \\ & = & 1 + 2\left(0 - i\right) + 3 \left( -1 + 0 \right) + 4 \left( 0 + i\right) \\ & = & -2+2i\end{array}$

$F_2 = -2$

$F_3 = -2-2i$

Now, if you take the values of $F$ and pass through the IDFT, it should return you back the sequence $x$, though I will not show it here.

## Implementation

The C++ program below simply runs the same sequence in the example through DFT, and pass the result through IDFT to get back the original sequence. I placed more emphasis on readability of the code, so it is not designed to be optimized but rather to prove the correctness of the mathematical formulations of DFT and IDFT.

#include <iostream>
#include <complex>
#include <cmath>
#include <iomanip>
#include <vector>

using namespace std;

double PI = acos(0) * 2;
typedef complex<double> xd;
typedef vector<double> dvec;
typedef vector<xd> xvec;
const xd J(0, 1); // sqrt(-1)

xvec dft(const dvec &input)
{
double N = input.size();
xvec X(N);

for (double k = 0; k < N; ++k) {
for (double n = 0; n < N; ++n) {
X[k] += (double)input[n] * exp(-2. * J * PI * n * k / N);
}
}

return X;
}

dvec idft(const xvec &input)
{
double N = input.size();
xvec x(N);
dvec out(N);

for (double k = 0; k < N; ++k) {
for (double n = 0; n < N; ++n) {
x[k] += input[n] * exp(2. * J * PI * n * k / N);
}
out[k] = x[k].real() / N;
}

return out;
}

int main()
{
cout << fixed << setprecision(2);

dvec input = { 1, 2, 3, 4 };

// convert from time to frequency domain
xvec freqdom = dft(input);

for (const auto &f : freqdom) {
cout << f << endl;
}
cout << endl;

// convert from frequency to time domain
dvec timedom = idft(freqdom);

for (const auto &t : timedom) {
cout << t << ' ';
}
cout << endl;
}


The output of this program will then be:

(10.00,0.00)
(-2.00,2.00)
(-2.00,-0.00)
(-2.00,-2.00)

1.00 2.00 3.00 4.00

With DFT and IDFT working properly we now use it to implement convolution:

dvec convolve(const dvec &a, const dvec &b)
{
// calculate degree of resulting polynomial
size_t N = 2 * a.size() - 1;

// extend size and pad with 0
dvec acof(N, 0), bcof(N, 0);
copy(a.begin(), a.end(), acof.begin());
copy(b.begin(), b.end(), bcof.begin());

xvec apv, bpv, cpv(N);

// evaluation
apv = dft(acof);
bpv = dft(bcof);

// point-wise multiplcation
for (size_t i = 0; i < N; ++i) {
cpv[i] = apv[i] * bpv[i];
}

// interpolation
return idft(cpv);
}


See the full working DFT convolution program from this gist.

## Conclusion

So, we covered DFT and vector convolution, and applied it to solve polynomial multiplication. What’s the problem now? Pointwise multiplication, in addition with evaluation and interpolation using DFT and IDFT, uses a time complexity of $O(N) + 2 \times O(N^2) \approx O(N^2)$. That’s actually no different, if not worst than using the naive method. To circumvent this, I will introduce FFT in the next post, which will compute the evaluation and interpolation in $O(N \log N)$ time. Stay tuned!

Update: Part 2 is up!