# The Programmer’s Guide To FFT – Part 2: FFT

This will be a 2 part series on fast fourier transform (FFT). My aim for these posts is to provide a more hands-on and layman friendly approach to this algorithm, contrast to a lot of the theoretically heavy material available on the internet. In short: less math, no proofs, examples provided, and working source code (C++11).

In my previous post, I wrote about DFT as the basis to understand FFT. In this final part I will write about the FFT algorithm as outlined in Cooley and Tukey’s seminal work published in 1964. I assume you understood the material in the prior post before coming here, since everything here builds on top of it.

## Key Ideas

Before we begin, let us assume that the length of the input (N) is always in powers of 2 (2, 4, 8, 16, 32…). I will explain why is this important later on.

There are 2 key ideas that enable the FFT algorithm to work. First is to understand that DFT can be separated as a sum of odd and even parts:

$\begin{array}{lcl}F_k & = & \sum\limits_{n=0}^{N-1}x_n\cdot e^{-\frac{i2\pi k n}{N}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (2m)}{N}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (2m+1)}{N}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m+1/2)}{N/2}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m)}{N/2} - \frac{i\pi k}{N/2}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + e^{-\frac{i2\pi k}{N}} \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m)}{N/2}}\end{array}$

Let us define a function $\omega$ (read as omega):

$\omega(p, q) = e^{\frac{i2\pi q}{p}}$

Now we simplify the DFT formulation to:

$F_k = \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot \omega(km, \frac{N}{2}) + \omega(N, k) \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot \omega(km, \frac{N}{2})$

Let’s generalize further to:

$F_k = F_k^{\text{even}} + \omega(N, k) \cdot F_k^{\text{odd}}$

The second key idea is to take advantage of the periodic nature of DFT:

$\begin{array}{lcl}F_k & = & F_k^{\text{even}} + \omega(N, k) \cdot F_k^{\text{odd}} \\ F_{k+\frac{N}{2}} & = & F_k^{\text{even}} - \omega(N, k) \cdot F_k^{\text{odd}}\end{array}$

What this means is that in the process of calculating the resulting sequence $F$ you only need to compute $\omega(N, k) \cdot F_k^{\text{odd}}$ a total of $\frac{N}{2}$ times; we can essentially half the number of computations using this technique. But why stop there? We can also take either $F_k^{\text{even}}$ or $F_k^{\text{odd}}$ and split them to odd and even parts, and repeat the same procedure. If we compute this recursively, the base case for this is when $N = 1$. In this manner we compute $\omega(N, k) \cdot F_k^{\text{odd}}$ for as many times as we can divide it by 2, or $\log N$. Therefore, for sequence of size N, FFT computes the DFT in $N \log N$ time.

## Example

Ok. That was probably hard to grasp, so let us break it down. Take an example where N = 2: a sequence in the coefficient representation s is (1, 9), and we want to convert it to point-value representation. The even and odd sequence is simply 1 and 9 respectively. We can use an auxiliary variable h to store $\omega(N, k) \cdot F_k^{\text{odd}}$:

$h = \omega(2, 0) \cdot 9 = 9$

$F_0 = 1 + h = 1 + 9 = 10$

$F_1 = 1 - h = 1 - 9 = -8$

Notice how we only need to compute $\omega(N, k) \cdot F_k^{\text{odd}}$ once and reused it for $F_{0 + \frac{N}{2}}$.

Now we go to a more complex example, where N = 8: s = (1, 6, 3, 8, 9, 5, 4, 2). Here we can show that it is possible that by using the fact that DFT can be expressed as sum of even and odd parts, that we can recursively divide s to smaller subproblems, up until N = 1:

I arranged such that the subsequence to the left contains even parts, and the sequence to the right contains odd parts. Now that we separate it nicely we can systemically work on the smaller parts and work our way up until the final answer. I’ve made a nice diagram that illustrates the computational flow of the FFT algorithm:

As before, the sequence to the left are the even parts and the sequence to the right are the odd parts. The cells show the type: yellow for real numbers, and green for complex numbers. Blue arrows branch out from even sequences, and red arrows branch out from odd sequences. Red arrows also denote that the cell it came from will be multiplied by $\omega(N, k)$, though not visually depicted. The whole computation flow shows a sort of “butterfly pattern”, as how most engineers like to describe it.

IFFT works roughly the same way as FFT in that it uses the same technique to save computation, so if you understand FFT you should get IFFT as well.

## Implementation

The implementation here includes FFT and IFFT. As with the DFT and IDFT implementation in the previous post, it takes a sequence in the coefficient representation and spits out a sequence of the same size in the point-value representation using FFT, and takes that sequence puts it through IFFT to get back the original sequence. As with before, my emphasis is on readability not optimization.

#include <iostream>
#include <complex>
#include <cmath>
#include <iomanip>
#include <vector>
#include <algorithm>

using namespace std;

double PI = acos(0) * 2;
typedef complex<double> xd;
typedef vector<double> dvec;
typedef vector<xd> xvec;
const xd J(0, 1); // sqrt(-1)

inline xd omega(const double &p, const double &q)
{
return exp((2. * PI * J * q) / p);
}

xvec _fft(xvec &f)
{
double N = f.size();

if (N == 1) return f;

xvec fe, fo;
fe.reserve(N / 2);
fo.reserve(N / 2);

for (int i = 0; i < N; i += 2) {
fe.push_back(f[i]);     // even
fo.push_back(f[i + 1]); // odd
}

fe = _fft(fe);
fo = _fft(fo);

for (int m = 0; m < N / 2; ++m) {
xd omfo = omega(N, -m) * fo[m];
f[m]         = fe[m] + omfo;
f[m + N / 2] = fe[m] - omfo;
}

return f;
}

xvec fft(const dvec &x)
{
xvec f(x.size());

for (size_t i = 0; i < x.size(); ++i) {
f[i] = xd(x[i], 0);
}

return _fft(f);
}

xvec _ifft(xvec &x)
{
double N = x.size();

if (N == 1) return x;

xvec xe, xo;
xe.reserve(N / 2);
xo.reserve(N / 2);

for (int i = 0; i < N; i += 2) {
xe.push_back(x[i]);     // even
xo.push_back(x[i + 1]); // odd
}

xe = _ifft(xe);
xo = _ifft(xo);

for (int m = 0; m < N / 2; ++m) {
xd iomxo = omega(N, m) * xo[m];
x[m]         = xe[m] + iomxo;
x[m + N / 2] = xe[m] - iomxo;
}

return x;
}

dvec ifft(xvec f)
{
double N = f.size();

xvec xcomplex = _ifft(f);
dvec x(N);

for (int i = 0; i < N; ++i) {
x[i] = xcomplex[i].real() / N;
}

return x;
}

int main()
{
cout << fixed << setprecision(2);

dvec input = { 1,6,3,8,9,5,4,2 };

// convert from time to frequency domain
xvec freqdom = fft(input);

for (const auto &f : freqdom) {
cout << f << endl;
}
cout << endl;

// convert from frequency to time domain
auto timedom = ifft(freqdom);

for (const auto &t : timedom) {
cout << t << ' ';
}
cout << endl;
}


It is a good idea to set breakpoints to see how the recursive implementation of FFT systematically solves DFT from the smaller subproblems.

Because of how similar FFT and IFFT is, it is not hard to merge them into a function and pass a boolean parameter to determine whether it will be FFT and IFFT (most implementations online will call this multi-purpose function “transform”), but for the sake of clarity I refrain from doing so.

Convolution is practically the same as before – it’s just that we replace DFT with FFT and IDFT with IFFT:

// vector convolution
dvec convolve(const dvec &a, const dvec &b)
{
// calculate degree of resulting polynomial
size_t N = 2 * a.size() - 1;

// extend size to match result
dvec acof(N), bcof(N);
copy(a.begin(), a.end(), acof.begin());
copy(b.begin(), b.end(), bcof.begin());

xvec apv, bpv, cpv(N);

// evaluation
apv = fft(acof);
bpv = fft(bcof);

// point-wise multiplcation
for (size_t i = 0; i < N; ++i) {
cpv[i] = apv[i] * bpv[i];
}

for (const auto &t : cpv)  cout << t << ' ';
cout << endl;

// interpolation
return ifft(cpv);
}


Now we estimate the time complexity of vector convolution using FFT: evaluation ($N \log N$), pointwise multiplication (N) and interpolation ($N \log N$) now costs a total of $2 \times N \log N + N \approx N \log N$.

## Must N Be In Powers of 2?

You could play around with different input sizes and compare the answer with DFT. What you will see is that FFT will return some funny values if N is not a power of 2. Why is this so? Well, you can see from the visual depictions of how FFT works: at the simplest subproblem (N = 2), you need to have one even value and one odd value. FFT must be able to divide in such a way that at some point splitting the input, all subsequences are of size 2, and the only way that is possible is if N is a power of 2.

If it doesn’t make sense, you could just play around with the code to convince yourself I guess.

Is this a bad thing? Well, if all you use FFT for is convolution then no. You could first calculate the resulting polynomial degree of the convolution, then pad the input with 0 until N is a power of 2, evaluate it, do pointwise multiplication, interpolate, and resize it to match the resulting degree.

If you use bit shifts to multiply by 2, you can compute N very quickly (see full implementation here):

// degree of resulting polynomial = size of resulting array
size_t deg = a.size() + b.size() - 1;

// transform array size must be in power of 2 for FFT
size_t N = 1;
while (N < deg) N <<= 1;

// set size for arrays in point-wise representation -
// extended space is padded with 0:
xvec a(N), b(N), c(N);

// ** do convolution... **

// Resize to actual size
c.resize(deg);


But, wouldn’t resizing the input array be slow? Well, we can prove with simple example. Say N = 129: In naive DFT, computing the DFT will take $129^2 = 16641$. In FFT we resize N to the closest power of 2, which is 256. $256 \log 256 \approx 617$. That is 2697% less computations! Of course, it shouldn’t be hard to convince yourself that this is still true as N gets larger.

There are variations of FFT where N does not need to be in powers of 2, but it won’t be the focus of this post.

## Optimizations

If you print the N and m used by the $\omega$ (omega) function, you will notice that there are some repetitions. Here is N and m where N = 8:

2 0
2 0
4 0
4 1
2 0
2 0
4 0
4 1
8 0
8 1
8 2
8 3

So this means it is possible to precompute results from $\omega$ and reuse them. In the case where N = 8, we only need to calculate $\omega$ 7 times as oppose to 12. I have an implementaion of FFT that does this simple optimization.

What else can you do? In-place calculation – here’s my implementation of in-place fft. Aside knowing how to meddle with indices, one key idea is to understand is this: the indices after dividing the sequence is the bit reverse, where the length of bits is $\log N$. For example if N = 16 ($\log N = 4$), the index 7 (0111) with be swapped with 14 (1110).

## Conclusion

We have went through DFT, and now FFT. I hope this helped you understand FFT a little better. It’s ok if you’re still a little foggy with the details; play around the source code long enough and it will be clearer to you. I intend to work on hands-on applications of FFT in future posts, so keep a look out for more!

# The Programmer’s Guide To FFT – Part 1: DFT

This will be a 2 part series on fast fourier transform (FFT). My aim for these posts is to provide a more hands-on and layman friendly approach to this algorithm, contrast to a lot of the theoretically heavy material available on the internet. In short: less math, no proofs, examples provided, and working source code (C++11).

In this first part I will write about discrete fourier transform (DFT) as the basis to understand FFT.

## Sorting as a Metaphor

DFT and FFT are similar as insertion sort is to merge sort; they both take the same type of inputs and spits out the same output, it’s just that FFT runs much faster than DFT by utilizing a technique called divide and conquer (mergesort also uses this). The computational difference is also the same as merge sort is to insertion sort: $O(N \log N)$ vs $O(N^2)$. How much difference is this? Well, if N is 10 billion and each computation takes 1 nanosecond, it is the difference between finishing in 30 seconds as opposed to 30 years.

## What Does DFT Do?

DFT takes a discrete series of input in the time domain and converts it to the frequency domain. Computer scientists describe it as converting from the coefficient representation to the point-value representation (more on that later). At either case, the semantics of the input is retained; it is like translating from one language to another.

Also, the conversation from one domain to another needs to be reversible. To convert from the frequency domain back to the time domain, you use the inverse of the DFT, also called IDFT. In computer science, the process to convert from coefficient representation to the point-value representation is called evaluation, and the reverse process is interpolation.

## Applications and Motivations

The applications of DFT is diverse: how Shazam matches songs, mp3 compression, remove noise from audio, jpeg compression, edge detection, radar, sonar, optics, seismic activity, satellite transmissions, genome matching (I will put up a post on genome matching in the future as an application of DFT)… Put it this way: as long as it involves a sequence (could be in multiple dimensions) that changes – whether it be over time or space, it can benefit from DFT.

However, this post focuses on using DFT for vector convolution (I profess that as of this writing this is the only thing I know how to use DFT for).

## Vector Convolution

We can understand what convolution does as a polynomial multiplication:

Example taken from CLRS

Put formally, we are given a sequence of real numbers $a$ and $b$ (in this example is (6, 7, -10, 9) and -2, 0, 4, -5) respectively) of size n – each sequence maps to the coefficients of the input (thus the term coefficient representation), and we return a sequence of real numbers $c$ (-12, -14, 44, -20, -75, 86, -45) of size 2n – 1. The degree of each polynomial can be mapped from the indices (start from 0). We also describe convolution in mathemical notation as $a \otimes b = c$.

Each element $c_j$ ($0 \leq j < 2n-1$) can be calculated like so (formula also from CLRS):

$c_j = \sum\limits_{k=0}^{j}a_kb_{j-k}$

I have an implementation of this formulation below (click here to see a sample using this function):

typedef vector<int> ivec;

ivec polynomialMultiply(const ivec &a, const ivec &b)
{
int N = 2 * a.size() - 1;
ivec c(N, 0);

for (int j = 0; j < N; ++j) {
int p = j >= a.size() ? j - a.size() + 1 : 0;
for (int k = p; k <= j - p; ++k) {
c[j] += a[k] * b[j - k];
}
}

return c;
}


Perhaps you might be wondering why is there an auxiliary variable p which is not in the formulation: I don’t know; it doesn’t work without it. BUT, that’s not the point. The point is that this technique takes $O(N^2)$ to run (we can infer this from the nested for loop). The same result, however, can be computed much faster in point-value representation:

In this illustration, yellow indicates the coefficient representation (real numbers) and green indicates the point-value representation (complex numbers). In the point-value form, convolution is simply a pointwise multiplication which can be done in $O(N)$. However, the actual complexity of this method also needs to also factor the computation needed for evaluation and interpolation.

## The Math Behind DFT

We now describe the transformation process with more formal mathematical notation. Let $x$ be the sequence of N real values in the time domain, where the n-th element of $x$ is denoted as $x_n$. DFT then takes $x$ and transforms it to a sequence $F$ of N complex values in the frequency domain, where the k-th element of $F$ is denoted as $F_k$. We can also write DFT as a function $dft(x) = F$, and IDFT as $idft(F) = x$.

The DFT is then given by:

$F_k=\sum\limits_{n=0}^{N-1}x_n\cdot e^{-\frac{i2\pi k n}{N}}$

And IDFT given by:

$x_n=\frac{1}{N}\sum\limits_{k=0}^{N-1}F_k\cdot e^{\frac{i2\pi k n}{N}}$

Where $i$ is the imaginary number $\sqrt{-1}$.

Since each element takes N computations for all N elements, it is not hard to see that both DFT and IDFT takes $O(N^2)$.

## Calculate An Example by Hand

It always helps to compute the DFT by hand first to get a feel for it. To do this we will need Euler’s formula to break down the exponential to its sine cosine equivalents:

$e^{iy}=\cos y + i \sin y$

In this context, $y$ is $\frac{2\pi n k}{N}$.

Example: Let $x$ be a sequence (1, 2, 3, 4) and N = 4, and we wish to convert it to the frequency domain.

$\begin{array}{lcl}F_0 & = & 1\cdot e^{-\frac{i2\pi 0 \cdot 0}{4}} + 2\cdot e^{-\frac{i2\pi 0 \cdot 1}{4}} + 3\cdot e^{-\frac{i2\pi 0 \cdot 2}{4}} + 4\cdot e^{-\frac{i2\pi 0 \cdot 3}{4}} \\ & = & 1\cdot e^0 + 2\cdot e^0 + 3\cdot e^0 + 4\cdot e^0 \\ & = & 10\end{array}$

$\begin{array}{lcl}F_1 & = & 1\cdot e^{-\frac{i2\pi 1 \cdot 0}{4}} + 2\cdot e^{-\frac{i2\pi 1 \cdot 1}{4}} + 3\cdot e^{-\frac{i2\pi 1 \cdot 2}{4}} + 4\cdot e^{-\frac{i2\pi 1 \cdot 3}{4}} \\ & = & 1\cdot e^0 + 2\cdot e^{-\frac{i\pi}{2}} + 3\cdot e^{-i\pi} + 4\cdot e^{-\frac{i3\pi}{2}} \\ & = & 1 + 2\left(\cos \frac{-\pi}{2} + i\sin \frac{-\pi}{2} \right) + 3 \left( \cos(-\pi) + i \sin(-\pi) \right) + 4 \left( \cos\frac{-3\pi}{2} + i\sin\frac{-3\pi}{2}\right) \\ & = & 1 + 2\left(0 - i\right) + 3 \left( -1 + 0 \right) + 4 \left( 0 + i\right) \\ & = & -2+2i\end{array}$

$F_2 = -2$

$F_3 = -2-2i$

Now, if you take the values of $F$ and pass through the IDFT, it should return you back the sequence $x$, though I will not show it here.

## Implementation

The C++ program below simply runs the same sequence in the example through DFT, and pass the result through IDFT to get back the original sequence. I placed more emphasis on readability of the code, so it is not designed to be optimized but rather to prove the correctness of the mathematical formulations of DFT and IDFT.

#include <iostream>
#include <complex>
#include <cmath>
#include <iomanip>
#include <vector>

using namespace std;

double PI = acos(0) * 2;
typedef complex<double> xd;
typedef vector<double> dvec;
typedef vector<xd> xvec;
const xd J(0, 1); // sqrt(-1)

xvec dft(const dvec &input)
{
double N = input.size();
xvec X(N);

for (double k = 0; k < N; ++k) {
for (double n = 0; n < N; ++n) {
X[k] += (double)input[n] * exp(-2. * J * PI * n * k / N);
}
}

return X;
}

dvec idft(const xvec &input)
{
double N = input.size();
xvec x(N);
dvec out(N);

for (double k = 0; k < N; ++k) {
for (double n = 0; n < N; ++n) {
x[k] += input[n] * exp(2. * J * PI * n * k / N);
}
out[k] = x[k].real() / N;
}

return out;
}

int main()
{
cout << fixed << setprecision(2);

dvec input = { 1, 2, 3, 4 };

// convert from time to frequency domain
xvec freqdom = dft(input);

for (const auto &f : freqdom) {
cout << f << endl;
}
cout << endl;

// convert from frequency to time domain
dvec timedom = idft(freqdom);

for (const auto &t : timedom) {
cout << t << ' ';
}
cout << endl;
}


The output of this program will then be:

(10.00,0.00)
(-2.00,2.00)
(-2.00,-0.00)
(-2.00,-2.00)

1.00 2.00 3.00 4.00

With DFT and IDFT working properly we now use it to implement convolution:

dvec convolve(const dvec &a, const dvec &b)
{
// calculate degree of resulting polynomial
size_t N = 2 * a.size() - 1;

// extend size and pad with 0
dvec acof(N, 0), bcof(N, 0);
copy(a.begin(), a.end(), acof.begin());
copy(b.begin(), b.end(), bcof.begin());

xvec apv, bpv, cpv(N);

// evaluation
apv = dft(acof);
bpv = dft(bcof);

// point-wise multiplcation
for (size_t i = 0; i < N; ++i) {
cpv[i] = apv[i] * bpv[i];
}

// interpolation
return idft(cpv);
}


See the full working DFT convolution program from this gist.

## Conclusion

So, we covered DFT and vector convolution, and applied it to solve polynomial multiplication. What’s the problem now? Pointwise multiplication, in addition with evaluation and interpolation using DFT and IDFT, uses a time complexity of $O(N) + 2 \times O(N^2) \approx O(N^2)$. That’s actually no different, if not worst than using the naive method. To circumvent this, I will introduce FFT in the next post, which will compute the evaluation and interpolation in $O(N \log N)$ time. Stay tuned!

Update: Part 2 is up!

# 1-0 Knapsack Problem – A Hands-on Guide (C++)

The 1-0 knapsack problem; an optimization puzzle famously solved with dynamic programming (dp). This post is merely my take on the problem, which I hope to provide a more hands-on approach.

To get started, try and attempt The Knapsack Problem (KNAPSACK) from SPOJ. Study the problem closely as I will referring to it throughout this guide. I will then explain how the general solution is derived and how dp is applied.

I assume you have known a bit of dp as a prerequisite, though if you haven’t you can check out my beginner friendly hands-on intro: 445A – Boredom – CodeForces Tutorial.

## The General Solution

A dp solution is usually derived from a recursive solution. So let’s start with that.

We define 2 containers: v and c, that contains the all the values of each item and the capacity they consume respectively, starting from index 1. So for example, v[2] and c[2] returns the value and size of the 2nd item.

What are we trying to optimize here? We are trying to maximize the value; the combination of items that yields the highest value. So let us define a function B(i, w) that returns the maximum value given a scope of items (i) and the capacity of the knapsack (w). The solution to KNAPSACK will therefore be B(N, S).

In solving it recursively, imagine we have all N items with us and our knapsack completely empty (current capacity is S), and we consider the items one by one from the last one (the N-th item).

What are the base cases of this problem? When there are no items to put in (i = 0), or when the knapsack cannot contain any items (w = 0).

Before we consider some i-th item, we first need to make sure that it can fit into the knapsack given its capacity, in other words, an i-th item should not be considered if c[i] > w. If this is so, you will consider the maximum value of the the scope of items excluding the i-th item, or B(i-1, w).

So what happens when you can put the item in the knapsack? You have 2 choices: To put it in (take), or not put it in (keep).

1. Keep: you exclude it from the scope of items in which you consider the maximum value, which is again B(i-1, w).
2. Take: you will get the value of the i-th item you select (v[i]), BUT, we should also consider the remaining space after adding the i-th item inside (w-c[i]). When considering items to add here, we need to exclude the item we already added, so the scope of items we consider is i-1. With this remaining space and this scope of items, we also want to get the maximum value. We can do this recursively via B(i-1, w-c[i]).

Choosing between keep and take is as simple as taking the maximum of the 2.

If you piece all this together, you will get the general solution:

$B(i,j) = \begin{cases} 0 \text{ if } i = 0\text{ or }w = 0\\ B(i-1, w) \text{ if } c[i] > w\\ max\Big( B(i-1, w), v[i] + B(i-1, w-c[i]) \Big) \end{cases}$

## Recursive Solution

Short and simple:

Plug it in the judge and you will get a TLE (Time Limit Exceeded).

## Dynamic Programming Solution

We use a 2D array, DP, containing N+1 rows and S+1 columns. The rows map to the scope of i items we consider (the top considers 0 items and the bottom considers all N items). The columns map to capacity (an individual column is denoted as the w-th column) of knapsack left to right from 0 to S. A cell DP[i][w] means “this is the maximum value given i items and capacity of w“. The maximum value for N items and capacity of S is therefore DP[N][S].

We first fill DP with base cases: where i = 0 and where w = 0. This means that the first row and first column are all 0. The order in which we solve this is simple: starting where i = 1 and w = 1, for each row from top to bottom, we fill the cells from left to right.

I will continue on with an example using inputs from Tushar Roy’s YouTube video (you should check it out), but I jumbled the order to prove that ordering of items is not important. Here is the DP array:

Cells shaded in pale blue is when item does not fit in capacity w, or where c[i] > w. In pale green cells we have a choice: keep or take. Notice that with every row, we are adding one more item to consideration, and with every column we increase the capacity by 1.

Here is the code:

## Getting the Selected Items

Some people use an auxiliary Boolean array (often called Keep) that keeps track of whether an item is selected or not. This seems to be an unnecessary occupation of space to me, since you can deduce the selected items from DP itself.

You start from DP[N][S], and from there:

• An item is selected, if the value of the cell directly above it is not equal to the current cell. When this happens, the capacity of the knapsack reduces by the weight of the selected item. With that new capacity you select the next item.
• An item is not selected, if the value of the cell directly above it is equal to the current cell. So we consider the next item by moving up one row; capacity remains unchanged.

This process continues until either there are no more items remaining, or the knapsack is full.

The table below shows the trail of the algorithm as it selects items (item 1 and 4) from the DP array we constructed before:

Below is the recursive function:

void pick(int i, int w)
{
if (i <= 0 || w <= 0) return;

int k = DP[i][w];
if (k != DP[i - 1][w]) {
cout << i << " "; // select!
pick(i - 1, w - c[i]); // capacity decreases
} else {
// move on to next item; capacity no change
pick(i - 1, w);
}
}


See the full implementation of this function in this gist.

## Application

Enough spoon feeding! It is time for you to try out some puzzles on your own. Conveniently UVa grouped a series of 3 questions that are slight variations of the 1-0 knapsack problem. I sort them here in order of difficulty:

1. 10130 – SuperSale (my solution)
2. 990 – Diving for Gold (my solution)
You will need to list down the items you select in this one.
– Beware of the tricky formatting! There shouldn’t be a blank line at the end of your output.
3. 562 – Dividing coins (my solution)
In my solution, there is a tip (short comment block before the main function) you can check out if you just want some pointers to get started.

# Matrix Chain Multiplication with C++ code – PART 3: Extracting the Sequence

In my previous post, I wrote about finding the minimum cost of multiplying a matrix chain using dynamic programming. In this post I build on top of what we have covered to extract the sequence of multiplying the chain itself.

This is part of a 3 part series:

1. Analysis and Design – analyze the problem and deduce a solution.
2. Implementation – implement the algorithm to find the minimum number of operations.
3. Extracting the Sequence – extract the sequence in which the chain is multiplied.

## Extract the Sequence

To do this we keep track of the point at which we split up the chain as prefix and suffix: the point (we define this from the previous post). We do this by storing it in another 2D array of the same size as DP, which we call splits:

int ops = DP[i][k] + DP[k + 1][j] + rc[i] * rc[k + 1] * rc[j + 1];
if (ops < DP[i][j]) {
DP[i][j] = ops;
splits[i][j] = k;
}


Now let us print out the elements of splits in dark blue. I will be reusing the example input from the previous post:

How do we interpret this? Let us start from where we got the main solution; where i = 1 and j = 6. This cell splits[1][6] signifies the last operation of multiplying the matrix chain – in the above table this maps to 3. This means in the last operation, a prefix from a chain $A_1 \ldots A_3$ multiplies with a suffix from a chain $A_{4} \ldots A_6$.

Let us now consider just the prefix. How then is prefix $A_1 \ldots A_3$ multiplied? We find that from splits[1][3], which in this case is 1. Therefore the it is formed by multiplying a prefix $A_1 \ldots A_1$ and a suffix $A_2 \ldots A_3$. In this prefix $A_1 \ldots A_1$ we hit a base case where i = j. Should this happen we return the matrix $A_1$. The suffix $A_2 \ldots A_3$ is split from from splits[2][3], which is 2. This in turns hits 2 base cases and returns the matrix $A_2$ and $A_3$.

If you repeat this recursive process with the top level suffix, you parenthesize the chain as such:

We can generalize a function mult that returns the resultant matrix that has been multiplied in the most optimal way from the matrix chain:

$mult(i, j)=\begin{cases}A_i \text{ if } i = j\\mult(i, k) \cdot mult(k+1, j)\text{ where } k = splits[i][j]\end{cases}$

The main solution is therefore mult(1, 6).

## Implementation

Below is an implementation of this:

Because we don’t have the contents of the matrices, I have the program output the parenthesis instead, along with the order in which it parenthesize them:

./mat-chain-mult-dp-trace < input.txt
multiply 2 and 3
multiply 1 and (2*3)
multiply 4 and 5
multiply (4*5) and 6
multiply (1*(2*3)) and ((4*5)*6)
((1*(2*3))*((4*5)*6))

Note that the numbers here are not integers but indices that map to its respective matrix in the matrix chain.

It is not that hard to convert the above code to multiply matrices. I will leave it as an exercise, should you be interested.

## Application

Why is this matrix chain multiplication problem so important that most computer science undergraduate programs must include it in their syllabus? The same reason applies for most algorithms you learn: some problems are just slight variations of another problem. Understanding one solution to a problem in the marrow of its bones may help you solve another similar problem.

Therefore, I encourage you to challenge yourself on these 2 problems to help solidify your understanding:

You can find my solutions to these problems in my github SPOJ repository.

# Matrix Chain Multiplication with C++ code – PART 2: Implementation

In the last post I explained how we got the general solution:

$B(i,j) = \begin{cases} 0 \text{ if } i = j \\ \displaystyle\min_{k=i}^{j-1} \begin{cases} B(i, k) + B(k+1, j) + r_i \cdot c_k \cdot c_j \end{cases} \end{cases}$

In this post we plug the formula in and watch it run! I will go through the recursive solution ($O(2^n)$), and then the more optimal dynamic programming solution ($O(n^3)$).

This is part of a 3 part series:

1. Analysis and Design – analyze the problem and deduce a solution.
2. Implementation – implement the algorithm to find the minimum number of operations.
3. Extracting the Sequence – extract the sequence in which the chain is multiplied.

## Input

Our input (input.txt), therefore, is the rows and columns of all n matrices:

7
30 35 15 5 10 20 25

The above input meant that there are 7 integers that correspond to the dimensions of 6 matrices:

The input only contains the rows from matrix $A_1$ until $A_6$, followed by the column of matrix $A_6$ . This is because a matrix A and only be multiplied with a matrix B if they are compatible (if $r_A = c_B$). So for brevity sake we do not need to put rows and columns of all matrices (otherwise the sequence will look like 30 35 35 15 15 5 5 10 10 20 20 25).

Notice, that if we place the sequence of integers in an array rcthe row of a matrix $A_i$ is rc[i] and the column is rc[i+1].

## Recursive Implementation

As surprising though it may be, the amount of code is surprisingly little:

I have written the code to follow the formula as close as possible, as such I start the index with 1 instead of 0. You can always change it to save that bit of space, but the code just serves to show that it works:

./mat-chain-mult-recursive < input.txt
15125

Now, if you print the inputs of the function B each time it is being called, you will see a lot of the same inputs are being called again and again. In geeksforgeeks.org’s article on matrix chain multiplication, they took the time to draw out the recursion tree.

## Dynamic Programming: Ordering

So now we have broken the main problem to small recurring subproblems (Overlapping Subproblems), which we can piece together to solve the main problem (Optimal Substructure). With this 2 properties, we can use dynamic programming to drastically reduce the running time of an exponential solution.

However, this problem is tricky to solve since it involves 2 variables i and j. We need a 2D array to keep track of this. In a single dimensional array ordering is trivial; we simply go from left to right, building the solution from the bottom up, but with 2 dimensions ordering becomes tricky. Which cell do we solve first?

Well, we know that:

• the base case is when i = j, then B(i, j) = 0.
• i cannot exceed j, so those areas will need to be grayed out.
• B(1, 6) is the solution.

So with that, let us fill up an array DP based on the inputs we have above:

So we need to find B(1, 6) yes? Well, what does B(1, 6) need? Well, if we unfold the general solution plugging in i as 1 and j as 6 we have 5 selections in which we choose the smallest:

$\begin{cases}k=1: B(1, 1) + B(2, 6) + r_1 \cdot c_1 \cdot c_6 \\k=2: B(1, 2) + B(3, 6) + r_1 \cdot c_2 \cdot c_6 \\k=3: B(1, 3) + B(4, 6) + r_1 \cdot c_3 \cdot c_6 \\k=4: B(1, 4) + B(5, 6) + r_1 \cdot c_4 \cdot c_6 \\k=5: B(1, 5) + B(6, 6) + r_1 \cdot c_5 \cdot c_6 \end{cases}$

To make it easier to visualize what B(1, 6) requires, I distinguish the selection by colour:

What you can see, is that B(1, 6) requires cells to the left and to the bottom of it. Well, we have the cells that represent the base cases solved. Perhaps the next logical step is to solve the next diagonal. This is possible, because all the values of all cells where j i = 1 depends on all cells where ji = 0. Consequently, all values of all cells where ji = 2 depends on all cells where ji = 1 and ji = 0; you can validate this by plugging in the numbers into the formula. Because all cells aside the base case requires cells below and to its left up until it hits the base case, we can continue filling up the cells of the array DP diagonally until we hit the top right cell.

From another perspective, we are finding all 5 pairs of the matrices in the chain and find their minimum cost (choose from 1 each), then we find all 4 triplets of matrices in the chain and find their minimum cost (choose from 2), then we find all 3 quadruplets of matrices in the chain and find their minimum cost (choose from 3), and this repeats until we hit the top right cell; each subsequent pass depending on the pass before it.

So we solve in diagonals in order ji = 0, 1, 2 … n – 1. Once you can sequentially iterate through the array DP in a diagonal fashion, you have everything you need to piece together the solution:

It helps that you calculate the cells by hand to grasp the essence of this algorithm.

## Dynamic Programming: Implementation

So what I did, was design a nested for loop to print the indices in the order I want. Once I do that I simply plug the formula inside, and I have my final solution:

Note that again, the index starts at 1.

Although code count is near indistinguishable, the performance is drastically improved at a negligible expense of space. We generalize that the improved implementation is $O(n^3)$ by counting the number of nested for loops.

## Conclusion

In the 3rd and final part, I will guide you through extracting the actual order in which the chain is multiplied. Check it out: Matrix Chain Multiplication with C++ code – PART 3: Extracting the Sequence.

# Matrix Chain Multiplication with C++ Code – PART 1: Analysis and Design

Ah. The matrix chain multiplication problem. A classic dynamic programming example in the undergraduate computer science world. I will be doing a series consisting of 3 parts:

1. Analysis and Design – analyze the problem and deduce a solution.
2. Implementation – implement the algorithm to find the minimum number of operations.
3. Extracting the Sequence – extract the sequence in which the chain is multiplied.

There’s practically a ton of articles out there that already writes about this problem. I intend to distinguish myself by expressing it in the simplest terms possible with examples, and season it with working C++11 source code. In other words, if you had problem understanding everywhere else, my hopes is that your search ends with this series.

In this first part, there will be no code. At all. Not even pseudocode. I promise some C++11 code in the coming 2 parts, but understanding this section will be important for implementation later.

## Prerequisites

I will assume you should have a bit of background in matrix math. It helps that you do a bit of refreshing if you have forgotten.

I will also assume you have known a bit of what dynamic programming is, as matrix chain multiplication is a slightly more complex example; you keep track of a 2D array instead of 1D array. I’ve written a beginner level hands on introduction to dynamic programming in a form of solving a problem which you can check out: 445A – Boredom – CodeForces Tutorial.

## Understanding the Problem

Given a chain of matrices, we want to multiply them. We cannot change the order of the matrices as that would change the result or make it incompatible, therefore we say matrix multiplication is not commutative. In addition, regardless of how you parenthesize (e.g. whether the sequence as multiplied as $A\cdot \left ( B\cdot C \right )$ or $\left ( A\cdot B \right )\cdot C$ ) the chain it will give you the same answer, thus we say it is associative.

How we parenthesize the matrices also determine how many operations will be performed. For a multiplication chain of integers, parenthesizing them does not affect the number of operations performed, but for matrices the change is significant. For example, given the matrices with the following dimensions:

If we define a row of a matrix as $r_A$ and the column as $c_A$, then the total number of multiplications required to multiply A and B is $r_A \cdot c_A \cdot c_{B}$.

• $A\cdot \left ( B\cdot C \right )$ -> 2*20*1 + 2*1*10 = 60 multiplications
• $\left ( A\cdot B \right )\cdot C$ -> 20*1*10 + 2*20*10 = 600 multiplications

The 1st case is 10 times faster!

Therefore, we want to figure out how to parenthesize the matrix chain such that we minimize the required number of operations.

However, to keep things simple, we will focus on the minimum number of operations. We will eventually find the sequence itself by building on top on this.

## Formulation

To put in formal terms, given a sequence of n matrices $A_1, A_2, A_3 \ldots A_n$ we wish to parenthesize the product of this sequence so as to minimize the number of operations required.

## Finding the Minimum Number of Operations

Suppose we have a function B(i, j) that computes the minimum number of required operations for multiplying a chain of matrices from matrix i to matrix j. The solution to the main problem would be B(1, n).

A standard technique to in solving a dynamic programming problem is usually to ask, “What is the last operation?”. Here, we ask what is the minimum total cost of the last operation? Regardless how the chain is parenthesized, there will always be 2 matrices remaining. The cost, is the minimum cost of some prefix + minimum cost of some suffix + the number of operations to multiply them. The cost of prefix and suffix in turn, is derived from the minimum number of operations that deduces them.

Because each matrix can only multiply with its adjacent matrix, a prefix can only start from $A_1$ to some matrix $A_k$, and a suffix can only start from $A_{k+1}$ to $A_n$, split at some index k. The cost in the last operation for the prefix is then B(1, k) and the cost of the suffix is B(k+1, n)k therefore ranges from 1 until n-1 in the last operation, and within a range i to j, k ranges from i to j-1. We choose k by selecting the minimum cost of all the possible costs within a range i to j.

The resultant dimensions from multiplying 2 matrices are important to finding the cost. In a sequence of matrices $A_i \ldots A_j$ If we split at a point k, the resultant dimensions of the prefix is $r_i \times c_k$ and the suffix is $r_{k+1} \times c_j$. The cost of multiplying these 2 matrices are therefore $r_i \cdot c_k \cdot c_j$.

## Running Through Possibilities

Let us consider the base case: When there is only 1 matrix. The prefix = suffix, and there are no operations performed, so the cost = 0.

Now what if there are only 2 matrices? There are no other possible combinations to compare with, so the minimum number of operations is the number of operations to multiply them. There is no cost to the prefix or suffix, because the cost for 1 matrix alone is 0.

3 matrices onwards is gets a little more challenging. When the chain has only 3 matrices, you have to choose between $\left ( A_1\cdot A_2 \right )\cdot A_3$ and $A_1\cdot \left ( A_2\cdot A_3\right )$. How we choose between the 2 is determined by cost of either multiplying $A_1$ and $A_2$ first, then take the resulting matrix (prefix) and multiply by $A_3$ (suffix), OR, multiplying $A_2$ and $A_3$ first, then multiplying that (suffix) by $A_1$ (prefix). Where we decide to make that split, either at $A_1$ or $A_2$, we determined by their cost. The matrix in which we choose we call it $A_k$, at a point k.

The minimum cost with 3 matrices, B(1, 3) is therefore $B(1, k) + B(k+1, 3) + r_1 \cdot c_k \cdot c_3$, and we choose between $B(1, 1) + B(2, 3) + r_1 \cdot c_1 \cdot c_3$ and $B(1, 2) + B(3, 3) + r_1 \cdot c_2 \cdot c_3$. You can continue doing this for 4 or 5 or 6 matrices until you start to see a pattern. Notice you can apply the same procedure to find B(1, 3) to find B(1, 2) – there is only 1 possible value for k for B(1, 2).

## The General Solution

So in a range i to j, we select from j – i possibilities, from i until j – 1. Those possibilities in turn call B recursively until it hits the base case where i = j. The general solution is therefore:

$B(i,j) = \begin{cases} 0 \text{ if } i = j \\ \displaystyle\min_{k=i}^{j-1} \begin{cases} B(i, k) + B(k+1, j) + r_i \cdot c_k \cdot c_j \end{cases} \end{cases}$

And the solution to the main problem would be B(1, n).

## Conclusion

This is all for the first part. Next post I will cover implementation: Matrix Chain Multiplication with C++ code – PART 2: Implementation.

# 445A – Boredom – CodeForces Tutorial

Inasmuch as this is a CodeForces tutorial, I’d also want to use Boredom as an introduction to dynamic programming (dp). I will also compare the dp solution to its alternate recursive implementation. This tutorial also caters to people who can’t seem to wrap their head around the solution or tutorial from CodeForces.

I won’t be explaining the problem itself. Should be pretty straightforward.

## Gotchas

The maximum points can exceed the limit for the 32-bit data type int. In some of the test cases that value can be as high as 9999899998 (the maximum capacity for 32-bit int is 2147483647). The maximum value for the n integers may be up to 10^5, but they can total up to a lot higher than that.

That is why you will see C/C++ solutions out there the data type long long int:

typedef long long int bigInt;


## Solution Overview

You can try to understand the succinct tutorial from CodeForces, if you haven’t already looked into it.

The sequence of n numbers needed to be grouped together and kept count, because when selecting a single number, it makes sense to repeatedly select it (thereby adding it to your points) until it doesn’t exist in the sequence. We will keep track of this count in array c, where c[i] contains the number of occurrences of i.

Let f be a function that calculates the maximum number of points to a value up to i. Now there will be 2 base cases:

1. If i = 0, then f(0) = 0
2. If i = 1, then f(0) = c[1]

I hope the 2nd point is clear. Up to 1 the maximum number of points you can get by selecting 1 is the number of occurrences of 1 in the array c.

When i >= 2, you have 2 choices in deciding which will be f(i):

1. Select f(– 2), in which you can also add with i * c[i].
2. Select f(i – 1), but in doing so you can’t choose to add i * c[i] because of the condition that all elements equal to ak + 1 and ak - 1 also must be deleted from the sequence.

Choosing between 1 and 2 is as simple as choosing which is largest.

Now let m be the largest number in the sequence. The solution is therefore f(m). The CodeForces tutorial writes f(n), but I doubt n here implies the number of elements in the sequence. To avoid confusion, I used m instead.

Most implementations I see online simply put f(100000).

It helps that you trace the algorithm step by step on paper, should it not be apparent to you how it works. For example for input:

10
10 5 8 9 5 6 8 7 2 8

You can track the values of i, c[i], and f(i):

i    c[i]   i*c[i]   f(i)
0    0      0        0
1    0      0        0
2    1      2        2
3    0      0        2
4    0      0        2
5    2      10       12
6    1      6        12
7    1      7        19
8    3      24       36
9    1      9        36
10   1      10       46

Do not undermine this method; learning how to trace algorithms in a systematic way can be very helpful to help you understand it.

## Recursive Implementation

Based on the solution as detailed above, here the recursive implementation.

#include <algorithm>
#include <iostream>
#include <vector>

typedef long long int bigInt;

using namespace std;

vector<bigInt>  c(100001, 0);

bigInt f(const int i)
{
if (i == 0) return 0;
if (i == 1) return c[1];

return max(f(i - 1), f(i - 2) + i*c[i]);
}

int main()
{
int n, x, m = 0;
cin >> n;

while(n--) {
cin >> x;
c[x]++;
m = max(m, x);
}

cout << f(m);
}


Don’t submit this solution; it will fail at test 9 (time limit exceeded).

That’s it! In 30 lines of code you can solve this problem! The issue? Well, try and increase the values in the sequence (not the number of elements in the sequence. This doesn’t affect the growth of the function). Even a value as small as 30 and I had to wait a few seconds for it to compute.

## Analysis of Recursive Implementation

We can define the performance of this recursive implementation by the number of recursive calls.

Let us trace the recursive calls of the function f in a recursion tree. I used paler colors to signify the depth of the recursion tree; the deeper it gets, the paler it gets:

That is a total of 9 recursive calls. How about f(5)?

5+9+1 (including itself) = 15 function calls.

If you observe for awhile, you will be able to deduce a pattern. Let q be a function that finds the number of function calls for f given i. Then we write q as: q(i)=q(i-1) + q(i-2) + 1, for all i >= 2; q(0)=1; q(1)=1. If you plot this function as a graph, you will get this:

So time complexity is O(2^n); Implementation scales exponentially.

By the time i reaches 18, q(18) = 8361. When i = 30, the number of function calls becomes 2692537! This would explain that bit of lag earlier. Just imagine i going as big as 10000. It is no surprise then, that when this happens, the program crashes due to stack overflow:

## Dynamic Programming Implementation

Dynamic programming, in essence, is an optimization technique that dramatically cuts down computational time but taking out a bit of space to store calculated answers to overlapping subproblems. The overlapping subproblems here are the calls to f, which for example in the recursion tree f(4) we can see that f(1) is called 3 times and f(2) and f(0) is called 2 times.

Furthermore, we can take the solutions the subproblems to compute the solution of the main problem. For example f(5) can be determined by f(4) and f(3), and f(4) can be determined by f(3) and f(2), and so on. This property is called Optimal Substructure.

To recap, the 2 properties that must exist for a problem to be made solvable by dp:

Let’s put theory to practice.

We can reduce the time complexity from O(2^n) to O(n) by keeping an array DP of the same size as cDP will store the calculated values of f for all i. The solution, therefore, is DP[m], since it contains f(m). Below is the dynamic programming solution:

#include <algorithm>
#include <iostream>
#include <vector>

typedef long long int bigInt;

using namespace std;

int main()
{
const int MAX_N = 100001;

int n, x;
vector<bigInt> dp(MAX_N, 0);
vector<bigInt>  c(MAX_N, 0);

cin >> n;

while(n--) {
cin >> x;
c[x]++;
}

dp[0] = 0;
dp[1] = c[1];

for (int i = 2; i < MAX_N; i++) {
dp[i] = max(dp[i - 1], dp[i - 2] + i*c[i]);
}

cout << dp[MAX_N - 1];
}


In the dp solution, f(i) is computed from 0 to 10000 (bottom to top), whereas in the recursive solution f(i) is computed from m to 0 (top to bottom).

Notice I didn’t bother keep track the largest value in the sequence (variable m). You can have that bit of optimization if you use it, but it isn’t necessary for this problem. In essence we have f (the block of code inside that for loop) called a constant 10000 times each time this program is called, and this works granted the largest value in the sequence doesn’t exceed 10000.

What is fascinating about this implementation is that it solves the problem with the same amount of code! Ladies and gentlemen, the power of algorithms!