The Programmer’s Guide To FFT – Part 2: FFT

This will be a 2 part series on fast fourier transform (FFT). My aim for these posts is to provide a more hands-on and layman friendly approach to this algorithm, contrast to a lot of the theoretically heavy material available on the internet. In short: less math, no proofs, examples provided, and working source code (C++11).

In my previous post, I wrote about DFT as the basis to understand FFT. In this final part I will write about the FFT algorithm as outlined in Cooley and Tukey’s seminal work published in 1964. I assume you understood the material in the prior post before coming here, since everything here builds on top of it.

Key Ideas

Before we begin, let us assume that the length of the input (N) is always in powers of 2 (2, 4, 8, 16, 32…). I will explain why is this important later on.

There are 2 key ideas that enable the FFT algorithm to work. First is to understand that DFT can be separated as a sum of odd and even parts:

\begin{array}{lcl}F_k & = & \sum\limits_{n=0}^{N-1}x_n\cdot e^{-\frac{i2\pi k n}{N}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (2m)}{N}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (2m+1)}{N}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m+1/2)}{N/2}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m)}{N/2} - \frac{i\pi k}{N/2}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + e^{-\frac{i2\pi k}{N}} \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m)}{N/2}}\end{array}

Let us define a function \omega (read as omega):

\omega(p, q) = e^{\frac{i2\pi q}{p}}

Now we simplify the DFT formulation to:

F_k = \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot \omega(km, \frac{N}{2}) + \omega(N, k) \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot \omega(km, \frac{N}{2})

Let’s generalize further to:

F_k = F_k^{\text{even}} + \omega(N, k) \cdot F_k^{\text{odd}}

The second key idea is to take advantage of the periodic nature of DFT:

\begin{array}{lcl}F_k & = & F_k^{\text{even}} + \omega(N, k) \cdot F_k^{\text{odd}} \\ F_{k+\frac{N}{2}} & = & F_k^{\text{even}} - \omega(N, k) \cdot F_k^{\text{odd}}\end{array}

What this means is that in the process of calculating the resulting sequence F you only need to compute \omega(N, k) \cdot F_k^{\text{odd}} a total of \frac{N}{2} times; we can essentially half the number of computations using this technique. But why stop there? We can also take either F_k^{\text{even}} or F_k^{\text{odd}} and split them to odd and even parts, and repeat the same procedure. If we compute this recursively, the base case for this is when N = 1. In this manner we compute \omega(N, k) \cdot F_k^{\text{odd}} for as many times as we can divide it by 2, or \log N. Therefore, for sequence of size N, FFT computes the DFT in N \log N time.


Ok. That was probably hard to grasp, so let us break it down. Take an example where N = 2: a sequence in the coefficient representation s is (1, 9), and we want to convert it to point-value representation. The even and odd sequence is simply 1 and 9 respectively. We can use an auxiliary variable h to store \omega(N, k) \cdot F_k^{\text{odd}}:

h = \omega(2, 0) \cdot 9 = 9

F_0 = 1 + h = 1 + 9 = 10

F_1 = 1 - h = 1 - 9 = -8

Notice how we only need to compute \omega(N, k) \cdot F_k^{\text{odd}} once and reused it for F_{0 + \frac{N}{2}}.

Now we go to a more complex example, where N = 8: s = (1, 6, 3, 8, 9, 5, 4, 2). Here we can show that it is possible that by using the fact that DFT can be expressed as sum of even and odd parts, that we can recursively divide s to smaller subproblems, up until N = 1:

I arranged such that the subsequence to the left contains even parts, and the sequence to the right contains odd parts. Now that we separate it nicely we can systemically work on the smaller parts and work our way up until the final answer. I’ve made a nice diagram that illustrates the computational flow of the FFT algorithm:

As before, the sequence to the left are the even parts and the sequence to the right are the odd parts. The cells show the type: yellow for real numbers, and green for complex numbers. Blue arrows branch out from even sequences, and red arrows branch out from odd sequences. Red arrows also denote that the cell it came from will be multiplied by \omega(N, k), though not visually depicted. The whole computation flow shows a sort of “butterfly pattern”, as how most engineers like to describe it.

IFFT works roughly the same way as FFT in that it uses the same technique to save computation, so if you understand FFT you should get IFFT as well.


The implementation here includes FFT and IFFT. As with the DFT and IDFT implementation in the previous post, it takes a sequence in the coefficient representation and spits out a sequence of the same size in the point-value representation using FFT, and takes that sequence puts it through IFFT to get back the original sequence. As with before, my emphasis is on readability not optimization.

#include <iostream>     
#include <complex>      
#include <cmath>
#include <iomanip>
#include <vector>
#include <algorithm>

using namespace std;

double PI = acos(0) * 2;
typedef complex<double> xd;
typedef vector<double> dvec;
typedef vector<xd> xvec;
const xd J(0, 1); // sqrt(-1)

inline xd omega(const double &p, const double &q)
    return exp((2. * PI * J * q) / p);

xvec _fft(xvec &f)
    double N = f.size();
    if (N == 1) return f;
    xvec fe, fo;
    fe.reserve(N / 2);
    fo.reserve(N / 2);
    for (int i = 0; i < N; i += 2) {
        fe.push_back(f[i]);     // even
        fo.push_back(f[i + 1]); // odd
    fe = _fft(fe);
    fo = _fft(fo);
    for (int m = 0; m < N / 2; ++m) {
        xd omfo = omega(N, -m) * fo[m];
        f[m]         = fe[m] + omfo;
        f[m + N / 2] = fe[m] - omfo;

    return f;

xvec fft(const dvec &x)
    xvec f(x.size());
    for (size_t i = 0; i < x.size(); ++i) { 
        f[i] = xd(x[i], 0);
    return _fft(f);

xvec _ifft(xvec &x)
    double N = x.size();
    if (N == 1) return x;
    xvec xe, xo;
    xe.reserve(N / 2);
    xo.reserve(N / 2);
    for (int i = 0; i < N; i += 2) {
        xe.push_back(x[i]);     // even
        xo.push_back(x[i + 1]); // odd
    xe = _ifft(xe);
    xo = _ifft(xo);
    for (int m = 0; m < N / 2; ++m) {
        xd iomxo = omega(N, m) * xo[m];
        x[m]         = xe[m] + iomxo;
        x[m + N / 2] = xe[m] - iomxo;
    return x;

dvec ifft(xvec f)
    double N = f.size();
    xvec xcomplex = _ifft(f);
    dvec x(N);
    for (int i = 0; i < N; ++i) {
        x[i] = xcomplex[i].real() / N;
    return x;

int main()
    cout << fixed << setprecision(2);

    dvec input = { 1,6,3,8,9,5,4,2 };
    // convert from time to frequency domain
    xvec freqdom = fft(input);

    for (const auto &f : freqdom) {
        cout << f << endl;
    cout << endl;
    // convert from frequency to time domain
    auto timedom = ifft(freqdom);
    for (const auto &t : timedom) {
        cout << t << ' ';
    cout << endl;

It is a good idea to set breakpoints to see how the recursive implementation of FFT systematically solves DFT from the smaller subproblems.

Because of how similar FFT and IFFT is, it is not hard to merge them into a function and pass a boolean parameter to determine whether it will be FFT and IFFT (most implementations online will call this multi-purpose function “transform”), but for the sake of clarity I refrain from doing so.

Convolution is practically the same as before – it’s just that we replace DFT with FFT and IDFT with IFFT:

// vector convolution
dvec convolve(const dvec &a, const dvec &b) 
    // calculate degree of resulting polynomial
    size_t N = 2 * a.size() - 1;
    // extend size to match result
    dvec acof(N), bcof(N);
    copy(a.begin(), a.end(), acof.begin());
    copy(b.begin(), b.end(), bcof.begin());
    xvec apv, bpv, cpv(N);
    // evaluation
    apv = fft(acof);
    bpv = fft(bcof);
    // point-wise multiplcation
    for (size_t i = 0; i < N; ++i) {
        cpv[i] = apv[i] * bpv[i];
    for (const auto &t : cpv)  cout << t << ' ';
    cout << endl;
    // interpolation
    return ifft(cpv);

Now we estimate the time complexity of vector convolution using FFT: evaluation (N \log N), pointwise multiplication (N) and interpolation (N \log N) now costs a total of 2 \times N \log N + N \approx N \log N.

Must N Be In Powers of 2?

You could play around with different input sizes and compare the answer with DFT. What you will see is that FFT will return some funny values if N is not a power of 2. Why is this so? Well, you can see from the visual depictions of how FFT works: at the simplest subproblem (N = 2), you need to have one even value and one odd value. FFT must be able to divide in such a way that at some point splitting the input, all subsequences are of size 2, and the only way that is possible is if N is a power of 2.

If it doesn’t make sense, you could just play around with the code to convince yourself I guess.

Is this a bad thing? Well, if all you use FFT for is convolution then no. You could first calculate the resulting polynomial degree of the convolution, then pad the input with 0 until N is a power of 2, evaluate it, do pointwise multiplication, interpolate, and resize it to match the resulting degree.

If you use bit shifts to multiply by 2, you can compute N very quickly (see full implementation here):

// degree of resulting polynomial = size of resulting array
size_t deg = a.size() + b.size() - 1;

// transform array size must be in power of 2 for FFT
size_t N = 1;
while (N < deg) N <<= 1;

// set size for arrays in point-wise representation -
// extended space is padded with 0:
xvec a(N), b(N), c(N);

// ** do convolution... **

// Resize to actual size

But, wouldn’t resizing the input array be slow? Well, we can prove with simple example. Say N = 129: In naive DFT, computing the DFT will take 129^2 = 16641. In FFT we resize N to the closest power of 2, which is 256. 256 \log 256 \approx 617. That is 2697% less computations! Of course, it shouldn’t be hard to convince yourself that this is still true as N gets larger.

There are variations of FFT where N does not need to be in powers of 2, but it won’t be the focus of this post.


If you print the N and m used by the \omega (omega) function, you will notice that there are some repetitions. Here is N and m where N = 8:

2 0
2 0
4 0
4 1
2 0
2 0
4 0
4 1
8 0
8 1
8 2
8 3

So this means it is possible to precompute results from \omega and reuse them. In the case where N = 8, we only need to calculate \omega 7 times as oppose to 12. I have an implementaion of FFT that does this simple optimization.

What else can you do? In-place calculation – here’s my implementation of in-place fft. Aside knowing how to meddle with indices, one key idea is to understand is this: the indices after dividing the sequence is the bit reverse, where the length of bits is \log N. For example if N = 16 (\log N = 4), the index 7 (0111) with be swapped with 14 (1110).


We have went through DFT, and now FFT. I hope this helped you understand FFT a little better. It’s ok if you’re still a little foggy with the details; play around the source code long enough and it will be clearer to you. I intend to work on hands-on applications of FFT in future posts, so keep a look out for more!

The Programmer’s Guide To FFT – Part 1: DFT

This will be a 2 part series on fast fourier transform (FFT). My aim for these posts is to provide a more hands-on and layman friendly approach to this algorithm, contrast to a lot of the theoretically heavy material available on the internet. In short: less math, no proofs, examples provided, and working source code (C++11).

In this first part I will write about discrete fourier transform (DFT) as the basis to understand FFT.

Sorting as a Metaphor

DFT and FFT are similar as insertion sort is to merge sort; they both take the same type of inputs and spits out the same output, it’s just that FFT runs much faster than DFT by utilizing a technique called divide and conquer (mergesort also uses this). The computational difference is also the same as merge sort is to insertion sort: O(N \log N) vs O(N^2). How much difference is this? Well, if N is 10 billion and each computation takes 1 nanosecond, it is the difference between finishing in 30 seconds as opposed to 30 years.

What Does DFT Do?

DFT takes a discrete series of input in the time domain and converts it to the frequency domain. Computer scientists describe it as converting from the coefficient representation to the point-value representation (more on that later). At either case, the semantics of the input is retained; it is like translating from one language to another.

Also, the conversation from one domain to another needs to be reversible. To convert from the frequency domain back to the time domain, you use the inverse of the DFT, also called IDFT. In computer science, the process to convert from coefficient representation to the point-value representation is called evaluation, and the reverse process is interpolation.

Applications and Motivations

The applications of DFT is diverse: how Shazam matches songs, mp3 compression, remove noise from audio, jpeg compression, edge detection, radar, sonar, optics, seismic activity, satellite transmissions, genome matching (I will put up a post on genome matching in the future as an application of DFT)… Put it this way: as long as it involves a sequence (could be in multiple dimensions) that changes – whether it be over time or space, it can benefit from DFT.

However, this post focuses on using DFT for vector convolution (I profess that as of this writing this is the only thing I know how to use DFT for).

Vector Convolution

We can understand what convolution does as a polynomial multiplication:

Example taken from CLRS

Put formally, we are given a sequence of real numbers a and b (in this example is (6, 7, -10, 9) and -2, 0, 4, -5) respectively) of size n – each sequence maps to the coefficients of the input (thus the term coefficient representation), and we return a sequence of real numbers c (-12, -14, 44, -20, -75, 86, -45) of size 2n – 1. The degree of each polynomial can be mapped from the indices (start from 0). We also describe convolution in mathemical notation as a \otimes b = c.

Each element c_j (0 \leq j < 2n-1) can be calculated like so (formula also from CLRS):

c_j = \sum\limits_{k=0}^{j}a_kb_{j-k}

I have an implementation of this formulation below (click here to see a sample using this function):

typedef vector<int> ivec;

ivec polynomialMultiply(const ivec &a, const ivec &b) 
    int N = 2 * a.size() - 1;
    ivec c(N, 0);
    for (int j = 0; j < N; ++j) { 
        int p = j >= a.size() ? j - a.size() + 1 : 0;
        for (int k = p; k <= j - p; ++k) {
            c[j] += a[k] * b[j - k];
    return c;

Perhaps you might be wondering why is there an auxiliary variable p which is not in the formulation: I don’t know; it doesn’t work without it. BUT, that’s not the point. The point is that this technique takes O(N^2) to run (we can infer this from the nested for loop). The same result, however, can be computed much faster in point-value representation:

In this illustration, yellow indicates the coefficient representation (real numbers) and green indicates the point-value representation (complex numbers). In the point-value form, convolution is simply a pointwise multiplication which can be done in O(N). However, the actual complexity of this method also needs to also factor the computation needed for evaluation and interpolation.

The Math Behind DFT

We now describe the transformation process with more formal mathematical notation. Let x be the sequence of N real values in the time domain, where the n-th element of x is denoted as x_n. DFT then takes x and transforms it to a sequence F of N complex values in the frequency domain, where the k-th element of F is denoted as F_k. We can also write DFT as a function dft(x) = F, and IDFT as idft(F) = x.

The DFT is then given by:

F_k=\sum\limits_{n=0}^{N-1}x_n\cdot e^{-\frac{i2\pi k n}{N}}

And IDFT given by:

x_n=\frac{1}{N}\sum\limits_{k=0}^{N-1}F_k\cdot e^{\frac{i2\pi k n}{N}}

Where i is the imaginary number \sqrt{-1}.

Since each element takes N computations for all N elements, it is not hard to see that both DFT and IDFT takes O(N^2).

Calculate An Example by Hand

It always helps to compute the DFT by hand first to get a feel for it. To do this we will need Euler’s formula to break down the exponential to its sine cosine equivalents:

e^{iy}=\cos y + i \sin y

In this context, y is \frac{2\pi n k}{N}.

Example: Let x be a sequence (1, 2, 3, 4) and N = 4, and we wish to convert it to the frequency domain.

\begin{array}{lcl}F_0 & = & 1\cdot e^{-\frac{i2\pi 0 \cdot 0}{4}} + 2\cdot e^{-\frac{i2\pi 0 \cdot 1}{4}} + 3\cdot e^{-\frac{i2\pi 0 \cdot 2}{4}} + 4\cdot e^{-\frac{i2\pi 0 \cdot 3}{4}} \\ & = & 1\cdot e^0 + 2\cdot e^0 + 3\cdot e^0 + 4\cdot e^0 \\ & = & 10\end{array}

\begin{array}{lcl}F_1 & = & 1\cdot e^{-\frac{i2\pi 1 \cdot 0}{4}} + 2\cdot e^{-\frac{i2\pi 1 \cdot 1}{4}} + 3\cdot e^{-\frac{i2\pi 1 \cdot 2}{4}} + 4\cdot e^{-\frac{i2\pi 1 \cdot 3}{4}} \\ & = & 1\cdot e^0 + 2\cdot e^{-\frac{i\pi}{2}} + 3\cdot e^{-i\pi} + 4\cdot e^{-\frac{i3\pi}{2}} \\ & = & 1 + 2\left(\cos \frac{-\pi}{2} + i\sin \frac{-\pi}{2} \right) + 3 \left( \cos(-\pi) + i \sin(-\pi) \right) + 4 \left( \cos\frac{-3\pi}{2} + i\sin\frac{-3\pi}{2}\right) \\ & = & 1 + 2\left(0 - i\right) + 3 \left( -1 + 0 \right) + 4 \left( 0 + i\right) \\ & = & -2+2i\end{array}

F_2 = -2

F_3 = -2-2i

Now, if you take the values of F and pass through the IDFT, it should return you back the sequence x, though I will not show it here.


The C++ program below simply runs the same sequence in the example through DFT, and pass the result through IDFT to get back the original sequence. I placed more emphasis on readability of the code, so it is not designed to be optimized but rather to prove the correctness of the mathematical formulations of DFT and IDFT.

#include <iostream>     
#include <complex>      
#include <cmath>
#include <iomanip>
#include <vector>

using namespace std;

double PI = acos(0) * 2;
typedef complex<double> xd;
typedef vector<double> dvec;
typedef vector<xd> xvec;
const xd J(0, 1); // sqrt(-1)

xvec dft(const dvec &input)
    double N = input.size();
    xvec X(N);

    for (double k = 0; k < N; ++k) {
        for (double n = 0; n < N; ++n) {
            X[k] += (double)input[n] * exp(-2. * J * PI * n * k / N);

    return X;

dvec idft(const xvec &input)
    double N = input.size();
    xvec x(N);
    dvec out(N);

    for (double k = 0; k < N; ++k) {
        for (double n = 0; n < N; ++n) {
            x[k] += input[n] * exp(2. * J * PI * n * k / N);
        out[k] = x[k].real() / N;

    return out;

int main()
    cout << fixed << setprecision(2);

    dvec input = { 1, 2, 3, 4 };
    // convert from time to frequency domain
    xvec freqdom = dft(input);

    for (const auto &f : freqdom) {
        cout << f << endl;
    cout << endl;
    // convert from frequency to time domain
    dvec timedom = idft(freqdom);
    for (const auto &t : timedom) {
        cout << t << ' ';
    cout << endl;

The output of this program will then be:


1.00 2.00 3.00 4.00

With DFT and IDFT working properly we now use it to implement convolution:

dvec convolve(const dvec &a, const dvec &b) 
    // calculate degree of resulting polynomial
    size_t N = 2 * a.size() - 1;
    // extend size and pad with 0
    dvec acof(N, 0), bcof(N, 0);
    copy(a.begin(), a.end(), acof.begin());
    copy(b.begin(), b.end(), bcof.begin());
    xvec apv, bpv, cpv(N);
    // evaluation
    apv = dft(acof);
    bpv = dft(bcof);
    // point-wise multiplcation
    for (size_t i = 0; i < N; ++i) {
        cpv[i] = apv[i] * bpv[i];
    // interpolation
    return idft(cpv);

See the full working DFT convolution program from this gist.


So, we covered DFT and vector convolution, and applied it to solve polynomial multiplication. What’s the problem now? Pointwise multiplication, in addition with evaluation and interpolation using DFT and IDFT, uses a time complexity of O(N) + 2 \times O(N^2) \approx O(N^2). That’s actually no different, if not worst than using the naive method. To circumvent this, I will introduce FFT in the next post, which will compute the evaluation and interpolation in O(N \log N) time. Stay tuned!

Update: Part 2 is up!

Setting Up CUDA + cuDNN for Theano in Ubuntu

This is my personal notes on setting up Theano with my laptop GPU. It is basically an amalgam of various sources that I pieced together to make everything work, which I will link at the end of this post.

As of this writing, this is my setup:

  • Linux Mint 18 (Cinnamon) 64-bit (based on Ubuntu 16.06)
  • NVIDIA GT740M (Kepler architecture)
  • Theano 0.8.2

NVIDIA Graphic Drivers

Linux Mint gives you an option to install the drivers from the settings, but it may be dated. To get the latest drivers, you may install the drivers via PPA:

IMPORTANT: You need to install the drivers first, before installing CUDA. The order is very important, since CUDA checks what version of the graphic driver you are using and installs accordingly. On a related note – should you upgrade/downgrade your graphics driver, you will need to install CUDA again. I emphasize this, because should you fail to do so, the errors that proceed it gives you no indication whatsoever that you screwed this step up.


Choose CUDA for download in As of this writing the latest version is CUDA 7.5, so that is what I downloaded. After you download, there are instructions that they suggest you run. Not all of them will work. So follow these suggested steps instead:

Open a terminal in the download directory and enter the first command they suggested for you in the downloads site. It should look like this:

sudo dpkg -i cuda-repo-ubuntu1504-7-5-local_7.5-18_amd64.deb

Change your /var/cuda-repo-7-5-local/Release to the following:

Origin: NVIDIA
Architecture: repogenstagetemp
 51483bc34577facd49f0fbc8c396aea0 75379 Packages
 4ef963dfa4276be01db8e7bf7d8a4f12 21448 Packages.gz
 532b1bb3b392b9083de4445dab2639b36865d7df1f610aeef8961a3c6f304d8a 75379 Packages
 2e48cc13b6cc5856c9c6f628c6fe8088ef62ed664e9e0046fc72819269f7432c 21448 Packages.gz

Run (ignoring warnings about invalid signatures, and you’re done):

sudo apt-get update

Then run:

sudo apt-get install cuda

Keep an eye out the output. There should not be any errors. This will install CUDA in /usr/local/cuda/

Add CUDA To Environment Paths

Open ~/.bashrc  and append the following (you may need to do this in sudo mode):

export PATH=/usr/local/cuda/bin:$PATH 
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Once you save, enter in a terminal:

sudo source ~/.bashrc

This will add the CUDA executables to the environment paths. Note that currently opened terminals will not have CUDA added to the environment paths. You will need to restart them (open and close) for changes to take affect.

And then you can open a new terminal and type nvcc (Nvidia CUDA Compiler) to see whether the environment is set correctly. It should not output any errors.

Solving gcc/g++ Incompatibilities

CUDA requires a compatible C/C++ compiler to work. The one that comes bundled with Ubuntu isn’t. To fix this, enter the following:

sudo apt-get install gcc-4.9 g++-4.9

Then we may establish a soft link of the specific version for the CUDA binaries folder:

sudo ln -s /usr/bin/gcc-4.9 /usr/local/cuda/bin/gcc
sudo ln -s /usr/bin/g++-4.9 /usr/local/cuda/bin/g++

IMPORTANT! Now, if you run import theano for the first time with the THEANO_FLAGS environment variable containing device=gpu, theano complains that CUDA is not available. To run any python script that uses Theano, you need to prepend the command with THEANO_FLAGS=device=gpu,nvcc.flags=-D_FORCE_INLINES. All python scripts executed here will be using this workaround. Alternatively there is a fix here: (thanks Anonoz for the suggestion).


Now running the following line:

THEANO_FLAGS=device=gpu,nvcc.flags=-D_FORCE_INLINES python -c "import theano; print(theano.sandbox.cuda.device_properties(0))"

Should give you something like this:

Using gpu device 0: GeForce GT 740M (CNMeM is disabled, CuDNN not available)
{'major': 3, 'tccDriver': 0, 'kernelExecTimeoutEnabled': 1, 'deviceOverlap': 1, 'driverVersion': 8000, 'warpSize': 32, 'concurrentKernels': 1, 'maxThreadsPerBlock': 1024, 'computeMode': 0, 'canMapHostMemory': 1, 'maxGridSize2': 65535, 'maxGridSize1': 65535, 'maxGridSize0': 2147483647, 'integrated': 0, 'minor': 0, 'ECCEnabled': 0, 'runtimeVersion': 7050, 'textureAlignment': 512, 'multiProcessorCount': 2, 'clockRate': 895000, 'totalConstMem': 65536, 'name': 'GeForce GT 740M', 'memPitch': 2147483647, 'maxThreadsDim1': 1024, 'maxThreadsDim0': 1024, 'maxThreadsDim2': 64, 'coresCount': -2, 'sharedMemPerBlock': 49152, 'regsPerBlock': 65536}


NVIDIA provides a library for common neural network operations that especially speeds up Convolutional Neural Networks (CNNs). For Lasagne, it is necessary that you install this to get a convnet to work. It can be obtained from NVIDIA (after registering as a developer):

Don’t expect an instant email upon registration. For some reason it takes quite a while for them to send that email. I waited about 30 minutes.

Once you are in, choose version 4. That’s the one currently supported by Theano.

To install it, copy the *.h files to /usr/local/cuda/include and the lib* files to /usr/local/cuda/lib64

To check whether it is installed, run

THEANO_FLAGS=device=gpu,nvcc.flags=-D_FORCE_INLINES python -c "from theano.sandbox.cuda.dnn import dnn_available as d; print(d() or d.msg)"

It will print True if everything is fine, or an error message otherwise. There are no additional steps required for Theano to make use of cuDNN.

Again, if everything if successful, you run your python scripts as such (the following is deep_q_rl, a Theano-based implementation of Deep Q-learning using Lasagne):

 THEANO_FLAGS=device=gpu,nvcc.flags=-D_FORCE_INLINES python --rom breakout

References (in order):

Synchronizing BibTeX in Overleaf (BibLaTeX) and Texmaker (MiKTeX, Apacite)

In this post I detail how to get bibtex working on Overleaf (previously known as WriteLatex) and Texmaker (Windows 10 64-bit, MikTeX). Note that the citation format I’m using is APA, as specified by my university.

Overleaf may have the advantage of having collaborative editing with (almost) live previewing, but I hit a lot of problems getting the documents with bibtex I wrote there to compile in Texmaker. It just doesn’t compile. Conversely, copy-pasting working bibtex code from TexMaker into Overleaf pulls out compile errors.

So the best workflow I can come out with at the moment is this: Create latex document from my template (get from here: Overleaf to Texmaker Bibtex Template. There should be 2 files: main.tex and ref.bib. If the template link is not working, you can get from this Github gist instead), edit the latex document collaboratively in Overleaf, and then when you need it to compile in Texmaker, download the project as a zip and change some code.

Fortunately, it’s only 2 blocks of code, annotated as “SETUP DOCUMENT” and “END DOCUMENT”. You’ll find this in the start and end of the latex document respectively. These code blocks (provided in the template; the texmaker version is commented out) will need to be changed when moving your code back to TexMaker.











If you receive warning messages in TexMaker that goes something like

Citation ‘blablabla2007’ undefined

when you press F1 (quickbuild), you will need to enable bibtex in your build. To do this, go to “options > Configure TexMaker” and under “Quick Build” tab, select the quick-build command “PdfLatex + Bib(la)tex + PdfLaTeX (x2) + View Pdf”

2016-05-12 15_44_11-Settings

Tip: You can generate bibtex code from easily with

There is another variation of this: separate your content into another *.tex file, and then you have 2 master documents – one for Overleaf (main.tex) and another you use for TexMaker (main-texmaker.tex or whatever name you want) – which both includes the same content file. In TexMaker, you’ll need to set one document to be a master document to work with multiple files.

Have fun with Overleaf!

CS231n – Assignment 1 Tutorial – Q2: Training a Support Vector Machine

This is part of a series of tutorials I’m writing for CS231n: Convolutional Neural Networks for Visual Recognition. Go to this page to see the full listing.

To conserve space, I won’t be placing my full solutions in this post. Refer to my github repository for full source code.

Loss is fairly straightforward so I will be skipping that. After you successfully implemented that, it should give a value between 8 to 10. If you replace W with a zero matrix instead, the loss value will be exactly 9.

Though not in the lecture video, you can find the formula for the gradient in the lecture notes. We define margin as a matrix of margins where:

\text{margin}_{i, j} = w_j \cdot x_i - w_{y_i}\cdot x_i + \Delta

For each sample i,  we look through the n number of margins (n being the number of classes we want to classify).

If that class is the correct class (denote as y_i), it contributes the sample row i itself with a weight of the negation of the total number (where 1 is the indicator function that is one if the condition inside is true or zero otherwise) of incorrect classes where the margin is > 0, given n number of classes:

\nabla_{w_{y_i}} L_i = - \left( \sum\limits_{j=1, j\neq y_i}^n 1(\text{margin}_{i,j} > 0) \right) x_i

If that class is incorrect, it contributes the sample row i depending on whether the margin is >0:

\nabla_{w_j} L_i = 1 \left( \text{margin}_{i,j} > 0 \right) x_i

After you implement the naive version of SVM gradient, it should match up fairly closely with the numerically computed gradient:

numerical: 0.449971 analytic: 0.449971, relative error: 5.636661e-10
numerical: -16.007355 analytic: -16.007355, relative error: 6.481167e-12
numerical: 9.938192 analytic: 9.938192, relative error: 8.279972e-12
numerical: 6.561015 analytic: 6.540654, relative error: 1.554020e-03
numerical: 0.151951 analytic: 0.151951, relative error: 1.675199e-09
numerical: 3.936123 analytic: 3.936123, relative error: 7.290840e-11
numerical: -28.234382 analytic: -28.234382, relative error: 4.143686e-12
numerical: -0.452718 analytic: -0.452718, relative error: 4.690168e-10
numerical: -22.248917 analytic: -22.248917, relative error: 1.991237e-11
numerical: 20.414164 analytic: 20.381418, relative error: 8.027022e-04
numerical: 10.663350 analytic: 10.663350, relative error: 8.140355e-12
numerical: 2.612748 analytic: 2.612748, relative error: 6.536924e-11
numerical: -4.644796 analytic: -4.644796, relative error: 5.570529e-12
numerical: 15.987607 analytic: 15.987607, relative error: 2.419968e-11
numerical: -35.476517 analytic: -35.476517, relative error: 9.777531e-12
numerical: -25.275914 analytic: -25.275914, relative error: 5.744716e-12
numerical: -3.928484 analytic: -3.928484, relative error: 1.384318e-11
numerical: -16.666177 analytic: -16.666177, relative error: 1.360765e-12
numerical: 17.289968 analytic: 17.289968, relative error: 4.612259e-11
numerical: 4.519505 analytic: 4.519505, relative error: 1.483479e-11

On average if the relative error is e-8 instead, check that you have implemented regularization. This is not explicitly given in the notes, but it is hinted in the assignment:

you didn’t forget the regularization gradient did you?

You can find the regularization gradient by finding the derivative of the loss function regularization. Given that the loss function regularization is \frac{1}{2}\lambda \sum W^2 where \lambda is the regularization parameter, the derivative of that with respect to W is then \lambda \sum W.

To implement vectorization we need to fuse operations together. So now we try to find the gradient of a given class k for all n classes. The result is a single row with the number of columns being the number of input features (1 x 3073).  We define that the total number of samples is m. This would be the mean of all correct (\nabla_{w_{y}} L) and incorrect (\nabla_{w_j} L) samples given the class k:

\nabla_{w_k} L = \frac{1}{m}\left[ \nabla_{w_j} L + \nabla_{w_{y}} L \right]

\nabla_{w_{y}}L= \sum\limits_{i=1}^{m}\left[- \left(  \sum\limits_{j=1, j \neq y_i}^{n} 1 \left( \text{margin}_{i,j} > 0 \right) \right) x_{i}\right]

\nabla_{w_{j}}L= \sum\limits_{i=1}^{m} 1 \left( \text{margin}_{i,j} > 0 ) x_i \right)

This can be implemented in a few lines of python code:

# Semi-vectorized version... It's not any faster than the loop version.
num_classes = W.shape[1]
incorrect_counts = np.sum(margins > 0, axis=1)
for k in xrange(num_classes):
  # use the indices of the margin array as a mask.
  wj = np.sum(X[margins[:, k] > 0], axis=0)
  wy = np.sum(-incorrect_counts[y == k][:, np.newaxis] * X[y == k], axis=0)
  dW[:, k] = wj + wy
dW /= num_train # average out weights

incorrect_counts counts the margins for each sample where margin > 0. The result is a single column vector is m number of rows. With this, for each class k, we select the samples in which the correct class is k. Because incorrect_counts is a vector, we need to change it to a matrix of size x 1 by adding [:, np.newaxis] so that the each element of the vector will be mapped to X for scalar multiplication.

There is a faster way, by simply using a single dot product, though it may be tough to figure it out. The intuition is that aside seeing it as converting from one dimension to another, an interesting way to look at matrix multiplication is that it is a grouping of sums of specific combination of rows together; each row in B is a sum of different row combinations of the matrix X, and you control it in the matrix A.

Let’s put this idea to a simpler context. Suppose we have a 4×3 matrix X, which maps to 4 samples (row) and 3 features (column). If we set k to 2, then A is a 2×4 matrix; k number of rows and m number of samples. A^TX will then return a matrix B with dimensions 2×3.

The following code snippet:

import numpy as np
X = np.arange(12).reshape((4, 3))
A = np.zeros((2, 4))
B =
print A.shape, X.shape, B.shape

Will produce the output:

(2, 4) (4, 3) (2, 3)

Each row in B maps to a class k and is a sum of specific rows in X. In A, each row maps to a class k, and each column maps to a specific sample. If we simplify B to be a boolean array of 1 and 0, then we choose 1 to select, 0 otherwise. For example:

\begin{bmatrix}0 & 1 & 0 & 1 \\1 & 0 & 1 & 0\end{bmatrix} \times \begin{bmatrix}0 & 1 & 2 \\ 3 & 4 & 5 \\ 6 & 7 & 8 \\ 9 & 10 & 12 \end{bmatrix} = \begin{bmatrix}12 & 14 & 16 \\ 6 & 8 & 10 \end{bmatrix}

Row 1 in B is a sum of row 2 and 4 of X, and row 2 in B is sum of row 1 and 3 of X. How matrix A chooses its rows is by multiplication: if the value is 0, multiplying gives an empty row and multiplying by 1 returns the row itself. If instead you change 1 to -2 or 0.5, it will be multiplied by that weight instead. Do play around this in your python console to get the intuition before proceeding.

With this we can attain a succinct formulation:

Q_{i, k} = \begin{cases} - \left(  \sum\limits_{j=1,  j \neq y_i}^{n} 1  \left(\text{margin}_{i,j} > 0 \right) \right)  & \text{if } k = y_i \\ 1 \left( \text{margin}_{i,k} > 0 \right) & \text{otherwise}\end{cases}

\nabla_{w} L = \frac{1}{m} \left( Q^{T}X \right)^T =\frac{1}{m} X^TQ

Now we can take this idea and apply in our code:

X_mask = np.zeros(margins.shape)
X_mask[margins > 0] = 1

Notice that this will factor only the incorrect classes because we previously set all the cells in the margin array that map to the correct class to 0:

margins[np.arange(num_train), y] = 0 # done this before

Of course, our X_mask matrix is incomplete because we didn’t factor in the correct classes. This is not too complicated to do. We understood that for every class k, there are rows in X where class k is the correct class. For each of these rows, we multiply it by the negation of number of incorrect classes where the margin is >0. We don’t need to recalculate this; we can already find this value for each of these rows from X_mask itself, since now each of its rows is 1 where the margin is >0 and 0 otherwise. Finding the number of incorrect classes where the margin is >0 is as simple as finding the sum of the columns for each row:

incorrect_counts = np.sum(X_mask, axis=1)

With this we can find the gradient matrix. The full implementation is below:

# Fully vectorized version. Roughly 10x faster.
X_mask = np.zeros(margins.shape)
X_mask[margins > 0] = 1
# for each sample, find the total number of classes where margin > 0
incorrect_counts = np.sum(X_mask, axis=1)
X_mask[np.arange(num_train), y] = -incorrect_counts
dW =

Lastly you should remember to regularize the gradient. The gradient with regularization is therefore:

\nabla_{w} L = \frac{1}{m} X^TQ + \lambda \sum W

The loss function is visualized:

loss vs iteration number

If the graph is drastically more fuzzy or it doesn’t seem to be decreasing, chances are somewhere in your code you are doing it wrong. Here’s what happen when I forgot to regularize the gradient:

loss vs iteration number forgot regularization

The loss is still decreasing, but not as fast as it should be.

The training and validation accuracy is visualized:

accuracy plot

In the last section code has been given to help you visualize the weights. With proper hyperparameters, more iterations will give a smoother result:

visualize weights

CS231n – Assignment 1 Tutorial – Q3: Implement a Softmax classifier

This is part of a series of tutorials I’m writing for CS231n: Convolutional Neural Networks for Visual Recognition. Go to this page to see the full listing.

To conserve space, I won’t be placing my full solutions in this post. Refer to my github repository for full source code.

The loss function is covered in the lecture and course notes. However, I find myself a little lost when looking for the formula for the gradient. What I find on the internet usually looks something like this:

\text{if } i = j :\; \frac{\partial y_i}{\partial z_i} = y_i (1 - y_i)
\text{if } i \neq j :\; \frac{\partial y_i}{\partial z_j} = -y_i y_j

And I have a lot of trouble figuring out how to implement it. Fortunately, there are notes on this, though not explicitly linked in the course site itself. Here is the link. I wasn’t able to see how these 2 formulas are also the derivative of the Softmax loss function, so anyone who is able to explain that I’d be really grateful.

For each sample, we introduce a variable p which is a vector of the normalized probabilities (normalize to prevent numerical instability. Refer to course notes: “Practical issues: Numeric stability.“):

f_i = W x_i

p_k = \frac{e^{f_k-\max_j f_j}}{ \sum_j e^{f_j-\max_j f_j} }

In my naive solution, this is implemented as a lambda function:

f_i = X[i].dot(W)
f_i -= np.max(f_i)
sum_j = np.sum(np.exp(f_i))
p = lambda k: np.exp(f_i[k]) / sum_j

The cross-entropy loss function L for a sample i is therefore:

L_i = -\log\left(p_{y_i}\right)

Total loss for all m samples with regularization is then:

L = \frac{1}{m}\sum_i \left[-\log\left(p_{y_i}\right)\right] + \frac{1}{2}\lambda \sum W^2

The gradient for a class k for a given sample i is therefore:

\nabla_{w_k}L_i = (p_k -1\{y_i = k\})x_i

(Recall that 1 is the indicator function that is one if the condition inside is true or zero otherwise)

When converting to vectorized implementation, as with SVM, you need to somehow change to use a single dot product operation. Fortunately this is simpler than SVM. We define a matrix Q where the number of rows (each sample as i) is the sample size, and the columns (each class as k) being the number of classes:

Q_{i, k} = p_k -1\{y_i = k\}

\nabla_{w} L = \frac{1}{m}  X^TQ + \lambda \sum W

Here is a snippet of the implementation:

num_train = X.shape[0]
f =
f -= np.max(f, axis=1, keepdims=True) # max of every sample
sum_f = np.sum(np.exp(f), axis=1, keepdims=True)
p = np.exp(f) / sum_f
ind = np.zeros_like(p)
ind[np.arange(num_train), y] = 1
Q = p - ind
dW =

Though not in the python notebook, I added the loss plot function (with some slight alterations) from the SVM notebook to visualize the loss as the number of iterations increase. You can refer to my python notebook for the code; I will only show you the plot result, which looks remarkably similar to SVM:


You should be able to get a validation accuracy pretty close to 0.4 (I got about 0.39).

When visualizing the weights, you should get a somewhat similar result to the SVM classifier:


Converting a Java Project to Use JPA

In this post I walk through some of the gotchas when converting a java application that works with raw SQL strings to one using ORM (Object Relational Mapping) via JPA (Java Persistence API). I will be converting a simple application from “Generic Java GUI Netbeans Project with Embedded Database” that only has 2 entities: Post and Comment.

The finished product is in a git branch called JPA. If you don’t use git you can download this sample project as a zip from MediaFire.

You can view all the changes to convert to JPA in this Github diff.

Netbeans makes the initial setup very simple by generating persistence.xml (you find the persistence unit name here) for you, as well as the the entities for you from your database.

SQL needs to be rewritten to Java Persistence Query Language

This isn’t much an issue really; in the long run it does you a favour since it is database vendor independent.

Change from:

ResultSet rs = db.executeQuery("SELECT * FROM post ORDER BY date_created DESC");


List<Post> rs = db.createQuery("SELECT p FROM Post p ORDER BY p.dateCreated DESC").getResultList();

Default Values are Lost

I noticed something strange when adding a Post entity: the created_date attribute shows up as a null when I convert to use JPA. My DDL (Database Definition Language) looks like this (Derby DB SQL):

    name VARCHAR(250) NOT NULL,
    content VARCHAR(500),

So each time I create a Post, I’m expecting the date_created attribute to show the current date, but it doesn’t. So all the SQL code where I have DEFAULT is basically replaced with null when I use JPA.


The workaround is to code the default values into the attribute field of the entity classes. Here is the dateCreated attribute inside Post:

@Column(name = "DATE_CREATED")
private Date dateCreated = new Date(); // new Date() returns the current timestamp

Exceptions in JPA Only Happen in Runtime

So when converting the code, I realized that in the places where SQLException would appear, Netbeans puts up an error saying that SQLException doesn’t happen here:

Sqlexception is never thrown

That’s ok I think. But what’s weird was that it offered to remove the try-catch block as a solution! Wow wow wow, stop. Aren’t there exceptions? Well, turns out there’s PersistenceException.

The problem? It’s a subclass of RuntimeException. I don’t exactly know what was the reason the exception was happening in runtime, but without the try-catch block, the procedure is going to fail (Here the Post entity cannot have null value for Name attribute) silently.

Now for a before and after. Before:

String sql = "INSERT INTO post (name, title, content) VALUES (?, ?, ?)";
try {
    PreparedStatement ps = core.DB.getInstance().getPreparedStatement(sql);
    ps.setString(1, authorField.getText());
    ps.setString(2, titleField.getText());
    ps.setString(3, postTxtArea.getText());
    JOptionPane.showMessageDialog(this, "Post has successfully been added.", "Successfully added post!", JOptionPane.INFORMATION_MESSAGE);
    dispatchEvent(new WindowEvent(this, WindowEvent.WINDOW_CLOSING));
} catch (SQLException ex) {
    JOptionPane.showMessageDialog(this, ex.toString(), "Invalid content... or some shit like that", JOptionPane.ERROR_MESSAGE);


try {
    Post p = new Post();
    JOptionPane.showMessageDialog(this, "Post has successfully been added.", "Successfully added post!", JOptionPane.INFORMATION_MESSAGE);
    dispatchEvent(new WindowEvent(this, WindowEvent.WINDOW_CLOSING));
} catch (PersistenceException ex) {
    JOptionPane.showMessageDialog(this, ex.toString(), "Invalid content... or some shit like that", JOptionPane.ERROR_MESSAGE);

The following dialog should pop up when the Author field is empty:


Well, this output doesn’t just happen automatically. There’s still one more issue that I’ll get to next:

Empty Strings are not Null

In my DDL, I have a rule that Post cannot have null value for Name attribute. Yet for some reason a string “” is not a null value in JPA. It is actually stored in the database as “”.

How is “” not null?

stored as empty string

This kind of shit doesn’t happen when simply working with raw SQL strings.

There are workarounds for this online: one of them was using the annotations @Size(min=1) or @NotNull. Unfortunately I’m using Java SE 8 (@Size is currently supported up to Java EE 7 as of this writing) and I’m not using Spring (for @NotNull).

So what I ended up doing was place the validation in the setter method:

public void setName(String name) {
    if (name.trim().isEmpty()) return; // empty strings begone! = name;

As you can imagine, I have to do this for every attribute in every entity that doesn’t accept empty strings.

You Need to Manually Manage One To Many Relationships

I had this issue that when I add a one to many relationship: Post has many Comments, now I add a Comment to a Post. I don’t see the changes immediately though; had to restart the application to see the new Comments. I would have thought that JPA would update the Comments collection inside the Post, but it didn’t.

Here’s how my setPostId function in my Comment entity looks like:

public void setPostId(Post postId) {
    this.postId = postId;

This is because (as aggravating as it sounds) in JPA, it is the responsibility of the application, or the object model to maintain relationships. By that it meant that it is the responsibility of the programmer to manually code in the links when a setter is called:

public void setPostId(Post postId) {
    this.postId = postId;
    postId.getCommentCollection().add(this); // programmer needs to add this himself

I’m going to be a bit honest here: this is kinda inconvenient, and retarded. In Rails this sort of things is done automatically, and it should be. In JPA I need to remind myself that each time I have a one-to-many relationship I need to manually link them together or it doesn’t update automatically/only updates after application is restarted.


There may be other quirks I may miss out since I’m only dealing with a very simple project that doesn’t even have update and delete, so feel free to let me know about in the comments below.

Yea, I know I complain a bit, but working with raw SQL strings is going to be pretty hard to manage in the long run. You’re dealing with simply strings, most errors only pop up in runtime, and your IDE can’t assist you much. With JPA when you’re writing in Java Persistence Language, Netbeans can actually figure out data types and attribute names:

autocomplete in java persistence language string

So unless you really need that sort of performance boost, using an ORM would be the right call.