# Deep Q-Learning 101: Part 3 – Deep Q-Learning

This is a 3 part series of Deep Q-Learning, which is written such that undergrads with highschool maths should be able to understand and hit the ground running on their deep learning projects. This series is really just the literature review section of my final year report (which in on Deep Q-Learning) broken to 3 chunks:

## Introduction: The Atari 2600 Challenge

The Atari 2600 (or Atari VCS before 1982) is a home video game console released on September 11, 1977 by Atari, Inc.

Atari 2600 with standard joystick

The challenge is as follows: can an AI, given the same inputs as human player, play a variety of games without supervision? In other words, if the AI can hold a joystick and see the same screen as human player, can it teach itself how to play the game by just playing the game?

This means that instead of having the programmer hard code the rules of the game and the AI learn an optimal way to play (as is the case with most prior RL agents), the AI will need to figure out the rules of the game by seeing how the score changes with the moves it makes.

Using an actual Atari 2600 will prove too arduous because there is a great deal of unpredictability in the physical world; to circumvent this, researchers at DeepMind used an emulator called the Arcade Learning Environment or ALE (Bellemare, Naddaf, Veness, & Bowling, 2013), which simulates a virtual Atari 2600 inside our computer. From there we can program inputs into ALE, and receive game screens ($210 \times 160$ pixel images) and the score (a number).

There are 18 possible actions that the agent can take. I list them in table below:

Possible actions an agent can take in ALE

The following sections dive into the individual procedures that composes Deep Q-Learning.

## Preprocessing

The raw Atari 2600 screens attained ($210 \times 160$ with 128-bit color pallete) will require to be preprocessed to reduce the input dimensionality and unwanted artifacts. Some frame encoding was used to remove flickering (not all sprites are rendered in every frame, due to the hardware limitation of the Atari 2600) that is usually not noticeable by the human eye. The images are then scaled to $84\times 84$ frames, from which we then extract the luminance values. The luminance values can be calculated from RGB via the formula $0.2126\cdot R + 0.7152\cdot G + 0.0722\cdot B$. For each input, the above preprocessing step (defined as a function $\phi$) is applied to 4 frames at any given time step, placing the resulting dimensions as $84 \times 84 \times 4$.

## Q-Network Architecture

Q-Learning’s iterative update converges to an optimal solution, but in practice this is not practical, as Q-values are unique for every action and state, without any generalisation. With large state spaces (such as every pixel in an image), we would easily run out of memory to compute every single possible Q-value. Therefore, it is more common to use an approximator to estimate the Q-values. For this reason a non-linear function approximator (an ANN) is used. Such ANNs are known as Q-networks. The current estimate then becomes:

$y = r + \gamma \max_{a'} Q(s', a', w)$

where $w$ is the weights of the ANN. The table below shows the CNN architecture used in Deep Q-Learning (conv = convolutional layer, fc = fully connected layer):

DeepMind’s Q-Network Architecture

The output of the CNN is the predicted scores of all 18 possible actions in ALE; Deep Q-Learning then simply chooses the action (integer between 0 to 17) in which the predicted score is the highest.

Notice that there are no pooling layers in the CNN. What pooling layers do enable the CNN to be insensitive to the location of an object in the image (or we say the CNN would be translation invariant). This would come in handy for image classification task, but for a video game we would not want to discard the location of the sprites. These are crucial in determining the reward as well.

The agent plays the game one action at a time, and with every action it takes (it takes an action by running 4 frames through the weights of the CNN), a training step is executed where 4 frames are sampled from the pool of data and used to train the CNN.

## The Loss Function

We now introduce a loss function $L$ (which is squared error loss) that will tell us how far we are from Bellman ($Q^*(s, a)$). At a time step $t$, we set a Bellman backup $y_t$ (also known as target) , followed by the loss function (de Freitas, 2014):

$y_t = \mathbb{E}_{s'}\left\lbrace r + \gamma \max_{a'} Q(s', a', w_{t-1})\right\rbrace$

$L_t(w_t) = \left[y_t - Q(s, a, w_t) \right]^2$

What we want to do is update the weights $w_{t}$ of the Q-function, $Q(s, a, w_{t})$, but for the target we used the previous weights $w_{t-1}$ (this is not mentioned in the research paper (Mnih et al., 2013), but it is evident in the implementation). What we are essentially doing here is approximate Bellman by minimizing the difference between current estimate reward $y$ and past estimate reward $Q(s, a, w)$. Notice that this difference is actually the TDError as discussed earlier. So we used a supervised learning technique (neural networks), but alter the loss function that it uses a RL technique (Q-Learning). Another way to see this is to see $y$ as the critic and $Q(s, a, w)$ as the actor; the critic informs the actor how well it has done.

Now to update the weights via backpropagation, we need to compute the gradient, which is the derivative of the loss function $L_i(w_i)$:

$\nabla_{w_i} L(w_i) = \left( r + \gamma \max_{a'} Q(s', a', w_{i-1}) \right) \nabla_{w_i} Q(s, a, w_i)$

During implementation this is normally abstracted out in a neural network library like Neon; we simply define that we are using a mean squared loss, and specify the target $y$ and $Q(s, a, w)$ at each iteration, and Neon figures out the loss and gradients when doing forward and backward propagation.

## Sampling From Experience

So now as the AI plays a game in ALE it receives a stream of inputs, along with the game score. Instead of training the Q-network with this stream of inputs, we store these episodes into a replay memory (Lin, 1993). In terms of MDP, for each discrete time step $t$, we store an experience $e_t$ as $(s_t, a_t, r_t, s_{t+1})$ in our replay memory $D=e_1, \ldots,e_N$, where $N$ is the maximum capacity for the replay memory (a fixed constant defined by us).

Now we each update step in the Q-Network, we apply backpropagation to samples of experience drawn at random from our replay memory $D$. This breaks the similarity of subsequent training samples, and makes the training task a lot more similar to supervised learning.

## Exploration-Exploitation

Now comes another problem: The AI initially explores the game, but will always choose the first strategy it finds. This is because the weights of the CNN are initialized with random values, and therefore the actions of the AI in the start of the training will be random as well. In other words, the AI tries out actions it has never seen before at the start of the training (exploration). However, as weights are learned, the AI converges to a solution (a way of playing) and settles down with that solution (exploitation).

What if we do not want the AI to settle for the first solution it finds? Perhaps there could be better solutions if the AI explores a bit longer. A simple fix for this is $\epsilon$-greedy policy, where $\epsilon$ is a probability value between 0 and 1: with a probability of $\epsilon$ select a random action; and with $1-\epsilon$ probability exploits what it has learned.

## The Deep Q-Learning Algorithm

We now piece everything we have discussed so far into the Deep Q-Learning Algorithm (Mnih et al., 2013), given in Algorithm 4:

Notice that when we select an action using $Q^*$, we use the previous weights $w_{t-1}$ instead of the current weights $w_t$.

I will now clarify the algorithm concerning the weights $w$. Remember that in the loss function, to calculate the current prediction $y_j$, we used the previous weights $w_{t-1}$. What happens if $t=1$? We will use random weights. In implementation, this simply means we keep track of 2 weights: one of a past time step, and one of the current time step. Note that we would not necessarily need to use the weights immediately before $w_t$. As long as it belongs to a past time step it will work ($w_{t-k}$, where $k$ is a fixed constant).

## Afterword

I took some material from Tambet Matiisen’s great series on Deep Q-Learning (Demystifying Deep Reinforcement Learning, Deep Reinforcement Learning with Neon), and also referred to his Simple DQN implementation. There is also some parts that I referenced from Freitas’s lectures on Deep Reinforcement Learning (part of their machine learning course).

## References

• Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2013, 06). The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47, 253–279.
• de Freitas, N. (2014). Machine learning: 2014-2015. University of Oxford.
• Lin, L.-J. (1993). Reinforcement learning for robots using neural networks (Tech. Rep.). DTIC Document.
• Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

# Deep Q-Learning 101: Part 2 – Reinforcement Learning

This is a 3 part series of Deep Q-Learning, which is written such that undergrads with highschool maths should be able to understand and hit the ground running on their deep learning projects. This series is really just the literature review section of my final year report (which in on Deep Q-Learning) broken to 3 chunks:

## Prologue: A Brief History

In as early as 1954, scientists began to speculate that psychological principles in learning will become important to building artificial intelligent systems (Minsky, 1954). The science and study of teaching computers to learn from experience become eventually known as reinforcement learning (RL).

Successful implementations of RL algorithms began in 1959 when Arthur Samuel published his work on an AI that can master checkers simply by playing with itself (Samuel, 1959). In this seminal work, he pioneered various ideas on RL, one of which spawn to a class of learning techniques known as temporal difference (TD) learning. Later in 1992, Gerry Tesauro had similar success in the game backgammon with his program, TD-Gammon (Tesauro, 1995), which surpassed all previous computer programs in its ability to play backgammon. What is new in Tesauro’s work is that TD-Gammon combines TD learning with another AI technique that is inspired by biological brains: a neural network.

However, in spite of these advances, in both Samuel’s checker program and TD-Gammon, the input is distilled to what the computer can understand (board positions), and the rules are hard-coded to the machine. This no longer becomes the case when DeepMind publishes a novel technique known as Deep Q-Learning. It made its official debut by playing retro console games off an Atari 2600 emulator (Mnih et al., 2013), and in certain games the AI is even able to outmatch expert human players. What is significant about Deep Q-Learning is that the AI learns by seeing the same output on the game console as human players do, and figures out the rules of the game without supervision. As with TD-Gammon, Deep Q-Learning also used an artificial neural network, albeit a different architecture – convolutional neural network (CNN).

## Introduction

In all the hype about machine learning in this era, some people call reinforcement learning (RL) a hybrid between supervised and unsupervised learning. Some people place reinforcement learning in a different field altogether, because knowing supervised and unsupervised learning does not mean one would understand reinforcement learning, and vice versa.

The core idea of RL comes natural to us even as an infant. We are born in a world which we knew very little of, in a body which we do not choose, and with little supervision. We learn remedial tasks, such as walking and talking, by continuously trying until we get it right. An infant does not have an understanding of failure and success; he makes attempts, sees that some things work, and some don’t, and focus more of what works. He does not get depressed or frustrated even if he keeps falling when he tries to walk. Really the only people getting emotional of his attempts are his parents. As we get older we derive more unnecessarily complex views of the concept of what should work and what is success, but that is besides the point here.

RL is the study of a computational approach to learning by trying. We teach a computer to learn as we do: by repeatedly interacting with the environment, and learning from prior experience. For example, a robot learns to find a path out of a maze by trying out different routes that it can see. To put it in more formal terms, we have an agent (robot) that starts in an initial state (robot’s position and what it sees), take an action (movement in some direction) in an environment (the maze), which would then return a reward (some score to let the robot know how close it is to finishing the maze) and a new state.

The agent-environment interaction in RL. (Sutton & Barto, 1998)

Should we describe supervised learning in terms of this agent-environment interface, we could say that an action (input) comes together with the reward (target); we train a model with both the input and the expected answer. RL introduces the concept of delayed rewards, where the agent is only aware of the reward after the action is taken. This meant that during training, the agent attempts an action, but will not know if it is the best course of action (determined by a reward value) until after it has taken an action and after it has landed on a new state.

## Markov Decision Process

Markov Decision Process (MDP) introduces a mathematical framework for describing decision making in a probabilistic environment. All RL algorithms in this series will be described using MDP.

Time in the real world is continuous; but in MDP, time is discrete, described in time steps in fixed intervals: $t = 0, 1, 2, 3...$, where $t$ increases with each action we take. In a given time step $t$, an MDP is defined by:

• A set of states $s \in S$, where $S$ is the set of all possible states.
• A set of all actions $a \in A(s)$, where $A(s)$ is the set of all possible actions that can be taken in the state $s$.
• A transition function $T(s_t, a_t, s_{t+1})$ which returns the probability that if at time $t$ we take an action $a_t$ in the state $s_t$ we would end up in new state $s_{t+1}$ in a next time step $t+1$.
• A reward function $R(s_t, a_t, s_{t+1})$, which is the consequence of the action and comes 1 time step later. This value is normally an integer, and can also be negative value (punishment).

Transition and reward functions are also written as $T(s, a, s')$ and $R(s, a, s')$ respectively; this is useful notation in cases where time is not a consideration. Reward also has a more succinct notation: $r_t$.

In MDP, possible actions in a current state depends only on the current state. What happens in the past and what will happen in the future does not affect what actions an agent can take now.

MDPs are used to model non-deterministic (or stochastic) search problems. This is different from deterministic search because unlike deterministic problems, non-deterministic problems do not have a fixed sequence of actions from start to finish as a final answer. The fundamental problem of an MDP is to find a policy $\pi$ (this is not the same as the constant pi 3.142), where $\pi(s) = a$; a policy returns an action given a state. The best solution $\pi^*$ is a policy that maximizes the sum of all rewards:

$\sum\limits_{t=0}^{\infty} \gamma^t R(s_t, a_t, s_{t+1})$

$\gamma$ (gamma) here is called the discount rate, and is a float between 0 to 1 set by us to determine the present value of future rewards. The reward of the current state remains unchanged regardless of the discount rate (because anything to the power of 0 is 1), whereas future rewards will decay exponentially with each time step. If $\gamma$ approaches 1, the agent becomes very farsighted and takes future rewards seriously; if $\gamma$ approaches 0, the agent has tunnel vision and only values immediate rewards. In the DeepMind’s Deep Q-Learning implementation, $\gamma$ is set to 0.99. Discount rates are important, for without it our learning algorithm will not be able to converge (or finish running).

## Utility

We now know that the goal of the agent is to decide on what actions to maximize the sum of rewards. But to decide on an action the agent needs to have an expectation of what reward it will get. This expected reward is known as the utility (or value). There are 2 ways we can compute a value (Marsland, 2015):

• State-value function, $V(s)$ – We consider the current state, and average across all of the actions that can be taken.
• Action-value function, $Q(s, a)$ – We consider the current state and each possible action that can be taken separately. This is also known as a Q-value. A Q-state $(s,a)$ is when you were in a state and took an action.

In either case we are thinking about what the expected reward would be if we started in state s (where $\mathbb{E}(\cdot)$ is the statistical expectation):

$V(s) = \mathbb{E}(r_t | s_t = s) = \mathbb{E} \left\lbrace \sum\limits_{i=0}^\infty \gamma^i r_{t+i+1} | s_t = s \right\rbrace$

$Q(s, a) = \mathbb{E}(r_t | s_t = s, a_t = a) = \mathbb{E} \left\lbrace \sum\limits_{i=0}^\infty \gamma^i r_{t+i+1} | s_t = s, a_t = a \right\rbrace$

The utility starting in a state $s$ and acting optimally is defined as $V^*(s)$ (optimal state-value function); the utility starting out having taken action $a$ from state $s$ and thereafter acting optimally as defined as $Q^*(s, a)$ (optimal state-action function) (Klein, 2014). If an agent knows all these $V^*$ or $Q^*$ values, it will be able to navigate through the state space such that it attains the maximum reward; in other words, our agent will always be acting optimally.

V∗ and Q∗ values in grid world example (Klein, 2014)

This is illustrated in the grid world example in figure above. Here the agent navigates from one box (each box represents a state) to another until it reaches a terminal state: a box with double outlines. In each state the agent chooses between 4 different directions, each of which is a possible action. Notice that the highest utility in each Q-state box is the utility of a V-state box.

## The Bellman Equations

How shall we characterize optimal utilities? We will use bellman equations. The intuition behind it is to take the correct first action, and from then onwards continue to be optimal. This is done in a recursive fashion:

$Q^*(s, a) = \sum_{s'} T(s, a, s') \left[ R(s, a, s') + \gamma V^*(s') \right]$

$V^*(s) = \max_a Q^*(s, a)$

$V^*(s) = \max_a \sum_{s'} T(s, a, s') \left[ R(s, a, s') + \gamma V^*(s') \right]$

In these equations, we find all possible consequent states that can result from from taking action $a$ from state $s$. The utility of each of these consequent states ($s'$) is the reward of the current state multiplied by future expected rewards of the consequent state; this is recursively denoted as $V^*(s')$. To prioritize the reward of the current state we apply a discount rate $\gamma$ to future expected utilities. Because the utility of all consequent states are stochastic, we multiply the culminated utilities of each consequent state by the probability they would be executed ($T(s, a, s')$).

## Value Iteration Algorithm

Now that we know what $V^*(s)$ and $Q^*(s,a)$ are, the next step is to compute the these optimal utilities. To do this we use the value iteration algorithm.

We assume there will always exist an optimal Q-value for each Q-state. At first, all Q-states will be initialized to small random values. Then we will iterate through every possible state, evaluating its future states in the process. With a proper discount rate $\gamma$, an optimal Q-value will eventually converge in each Q-state.

The value iteration algorithm for finding $V^*$ and $Q^*$ are given in Algorithm 1 and Algorithm 2 respectively.

Notice that in the value iteration algorithm, the probability of entering a state $T(s, a, s')$ is known. This is not usually the case in most real world environments. Also when the state space is large (or infinite), value iteration will never finish running. To circumvent these limitations, we introduce Q-Learning.

## Q-Learning

Q-Learning (Watkins, 1989) belongs to a class of RL methods called temporal difference learning. It offers an online algorithm to iteratively approximate Q-values without knowing the transitions between states.

This is done with an update function that finds the temporal difference between learning experiences. Temporal difference simply means the difference between the reward at this current step and the estimated reward we got at the previous step. The Q-values are then determined by a running average. Naturally, the more we iterate, the closer we converge to optimal $Q^*$ values.

To do this we break down how we find our Q-values: instead of recursively computing an infinite amount of time steps, we use a one-step value iteration update given by the Bellman equation, which is just depth-one expansion of the recursive definition of a Q-value:

$Q(s, a) = R(s, a, s') + \gamma \max_{a'} Q(s', a')$

This will be our estimated utility for the current state. Let us call this estimated utility $y$ and simplify $R(s, a, s')$ to $r$. The estimated utility then becomes:

$y = r + \gamma \max_{a'} Q(s', a')$

The Q-Learning algorithm is given in Algorithm 3:

In the update function, $y-Q(s,a)$ is also called TDError or temporal difference error. $\alpha$ can be describe as the learning rate, as it determines the weightage between current estimates and previous estimates. To show this more clearly, notice that these update functions are identical:

$Q(s, a) \gets Q(s, a) + \alpha \left[y - Q(s, a)\right]$

$Q(s, a) \gets (1-\alpha)Q(s, a) + \alpha \cdot y$

As time passes, it is usually the case that we incrementally reduce $\alpha$, to place priority to past estimates.

Note that in Q-learning we do not use the policy to find the value of $a'$, but instead choose the one that gives the highest value. This is known as an off-policy decision.

## Afterword

If you want to learn more about RL, I would recommend looking up the AI course (CS188) from the university of Berkeley. I didn’t actually finish the course – just the lectures that involve RL, and attempted the corresponding programming assignments. The grid world example is taken from there; it is actually a programming assignment (python3) where you have to write the code to compute the Q values. Marsland’s Machine Learning textbook only takes up a small part to talk about RL, but the material is relatively easier to understand, not to mention that it comes with python source code. The definitive guide to RL would definitely be Sutton’s RL book, but I had a lot of difficulty understanding it, since it just teach concepts.

I understand if the material here is pretty esoteric; I took a long time to understand it myself. However, if you manage to make it thus far, you are ready to understand the Deep Q-Learning algorithm, which I will cover in the final part of this series.

## References

• Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1) (No. 1). MIT press Cambridge.
• Minsky, M. L. (1954). Theory of neural-analog reinforcement systems and its application to the brain model problem. Princeton University.
• Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3), 210–229.
• Tesauro, G. (1995). Temporal diﬀerence learning and td-gammon. Communications of the ACM , 38(3), 58–68.
• Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
• Marsland, S. (2015). Machine learning: an algorithmic perspective. CRC press.
• Klein, D. (2014). Cs 188: Artiﬁcial intelligence. University of California, Berkeley.
• Watkins, C. J. C. H. (1989). Learning from delayed rewards (Unpublished doctoral dissertation). University of Cambridge England.

# Deep Q-Learning 101: Part 1 – Convolutional Neural Networks

This is a 3 part series of Deep Q-Learning, which is written such that undergrads with highschool maths should be able to understand and hit the ground running on their deep learning projects. This series is really just the literature review section of my final year report (which in on Deep Q-Learning) broken to 3 chunks:

## Prologue

You can skip this. It is just me explaining how I end up publishing this.

When I was writing my final year report I was advised by my supervisor (Dr Wong Ya Ping) to write in such a way that the average layman could understand. I think he put it as “write it such that even your mom can understand” (my mom is pretty highly educated by the way). So I went out of my way to take a course on it, and kept simplifying my writing under my co-supervisor’s invaluable and meticulous review (Dr Ng Boon Yian – he is famous for reviewing final year reports). After my final year project was complete I just shelved everything and gave myself a pat in the back.

Fast forward 7 months to today. I passed my report around as reference for my juniors, and one of them commented that it was like a crash course in Convolutional Neural Networks (CNN). In a week or so, a friend from the engineering faculty was having difficulty understanding CNN, so I passed my report to him. He said it clears a lot of things up, since he had been mostly referring to the Tensorflow documentation, which is focused on teaching you how to use the library not teach you machine learning. So, with that I decided to breakdown the literature review of my report to 3 parts and publish it in my blog, in hopes to enlighten a wider audience. Hope it will clear things up for you as well!

I myself am not very focused on machine learning at the time being; I have decided to direct my attention on the study of algorithms via the Data Structures and Algorithms Specialization in Coursera. So chances are if you ask some machine learning question now I won’t be able to understand, but I’ll try. (:

## Abstract

In this post, I will introduce machine learning and its three main branches. Then, I will talk about neural networks, along with the biologically-inspired CNN. In part 2, I will introduce the reader to reinforcement learning (RL), followed by the RL technique Q-Learning. In the final part, I piece together everything when explaining Deep Q-Learning.

If you are here just to understand CNN, this first part is all you need.

## Types of Machine Learning

The art and science of having computer programs learn without explicitly programming it to do so is called machine learning. Machine learning, then, is about making computers modify or adapt their actions (whether these actions are making predictions, or controlling a robot) so that these actions get more accurate. Machine learning itself is divided to three broad categories (Marsland, 2015):

• Supervised Learning – We train a machine with a set of questions (inputs), paired with the correct responses (targets). The algorithm then generalizes over this training data to respond to all possible inputs. Included in this category of learning techniques is neural networks.
• Unsupervised Learning – Correct responses are not provided, but the algorithm looks for patterns in the data and attempts to cluster them together. Unsupervised learning will not be covered in this series as it is not used in Deep Q-Learning.
• Reinforcement Learning – A cross between supervised learning and unsupervised learning. The algorithm is told when the answer is wrong, but is not shown how to correct it. It has to explore different possibilities on its own until it figures out the right answer.

A system that uses these learning techniques to make predictions is called a model.

## The Artificial Neural Network (ANN)

### The Neuron

The simplest unit in an ANN is called a neuron. The neuron was introduced in 1943 by Warren S. McCulloch, a neuroscientist, and Walter Pitts, a logician (McCulloch & Pitts, 1943). Inspired by biological neurons in the brain, they proposed a mathematical model that extracts the bare essentials of what a neuron does: it takes a set of inputs and it either fires (1) or it does not (0). In other words, a neuron is a binary classifier; it classifies the inputs into 2 categories.

Mathematical Model of a neuron (Marsland, 2015)

In a neuron, a set of $m$ inputs $x_{1}\ldots x_{m}$ is multiplied by a set a weights $w_{1}\ldots w_{m}$ (the weights are learned over time) and summed together. Both $x$ and $w$ are typically represented as vectors.

$h=\sum\limits_{i=1}^m w_ix_i$

The result, $h$, is then passed to an activation function, which returns an output (1 or 0).

### Building ANN from Neurons

A neuron by itself cannot do much; we need to put sets of neurons together into an ANN before they can be anything useful.

ANN – each circle (node) is a neuron (Karpathy et al., 2016)

What happens after we clump these neurons together to layers? How do they learn? The algorithm will learn by example (supervised learning); the dataset will have the correct output associated with each data point. It may not make sense to provide the answers, but the main goal of an ANN is to generalise over the data; finding patterns and predict new examples correctly.

To teach an ANN, we use an algorithm called back-propagation.

### Back-propagation Algorithm

Back-propagation algorithm consists of two main phases, executed in order:

• Forward propagation – the inputs are passed through the ANN starting at the input layer, and predictions are made at the output layer.
• Weight update – from the predictions, we calculate how far we differ from the answer (also known as the loss). We then use this information to update the weights in the reverse direction; starting from the output layer, back to the input layer.

The weight update step is made possible by another algorithm: gradient descent.

To use gradient descent, we first need to define a loss function $L$, which calculates the loss. For each sample $i$, loss is the difference between the predicted value $h_w(x^i)$ and the actual value $y^i$ for all $m$ samples. There are various methods of calculating the loss; one of the most popular would be mean squared error function:

$L(w) = \frac{1}{m} \sum\limits_{i=1}^m \left( h_w(x^i) - y^i\right)^2$

The goal of the ANN is then to minimize the loss. To do this we find the derivative of the loss function with respect to the weights, $\nabla_w L(w)$. This gives us the gradient of the error. Since the purpose of learning is to minimize the loss, nudging the values of the weights in the direction of the negative gradient will reduce the loss. We therefore define the back-propagation update rule of the weights as:

$w = w - \alpha \nabla_w L(w)$

$\alpha$ here is known as the learning rate, which is a parameter that we tweak to determine how strong we will nudge the weights with each update. An update step of the weights (including both forward and backward pass) on one sample is known as an iteration; when we iterate over all samples one time, we call this an epoch. More epochs would usually mean better accuracy, but up until the ANN converges to a possible solution. In gradient descent, 1 iteration is also 1 epoch, because the entire dataset is processed in each iteration.

What happens if the dataset gets constantly updated with new data (transaction data, weather information, traffic updates, etc.)? There is a variation of gradient descent that allows us to stream the data piece by piece into the ANN: stochastic gradient descent (SGD). To do this a simple modification is made to the loss function (which consequently changes the derivative): instead of factoring the entire dataset in each update step we take in only a single input at a time:

$L(w) = \left( h_w(x^i) - y^i\right)^2$

For SGD, a dataset of 500 samples would take 500 iterations to complete 1 epoch. Deep Q-Learning uses SGD to perform updates to the weights (Mnih et al., 2013), using a rather unique loss function. I will elaborate on this in part 3.

## Convolutional Neural Networks

Convolutional Neural Networks (CNN) are biologically-inspired variants of multi-layered neural networks. From Hubel and Wiesel’s work on visual cortex of cats (Hubel & Wiesel, 1963), we understand that the cells in the visual cortex are structured in a hierarchy: simple cells respond to specific edges, and their outputs are received by complex cells.

Hierarchical structure of the neurons (Lehar, n.d.)

In a CNN, neurons are arranged into layers, and in different layers the neurons specialize to be more sensitive to certain features. For example in the base layer the neurons react to abstract features like lines and edges, and then in higher layers neurons react to more specific features like eye, nose, handle or bottle.

A CNN is commonly composed of 3 main types of layers: Convolutional Layer, Pooling Layer, and Fully-Connected Layer. These layers are stacked together and inputs are passed forward and back according to that order.

Architecture of LeNet-5, a CNN used for digits recognition (LeCun et al., 1998)

### Convolutional Layer

Each convolutional layer consist of a set of learnable filters (also referred to as kernels), which is small spatially but extends through the full depth of the input volume. For example, for a coloured image (images passed into a CNN are typically resized to a square) as input, a filter on a first layer of a CNN might have size $5\times 5\times 3$ (5 pixels width and height, and 3 color channels, RGB). During the forward pass, we slide (or convolve) each filter across the width and height of the input volume (the distance for each interval we slide is call a stride) and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map (also referred to as feature map). We will stack these activation maps along the depth dimension to produce the output volume (Karpathy et al., 2016), therefore the number of filters = depth of output volume.

Click image (you will need to scroll down a bit) to check out an interactive demo (built by Karpathy) of the convolutional layer at work. 2 filters, size 3*3*3, with stride of 1.

In summary, each convolutional layer requires a 3 hyperparameters to be defined: filter size (width and height only; the depth will match with the input volume), number of filters, and stride.

After an input is convolved in one layer, the output volume will pass through a Rectified Linear Units (RELU) layer. In RELU, an elementwise activation function is performed, such as:

$f(x) = \max(0, x)$

The output volume dimensions remains unchanged. RELU is simple, computationally efficient, and converges much faster than other activation functions (sigmoid, tanh) in practice.

### Pooling Layer

Pooling layers performs a down sampling operation (that is why pooling operations are also called subsampling) and reduces the input dimensions. It is used to control overfitting (the state where the ANN becomes too scrupulous, and cannot generalise the input) by incrementally reducing the spatial size of the input to reduce the amount of parameters and computation in the network. Though there are many types of pooling layers, the most effective and simple is max pooling, illustrated below:

Illustration of max pooling (Karpathy et al., 2016)

There are proposed solutions to replace pooling layers altogether by simply increasing the stride (Springenberg, Dosovitskiy, Brox, & Riedmiller, 2014), and it seems likely that future architectures will either have very few to no pooling layers. Pooling layers are not used in DeepMind’s Deep Q-Learning implementation, and this will be explained later in part 3.

### Fully Connected Layer

Neurons in a fully connected (FC) layer have full connections to all activations in the previous layer, as seen in regular ANN as described previously. In certain implementations such as Neon, FC layers are referred to as affine layers.

## Afterword

There is a detailed writeup of CNN in the CS231n course by Karpathy: Convolutional Neural Networks (CNNs / ConvNets). I’d recommend taking a look at that for a more detailed (and more math intensive) explanation.

Of course, if you have time, the best way to get a proper foundation would be take up Andrew Ng’s machine learning course. I have gotten a cert from it, and if you are serious on this subject I’d suggest you enroll as well. Andrew even has a 5 course specialization on deep learning it now, though I won’t be taking it up anytime soon. What you will find, is that deep learning is more than just GPU’s and this magic black box called Tensorflow.

## References

• Marsland, S. (2015). Machine learning: an algorithmic perspective. CRC press.
• McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4), 115–133.
• Karpathy, A., Li, F., & Johnson, J. (2016). Cs231n convolutional neural network for visual recognition. Stanford University.
• Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
• Hubel, D., & Wiesel, T. (1963). Shape and arrangement of columns in cat’s striate cortex. The Journal of physiology, 165(3), 559.
• Lehar, S. (n.d.). Hubel & Wiesel. Retrieved 2016-08-19, from http://cns-alumni.bu.edu/~slehar/webstuff/pcave/hubel.html
• LeCun, Y., Bottou, L., Bengio, Y., & Haﬀner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
• Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.

# The Programmer’s Guide To FFT – Part 2: FFT

This will be a 2 part series on fast fourier transform (FFT). My aim for these posts is to provide a more hands-on and layman friendly approach to this algorithm, contrast to a lot of the theoretically heavy material available on the internet. In short: less math, no proofs, examples provided, and working source code (C++11).

In my previous post, I wrote about DFT as the basis to understand FFT. In this final part I will write about the FFT algorithm as outlined in Cooley and Tukey’s seminal work published in 1964. I assume you understood the material in the prior post before coming here, since everything here builds on top of it.

## Key Ideas

Before we begin, let us assume that the length of the input (N) is always in powers of 2 (2, 4, 8, 16, 32…). I will explain why is this important later on.

There are 2 key ideas that enable the FFT algorithm to work. First is to understand that DFT can be separated as a sum of odd and even parts:

$\begin{array}{lcl}F_k & = & \sum\limits_{n=0}^{N-1}x_n\cdot e^{-\frac{i2\pi k n}{N}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (2m)}{N}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (2m+1)}{N}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m+1/2)}{N/2}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m)}{N/2} - \frac{i\pi k}{N/2}} \\ & = & \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot e^{-\frac{i2\pi k (m)}{N/2}} + e^{-\frac{i2\pi k}{N}} \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot e^{-\frac{i2\pi k (m)}{N/2}}\end{array}$

Let us define a function $\omega$ (read as omega):

$\omega(p, q) = e^{\frac{i2\pi q}{p}}$

Now we simplify the DFT formulation to:

$F_k = \sum\limits_{m=0}^{N/2-1}x_{2m}\cdot \omega(km, \frac{N}{2}) + \omega(N, k) \sum\limits_{m=0}^{N/2-1}x_{2m+1}\cdot \omega(km, \frac{N}{2})$

Let’s generalize further to:

$F_k = F_k^{\text{even}} + \omega(N, k) \cdot F_k^{\text{odd}}$

The second key idea is to take advantage of the periodic nature of DFT:

$\begin{array}{lcl}F_k & = & F_k^{\text{even}} + \omega(N, k) \cdot F_k^{\text{odd}} \\ F_{k+\frac{N}{2}} & = & F_k^{\text{even}} - \omega(N, k) \cdot F_k^{\text{odd}}\end{array}$

What this means is that in the process of calculating the resulting sequence $F$ you only need to compute $\omega(N, k) \cdot F_k^{\text{odd}}$ a total of $\frac{N}{2}$ times; we can essentially half the number of computations using this technique. But why stop there? We can also take either $F_k^{\text{even}}$ or $F_k^{\text{odd}}$ and split them to odd and even parts, and repeat the same procedure. If we compute this recursively, the base case for this is when $N = 1$. In this manner we compute $\omega(N, k) \cdot F_k^{\text{odd}}$ for as many times as we can divide it by 2, or $\log N$. Therefore, for sequence of size N, FFT computes the DFT in $N \log N$ time.

## Example

Ok. That was probably hard to grasp, so let us break it down. Take an example where N = 2: a sequence in the coefficient representation s is (1, 9), and we want to convert it to point-value representation. The even and odd sequence is simply 1 and 9 respectively. We can use an auxiliary variable h to store $\omega(N, k) \cdot F_k^{\text{odd}}$:

$h = \omega(2, 0) \cdot 9 = 9$

$F_0 = 1 + h = 1 + 9 = 10$

$F_1 = 1 - h = 1 - 9 = -8$

Notice how we only need to compute $\omega(N, k) \cdot F_k^{\text{odd}}$ once and reused it for $F_{0 + \frac{N}{2}}$.

Now we go to a more complex example, where N = 8: s = (1, 6, 3, 8, 9, 5, 4, 2). Here we can show that it is possible that by using the fact that DFT can be expressed as sum of even and odd parts, that we can recursively divide s to smaller subproblems, up until N = 1:

I arranged such that the subsequence to the left contains even parts, and the sequence to the right contains odd parts. Now that we separate it nicely we can systemically work on the smaller parts and work our way up until the final answer. I’ve made a nice diagram that illustrates the computational flow of the FFT algorithm:

As before, the sequence to the left are the even parts and the sequence to the right are the odd parts. The cells show the type: yellow for real numbers, and green for complex numbers. Blue arrows branch out from even sequences, and red arrows branch out from odd sequences. Red arrows also denote that the cell it came from will be multiplied by $\omega(N, k)$, though not visually depicted. The whole computation flow shows a sort of “butterfly pattern”, as how most engineers like to describe it.

IFFT works roughly the same way as FFT in that it uses the same technique to save computation, so if you understand FFT you should get IFFT as well.

## Implementation

The implementation here includes FFT and IFFT. As with the DFT and IDFT implementation in the previous post, it takes a sequence in the coefficient representation and spits out a sequence of the same size in the point-value representation using FFT, and takes that sequence puts it through IFFT to get back the original sequence. As with before, my emphasis is on readability not optimization.

#include <iostream>
#include <complex>
#include <cmath>
#include <iomanip>
#include <vector>
#include <algorithm>

using namespace std;

double PI = acos(0) * 2;
typedef complex<double> xd;
typedef vector<double> dvec;
typedef vector<xd> xvec;
const xd J(0, 1); // sqrt(-1)

inline xd omega(const double &p, const double &q)
{
return exp((2. * PI * J * q) / p);
}

xvec _fft(xvec &f)
{
double N = f.size();

if (N == 1) return f;

xvec fe, fo;
fe.reserve(N / 2);
fo.reserve(N / 2);

for (int i = 0; i < N; i += 2) {
fe.push_back(f[i]);     // even
fo.push_back(f[i + 1]); // odd
}

fe = _fft(fe);
fo = _fft(fo);

for (int m = 0; m < N / 2; ++m) {
xd omfo = omega(N, -m) * fo[m];
f[m]         = fe[m] + omfo;
f[m + N / 2] = fe[m] - omfo;
}

return f;
}

xvec fft(const dvec &x)
{
xvec f(x.size());

for (size_t i = 0; i < x.size(); ++i) {
f[i] = xd(x[i], 0);
}

return _fft(f);
}

xvec _ifft(xvec &x)
{
double N = x.size();

if (N == 1) return x;

xvec xe, xo;
xe.reserve(N / 2);
xo.reserve(N / 2);

for (int i = 0; i < N; i += 2) {
xe.push_back(x[i]);     // even
xo.push_back(x[i + 1]); // odd
}

xe = _ifft(xe);
xo = _ifft(xo);

for (int m = 0; m < N / 2; ++m) {
xd iomxo = omega(N, m) * xo[m];
x[m]         = xe[m] + iomxo;
x[m + N / 2] = xe[m] - iomxo;
}

return x;
}

dvec ifft(xvec f)
{
double N = f.size();

xvec xcomplex = _ifft(f);
dvec x(N);

for (int i = 0; i < N; ++i) {
x[i] = xcomplex[i].real() / N;
}

return x;
}

int main()
{
cout << fixed << setprecision(2);

dvec input = { 1,6,3,8,9,5,4,2 };

// convert from time to frequency domain
xvec freqdom = fft(input);

for (const auto &f : freqdom) {
cout << f << endl;
}
cout << endl;

// convert from frequency to time domain
auto timedom = ifft(freqdom);

for (const auto &t : timedom) {
cout << t << ' ';
}
cout << endl;
}


It is a good idea to set breakpoints to see how the recursive implementation of FFT systematically solves DFT from the smaller subproblems.

Because of how similar FFT and IFFT is, it is not hard to merge them into a function and pass a boolean parameter to determine whether it will be FFT and IFFT (most implementations online will call this multi-purpose function “transform”), but for the sake of clarity I refrain from doing so.

Convolution is practically the same as before – it’s just that we replace DFT with FFT and IDFT with IFFT:

// vector convolution
dvec convolve(const dvec &a, const dvec &b)
{
// calculate degree of resulting polynomial
size_t N = 2 * a.size() - 1;

// extend size to match result
dvec acof(N), bcof(N);
copy(a.begin(), a.end(), acof.begin());
copy(b.begin(), b.end(), bcof.begin());

xvec apv, bpv, cpv(N);

// evaluation
apv = fft(acof);
bpv = fft(bcof);

// point-wise multiplcation
for (size_t i = 0; i < N; ++i) {
cpv[i] = apv[i] * bpv[i];
}

for (const auto &t : cpv)  cout << t << ' ';
cout << endl;

// interpolation
return ifft(cpv);
}


Now we estimate the time complexity of vector convolution using FFT: evaluation ($N \log N$), pointwise multiplication (N) and interpolation ($N \log N$) now costs a total of $2 \times N \log N + N \approx N \log N$.

## Must N Be In Powers of 2?

You could play around with different input sizes and compare the answer with DFT. What you will see is that FFT will return some funny values if N is not a power of 2. Why is this so? Well, you can see from the visual depictions of how FFT works: at the simplest subproblem (N = 2), you need to have one even value and one odd value. FFT must be able to divide in such a way that at some point splitting the input, all subsequences are of size 2, and the only way that is possible is if N is a power of 2.

If it doesn’t make sense, you could just play around with the code to convince yourself I guess.

Is this a bad thing? Well, if all you use FFT for is convolution then no. You could first calculate the resulting polynomial degree of the convolution, then pad the input with 0 until N is a power of 2, evaluate it, do pointwise multiplication, interpolate, and resize it to match the resulting degree.

If you use bit shifts to multiply by 2, you can compute N very quickly (see full implementation here):

// degree of resulting polynomial = size of resulting array
size_t deg = a.size() + b.size() - 1;

// transform array size must be in power of 2 for FFT
size_t N = 1;
while (N < deg) N <<= 1;

// set size for arrays in point-wise representation -
// extended space is padded with 0:
xvec a(N), b(N), c(N);

// ** do convolution... **

// Resize to actual size
c.resize(deg);


But, wouldn’t resizing the input array be slow? Well, we can prove with simple example. Say N = 129: In naive DFT, computing the DFT will take $129^2 = 16641$. In FFT we resize N to the closest power of 2, which is 256. $256 \log 256 \approx 617$. That is 2697% less computations! Of course, it shouldn’t be hard to convince yourself that this is still true as N gets larger.

There are variations of FFT where N does not need to be in powers of 2, but it won’t be the focus of this post.

## Optimizations

If you print the N and m used by the $\omega$ (omega) function, you will notice that there are some repetitions. Here is N and m where N = 8:

2 0
2 0
4 0
4 1
2 0
2 0
4 0
4 1
8 0
8 1
8 2
8 3

So this means it is possible to precompute results from $\omega$ and reuse them. In the case where N = 8, we only need to calculate $\omega$ 7 times as oppose to 12. I have an implementaion of FFT that does this simple optimization.

What else can you do? In-place calculation – here’s my implementation of in-place fft. Aside knowing how to meddle with indices, one key idea is to understand is this: the indices after dividing the sequence is the bit reverse, where the length of bits is $\log N$. For example if N = 16 ($\log N = 4$), the index 7 (0111) with be swapped with 14 (1110).

## Conclusion

We have went through DFT, and now FFT. I hope this helped you understand FFT a little better. It’s ok if you’re still a little foggy with the details; play around the source code long enough and it will be clearer to you. I intend to work on hands-on applications of FFT in future posts, so keep a look out for more!

# The Programmer’s Guide To FFT – Part 1: DFT

This will be a 2 part series on fast fourier transform (FFT). My aim for these posts is to provide a more hands-on and layman friendly approach to this algorithm, contrast to a lot of the theoretically heavy material available on the internet. In short: less math, no proofs, examples provided, and working source code (C++11).

In this first part I will write about discrete fourier transform (DFT) as the basis to understand FFT.

## Sorting as a Metaphor

DFT and FFT are similar as insertion sort is to merge sort; they both take the same type of inputs and spits out the same output, it’s just that FFT runs much faster than DFT by utilizing a technique called divide and conquer (mergesort also uses this). The computational difference is also the same as merge sort is to insertion sort: $O(N \log N)$ vs $O(N^2)$. How much difference is this? Well, if N is 10 billion and each computation takes 1 nanosecond, it is the difference between finishing in 30 seconds as opposed to 30 years.

## What Does DFT Do?

DFT takes a discrete series of input in the time domain and converts it to the frequency domain. Computer scientists describe it as converting from the coefficient representation to the point-value representation (more on that later). At either case, the semantics of the input is retained; it is like translating from one language to another.

Also, the conversation from one domain to another needs to be reversible. To convert from the frequency domain back to the time domain, you use the inverse of the DFT, also called IDFT. In computer science, the process to convert from coefficient representation to the point-value representation is called evaluation, and the reverse process is interpolation.

## Applications and Motivations

The applications of DFT is diverse: how Shazam matches songs, mp3 compression, remove noise from audio, jpeg compression, edge detection, radar, sonar, optics, seismic activity, satellite transmissions, genome matching (I will put up a post on genome matching in the future as an application of DFT)… Put it this way: as long as it involves a sequence (could be in multiple dimensions) that changes – whether it be over time or space, it can benefit from DFT.

However, this post focuses on using DFT for vector convolution (I profess that as of this writing this is the only thing I know how to use DFT for).

## Vector Convolution

We can understand what convolution does as a polynomial multiplication:

Example taken from CLRS

Put formally, we are given a sequence of real numbers $a$ and $b$ (in this example is (6, 7, -10, 9) and -2, 0, 4, -5) respectively) of size n – each sequence maps to the coefficients of the input (thus the term coefficient representation), and we return a sequence of real numbers $c$ (-12, -14, 44, -20, -75, 86, -45) of size 2n – 1. The degree of each polynomial can be mapped from the indices (start from 0). We also describe convolution in mathemical notation as $a \otimes b = c$.

Each element $c_j$ ($0 \leq j < 2n-1$) can be calculated like so (formula also from CLRS):

$c_j = \sum\limits_{k=0}^{j}a_kb_{j-k}$

I have an implementation of this formulation below (click here to see a sample using this function):

typedef vector<int> ivec;

ivec polynomialMultiply(const ivec &a, const ivec &b)
{
int N = 2 * a.size() - 1;
ivec c(N, 0);

for (int j = 0; j < N; ++j) {
int p = j >= a.size() ? j - a.size() + 1 : 0;
for (int k = p; k <= j - p; ++k) {
c[j] += a[k] * b[j - k];
}
}

return c;
}


Perhaps you might be wondering why is there an auxiliary variable p which is not in the formulation: I don’t know; it doesn’t work without it. BUT, that’s not the point. The point is that this technique takes $O(N^2)$ to run (we can infer this from the nested for loop). The same result, however, can be computed much faster in point-value representation:

In this illustration, yellow indicates the coefficient representation (real numbers) and green indicates the point-value representation (complex numbers). In the point-value form, convolution is simply a pointwise multiplication which can be done in $O(N)$. However, the actual complexity of this method also needs to also factor the computation needed for evaluation and interpolation.

## The Math Behind DFT

We now describe the transformation process with more formal mathematical notation. Let $x$ be the sequence of N real values in the time domain, where the n-th element of $x$ is denoted as $x_n$. DFT then takes $x$ and transforms it to a sequence $F$ of N complex values in the frequency domain, where the k-th element of $F$ is denoted as $F_k$. We can also write DFT as a function $dft(x) = F$, and IDFT as $idft(F) = x$.

The DFT is then given by:

$F_k=\sum\limits_{n=0}^{N-1}x_n\cdot e^{-\frac{i2\pi k n}{N}}$

And IDFT given by:

$x_n=\frac{1}{N}\sum\limits_{k=0}^{N-1}F_k\cdot e^{\frac{i2\pi k n}{N}}$

Where $i$ is the imaginary number $\sqrt{-1}$.

Since each element takes N computations for all N elements, it is not hard to see that both DFT and IDFT takes $O(N^2)$.

## Calculate An Example by Hand

It always helps to compute the DFT by hand first to get a feel for it. To do this we will need Euler’s formula to break down the exponential to its sine cosine equivalents:

$e^{iy}=\cos y + i \sin y$

In this context, $y$ is $\frac{2\pi n k}{N}$.

Example: Let $x$ be a sequence (1, 2, 3, 4) and N = 4, and we wish to convert it to the frequency domain.

$\begin{array}{lcl}F_0 & = & 1\cdot e^{-\frac{i2\pi 0 \cdot 0}{4}} + 2\cdot e^{-\frac{i2\pi 0 \cdot 1}{4}} + 3\cdot e^{-\frac{i2\pi 0 \cdot 2}{4}} + 4\cdot e^{-\frac{i2\pi 0 \cdot 3}{4}} \\ & = & 1\cdot e^0 + 2\cdot e^0 + 3\cdot e^0 + 4\cdot e^0 \\ & = & 10\end{array}$

$\begin{array}{lcl}F_1 & = & 1\cdot e^{-\frac{i2\pi 1 \cdot 0}{4}} + 2\cdot e^{-\frac{i2\pi 1 \cdot 1}{4}} + 3\cdot e^{-\frac{i2\pi 1 \cdot 2}{4}} + 4\cdot e^{-\frac{i2\pi 1 \cdot 3}{4}} \\ & = & 1\cdot e^0 + 2\cdot e^{-\frac{i\pi}{2}} + 3\cdot e^{-i\pi} + 4\cdot e^{-\frac{i3\pi}{2}} \\ & = & 1 + 2\left(\cos \frac{-\pi}{2} + i\sin \frac{-\pi}{2} \right) + 3 \left( \cos(-\pi) + i \sin(-\pi) \right) + 4 \left( \cos\frac{-3\pi}{2} + i\sin\frac{-3\pi}{2}\right) \\ & = & 1 + 2\left(0 - i\right) + 3 \left( -1 + 0 \right) + 4 \left( 0 + i\right) \\ & = & -2+2i\end{array}$

$F_2 = -2$

$F_3 = -2-2i$

Now, if you take the values of $F$ and pass through the IDFT, it should return you back the sequence $x$, though I will not show it here.

## Implementation

The C++ program below simply runs the same sequence in the example through DFT, and pass the result through IDFT to get back the original sequence. I placed more emphasis on readability of the code, so it is not designed to be optimized but rather to prove the correctness of the mathematical formulations of DFT and IDFT.

#include <iostream>
#include <complex>
#include <cmath>
#include <iomanip>
#include <vector>

using namespace std;

double PI = acos(0) * 2;
typedef complex<double> xd;
typedef vector<double> dvec;
typedef vector<xd> xvec;
const xd J(0, 1); // sqrt(-1)

xvec dft(const dvec &input)
{
double N = input.size();
xvec X(N);

for (double k = 0; k < N; ++k) {
for (double n = 0; n < N; ++n) {
X[k] += (double)input[n] * exp(-2. * J * PI * n * k / N);
}
}

return X;
}

dvec idft(const xvec &input)
{
double N = input.size();
xvec x(N);
dvec out(N);

for (double k = 0; k < N; ++k) {
for (double n = 0; n < N; ++n) {
x[k] += input[n] * exp(2. * J * PI * n * k / N);
}
out[k] = x[k].real() / N;
}

return out;
}

int main()
{
cout << fixed << setprecision(2);

dvec input = { 1, 2, 3, 4 };

// convert from time to frequency domain
xvec freqdom = dft(input);

for (const auto &f : freqdom) {
cout << f << endl;
}
cout << endl;

// convert from frequency to time domain
dvec timedom = idft(freqdom);

for (const auto &t : timedom) {
cout << t << ' ';
}
cout << endl;
}


The output of this program will then be:

(10.00,0.00)
(-2.00,2.00)
(-2.00,-0.00)
(-2.00,-2.00)

1.00 2.00 3.00 4.00

With DFT and IDFT working properly we now use it to implement convolution:

dvec convolve(const dvec &a, const dvec &b)
{
// calculate degree of resulting polynomial
size_t N = 2 * a.size() - 1;

// extend size and pad with 0
dvec acof(N, 0), bcof(N, 0);
copy(a.begin(), a.end(), acof.begin());
copy(b.begin(), b.end(), bcof.begin());

xvec apv, bpv, cpv(N);

// evaluation
apv = dft(acof);
bpv = dft(bcof);

// point-wise multiplcation
for (size_t i = 0; i < N; ++i) {
cpv[i] = apv[i] * bpv[i];
}

// interpolation
return idft(cpv);
}


See the full working DFT convolution program from this gist.

## Conclusion

So, we covered DFT and vector convolution, and applied it to solve polynomial multiplication. What’s the problem now? Pointwise multiplication, in addition with evaluation and interpolation using DFT and IDFT, uses a time complexity of $O(N) + 2 \times O(N^2) \approx O(N^2)$. That’s actually no different, if not worst than using the naive method. To circumvent this, I will introduce FFT in the next post, which will compute the evaluation and interpolation in $O(N \log N)$ time. Stay tuned!

Update: Part 2 is up!

# Setting Up CUDA + cuDNN for Theano in Ubuntu

This is my personal notes on setting up Theano with my laptop GPU. It is basically an amalgam of various sources that I pieced together to make everything work, which I will link at the end of this post.

As of this writing, this is my setup:

• Linux Mint 18 (Cinnamon) 64-bit (based on Ubuntu 16.06)
• NVIDIA GT740M (Kepler architecture)
• Theano 0.8.2

## NVIDIA Graphic Drivers

Linux Mint gives you an option to install the drivers from the settings, but it may be dated. To get the latest drivers, you may install the drivers via PPA: https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

IMPORTANT: You need to install the drivers first, before installing CUDA. The order is very important, since CUDA checks what version of the graphic driver you are using and installs accordingly. On a related note – should you upgrade/downgrade your graphics driver, you will need to install CUDA again. I emphasize this, because should you fail to do so, the errors that proceed it gives you no indication whatsoever that you screwed this step up.

## CUDA

Open a terminal in the download directory and enter the first command they suggested for you in the downloads site. It should look like this:

sudo dpkg -i cuda-repo-ubuntu1504-7-5-local_7.5-18_amd64.deb

Change your /var/cuda-repo-7-5-local/Release to the following:

Origin: NVIDIA
Label: NVIDIA CUDA
Architecture: repogenstagetemp
MD5Sum:
51483bc34577facd49f0fbc8c396aea0 75379 Packages
4ef963dfa4276be01db8e7bf7d8a4f12 21448 Packages.gz
SHA256:
532b1bb3b392b9083de4445dab2639b36865d7df1f610aeef8961a3c6f304d8a 75379 Packages
2e48cc13b6cc5856c9c6f628c6fe8088ef62ed664e9e0046fc72819269f7432c 21448 Packages.gz

Run (ignoring warnings about invalid signatures, and you’re done):

sudo apt-get update

Then run:

sudo apt-get install cuda

Keep an eye out the output. There should not be any errors. This will install CUDA in /usr/local/cuda/

## Add CUDA To Environment Paths

Open ~/.bashrc  and append the following (you may need to do this in sudo mode):

export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Once you save, enter in a terminal:

sudo source ~/.bashrc

This will add the CUDA executables to the environment paths. Note that currently opened terminals will not have CUDA added to the environment paths. You will need to restart them (open and close) for changes to take affect.

And then you can open a new terminal and type nvcc (Nvidia CUDA Compiler) to see whether the environment is set correctly. It should not output any errors.

## Solving gcc/g++ Incompatibilities

CUDA requires a compatible C/C++ compiler to work. The one that comes bundled with Ubuntu isn’t. To fix this, enter the following:

sudo apt-get install gcc-4.9 g++-4.9

Then we may establish a soft link of the specific version for the CUDA binaries folder:

sudo ln -s /usr/bin/gcc-4.9 /usr/local/cuda/bin/gcc
sudo ln -s /usr/bin/g++-4.9 /usr/local/cuda/bin/g++

IMPORTANT! Now, if you run import theano for the first time with the THEANO_FLAGS environment variable containing device=gpu, theano complains that CUDA is not available. To run any python script that uses Theano, you need to prepend the command with THEANO_FLAGS=device=gpu,nvcc.flags=-D_FORCE_INLINES. All python scripts executed here will be using this workaround. Alternatively there is a fix here: https://github.com/Theano/Theano/issues/4425 (thanks Anonoz for the suggestion).

Alternatively,

Now running the following line:

THEANO_FLAGS=device=gpu,nvcc.flags=-D_FORCE_INLINES python -c "import theano; print(theano.sandbox.cuda.device_properties(0))"

Should give you something like this:

Using gpu device 0: GeForce GT 740M (CNMeM is disabled, CuDNN not available)
{'major': 3, 'tccDriver': 0, 'kernelExecTimeoutEnabled': 1, 'deviceOverlap': 1, 'driverVersion': 8000, 'warpSize': 32, 'concurrentKernels': 1, 'maxThreadsPerBlock': 1024, 'computeMode': 0, 'canMapHostMemory': 1, 'maxGridSize2': 65535, 'maxGridSize1': 65535, 'maxGridSize0': 2147483647, 'integrated': 0, 'minor': 0, 'ECCEnabled': 0, 'runtimeVersion': 7050, 'textureAlignment': 512, 'multiProcessorCount': 2, 'clockRate': 895000, 'totalConstMem': 65536, 'name': 'GeForce GT 740M', 'memPitch': 2147483647, 'maxThreadsDim1': 1024, 'maxThreadsDim0': 1024, 'maxThreadsDim2': 64, 'coresCount': -2, 'sharedMemPerBlock': 49152, 'regsPerBlock': 65536}

## cuDNN

NVIDIA provides a library for common neural network operations that especially speeds up Convolutional Neural Networks (CNNs). For Lasagne, it is necessary that you install this to get a convnet to work. It can be obtained from NVIDIA (after registering as a developer): https://developer.nvidia.com/cudnn

Don’t expect an instant email upon registration. For some reason it takes quite a while for them to send that email. I waited about 30 minutes.

Once you are in, choose version 4. That’s the one currently supported by Theano.

To install it, copy the *.h files to /usr/local/cuda/include and the lib* files to /usr/local/cuda/lib64

To check whether it is installed, run

THEANO_FLAGS=device=gpu,nvcc.flags=-D_FORCE_INLINES python -c "from theano.sandbox.cuda.dnn import dnn_available as d; print(d() or d.msg)"

It will print True if everything is fine, or an error message otherwise. There are no additional steps required for Theano to make use of cuDNN.

Again, if everything if successful, you run your python scripts as such (the following is deep_q_rl, a Theano-based implementation of Deep Q-learning using Lasagne):

 THEANO_FLAGS=device=gpu,nvcc.flags=-D_FORCE_INLINES python run_nips.py --rom breakout

# Synchronizing BibTeX in Overleaf (BibLaTeX) and Texmaker (MiKTeX, Apacite)

In this post I detail how to get bibtex working on Overleaf (previously known as WriteLatex) and Texmaker (Windows 10 64-bit, MikTeX). Note that the citation format I’m using is APA, as specified by my university.

Overleaf may have the advantage of having collaborative editing with (almost) live previewing, but I hit a lot of problems getting the documents with bibtex I wrote there to compile in Texmaker. It just doesn’t compile. Conversely, copy-pasting working bibtex code from TexMaker into Overleaf pulls out compile errors.

So the best workflow I can come out with at the moment is this: Create latex document from my template (get from here: Overleaf to Texmaker Bibtex Template. There should be 2 files: main.tex and ref.bib. If the template link is not working, you can get from this Github gist instead), edit the latex document collaboratively in Overleaf, and then when you need it to compile in Texmaker, download the project as a zip and change some code.

Fortunately, it’s only 2 blocks of code, annotated as “SETUP DOCUMENT” and “END DOCUMENT”. You’ll find this in the start and end of the latex document respectively. These code blocks (provided in the template; the texmaker version is commented out) will need to be changed when moving your code back to TexMaker.

#### Overleaf

SETUP DOCUMENT:

% BEGIN -- SETUP DOCUMENT --
\documentclass[a4paper,12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[british]{babel}
\usepackage{csquotes}
\usepackage[backend=biber,style=apa]{biblatex}
\DeclareLanguageMapping{british}{british-apa}
\usepackage[pdftex]{graphicx}
\let\cite\parencite
\begin{document}
% END -- SETUP DOCUMENT --


END DOCUMENT:

% BEGIN -- END DOCUMENT --
\printbibliography
\end{document}
% END -- END DOCUMENT --


#### TexMaker

SETUP DOCUMENT:

% BEGIN -- SETUP DOCUMENT --
\documentclass[a4paper,12pt]{article}
\usepackage{hyperref}
\usepackage{apacite}
\begin{document}
\bibliographystyle{apacite}
% END -- SETUP DOCUMENT --


END DOCUMENT:

% BEGIN -- END DOCUMENT --
\bibliography{ref}{}
\end{document}
% END -- END DOCUMENT --


If you receive warning messages in TexMaker that goes something like

Citation ‘blablabla2007’ undefined

when you press F1 (quickbuild), you will need to enable bibtex in your build. To do this, go to “options > Configure TexMaker” and under “Quick Build” tab, select the quick-build command “PdfLatex + Bib(la)tex + PdfLaTeX (x2) + View Pdf”

Tip: You can generate bibtex code from easily with bibme.org.

There is another variation of this: separate your content into another *.tex file, and then you have 2 master documents – one for Overleaf (main.tex) and another you use for TexMaker (main-texmaker.tex or whatever name you want) – which both includes the same content file. In TexMaker, you’ll need to set one document to be a master document to work with multiple files.

Have fun with Overleaf!