Deep Q-Learning 101: Part 1 – Convolutional Neural Networks

This is a 3 part series of Deep Q-Learning, which is written such that undergrads with highschool maths should be able to understand and hit the ground running on their deep learning projects. This series is really just the literature review section of my final year report (which in on Deep Q-Learning) broken to 3 chunks:


You can skip this. It is just me explaining how I end up publishing this.

When I was writing my final year report I was advised by my supervisor (Dr Wong Ya Ping) to write in such a way that the average layman could understand. I think he put it as “write it such that even your mom can understand” (my mom is pretty highly educated by the way). So I went out of my way to take a course on it, and kept simplifying my writing under my co-supervisor’s invaluable and meticulous review (Dr Ng Boon Yian – he is famous for reviewing final year reports). After my final year project was complete I just shelved everything and gave myself a pat in the back.

Fast forward 7 months to today. I passed my report around as reference for my juniors, and one of them commented that it was like a crash course in Convolutional Neural Networks (CNN). In a week or so, a friend from the engineering faculty was having difficulty understanding CNN, so I passed my report to him. He said it clears a lot of things up, since he had been mostly referring to the Tensorflow documentation, which is focused on teaching you how to use the library not teach you machine learning. So, with that I decided to breakdown the literature review of my report to 3 parts and publish it in my blog, in hopes to enlighten a wider audience. Hope it will clear things up for you as well!

I myself am not very focused on machine learning at the time being; I have decided to direct my attention on the study of algorithms via the Data Structures and Algorithms Specialization in Coursera. So chances are if you ask some machine learning question now I won’t be able to understand, but I’ll try. (:


In this post, I will introduce machine learning and its three main branches. Then, I will talk about neural networks, along with the biologically-inspired CNN. In part 2, I will introduce the reader to reinforcement learning (RL), followed by the RL technique Q-Learning. In the final part, I piece together everything when explaining Deep Q-Learning.

If you are here just to understand CNN, this first part is all you need.

Types of Machine Learning

The art and science of having computer programs learn without explicitly programming it to do so is called machine learning. Machine learning, then, is about making computers modify or adapt their actions (whether these actions are making predictions, or controlling a robot) so that these actions get more accurate. Machine learning itself is divided to three broad categories (Marsland, 2015):

  • Supervised Learning – We train a machine with a set of questions (inputs), paired with the correct responses (targets). The algorithm then generalizes over this training data to respond to all possible inputs. Included in this category of learning techniques is neural networks.
  • Unsupervised Learning – Correct responses are not provided, but the algorithm looks for patterns in the data and attempts to cluster them together. Unsupervised learning will not be covered in this series as it is not used in Deep Q-Learning.
  • Reinforcement Learning – A cross between supervised learning and unsupervised learning. The algorithm is told when the answer is wrong, but is not shown how to correct it. It has to explore different possibilities on its own until it figures out the right answer.

A system that uses these learning techniques to make predictions is called a model.

The Artificial Neural Network (ANN)

The Neuron

The simplest unit in an ANN is called a neuron. The neuron was introduced in 1943 by Warren S. McCulloch, a neuroscientist, and Walter Pitts, a logician (McCulloch & Pitts, 1943). Inspired by biological neurons in the brain, they proposed a mathematical model that extracts the bare essentials of what a neuron does: it takes a set of inputs and it either fires (1) or it does not (0). In other words, a neuron is a binary classifier; it classifies the inputs into 2 categories.

Mathematical Model of a neuron (Marsland, 2015)

In a neuron, a set of m inputs x_{1}\ldots x_{m} is multiplied by a set a weights w_{1}\ldots w_{m} (the weights are learned over time) and summed together. Both  x and w are typically represented as vectors.

h=\sum\limits_{i=1}^m w_ix_i

The result, h, is then passed to an activation function, which returns an output (1 or 0).

Building ANN from Neurons

A neuron by itself cannot do much; we need to put sets of neurons together into an ANN before they can be anything useful.

ANN – each circle (node) is a neuron (Karpathy et al., 2016)

What happens after we clump these neurons together to layers? How do they learn? The algorithm will learn by example (supervised learning); the dataset will have the correct output associated with each data point. It may not make sense to provide the answers, but the main goal of an ANN is to generalise over the data; finding patterns and predict new examples correctly.

To teach an ANN, we use an algorithm called back-propagation.

Back-propagation Algorithm

Back-propagation algorithm consists of two main phases, executed in order:

  • Forward propagation – the inputs are passed through the ANN starting at the input layer, and predictions are made at the output layer.
  • Weight update – from the predictions, we calculate how far we differ from the answer (also known as the loss). We then use this information to update the weights in the reverse direction; starting from the output layer, back to the input layer.

The weight update step is made possible by another algorithm: gradient descent.

Loss and Gradient Descent

To use gradient descent, we first need to define a loss function L, which calculates the loss. For each sample i, loss is the difference between the predicted value h_w(x^i) and the actual value y^i for all m samples. There are various methods of calculating the loss; one of the most popular would be mean squared error function:

L(w) = \frac{1}{m} \sum\limits_{i=1}^m \left( h_w(x^i) - y^i\right)^2

The goal of the ANN is then to minimize the loss. To do this we find the derivative of the loss function with respect to the weights, \nabla_w L(w). This gives us the gradient of the error. Since the purpose of learning is to minimize the loss, nudging the values of the weights in the direction of the negative gradient will reduce the loss. We therefore define the back-propagation update rule of the weights as:

w = w - \alpha \nabla_w L(w)

\alpha here is known as the learning rate, which is a parameter that we tweak to determine how strong we will nudge the weights with each update. An update step of the weights (including both forward and backward pass) on one sample is known as an iteration; when we iterate over all samples one time, we call this an epoch. More epochs would usually mean better accuracy, but up until the ANN converges to a possible solution. In gradient descent, 1 iteration is also 1 epoch, because the entire dataset is processed in each iteration.

Stochastic Gradient Descent

What happens if the dataset gets constantly updated with new data (transaction data, weather information, traffic updates, etc.)? There is a variation of gradient descent that allows us to stream the data piece by piece into the ANN: stochastic gradient descent (SGD). To do this a simple modification is made to the loss function (which consequently changes the derivative): instead of factoring the entire dataset in each update step we take in only a single input at a time:

L(w) = \left( h_w(x^i) - y^i\right)^2

For SGD, a dataset of 500 samples would take 500 iterations to complete 1 epoch. Deep Q-Learning uses SGD to perform updates to the weights (Mnih et al., 2013), using a rather unique loss function. I will elaborate on this in part 3.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) are biologically-inspired variants of multi-layered neural networks. From Hubel and Wiesel’s work on visual cortex of cats (Hubel & Wiesel, 1963), we understand that the cells in the visual cortex are structured in a hierarchy: simple cells respond to specific edges, and their outputs are received by complex cells.

Hierarchical structure of the neurons (Lehar, n.d.)

In a CNN, neurons are arranged into layers, and in different layers the neurons specialize to be more sensitive to certain features. For example in the base layer the neurons react to abstract features like lines and edges, and then in higher layers neurons react to more specific features like eye, nose, handle or bottle.

A CNN is commonly composed of 3 main types of layers: Convolutional Layer, Pooling Layer, and Fully-Connected Layer. These layers are stacked together and inputs are passed forward and back according to that order.

Architecture of LeNet-5, a CNN used for digits recognition (LeCun et al., 1998)

Convolutional Layer

Each convolutional layer consist of a set of learnable filters (also referred to as kernels), which is small spatially but extends through the full depth of the input volume. For example, for a coloured image (images passed into a CNN are typically resized to a square) as input, a filter on a first layer of a CNN might have size 5\times 5\times 3 (5 pixels width and height, and 3 color channels, RGB). During the forward pass, we slide (or convolve) each filter across the width and height of the input volume (the distance for each interval we slide is call a stride) and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map (also referred to as feature map). We will stack these activation maps along the depth dimension to produce the output volume (Karpathy et al., 2016), therefore the number of filters = depth of output volume.

Click image (you will need to scroll down a bit) to check out an interactive demo (built by Karpathy) of the convolutional layer at work. 2 filters, size 3*3*3, with stride of 1.

In summary, each convolutional layer requires a 3 hyperparameters to be defined: filter size (width and height only; the depth will match with the input volume), number of filters, and stride.

After an input is convolved in one layer, the output volume will pass through a Rectified Linear Units (RELU) layer. In RELU, an elementwise activation function is performed, such as:

f(x) = \max(0, x)

The output volume dimensions remains unchanged. RELU is simple, computationally efficient, and converges much faster than other activation functions (sigmoid, tanh) in practice.

Pooling Layer

Pooling layers performs a down sampling operation (that is why pooling operations are also called subsampling) and reduces the input dimensions. It is used to control overfitting (the state where the ANN becomes too scrupulous, and cannot generalise the input) by incrementally reducing the spatial size of the input to reduce the amount of parameters and computation in the network. Though there are many types of pooling layers, the most effective and simple is max pooling, illustrated below:

Illustration of max pooling (Karpathy et al., 2016)

There are proposed solutions to replace pooling layers altogether by simply increasing the stride (Springenberg, Dosovitskiy, Brox, & Riedmiller, 2014), and it seems likely that future architectures will either have very few to no pooling layers. Pooling layers are not used in DeepMind’s Deep Q-Learning implementation, and this will be explained later in part 3.

Fully Connected Layer

Neurons in a fully connected (FC) layer have full connections to all activations in the previous layer, as seen in regular ANN as described previously. In certain implementations such as Neon, FC layers are referred to as affine layers.


There is a detailed writeup of CNN in the CS231n course by Karpathy: Convolutional Neural Networks (CNNs / ConvNets). I’d recommend taking a look at that for a more detailed (and more math intensive) explanation.

Of course, if you have time, the best way to get a proper foundation would be take up Andrew Ng’s machine learning course. I have gotten a cert from it, and if you are serious on this subject I’d suggest you enroll as well. Andrew even has a 5 course specialization on deep learning it now, though I won’t be taking it up anytime soon. What you will find, is that deep learning is more than just GPU’s and this magic black box called Tensorflow.


  • Marsland, S. (2015). Machine learning: an algorithmic perspective. CRC press.
  • McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4), 115–133.
  • Karpathy, A., Li, F., & Johnson, J. (2016). Cs231n convolutional neural network for visual recognition. Stanford University.
  • Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  • Hubel, D., & Wiesel, T. (1963). Shape and arrangement of columns in cat’s striate cortex. The Journal of physiology, 165(3), 559.
  • Lehar, S. (n.d.). Hubel & Wiesel. Retrieved 2016-08-19, from
  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  • Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.

CS231n – Assignment 1 Tutorial – Q2: Training a Support Vector Machine

This is part of a series of tutorials I’m writing for CS231n: Convolutional Neural Networks for Visual Recognition. Go to this page to see the full listing.

To conserve space, I won’t be placing my full solutions in this post. Refer to my github repository for full source code.

Loss is fairly straightforward so I will be skipping that. After you successfully implemented that, it should give a value between 8 to 10. If you replace W with a zero matrix instead, the loss value will be exactly 9.

Though not in the lecture video, you can find the formula for the gradient in the lecture notes. We define margin as a matrix of margins where:

\text{margin}_{i, j} = w_j \cdot x_i - w_{y_i}\cdot x_i + \Delta

For each sample i,  we look through the n number of margins (n being the number of classes we want to classify).

If that class is the correct class (denote as y_i), it contributes the sample row i itself with a weight of the negation of the total number (where 1 is the indicator function that is one if the condition inside is true or zero otherwise) of incorrect classes where the margin is > 0, given n number of classes:

\nabla_{w_{y_i}} L_i = - \left( \sum\limits_{j=1, j\neq y_i}^n 1(\text{margin}_{i,j} > 0) \right) x_i

If that class is incorrect, it contributes the sample row i depending on whether the margin is >0:

\nabla_{w_j} L_i = 1 \left( \text{margin}_{i,j} > 0 \right) x_i

After you implement the naive version of SVM gradient, it should match up fairly closely with the numerically computed gradient:

numerical: 0.449971 analytic: 0.449971, relative error: 5.636661e-10
numerical: -16.007355 analytic: -16.007355, relative error: 6.481167e-12
numerical: 9.938192 analytic: 9.938192, relative error: 8.279972e-12
numerical: 6.561015 analytic: 6.540654, relative error: 1.554020e-03
numerical: 0.151951 analytic: 0.151951, relative error: 1.675199e-09
numerical: 3.936123 analytic: 3.936123, relative error: 7.290840e-11
numerical: -28.234382 analytic: -28.234382, relative error: 4.143686e-12
numerical: -0.452718 analytic: -0.452718, relative error: 4.690168e-10
numerical: -22.248917 analytic: -22.248917, relative error: 1.991237e-11
numerical: 20.414164 analytic: 20.381418, relative error: 8.027022e-04
numerical: 10.663350 analytic: 10.663350, relative error: 8.140355e-12
numerical: 2.612748 analytic: 2.612748, relative error: 6.536924e-11
numerical: -4.644796 analytic: -4.644796, relative error: 5.570529e-12
numerical: 15.987607 analytic: 15.987607, relative error: 2.419968e-11
numerical: -35.476517 analytic: -35.476517, relative error: 9.777531e-12
numerical: -25.275914 analytic: -25.275914, relative error: 5.744716e-12
numerical: -3.928484 analytic: -3.928484, relative error: 1.384318e-11
numerical: -16.666177 analytic: -16.666177, relative error: 1.360765e-12
numerical: 17.289968 analytic: 17.289968, relative error: 4.612259e-11
numerical: 4.519505 analytic: 4.519505, relative error: 1.483479e-11

On average if the relative error is e-8 instead, check that you have implemented regularization. This is not explicitly given in the notes, but it is hinted in the assignment:

you didn’t forget the regularization gradient did you?

You can find the regularization gradient by finding the derivative of the loss function regularization. Given that the loss function regularization is \frac{1}{2}\lambda \sum W^2 where \lambda is the regularization parameter, the derivative of that with respect to W is then \lambda \sum W.

To implement vectorization we need to fuse operations together. So now we try to find the gradient of a given class k for all n classes. The result is a single row with the number of columns being the number of input features (1 x 3073).  We define that the total number of samples is m. This would be the mean of all correct (\nabla_{w_{y}} L) and incorrect (\nabla_{w_j} L) samples given the class k:

\nabla_{w_k} L = \frac{1}{m}\left[ \nabla_{w_j} L + \nabla_{w_{y}} L \right]

\nabla_{w_{y}}L= \sum\limits_{i=1}^{m}\left[- \left(  \sum\limits_{j=1, j \neq y_i}^{n} 1 \left( \text{margin}_{i,j} > 0 \right) \right) x_{i}\right]

\nabla_{w_{j}}L= \sum\limits_{i=1}^{m} 1 \left( \text{margin}_{i,j} > 0 ) x_i \right)

This can be implemented in a few lines of python code:

# Semi-vectorized version... It's not any faster than the loop version.
num_classes = W.shape[1]
incorrect_counts = np.sum(margins > 0, axis=1)
for k in xrange(num_classes):
  # use the indices of the margin array as a mask.
  wj = np.sum(X[margins[:, k] > 0], axis=0)
  wy = np.sum(-incorrect_counts[y == k][:, np.newaxis] * X[y == k], axis=0)
  dW[:, k] = wj + wy
dW /= num_train # average out weights

incorrect_counts counts the margins for each sample where margin > 0. The result is a single column vector is m number of rows. With this, for each class k, we select the samples in which the correct class is k. Because incorrect_counts is a vector, we need to change it to a matrix of size x 1 by adding [:, np.newaxis] so that the each element of the vector will be mapped to X for scalar multiplication.

There is a faster way, by simply using a single dot product, though it may be tough to figure it out. The intuition is that aside seeing it as converting from one dimension to another, an interesting way to look at matrix multiplication is that it is a grouping of sums of specific combination of rows together; each row in B is a sum of different row combinations of the matrix X, and you control it in the matrix A.

Let’s put this idea to a simpler context. Suppose we have a 4×3 matrix X, which maps to 4 samples (row) and 3 features (column). If we set k to 2, then A is a 2×4 matrix; k number of rows and m number of samples. A^TX will then return a matrix B with dimensions 2×3.

The following code snippet:

import numpy as np
X = np.arange(12).reshape((4, 3))
A = np.zeros((2, 4))
B =
print A.shape, X.shape, B.shape

Will produce the output:

(2, 4) (4, 3) (2, 3)

Each row in B maps to a class k and is a sum of specific rows in X. In A, each row maps to a class k, and each column maps to a specific sample. If we simplify B to be a boolean array of 1 and 0, then we choose 1 to select, 0 otherwise. For example:

\begin{bmatrix}0 & 1 & 0 & 1 \\1 & 0 & 1 & 0\end{bmatrix} \times \begin{bmatrix}0 & 1 & 2 \\ 3 & 4 & 5 \\ 6 & 7 & 8 \\ 9 & 10 & 12 \end{bmatrix} = \begin{bmatrix}12 & 14 & 16 \\ 6 & 8 & 10 \end{bmatrix}

Row 1 in B is a sum of row 2 and 4 of X, and row 2 in B is sum of row 1 and 3 of X. How matrix A chooses its rows is by multiplication: if the value is 0, multiplying gives an empty row and multiplying by 1 returns the row itself. If instead you change 1 to -2 or 0.5, it will be multiplied by that weight instead. Do play around this in your python console to get the intuition before proceeding.

With this we can attain a succinct formulation:

Q_{i, k} = \begin{cases} - \left(  \sum\limits_{j=1,  j \neq y_i}^{n} 1  \left(\text{margin}_{i,j} > 0 \right) \right)  & \text{if } k = y_i \\ 1 \left( \text{margin}_{i,k} > 0 \right) & \text{otherwise}\end{cases}

\nabla_{w} L = \frac{1}{m} \left( Q^{T}X \right)^T =\frac{1}{m} X^TQ

Now we can take this idea and apply in our code:

X_mask = np.zeros(margins.shape)
X_mask[margins > 0] = 1

Notice that this will factor only the incorrect classes because we previously set all the cells in the margin array that map to the correct class to 0:

margins[np.arange(num_train), y] = 0 # done this before

Of course, our X_mask matrix is incomplete because we didn’t factor in the correct classes. This is not too complicated to do. We understood that for every class k, there are rows in X where class k is the correct class. For each of these rows, we multiply it by the negation of number of incorrect classes where the margin is >0. We don’t need to recalculate this; we can already find this value for each of these rows from X_mask itself, since now each of its rows is 1 where the margin is >0 and 0 otherwise. Finding the number of incorrect classes where the margin is >0 is as simple as finding the sum of the columns for each row:

incorrect_counts = np.sum(X_mask, axis=1)

With this we can find the gradient matrix. The full implementation is below:

# Fully vectorized version. Roughly 10x faster.
X_mask = np.zeros(margins.shape)
X_mask[margins > 0] = 1
# for each sample, find the total number of classes where margin > 0
incorrect_counts = np.sum(X_mask, axis=1)
X_mask[np.arange(num_train), y] = -incorrect_counts
dW =

Lastly you should remember to regularize the gradient. The gradient with regularization is therefore:

\nabla_{w} L = \frac{1}{m} X^TQ + \lambda \sum W

The loss function is visualized:

loss vs iteration number

If the graph is drastically more fuzzy or it doesn’t seem to be decreasing, chances are somewhere in your code you are doing it wrong. Here’s what happen when I forgot to regularize the gradient:

loss vs iteration number forgot regularization

The loss is still decreasing, but not as fast as it should be.

The training and validation accuracy is visualized:

accuracy plot

In the last section code has been given to help you visualize the weights. With proper hyperparameters, more iterations will give a smoother result:

visualize weights

CS231n – Assignment 1 Tutorial – Q3: Implement a Softmax classifier

This is part of a series of tutorials I’m writing for CS231n: Convolutional Neural Networks for Visual Recognition. Go to this page to see the full listing.

To conserve space, I won’t be placing my full solutions in this post. Refer to my github repository for full source code.

The loss function is covered in the lecture and course notes. However, I find myself a little lost when looking for the formula for the gradient. What I find on the internet usually looks something like this:

\text{if } i = j :\; \frac{\partial y_i}{\partial z_i} = y_i (1 - y_i)
\text{if } i \neq j :\; \frac{\partial y_i}{\partial z_j} = -y_i y_j

And I have a lot of trouble figuring out how to implement it. Fortunately, there are notes on this, though not explicitly linked in the course site itself. Here is the link. I wasn’t able to see how these 2 formulas are also the derivative of the Softmax loss function, so anyone who is able to explain that I’d be really grateful.

For each sample, we introduce a variable p which is a vector of the normalized probabilities (normalize to prevent numerical instability. Refer to course notes: “Practical issues: Numeric stability.“):

f_i = W x_i

p_k = \frac{e^{f_k-\max_j f_j}}{ \sum_j e^{f_j-\max_j f_j} }

In my naive solution, this is implemented as a lambda function:

f_i = X[i].dot(W)
f_i -= np.max(f_i)
sum_j = np.sum(np.exp(f_i))
p = lambda k: np.exp(f_i[k]) / sum_j

The cross-entropy loss function L for a sample i is therefore:

L_i = -\log\left(p_{y_i}\right)

Total loss for all m samples with regularization is then:

L = \frac{1}{m}\sum_i \left[-\log\left(p_{y_i}\right)\right] + \frac{1}{2}\lambda \sum W^2

The gradient for a class k for a given sample i is therefore:

\nabla_{w_k}L_i = (p_k -1\{y_i = k\})x_i

(Recall that 1 is the indicator function that is one if the condition inside is true or zero otherwise)

When converting to vectorized implementation, as with SVM, you need to somehow change to use a single dot product operation. Fortunately this is simpler than SVM. We define a matrix Q where the number of rows (each sample as i) is the sample size, and the columns (each class as k) being the number of classes:

Q_{i, k} = p_k -1\{y_i = k\}

\nabla_{w} L = \frac{1}{m}  X^TQ + \lambda \sum W

Here is a snippet of the implementation:

num_train = X.shape[0]
f =
f -= np.max(f, axis=1, keepdims=True) # max of every sample
sum_f = np.sum(np.exp(f), axis=1, keepdims=True)
p = np.exp(f) / sum_f
ind = np.zeros_like(p)
ind[np.arange(num_train), y] = 1
Q = p - ind
dW =

Though not in the python notebook, I added the loss plot function (with some slight alterations) from the SVM notebook to visualize the loss as the number of iterations increase. You can refer to my python notebook for the code; I will only show you the plot result, which looks remarkably similar to SVM:


You should be able to get a validation accuracy pretty close to 0.4 (I got about 0.39).

When visualizing the weights, you should get a somewhat similar result to the SVM classifier: