CS231n – Assignment 1 Tutorial – Q2: Training a Support Vector Machine

This is part of a series of tutorials I’m writing for CS231n: Convolutional Neural Networks for Visual Recognition. Go to this page to see the full listing.

To conserve space, I won’t be placing my full solutions in this post. Refer to my github repository for full source code.

Loss is fairly straightforward so I will be skipping that. After you successfully implemented that, it should give a value between 8 to 10. If you replace W with a zero matrix instead, the loss value will be exactly 9.

Though not in the lecture video, you can find the formula for the gradient in the lecture notes. We define margin as a matrix of margins where: $\text{margin}_{i, j} = w_j \cdot x_i - w_{y_i}\cdot x_i + \Delta$

For each sample i,  we look through the n number of margins (n being the number of classes we want to classify).

If that class is the correct class (denote as $y_i$), it contributes the sample row i itself with a weight of the negation of the total number (where 1 is the indicator function that is one if the condition inside is true or zero otherwise) of incorrect classes where the margin is > 0, given n number of classes: $\nabla_{w_{y_i}} L_i = - \left( \sum\limits_{j=1, j\neq y_i}^n 1(\text{margin}_{i,j} > 0) \right) x_i$

If that class is incorrect, it contributes the sample row i depending on whether the margin is >0: $\nabla_{w_j} L_i = 1 \left( \text{margin}_{i,j} > 0 \right) x_i$

After you implement the naive version of SVM gradient, it should match up fairly closely with the numerically computed gradient:

numerical: 0.449971 analytic: 0.449971, relative error: 5.636661e-10
numerical: -16.007355 analytic: -16.007355, relative error: 6.481167e-12
numerical: 9.938192 analytic: 9.938192, relative error: 8.279972e-12
numerical: 6.561015 analytic: 6.540654, relative error: 1.554020e-03
numerical: 0.151951 analytic: 0.151951, relative error: 1.675199e-09
numerical: 3.936123 analytic: 3.936123, relative error: 7.290840e-11
numerical: -28.234382 analytic: -28.234382, relative error: 4.143686e-12
numerical: -0.452718 analytic: -0.452718, relative error: 4.690168e-10
numerical: -22.248917 analytic: -22.248917, relative error: 1.991237e-11
numerical: 20.414164 analytic: 20.381418, relative error: 8.027022e-04
numerical: 10.663350 analytic: 10.663350, relative error: 8.140355e-12
numerical: 2.612748 analytic: 2.612748, relative error: 6.536924e-11
numerical: -4.644796 analytic: -4.644796, relative error: 5.570529e-12
numerical: 15.987607 analytic: 15.987607, relative error: 2.419968e-11
numerical: -35.476517 analytic: -35.476517, relative error: 9.777531e-12
numerical: -25.275914 analytic: -25.275914, relative error: 5.744716e-12
numerical: -3.928484 analytic: -3.928484, relative error: 1.384318e-11
numerical: -16.666177 analytic: -16.666177, relative error: 1.360765e-12
numerical: 17.289968 analytic: 17.289968, relative error: 4.612259e-11
numerical: 4.519505 analytic: 4.519505, relative error: 1.483479e-11

On average if the relative error is e-8 instead, check that you have implemented regularization. This is not explicitly given in the notes, but it is hinted in the assignment:

you didn’t forget the regularization gradient did you?

You can find the regularization gradient by finding the derivative of the loss function regularization. Given that the loss function regularization is $\frac{1}{2}\lambda \sum W^2$ where $\lambda$ is the regularization parameter, the derivative of that with respect to W is then $\lambda \sum W$.

To implement vectorization we need to fuse operations together. So now we try to find the gradient of a given class k for all n classes. The result is a single row with the number of columns being the number of input features (1 x 3073).  We define that the total number of samples is m. This would be the mean of all correct ( $\nabla_{w_{y}} L$) and incorrect ( $\nabla_{w_j} L$) samples given the class k: $\nabla_{w_k} L = \frac{1}{m}\left[ \nabla_{w_j} L + \nabla_{w_{y}} L \right]$ $\nabla_{w_{y}}L= \sum\limits_{i=1}^{m}\left[- \left( \sum\limits_{j=1, j \neq y_i}^{n} 1 \left( \text{margin}_{i,j} > 0 \right) \right) x_{i}\right]$ $\nabla_{w_{j}}L= \sum\limits_{i=1}^{m} 1 \left( \text{margin}_{i,j} > 0 ) x_i \right)$

This can be implemented in a few lines of python code:

# Semi-vectorized version... It's not any faster than the loop version.
num_classes = W.shape
incorrect_counts = np.sum(margins > 0, axis=1)
for k in xrange(num_classes):
# use the indices of the margin array as a mask.
wj = np.sum(X[margins[:, k] > 0], axis=0)
wy = np.sum(-incorrect_counts[y == k][:, np.newaxis] * X[y == k], axis=0)
dW[:, k] = wj + wy
dW /= num_train # average out weights


incorrect_counts counts the margins for each sample where margin > 0. The result is a single column vector is m number of rows. With this, for each class k, we select the samples in which the correct class is k. Because incorrect_counts is a vector, we need to change it to a matrix of size x 1 by adding [:, np.newaxis] so that the each element of the vector will be mapped to X for scalar multiplication.

There is a faster way, by simply using a single dot product, though it may be tough to figure it out. The intuition is that aside seeing it as converting from one dimension to another, an interesting way to look at matrix multiplication is that it is a grouping of sums of specific combination of rows together; each row in B is a sum of different row combinations of the matrix X, and you control it in the matrix A.

Let’s put this idea to a simpler context. Suppose we have a 4×3 matrix X, which maps to 4 samples (row) and 3 features (column). If we set k to 2, then A is a 2×4 matrix; k number of rows and m number of samples. $A^TX$ will then return a matrix B with dimensions 2×3.

The following code snippet:

import numpy as np
X = np.arange(12).reshape((4, 3))
A = np.zeros((2, 4))
B = A.dot(X)
print A.shape, X.shape, B.shape


Will produce the output:

(2, 4) (4, 3) (2, 3)

Each row in B maps to a class k and is a sum of specific rows in X. In A, each row maps to a class k, and each column maps to a specific sample. If we simplify B to be a boolean array of 1 and 0, then we choose 1 to select, 0 otherwise. For example: $\begin{bmatrix}0 & 1 & 0 & 1 \\1 & 0 & 1 & 0\end{bmatrix} \times \begin{bmatrix}0 & 1 & 2 \\ 3 & 4 & 5 \\ 6 & 7 & 8 \\ 9 & 10 & 12 \end{bmatrix} = \begin{bmatrix}12 & 14 & 16 \\ 6 & 8 & 10 \end{bmatrix}$

Row 1 in B is a sum of row 2 and 4 of X, and row 2 in B is sum of row 1 and 3 of X. How matrix A chooses its rows is by multiplication: if the value is 0, multiplying gives an empty row and multiplying by 1 returns the row itself. If instead you change 1 to -2 or 0.5, it will be multiplied by that weight instead. Do play around this in your python console to get the intuition before proceeding.

With this we can attain a succinct formulation: $Q_{i, k} = \begin{cases} - \left( \sum\limits_{j=1, j \neq y_i}^{n} 1 \left(\text{margin}_{i,j} > 0 \right) \right) & \text{if } k = y_i \\ 1 \left( \text{margin}_{i,k} > 0 \right) & \text{otherwise}\end{cases}$ $\nabla_{w} L = \frac{1}{m} \left( Q^{T}X \right)^T =\frac{1}{m} X^TQ$

Now we can take this idea and apply in our code:

X_mask = np.zeros(margins.shape)


Notice that this will factor only the incorrect classes because we previously set all the cells in the margin array that map to the correct class to 0:

margins[np.arange(num_train), y] = 0 # done this before


Of course, our X_mask matrix is incomplete because we didn’t factor in the correct classes. This is not too complicated to do. We understood that for every class k, there are rows in X where class k is the correct class. For each of these rows, we multiply it by the negation of number of incorrect classes where the margin is >0. We don’t need to recalculate this; we can already find this value for each of these rows from X_mask itself, since now each of its rows is 1 where the margin is >0 and 0 otherwise. Finding the number of incorrect classes where the margin is >0 is as simple as finding the sum of the columns for each row:

incorrect_counts = np.sum(X_mask, axis=1)


With this we can find the gradient matrix. The full implementation is below:

# Fully vectorized version. Roughly 10x faster.
# for each sample, find the total number of classes where margin > 0


Lastly you should remember to regularize the gradient. The gradient with regularization is therefore: $\nabla_{w} L = \frac{1}{m} X^TQ + \lambda \sum W$

The loss function is visualized: If the graph is drastically more fuzzy or it doesn’t seem to be decreasing, chances are somewhere in your code you are doing it wrong. Here’s what happen when I forgot to regularize the gradient: The loss is still decreasing, but not as fast as it should be.

The training and validation accuracy is visualized: In the last section code has been given to help you visualize the weights. With proper hyperparameters, more iterations will give a smoother result: CS231n – Assignment 1 Tutorial – Q3: Implement a Softmax classifier

This is part of a series of tutorials I’m writing for CS231n: Convolutional Neural Networks for Visual Recognition. Go to this page to see the full listing.

To conserve space, I won’t be placing my full solutions in this post. Refer to my github repository for full source code.

The loss function is covered in the lecture and course notes. However, I find myself a little lost when looking for the formula for the gradient. What I find on the internet usually looks something like this: $\text{if } i = j :\; \frac{\partial y_i}{\partial z_i} = y_i (1 - y_i)$ $\text{if } i \neq j :\; \frac{\partial y_i}{\partial z_j} = -y_i y_j$

And I have a lot of trouble figuring out how to implement it. Fortunately, there are notes on this, though not explicitly linked in the course site itself. Here is the link. I wasn’t able to see how these 2 formulas are also the derivative of the Softmax loss function, so anyone who is able to explain that I’d be really grateful.

For each sample, we introduce a variable p which is a vector of the normalized probabilities (normalize to prevent numerical instability. Refer to course notes: “Practical issues: Numeric stability.“): $f_i = W x_i$ $p_k = \frac{e^{f_k-\max_j f_j}}{ \sum_j e^{f_j-\max_j f_j} }$

In my naive solution, this is implemented as a lambda function:

f_i = X[i].dot(W)
f_i -= np.max(f_i)
sum_j = np.sum(np.exp(f_i))
p = lambda k: np.exp(f_i[k]) / sum_j


The cross-entropy loss function L for a sample i is therefore: $L_i = -\log\left(p_{y_i}\right)$

Total loss for all m samples with regularization is then: $L = \frac{1}{m}\sum_i \left[-\log\left(p_{y_i}\right)\right] + \frac{1}{2}\lambda \sum W^2$

The gradient for a class k for a given sample i is therefore: $\nabla_{w_k}L_i = (p_k -1\{y_i = k\})x_i$

(Recall that 1 is the indicator function that is one if the condition inside is true or zero otherwise)

When converting to vectorized implementation, as with SVM, you need to somehow change to use a single dot product operation. Fortunately this is simpler than SVM. We define a matrix Q where the number of rows (each sample as i) is the sample size, and the columns (each class as k) being the number of classes: $Q_{i, k} = p_k -1\{y_i = k\}$ $\nabla_{w} L = \frac{1}{m} X^TQ + \lambda \sum W$

Here is a snippet of the implementation:

num_train = X.shape
f = X.dot(W)
f -= np.max(f, axis=1, keepdims=True) # max of every sample
sum_f = np.sum(np.exp(f), axis=1, keepdims=True)
p = np.exp(f) / sum_f
ind = np.zeros_like(p)
ind[np.arange(num_train), y] = 1
Q = p - ind
dW = X.T.dot(Q)


Though not in the python notebook, I added the loss plot function (with some slight alterations) from the SVM notebook to visualize the loss as the number of iterations increase. You can refer to my python notebook for the code; I will only show you the plot result, which looks remarkably similar to SVM: You should be able to get a validation accuracy pretty close to 0.4 (I got about 0.39).

When visualizing the weights, you should get a somewhat similar result to the SVM classifier: 