CS231n – Assignment 1 Tutorial – Q2: Training a Support Vector Machine

This is part of a series of tutorials I’m writing for CS231n: Convolutional Neural Networks for Visual Recognition. Go to this page to see the full listing.

To conserve space, I won’t be placing my full solutions in this post. Refer to my github repository for full source code.

Loss is fairly straightforward so I will be skipping that. After you successfully implemented that, it should give a value between 8 to 10. If you replace W with a zero matrix instead, the loss value will be exactly 9.

Though not in the lecture video, you can find the formula for the gradient in the lecture notes. We define margin as a matrix of margins where:

\text{margin}_{i, j} = w_j \cdot x_i - w_{y_i}\cdot x_i + \Delta

For each sample i,  we look through the n number of margins (n being the number of classes we want to classify).

If that class is the correct class (denote as y_i), it contributes the sample row i itself with a weight of the negation of the total number (where 1 is the indicator function that is one if the condition inside is true or zero otherwise) of incorrect classes where the margin is > 0, given n number of classes:

\nabla_{w_{y_i}} L_i = - \left( \sum\limits_{j=1, j\neq y_i}^n 1(\text{margin}_{i,j} > 0) \right) x_i

If that class is incorrect, it contributes the sample row i depending on whether the margin is >0:

\nabla_{w_j} L_i = 1 \left( \text{margin}_{i,j} > 0 \right) x_i

After you implement the naive version of SVM gradient, it should match up fairly closely with the numerically computed gradient:

numerical: 0.449971 analytic: 0.449971, relative error: 5.636661e-10
numerical: -16.007355 analytic: -16.007355, relative error: 6.481167e-12
numerical: 9.938192 analytic: 9.938192, relative error: 8.279972e-12
numerical: 6.561015 analytic: 6.540654, relative error: 1.554020e-03
numerical: 0.151951 analytic: 0.151951, relative error: 1.675199e-09
numerical: 3.936123 analytic: 3.936123, relative error: 7.290840e-11
numerical: -28.234382 analytic: -28.234382, relative error: 4.143686e-12
numerical: -0.452718 analytic: -0.452718, relative error: 4.690168e-10
numerical: -22.248917 analytic: -22.248917, relative error: 1.991237e-11
numerical: 20.414164 analytic: 20.381418, relative error: 8.027022e-04
numerical: 10.663350 analytic: 10.663350, relative error: 8.140355e-12
numerical: 2.612748 analytic: 2.612748, relative error: 6.536924e-11
numerical: -4.644796 analytic: -4.644796, relative error: 5.570529e-12
numerical: 15.987607 analytic: 15.987607, relative error: 2.419968e-11
numerical: -35.476517 analytic: -35.476517, relative error: 9.777531e-12
numerical: -25.275914 analytic: -25.275914, relative error: 5.744716e-12
numerical: -3.928484 analytic: -3.928484, relative error: 1.384318e-11
numerical: -16.666177 analytic: -16.666177, relative error: 1.360765e-12
numerical: 17.289968 analytic: 17.289968, relative error: 4.612259e-11
numerical: 4.519505 analytic: 4.519505, relative error: 1.483479e-11

On average if the relative error is e-8 instead, check that you have implemented regularization. This is not explicitly given in the notes, but it is hinted in the assignment:

you didn’t forget the regularization gradient did you?

You can find the regularization gradient by finding the derivative of the loss function regularization. Given that the loss function regularization is \frac{1}{2}\lambda \sum W^2 where \lambda is the regularization parameter, the derivative of that with respect to W is then \lambda \sum W.

To implement vectorization we need to fuse operations together. So now we try to find the gradient of a given class k for all n classes. The result is a single row with the number of columns being the number of input features (1 x 3073).  We define that the total number of samples is m. This would be the mean of all correct (\nabla_{w_{y}} L) and incorrect (\nabla_{w_j} L) samples given the class k:

\nabla_{w_k} L = \frac{1}{m}\left[ \nabla_{w_j} L + \nabla_{w_{y}} L \right]

\nabla_{w_{y}}L= \sum\limits_{i=1}^{m}\left[- \left(  \sum\limits_{j=1, j \neq y_i}^{n} 1 \left( \text{margin}_{i,j} > 0 \right) \right) x_{i}\right]

\nabla_{w_{j}}L= \sum\limits_{i=1}^{m} 1 \left( \text{margin}_{i,j} > 0 ) x_i \right)

This can be implemented in a few lines of python code:

# Semi-vectorized version... It's not any faster than the loop version.
num_classes = W.shape[1]
incorrect_counts = np.sum(margins > 0, axis=1)
for k in xrange(num_classes):
  # use the indices of the margin array as a mask.
  wj = np.sum(X[margins[:, k] > 0], axis=0)
  wy = np.sum(-incorrect_counts[y == k][:, np.newaxis] * X[y == k], axis=0)
  dW[:, k] = wj + wy
dW /= num_train # average out weights

incorrect_counts counts the margins for each sample where margin > 0. The result is a single column vector is m number of rows. With this, for each class k, we select the samples in which the correct class is k. Because incorrect_counts is a vector, we need to change it to a matrix of size x 1 by adding [:, np.newaxis] so that the each element of the vector will be mapped to X for scalar multiplication.

There is a faster way, by simply using a single dot product, though it may be tough to figure it out. The intuition is that aside seeing it as converting from one dimension to another, an interesting way to look at matrix multiplication is that it is a grouping of sums of specific combination of rows together; each row in B is a sum of different row combinations of the matrix X, and you control it in the matrix A.

Let’s put this idea to a simpler context. Suppose we have a 4×3 matrix X, which maps to 4 samples (row) and 3 features (column). If we set k to 2, then A is a 2×4 matrix; k number of rows and m number of samples. A^TX will then return a matrix B with dimensions 2×3.

The following code snippet:

import numpy as np
X = np.arange(12).reshape((4, 3))
A = np.zeros((2, 4))
B = A.dot(X)
print A.shape, X.shape, B.shape

Will produce the output:

(2, 4) (4, 3) (2, 3)

Each row in B maps to a class k and is a sum of specific rows in X. In A, each row maps to a class k, and each column maps to a specific sample. If we simplify B to be a boolean array of 1 and 0, then we choose 1 to select, 0 otherwise. For example:

\begin{bmatrix}0 & 1 & 0 & 1 \\1 & 0 & 1 & 0\end{bmatrix} \times \begin{bmatrix}0 & 1 & 2 \\ 3 & 4 & 5 \\ 6 & 7 & 8 \\ 9 & 10 & 12 \end{bmatrix} = \begin{bmatrix}12 & 14 & 16 \\ 6 & 8 & 10 \end{bmatrix}

Row 1 in B is a sum of row 2 and 4 of X, and row 2 in B is sum of row 1 and 3 of X. How matrix A chooses its rows is by multiplication: if the value is 0, multiplying gives an empty row and multiplying by 1 returns the row itself. If instead you change 1 to -2 or 0.5, it will be multiplied by that weight instead. Do play around this in your python console to get the intuition before proceeding.

With this we can attain a succinct formulation:

Q_{i, k} = \begin{cases} - \left(  \sum\limits_{j=1,  j \neq y_i}^{n} 1  \left(\text{margin}_{i,j} > 0 \right) \right)  & \text{if } k = y_i \\ 1 \left( \text{margin}_{i,k} > 0 \right) & \text{otherwise}\end{cases}

\nabla_{w} L = \frac{1}{m} \left( Q^{T}X \right)^T =\frac{1}{m} X^TQ

Now we can take this idea and apply in our code:

X_mask = np.zeros(margins.shape)
X_mask[margins > 0] = 1

Notice that this will factor only the incorrect classes because we previously set all the cells in the margin array that map to the correct class to 0:

margins[np.arange(num_train), y] = 0 # done this before

Of course, our X_mask matrix is incomplete because we didn’t factor in the correct classes. This is not too complicated to do. We understood that for every class k, there are rows in X where class k is the correct class. For each of these rows, we multiply it by the negation of number of incorrect classes where the margin is >0. We don’t need to recalculate this; we can already find this value for each of these rows from X_mask itself, since now each of its rows is 1 where the margin is >0 and 0 otherwise. Finding the number of incorrect classes where the margin is >0 is as simple as finding the sum of the columns for each row:

incorrect_counts = np.sum(X_mask, axis=1)

With this we can find the gradient matrix. The full implementation is below:

# Fully vectorized version. Roughly 10x faster.
X_mask = np.zeros(margins.shape)
X_mask[margins > 0] = 1
# for each sample, find the total number of classes where margin > 0
incorrect_counts = np.sum(X_mask, axis=1)
X_mask[np.arange(num_train), y] = -incorrect_counts
dW = X.T.dot(X_mask)

Lastly you should remember to regularize the gradient. The gradient with regularization is therefore:

\nabla_{w} L = \frac{1}{m} X^TQ + \lambda \sum W

The loss function is visualized:

loss vs iteration number

If the graph is drastically more fuzzy or it doesn’t seem to be decreasing, chances are somewhere in your code you are doing it wrong. Here’s what happen when I forgot to regularize the gradient:

loss vs iteration number forgot regularization

The loss is still decreasing, but not as fast as it should be.

The training and validation accuracy is visualized:

accuracy plot

In the last section code has been given to help you visualize the weights. With proper hyperparameters, more iterations will give a smoother result:

visualize weights