The goals of this assignment are as follows
- understand the basic Image Classification pipeline and the data-driven approach (train/predict stages)
- understand the train/val/test splits and the use of validation data for hyperparameter tuning.
- develop proficiency in writing efficient vectorized code with numpy
- implement and apply a k-Nearest Neighbor (kNN) classifier
- implement and apply a Multiclass Support Vector Machine (SVM) classifier
- implement and apply a Softmax classifier
- implement and apply a Two layer neural network classifier
- understand the differences and tradeoffs between these classifiers
- get a basic understanding of performance improvements from using higher-level representations than raw pixels (e.g. color histograms, Histogram of Gradient (HOG) features)
k-Nearest Neighbor classifier
The kNN classifier consists of two stages:
- During training, the classifier takes the training data and simply remembers it
- During testing, kNN classifies every test image by comparing to all training images and transfering the labels of the k most similar training examples
- The value of k is cross-validated
In this exercise you will implement these steps and understand the basic Image Classification pipeline, cross-validation, and gain proficiency in writing efficient, vectorized code.
Key point
- When we compute the distances wiout loop, these is a important observation
for example, if we want to compute the distance between $\vec A$ and $\vec B$, the l2 norm is:
so we can factoring the equation above:
1 | test_sum = np.sum(np.square(X),axis = 1, keepdims = True) |
KNN code
1 | from builtins import range |
# Multiclass Support Vector Machine exercise
- implement a fully-vectorized loss function for the SVM
- implement the fully-vectorized expression for its analytic gradient
- check your implementation using numerical gradient
- use a validation set to tune the learning rate and regularization strength
- optimize the loss function with SGD
- visualize the final learned weights
Key points
Recall that for the i-th example we are given the pixels of image $x_i$,and the label $y_i$ that specifies the index of the correct class. The score function takes the pixels and computes the vector $f(x_i,w)$ of class scores, which we will abbreviate to $s$ (short for scores). For example, the score for the j-th class is the j-th element: $s_j = f(x_i, W)_j$, The Multiclass SVM loss for the i-th example is then formalized as follows:
Note that in this particular module we are working with linear score functions ($f(x_i; W) = W x_i$), so we can also rewrite the loss function in this equivalent form:
add regularization term and expanding this out in its full form:
In addition to the motivation we provided above there are many desirable properties to include the regularization penalty, many of which we will come back to in later sections. For example, it turns out that including the L2 penalty leads to the appealing max margin property in SVMs
- Calculate the derivative for svm
\begin{equation}
\left\{
\begin{array}{lr}
& \frac{\partial {L_{i j}}}{\partial w_j} = - x_i^{T} \quad j = y_i \\
& \frac{\partial {L_{i j}}}{\partial w_j} = x_i^{T} \quad j \neq y_i
\end{array}
\right.
\end{equation}
- calculate the loss and gradient by vector
1 | scores = X @ W |
SVM code
1 | from builtins import range |
Implement a Softmax classifier
- implement a fully-vectorized loss function for the Softmax classifier
- implement the fully-vectorized expression for its analytic gradient
- check your implementation with numerical gradient
- use a validation set to tune the learning rate and regularization strength
- optimize the loss function with SGD
- visualize the final learned weights
Unlike the SVM which treats the outputs $f(x_i,W)$ as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, the function mapping $f(x_i; W) = W x_i$ stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:
The function $\frac{e^{z_j}}{\sum_k e^{z_k}}$, is called the softmax function: It takes a vector of arbitrary real-valued scores (in $z$) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you’re seeing it for the first time but it is relatively easy to motivate.
1 | num_examples = X.shape[0] |
We have a way of evaluating the loss, and now we have to minimize it. We’ll do so with gradient descent. That is, we start with random parameters (as shown above), and evaluate the gradient of the loss function with respect to the parameters, so that we know how we should change the parameters to decrease the loss. Lets introduce the intermediate variable $P$,which is a vector of the (normalized) probabilities. The loss for one example is:
We now wish to understand how the computed scores inside $f$ should change to decrease the loss $L_i$ that this example contributes to the full objective. In other words, we want to derive the gradient $\partial L_i / \partial f_k$. The loss $Li$ is computed from $p$ which in turn depends on $f$ It’s a fun exercise to the reader to use the chain rule to derive the gradient, but it turns out to be extremely simple and interpretible in the end, after a lot of things cancel out:
Notice how elegant and simple this expression is. Suppose the probabilities we computed were p = [0.2, 0.3, 0.5]
, and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be df = [0.2, -0.7, 0.5]
. Recalling what the interpretation of the gradient, we see that this result is highly intuitive: increasing the first or last element of the score vector f (the scores of the incorrect classes) leads to an increased loss (due to the positive signs +0.2 and +0.5) - and increasing the loss is bad, as expected. However, increasing the score of the correct class has negative influence on the loss. The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss $L_i$, which makes sense.
To get the gradient on the scores, which we call $dscores$, we proceed as follows:1
2
3dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
Lastly, we had that , so armed with the gradient on scores (stored in dscores), we can now backpropagate into W and b:
1 | dW = np.dot(X.T, dscores) |
Softmax code
1 | from builtins import range |
Two-Layer Neural Network
Clearly, a linear classifier is inadequate for this dataset and we would like to use a Neural Network. One additional hidden layer will suffice for this toy data. We will now need two sets of weights and biases (for the first and second layers):
1 | # initialize parameters randomly |
The forward pass to compute scores now changes form:
1 | # evaluate class scores with a 2-layer Neural Network |
Notice that the only change from before is one extra line of code, where we first compute the hidden layer representation and then the scores based on this hidden layer. Crucially, we’ve also added a non-linearity, which in this case is simple ReLU that thresholds the activations on the hidden layer at zero.
Everything else remains the same. We compute the loss based on the scores exactly as before, and get the gradient for the scores dscores exactly as before. However, the way we backpropagate that gradient into the model parameters now changes form, of course. First lets backpropagate the second layer of the Neural Network. This looks identical to the code we had for the Softmax classifier, except we’re replacing X (the raw data), with the variable hidden_layer):
1 | # backpropate the gradient to the parameters |
However, unlike before we are not yet done, because hidden_layer is itself a function of other parameters and the data! We need to continue backpropagation through this variable. Its gradient can be computed as:
1 | dhidden = np.dot(dscores, W2.T) |
Now we have the gradient on the outputs of the hidden layer. Next, we have to backpropagate the ReLU non-linearity. This turns out to be easy because ReLU during the backward pass is effectively a switch. Since $r = max(0, x)$, we have that $\frac{dr}{dx} = 1(x > 0)$, Combined with the chain rule, we see that the ReLU unit lets the gradient pass through unchanged if its input was greater than 0, but kills it if its input was less than zero during the forward pass. Hence, we can backpropagate the ReLU in place simply with:
1 | # backprop the ReLU non-linearity |
Neural Net code
1 | from __future__ import print_function |
Higher Level Representations: Image Features
An important way to gain intuition about how an algorithm works is to visualize the mistakes that it makes. In this visualization, we show examples of images that are misclassified by our current system. The first column shows images that our system labeled as “plane” but whose true label is something other than “plane”.
1 | examples_per_class = 8 |
1 | from cs231n.classifiers.neural_net import TwoLayerNet |