1 | import numpy as np |
Helper functions
1 | def sigmoid(x): |
Naive Softmax Loss And Its Gradient
In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
- $u_o$ is the ‘outside’ vector representing outside word o
- $v_c$ is the ‘center’ vector representing center word c
The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is:
So that the naive-softmax loss for word2vec given in following equation is the same as the cross-entropy loss between $y$ and $\hat{y}$:
For the backpropagation, lets introduce the intermediate variable $p$, which is a vector of the (normalized) probabilities. The loss for one example is:
We now wish to understand how the computed scores inside $f$ should change to decrease the loss $L_i$ that this example contributes to the full objective. In other words, we want to derive the gradient $\frac{\partial L_i}{\partial f_k}$. The loss $L_i$ is computed from $p$ which in turn depends on $f$.
Notice how elegant and simple this expression is. Suppose the probabilities we computed were p = [0.2, 0.3, 0.5]
, and that the correct class was the middle one (with probability 0.3
). According to this derivation the gradient on the scores would be df = [0.2, -0.7, 0.5]
.
1 |
|
Negative Sampling Loss And Its Gradient
Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1,w_2,\cdots,w_k$ and their outside vectors as $u_1,\cdots,u_k$. Note that $o \in {w_1, \cdots, w_k}$. For a center word c and an outside word o, the negative sampling loss function is given by:
The sigmoid function and its gradient is as follows:
1 | def negSamplingLossAndGradient( |
SkipGram
Suppose the center word is $c = w_t$ and the context window is $[w_{t-m},\cdots,w_{t-1},\cdots, w_{t}, \cdots, w_{t+1}, \cdots,w_{t+m} ]$, where m is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:
Here, $J(v_c,w_{t+j},U)$ represents an arbitrary loss term for the center word $c = w_t$ and outside word
1 | def skipgram(currentCenterWord, windowSize, outsideWords, word2Ind, |