## Naive Softmax Loss And Its Gradient

In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:

• $u_o$ is the ‘outside’ vector representing outside word o
• $v_c$ is the ‘center’ vector representing center word c

The Cross Entropy Loss between the true (discrete) probability distribution p and another distribution q is:

So that the naive-softmax loss for word2vec given in following equation is the same as the cross-entropy loss between $y$ and $\hat{y}$:

For the backpropagation, lets introduce the intermediate variable $p$, which is a vector of the (normalized) probabilities. The loss for one example is:

We now wish to understand how the computed scores inside $f$ should change to decrease the loss $L_i$ that this example contributes to the full objective. In other words, we want to derive the gradient $\frac{\partial L_i}{\partial f_k}$. The loss $L_i$ is computed from $p$ which in turn depends on $f$.

Notice how elegant and simple this expression is. Suppose the probabilities we computed were p = [0.2, 0.3, 0.5], and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be df = [0.2, -0.7, 0.5].

## Negative Sampling Loss And Its Gradient

Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that K negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1,w_2,\cdots,w_k$ and their outside vectors as $u_1,\cdots,u_k$. Note that $o \in {w_1, \cdots, w_k}$. For a center word c and an outside word o, the negative sampling loss function is given by:

The sigmoid function and its gradient is as follows:

## SkipGram

Suppose the center word is $c = w_t$ and the context window is $[w_{t-m},\cdots,w_{t-1},\cdots, w_{t}, \cdots, w_{t+1}, \cdots,w_{t+m} ]$, where m is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:

Here, $J(v_c,w_{t+j},U)$ represents an arbitrary loss term for the center word $c = w_t$ and outside word

Donate article here