BackPropagation through time

The formula of recurrent neural network (rnn)

The loss is:

Our purpose is to calculate the gradient of U,V,W and use it to update themself. Just like we add up the loss, we should add up the gradient of all the time steps.

Take $E_3$ for example:

What important here is $\frac{\partial E_{3}}{\partial V}$ just depend on current time step

But $\frac{\partial E_{3}}{\partial W}$ is different. And u is same as w.

$s_{3} = tanh(U x_{t} + W s_{2})$ depend on $s_{2}$ and $s_2$ depend on $W$ and $s_1$. if we calculate the gradient of $w$, we can not regard it as contant value.

We should add up gradient from all the time steps。In other words,the output of all the time steps depend on W.

Why gradient vanish

we can see the gradient formula:

$\frac{\partial s_{3}}{\partial s_{k}}$is a chian rule. for example: $\frac{\partial s_{3}}{\partial s_{1}} = \frac{\partial s_{3}}{\partial s_{2}} \frac{\partial s_{2}}{\partial s_{1}}$. if we reformula it: