The formula of recurrent neural network (rnn)

The loss is:

Our purpose is to calculate the gradient of U,V,W and use it to update themself. Just like we add up the loss, we should add up the gradient of all the time steps.

Take $E_3$ for example:

What important here is $\frac{\partial E_{3}}{\partial V}$ just **depend on current time step**

But $\frac{\partial E_{3}}{\partial W}$ is different. And u is same as w.

$s_{3} = tanh(U x_{t} + W s_{2})$ depend on $s_{2}$ and $s_2$ depend on $W$ and $s_1$. if we calculate the gradient of $w$, we can not regard it as contant value.

We should add up gradient from all the time steps。In other words，the output of all the time steps depend on W.

## Why gradient vanish

we can see the gradient formula:

$\frac{\partial s_{3}}{\partial s_{k}}$is a chian rule. for example: $\frac{\partial s_{3}}{\partial s_{1}} = \frac{\partial s_{3}}{\partial s_{2}} \frac{\partial s_{2}}{\partial s_{1}}$. if we reformula it: