## GRU

RNNs have been found to perform better with the use of more complex units for activation. Here, we discuss the use of a gated activation function thereby modifying the RNN architecture. What motivates this? Well, although RNNs can theoretically capture long-term dependencies, they are very hard to actually train to do this. Gated recurrent units are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture long-term dependencies. Let us see mathematically how a GRU uses $h_{t−1}$ and $x_t$ to generate the next hidden state ht. We will then dive into the intuition of this architecture.

The above equations can be thought of a GRU’s four fundamental operational stages and they have intuitive interpretations that make this model much more intellectually.

**Reset gate**: controls what parts of previous hidden state are used to compute new content**Update gate**: controls what parts of hidden state are updated vs preserved**New hidden state content**: reset gate selects useful parts of prev hidden state. Use this and current input to compute new hidden content.**Hidden state**: update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content

## LSTM

Long-Short-Term-Memories are another type of complex activation unit that differ a little from GRUs. The motivation for using these is similar to those for GRUs however the architecture of such units does differ. Let us first take a look at the mathematical formulation of LSTM units before diving into the intuition behind this design:

- The LSTM architecture makes it easier for the RNN to preserve information over many timesteps
- e.g. if the forget gate is set to remember everything on every timestep, then the info in the cell is preserved indefinitely
- By contrast, it’s harder for vanilla RNN to learn a recurrent weight matrix Wh that preserves info in hidden state

- LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies.

## LSTM vs GRU

- Researchers have proposed many gated RNN variants, but LSTM and GRU are the most widely-used
- The biggest difference is that GRU is quicker to compute and has fewer parameters
- There is no conclusive evidence that one consistently performs better than the other
- LSTM is a good default choice (especially if your data has particularly long dependencies, or you have lots of training data)
**Rule of thumb**: start with LSTM, but switch to GRU if you want something more efficient

## reference

- course slides and notes from cs224n (http://web.stanford.edu/class/cs224n/)