Gated RNN Units

GRU

RNNs have been found to perform better with the use of more complex units for activation. Here, we discuss the use of a gated activation function thereby modifying the RNN architecture. What motivates this? Well, although RNNs can theoretically capture long-term dependencies, they are very hard to actually train to do this. Gated recurrent units are designed in a manner to have more persistent memory thereby making it easier for RNNs to capture long-term dependencies. Let us see mathematically how a GRU uses $h_{t−1}$ and $x_t$ to generate the next hidden state ht. We will then dive into the intuition of this architecture.

The above equations can be thought of a GRU’s four fundamental operational stages and they have intuitive interpretations that make this model much more intellectually.

  1. Reset gate: controls what parts of previous hidden state are used to compute new content
  2. Update gate: controls what parts of hidden state are updated vs preserved
  3. New hidden state content: reset gate selects useful parts of prev hidden state. Use this and current input to compute new hidden content.
  4. Hidden state: update gate simultaneously controls what is kept from previous hidden state, and what is updated to new hidden state content

LSTM

Long-Short-Term-Memories are another type of complex activation unit that differ a little from GRUs. The motivation for using these is similar to those for GRUs however the architecture of such units does differ. Let us first take a look at the mathematical formulation of LSTM units before diving into the intuition behind this design:

slides of cs224n
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
  1. The LSTM architecture makes it easier for the RNN to preserve information over many timesteps
    • e.g. if the forget gate is set to remember everything on every timestep, then the info in the cell is preserved indefinitely
    • By contrast, it’s harder for vanilla RNN to learn a recurrent weight matrix Wh that preserves info in hidden state
  2. LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies.

LSTM vs GRU

  • Researchers have proposed many gated RNN variants, but LSTM and GRU are the most widely-used
  • The biggest difference is that GRU is quicker to compute and has fewer parameters
  • There is no conclusive evidence that one consistently performs better than the other
  • LSTM is a good default choice (especially if your data has particularly long dependencies, or you have lots of training data)
  • Rule of thumb: start with LSTM, but switch to GRU if you want something more efficient

reference

  1. course slides and notes from cs224n (http://web.stanford.edu/class/cs224n/)
Donate article here