Gated recurrent unit explained

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.[1] The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features,[2] but lacks a context vector or output gate, resulting in fewer parameters than LSTM.[3] GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.[4] [5] GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.[6]

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.[7]

The operator

\odot

denotes the Hadamard product in the following.

Fully gated unit

Initially, for

t=0

, the output vector is

h0=0

.

\beginz_t &= \sigma(W_ x_t + U_ h_ + b_z) \\r_t &= \sigma(W_ x_t + U_ h_ + b_r) \\\hat_t &= \phi(W_ x_t + U_ (r_t \odot h_) + b_h) \\h_t &= (1-z_t) \odot h_ + z_t \odot \hat_t\end

Variables (

d

denotes the number of input features and

e

the number of output features):

xt\inRd

: input vector

ht\inRe

: output vector

\hat{h}t\inRe

: candidate activation vector

zt\in(0,1)e

: update gate vector

rt\in(0,1)e

: reset gate vector

W\inRe

,

U\inRe

and

b\inRe

: parameter matrices and vector which need to be learned during training

Activation functions

\sigma

: The original is a logistic function.

\phi

The original is a hyperbolic tangent.Alternative activation functions are possible, provided that

\sigma(x)\isin[0,1]

.

Alternate forms can be created by changing

zt

and

rt

[8]

\begin{align} zt&=\sigma(Uzht-1+bz)\\ rt&=\sigma(Urht-1+br)\\ \end{align}

\begin{align} zt&=\sigma(Uzht-1)\\ rt&=\sigma(Urht-1)\\ \end{align}

\begin{align} zt&=\sigma(bz)\\ rt&=\sigma(br)\\ \end{align}

Minimal gated unit

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:[9]

\begin{align} ft&=\sigma(Wfxt+Ufht-1+bf)\\ \hat{h}t&=\phi(Whxt+Uh(ft\odotht-1)+bh)\\ ht&=(1-ft)\odotht-1+ft\odot\hat{h}t \end{align}

Variables

xt

: input vector

ht

: output vector

\hat{h}t

: candidate activation vector

ft

: forget vector

W

,

U

and

b

: parameter matrices and vector

Light gated recurrent unit

The light gated recurrent unit (LiGRU)[4] removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):

\begin{align} zt&=\sigma(\operatorname{BN}(Wzxt)+Uzht-1)\\ \tilde{h}t&=\operatorname{ReLU}(\operatorname{BN}(Whxt)+Uhht-1)\\ ht&=zt\odotht-1+(1-zt)\odot\tilde{h}t \end{align}

LiGRU has been studied from a Bayesian perspective.[10] This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.

Notes and References

  1. Cho . Kyunghyun . van Merrienboer . Bart . Bahdanau . DZmitry . Bougares . Fethi . Schwenk . Holger . Bengio . Yoshua . 2014 . Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation . cs.CL . 1406.1078.
  2. Book: Felix Gers . Jürgen Schmidhuber . Fred Cummins . 9th International Conference on Artificial Neural Networks: ICANN '99 . Learning to forget: Continual prediction with LSTM . 1999 . 850–855 . 1999. Jürgen Schmidhuber . Felix Gers . 10.1049/cp:19991218 . 0-85296-721-7 .
  3. Web site: Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML . Wildml.com . 2015-10-27 . May 18, 2016 . 2021-11-10 . https://web.archive.org/web/20211110112626/http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ . dead .
  4. 1803.10225 . Light Gated Recurrent Units for Speech Recognition . Ravanelli . Mirco. Brakel . Philemon . Omologo . Maurizio . Bengio . Yoshua . Yoshua Bengio . IEEE Transactions on Emerging Topics in Computational Intelligence . 2018. 2 . 2 . 92–102 . 10.1109/TETCI.2017.2762739 . 4402991 .
  5. 1803.01686 . On extended long short-term memory and dependent bidirectional recurrent neural network . Su . Yuahang . Kuo . Jay . Neurocomputing . 2019. 356 . 151–161 . 10.1016/j.neucom.2019.04.044 . 3675055 .
  6. 1412.3555. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Chung . Junyoung . Gulcehre . Caglar . Cho . KyungHyun . Bengio . Yoshua . cs.NE . 2014 .
  7. 1412.3555. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Chung . Junyoung . Gulcehre . Caglar . Cho . KyungHyun . Bengio . Yoshua . cs.NE . 2014 .
  8. Dey. Rahul. Salem. Fathi M.. 2017-01-20. Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks. 1701.05923 . cs.NE.
  9. Heck. Joel. Salem. Fathi M.. 2017-01-12. Simplified Minimal Gated Unit Variations for Recurrent Neural Networks. 1701.03452 . cs.NE.
  10. A Bayesian Interpretation of the Light Gated Recurrent Unit . Bittar . Alexandre . Garner . Philip N. . May 2021 . IEEE . ICASSP 2021 . 2965–2969 . Toronto, ON, Canada . 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 10.1109/ICASSP39728.2021.9414259.