Gated recurrent unit explained
Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.[1] The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features,[2] but lacks a context vector or output gate, resulting in fewer parameters than LSTM.[3] GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.[4] [5] GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.[6]
Architecture
There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.[7]
The operator
denotes the
Hadamard product in the following.
Fully gated unit
Initially, for
, the output vector is
.
Variables (
denotes the number of input features and
the number of output features):
: input vector
: output vector
: candidate activation vector
: update gate vector
: reset gate vector
,
and
: parameter matrices and vector which need to be learned during training
Activation functions
: The original is a
logistic function.
The original is a hyperbolic tangent.Alternative activation functions are possible, provided that
.
Alternate forms can be created by changing
and
[8] - Type 1, each gate depends only on the previous hidden state and the bias.
\begin{align}
zt&=\sigma(Uzht-1+bz)\\
rt&=\sigma(Urht-1+br)\\
\end{align}
- Type 2, each gate depends only on the previous hidden state.
\begin{align}
zt&=\sigma(Uzht-1)\\
rt&=\sigma(Urht-1)\\
\end{align}
- Type 3, each gate is computed using only the bias.
\begin{align}
zt&=\sigma(bz)\\
rt&=\sigma(br)\\
\end{align}
Minimal gated unit
The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:[9]
\begin{align}
ft&=\sigma(Wfxt+Ufht-1+bf)\\
\hat{h}t&=\phi(Whxt+Uh(ft\odotht-1)+bh)\\
ht&=(1-ft)\odotht-1+ft\odot\hat{h}t
\end{align}
Variables
: input vector
: output vector
: candidate activation vector
: forget vector
,
and
: parameter matrices and vector
Light gated recurrent unit
The light gated recurrent unit (LiGRU)[4] removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):
\begin{align}
zt&=\sigma(\operatorname{BN}(Wzxt)+Uzht-1)\\
\tilde{h}t&=\operatorname{ReLU}(\operatorname{BN}(Whxt)+Uhht-1)\\
ht&=zt\odotht-1+(1-zt)\odot\tilde{h}t
\end{align}
LiGRU has been studied from a Bayesian perspective.[10] This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.
Notes and References
- Cho . Kyunghyun . van Merrienboer . Bart . Bahdanau . DZmitry . Bougares . Fethi . Schwenk . Holger . Bengio . Yoshua . 2014 . Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation . cs.CL . 1406.1078.
- Book: Felix Gers . Jürgen Schmidhuber . Fred Cummins . 9th International Conference on Artificial Neural Networks: ICANN '99 . Learning to forget: Continual prediction with LSTM . 1999 . 850–855 . 1999. Jürgen Schmidhuber . Felix Gers . 10.1049/cp:19991218 . 0-85296-721-7 .
- Web site: Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML . Wildml.com . 2015-10-27 . May 18, 2016 . 2021-11-10 . https://web.archive.org/web/20211110112626/http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ . dead .
- 1803.10225 . Light Gated Recurrent Units for Speech Recognition . Ravanelli . Mirco. Brakel . Philemon . Omologo . Maurizio . Bengio . Yoshua . Yoshua Bengio . IEEE Transactions on Emerging Topics in Computational Intelligence . 2018. 2 . 2 . 92–102 . 10.1109/TETCI.2017.2762739 . 4402991 .
- 1803.01686 . On extended long short-term memory and dependent bidirectional recurrent neural network . Su . Yuahang . Kuo . Jay . Neurocomputing . 2019. 356 . 151–161 . 10.1016/j.neucom.2019.04.044 . 3675055 .
- 1412.3555. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Chung . Junyoung . Gulcehre . Caglar . Cho . KyungHyun . Bengio . Yoshua . cs.NE . 2014 .
- 1412.3555. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Chung . Junyoung . Gulcehre . Caglar . Cho . KyungHyun . Bengio . Yoshua . cs.NE . 2014 .
- Dey. Rahul. Salem. Fathi M.. 2017-01-20. Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks. 1701.05923 . cs.NE.
- Heck. Joel. Salem. Fathi M.. 2017-01-12. Simplified Minimal Gated Unit Variations for Recurrent Neural Networks. 1701.03452 . cs.NE.
- A Bayesian Interpretation of the Light Gated Recurrent Unit . Bittar . Alexandre . Garner . Philip N. . May 2021 . IEEE . ICASSP 2021 . 2965–2969 . Toronto, ON, Canada . 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 10.1109/ICASSP39728.2021.9414259.