Gating mechanism explained

In neural networks, the gating mechanism is an architectural motif for controlling the flow of activation and gradient signals. They are most prominently used in recurrent neural networks (RNNs), but have also found applications in other architectures.

RNNs

Gating mechanisms are the centerpiece of long short-term memory (LSTM).[1] They were proposed to mitigate the vanishing gradient problem often encountered by regular RNNs.

An LSTM unit contains three gates:

The equations for LSTM are:[2]

\begin\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_i) \\\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_f) \\\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_o) \\\tilde_t &= \tanh(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_c) \\\mathbf_t &= \mathbf_t \odot \mathbf_ + \mathbf_t \odot \tilde_t \\\mathbf_t &= \mathbf_t \odot \tanh(\mathbf_t)\end

Here,

\odot

represents elementwise multiplication. The gated recurrent unit (GRU) simplifies the LSTM.[3] Compared to the LSTM, the GRU has just two gates: a reset gate and an update gate. GRU also merges the cell state and hidden state. The reset gate roughly corresponds to the forget gate, and the update gate roughly corresponds to the input gate. The output gate is removed.

There are several variants of GRU. One particular variant has these equations:[4]

\begin\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_r) \\\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_z) \\\tilde_t &= \tanh(\mathbf_t \mathbf_ + (\mathbf_t \odot \mathbf_) \mathbf_ + \mathbf_h) \\\mathbf_t &= \mathbf_t \odot \mathbf_ + (1 - \mathbf_t) \odot \tilde_t\end

Gated Linear Unit

Gated Linear Units (GLUs)[5] adapt the gating mechanism for use in feedforward neural networks, often within transformer-based architectures. They are defined as:

\mathrm(a,b)=a \odot \sigma(b)

where

a,b

are the first and second inputs, respectively.

\sigma

represents the sigmoid activation function.

Replacing

\sigma

with other activation functions leads to variants of GLU:

\begin\mathrm(a, b) &= a \odot \text(b)\\\mathrm(a, b) &= a \odot \text(b)\\\mathrm(a, b, \beta) &= a \odot \text_\beta(b)\end

where ReLU, GELU, and Swish are different activation functions (see this table for definitions).

In transformer models, such gating units are often used in the feedforward modules. For a single vector input, this results in:[6]

\begin\operatorname(x, W, V, b, c) & =\sigma(x W+b) \odot(x V+c) \\\operatorname(x, W, V, b, c) & =(x W+b) \odot(x V+c) \\\operatorname(x, W, V, b, c) & =\max (0, x W+b) \odot(x V+c) \\\operatorname(x, W, V, b, c) & =\operatorname(x W+b) \odot(x V+c) \\\operatorname(x, W, V, b, c, \beta) & =\operatorname_\beta(x W+b) \odot(x V+c)\end

Other architectures

Gating mechanism is used in highway networks, which were designed by unrolling an LSTM.

Channel gating[7] uses a gate to control the flow of information through different channels inside a convolutional neural network (CNN).

See also

References

  1. Sepp Hochreiter . Sepp Hochreiter . Jürgen Schmidhuber . Jürgen Schmidhuber . 1997 . Long short-term memory . . 9 . 8 . 1735–1780 . 10.1162/neco.1997.9.8.1735 . 9377276 . 1915014.
  2. Book: Zhang . Aston . Dive into deep learning . Lipton . Zachary . Li . Mu . Smola . Alexander J. . 2024 . Cambridge University Press . 978-1-009-38943-3 . Cambridge New York Port Melbourne New Delhi Singapore . 10.1. Long Short-Term Memory (LSTM) . https://d2l.ai/chapter_recurrent-modern/lstm.html.
  3. Cho . Kyunghyun . van Merrienboer . Bart . Bahdanau . DZmitry . Bougares . Fethi . Schwenk . Holger . Bengio . Yoshua . 2014 . Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation . Association for Computational Linguistics . 1406.1078.
  4. Book: Zhang . Aston . Dive into deep learning . Lipton . Zachary . Li . Mu . Smola . Alexander J. . 2024 . Cambridge University Press . 978-1-009-38943-3 . Cambridge New York Port Melbourne New Delhi Singapore . 10.2. Gated Recurrent Units (GRU). https://d2l.ai/chapter_recurrent-modern/gru.html.
  5. Dauphin . Yann N. . Fan . Angela . Auli . Michael . Grangier . David . 2017-07-17 . Language Modeling with Gated Convolutional Networks . Proceedings of the 34th International Conference on Machine Learning . en . PMLR . 933–941. 1612.08083 .
  6. 2002.05202 . cs.LG . Noam . Shazeer . GLU Variants Improve Transformer . February 14, 2020.
  7. Hua . Weizhe . Zhou . Yuan . De Sa . Christopher M . Zhang . Zhiru . Suh . G. Edward . 2019 . Channel Gating Neural Networks . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 32. 1805.12549 .

Further reading