neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

, the gating mechanism is an architectural motif for controlling the flow of

activation In chemistry and biology, activation is the process whereby something is prepared or excited for a subsequent reaction. Chemistry In chemistry, "activation" refers to the reversible transition of a molecule into a nearly identical chemical or ...

and gradient signals. They are most prominently used in

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s (RNNs), but have also found applications in other architectures.

RNNs

Gating mechanisms are the centerpiece of

long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...

(LSTM). They were proposed to mitigate the

vanishing gradient problem In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights ar ...

often encountered by regular RNNs. An LSTM unit contains three gates: * An input gate, which controls the flow of new information into the memory cell * A forget gate, which controls how much information is retained from the previous time step * An output gate, which controls how much information is passed to the next layer. The equations for LSTM are:

\begin
\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_i) \\
\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_f) \\
\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_o) \\
\tilde_t &= \tanh(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_c) \\
\mathbf_t &= \mathbf_t \odot \mathbf_ + \mathbf_t \odot \tilde_t \\
\mathbf_t &= \mathbf_t \odot \tanh(\mathbf_t)
\end

Here,

\odot

represents elementwise multiplication. File:LSTM 1.svg File:LSTM 0.svg File:LSTM 2.svg File:LSTM 3.svg The

gated recurrent unit Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features, but lacks a ...

(GRU) simplifies the LSTM. Compared to the LSTM, the GRU has just two gates: a reset gate and an update gate. GRU also merges the cell state and hidden state. The reset gate roughly corresponds to the forget gate, and the update gate roughly corresponds to the input gate. The output gate is removed. There are several variants of GRU. One particular variant has these equations:

\begin
\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_r) \\
\mathbf_t &= \sigma(\mathbf_t \mathbf_ + \mathbf_ \mathbf_ + \mathbf_z) \\
\tilde_t &= \tanh(\mathbf_t \mathbf_ + (\mathbf_t \odot \mathbf_) \mathbf_ + \mathbf_h) \\
\mathbf_t &= \mathbf_t \odot \mathbf_  + (1 - \mathbf_t) \odot \tilde_t
\end

File:Gated Recurrent Unit 1.svg File:Gated Recurrent Unit 2.svg File:Gated Recurrent Unit 3.svg

Gated Linear Unit

Gated Linear Units (GLUs) adapt the gating mechanism for use in

feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...

s, often within

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

-based architectures. They are defined as:

\mathrm(a,b)=a \odot \sigma(b)

where

a,b

are the first and second inputs, respectively.

\sigma

represents the

sigmoid Sigmoid means resembling the lower-case Greek letter sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς) or the Latin letter S. Specific uses include: * Sigmoid function, a mathematical function * Sigmoid colon, part of the l ...

activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

. Replacing

\sigma

with other activation functions leads to variants of GLU:

\begin
\mathrm(a, b) &=  a \odot \text(b)\\
\mathrm(a, b) &=  a \odot \text(b)\\
\mathrm(a, b, \beta) &=  a \odot \text_\beta(b)
\end

where ReLU, GELU, and Swish are different activation functions (see this table for definitions). In transformer models, such gating units are often used in the feedforward modules. For a single vector input, this results in:

\begin
\operatorname(x, W, V, b, c) & =\sigma(x W+b) \odot(x V+c) \\
\operatorname(x, W, V, b, c) & =(x W+b) \odot(x V+c) \\
\operatorname(x, W, V, b, c) & =\max (0, x W+b) \odot(x V+c) \\
\operatorname(x, W, V, b, c) & =\operatorname(x W+b) \odot(x V+c) \\
\operatorname(x, W, V, b, c, \beta) & =\operatorname_\beta(x W+b) \odot(x V+c)
\end

Other architectures

Gating mechanism is used in highway networks, which were designed by unrolling an LSTM. Channel gating uses a gate to control the flow of information through different channels inside a

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

(CNN).

RNNs

Gated Linear Unit

Other architectures

See also

References

Further reading