Gated recurrent units (GRUs) are a gating mechanism in

recurrent neural networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a

long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...

(LSTM) with a gating mechanism to input or forget certain features, but lacks a context vector or output gate, resulting in fewer parameters than LSTM. GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM. GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit. The operator

\odot

denotes the Hadamard product in the following.

Fully gated unit

Initially, for

t = 0

, the output vector is

h_0 = 0

. :

\begin
z_t &= \sigma(W_ x_t + U_ h_ + b_z) \\
r_t &= \sigma(W_ x_t + U_ h_ + b_r) \\
\hat_t &= \phi(W_ x_t + U_ (r_t \odot h_) + b_h) \\
h_t &=   (1-z_t) \odot  h_ + z_t \odot  \hat_t
\end

Variables (

d

denotes the number of input features and

e

the number of output features): *

x_t \in \mathbb^

: input vector *

h_t \in \mathbb^

: output vector *

\hat_t \in \mathbb^

: candidate activation vector *

z_t \in (0,1)^

: update gate vector *

r_t \in (0,1)^

: reset gate vector *

W \in \mathbb^

U \in \mathbb^

and

b \in \mathbb^

: parameter matrices and vector which need to be learned during training

Activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

s *

\sigma

: The original is a

logistic function A logistic function or logistic curve is a common S-shaped curve ( sigmoid curve) with the equation f(x) = \frac where The logistic function has domain the real numbers, the limit as x \to -\infty is 0, and the limit as x \to +\infty is L. ...

. *

\phi

: The original is a hyperbolic tangent. Alternative activation functions are possible, provided that

\sigma(x) \isin

, 1 The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

/math>. Alternate forms can be created by changing

z_t

and

r_t

* Type 1, each gate depends only on the previous hidden state and the bias. *:

\begin
z_t &= \sigma(U_ h_ + b_z) \\
r_t &= \sigma(U_ h_ + b_r) \\
\end

* Type 2, each gate depends only on the previous hidden state. *:

\begin
z_t &= \sigma(U_ h_) \\
r_t &= \sigma(U_ h_) \\
\end

* Type 3, each gate is computed using only the bias. *:

\begin
z_t &= \sigma(b_z) \\
r_t &= \sigma(b_r) \\
\end

Minimal gated unit

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed: :

\begin
f_t &= \sigma(W_ x_t + U_ h_ + b_f) \\
\hat_t &= \phi(W_ x_t + U_ (f_t \odot h_) + b_h) \\
h_t &=  (1-f_t) \odot h_ + f_t \odot \hat_t
\end

Variables *

x_t

: input vector *

h_t

: output vector *

\hat_t

: candidate activation vector *

f_t

: forget vector *

W

U

and

b

: parameter matrices and vector

Light gated recurrent unit

The light gated recurrent unit (LiGRU) removes the reset gate altogether, replaces tanh with the

ReLU In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...

activation, and applies

batch normalization Batch normalization (also known as batch norm) is a normalization technique used to make training of artificial neural networks faster and more stable by adjusting the inputs to each layer—re-centering them around zero and re-scaling them to ...

(BN): :

\begin
z_t &= \sigma(\operatorname(W_z x_t) + U_z h_) \\
\tilde_t &= \operatorname(\operatorname(W_h x_t) + U_h h_) \\
h_t &= z_t \odot h_ + (1 - z_t) \odot \tilde_t
\end

LiGRU has been studied from a Bayesian perspective. This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

tasks.

References

{{Artificial intelligence navbox Neural network architectures Artificial neural networks