Highway Network
   HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, the Highway Network was the first working very deep
feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...
with hundreds of layers, much deeper than previous
neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...
. It uses ''skip connections'' modulated by learned
gating mechanism In neural networks, the gating mechanism is an architectural motif for controlling the flow of activation and gradient signals. They are most prominently used in recurrent neural networks (RNNs), but have also found applications in other archite ...
s to regulate information flow, inspired by
long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...
(LSTM)
recurrent neural networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
. The advantage of the Highway Network over other
deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
architectures is its ability to overcome or partially prevent the
vanishing gradient problem In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights ar ...
, thus improving its optimization. Gating mechanisms are used to facilitate information flow across the many layers ("information highways"). Highway Networks have found use in text sequence labeling and
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...
tasks. In 2014, the state of the art was training deep neural networks with 20 to 30 layers. Stacking too many layers led to a steep reduction in
training Training is teaching, or developing in oneself or others, any skills and knowledge or fitness that relate to specific useful competencies. Training has specific goals of improving one's capability, capacity, productivity and performance. I ...
accuracy, known as the "degradation" problem. In 2015, two techniques were developed to train such networks: the Highway Network (published in May), and the
residual neural network A residual neural network (also referred to as a residual network or ResNet) is a deep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition, and won ...
, or ResNet (December). ResNet behaves like an open-gated Highway Net.


Model

The model has two gates in addition to the H(W_H,x) gate: the transform gate T(W_T,x) and the carry gate C(W_C,x). The latter two gates are non-linear transfer functions (specifically
sigmoid Sigmoid means resembling the lower-case Greek letter sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς) or the Latin letter S. Specific uses include: * Sigmoid function, a mathematical function * Sigmoid colon, part of the l ...
by convention). The function H can be any desired transfer function. The carry gate is defined as: C(W_C,x)=1-T(W_T,x) while the transform gate is just a gate with a sigmoid transfer function.


Structure

The structure of a hidden layer in the Highway Network follows the equation: \begin y = H(x,W_) \cdot T(x,W_) + x \cdot C(x,W_) \\ = H(x,W_) \cdot T(x,W_) + x \cdot (1 - T(x,W_)) \end


Related work

Sepp Hochreiter analyzed the
vanishing gradient problem In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights ar ...
in 1991 and attributed to it the reason why
deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
did not work well. To overcome this problem,
Long Short-Term Memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...
(LSTM)
recurrent neural networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
have residual connections with a weight of 1.0 in every LSTM cell (called the constant error carrousel) to compute y_ = F(x_) + x_t . During
backpropagation through time Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks, such as Elman networks. The algorithm was independently derived by numerous researchers. Algorithm The training data ...
, this becomes the residual formula y = F(x) + x for feedforward neural networks. This enables training very deep
recurrent neural networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...
with a very long time span t. A later LSTM version published in 2000 modulates the identity LSTM connections by so-called "forget gates" such that their weights are not fixed to 1.0 but can be learned. In experiments, the forget gates were initialized with positive bias weights, thus being opened, addressing the vanishing gradient problem. As long as the forget gates of the 2000 LSTM are open, it behaves like the 1997 LSTM. The Highway Network of May 2015 applies these principles to
feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...
s. It was reported to be "the first very deep feedforward network with hundreds of layers". It is like a 2000 LSTM with forget gates unfolded in time, while the later Residual Nets have no equivalent of forget gates and are like the unfolded original 1997 LSTM. If the skip connections in Highway Networks are "without gates," or if their gates are kept open (activation 1.0), they become Residual Networks. The residual connection is a special case of the "short-cut connection" or "skip connection" by Rosenblatt (1961) and Lang & Witbrock (1988) which has the form Here the randomly initialized weight matrix A does not have to be the identity mapping. Every residual connection is a skip connection, but almost all skip connections are not residual connections. The original Highway Network paper not only introduced the basic principle for very deep feedforward networks, but also included experimental results with 20, 50, and 100 layers networks, and mentioned ongoing experiments with up to 900 layers. Networks with 50 or 100 layers had lower training error than their plain network counterparts, but no lower training error than their 20 layers counterpart (on the MNIST dataset, Figure 1 in ). No improvement on test accuracy was reported with networks deeper than 19 layers (on the CIFAR-10 dataset; Table 1 in ). The ResNet paper, however, provided strong experimental evidence of the benefits of going deeper than 20 layers. It argued that the identity mapping without modulation is crucial and mentioned that modulation in the skip connection can still lead to vanishing signals in forward and backward propagation (Section 3 in ). This is also why the forget gates of the 2000 LSTM were initially opened through positive bias weights: as long as the gates are open, it behaves like the 1997 LSTM. Similarly, a Highway Net whose gates are opened through strongly positive bias weights behaves like a ResNet. The skip connections used in modern neural networks (e.g.,
Transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...
) are dominantly identity mappings.


References

{{Artificial intelligence navbox Neural network architectures Machine learning