
A residual neural network (ResNet)
is an
artificial neural network (ANN). It is a gateless or open-gated variant of the
HighwayNet, the first working very deep
feedforward neural network
A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks.
The feedforward neural network was the ...
with hundreds of layers, much deeper than previous neural networks. ''Skip connections'' or ''shortcuts'' are used to jump over some layers (
HighwayNets may also learn the skip weights themselves through an additional weight matrix for their gates).
Typical ''ResNet'' models are implemented with double- or triple- layer skips that contain nonlinearities (
ReLU) and
batch normalization in between. Models with several parallel skips are referred to as ''DenseNets''. In the context of residual neural networks, a non-residual network may be described as a ''plain network''.
Like in the case of
Long Short-Term Memory recurrent neural networks
there are two main reasons to add skip connections: to avoid the problem of
vanishing gradients,
thus leading to easier to optimize neural networks, where
the gating mechanisms facilitate information flow across many layers ("information highways"),
or to mitigate the Degradation (accuracy saturation) problem; where adding more layers to a suitably deep model leads to higher training error.
During training, the weights adapt to mute the upstream layer, and amplify the previously-skipped layer. In the simplest case, only the weights for the adjacent layer's connection are adapted, with no explicit weights for the upstream layer. This works best when a single nonlinear layer is stepped over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection (a
HighwayNet should be used).
Skipping effectively simplifies the network, using fewer layers in the initial training stages. This speeds learning by reducing the impact of vanishing gradients,
as there are fewer layers to propagate through. The network then gradually restores the skipped layers as it learns the
feature space. Towards the end of training, when all layers are expanded, it stays closer to the manifold and thus learns faster. A neural network without residual parts explores more of the feature space. This makes it more vulnerable to perturbations that cause it to leave the manifold, and necessitates extra training data to recover.
A residual neural network was used to win the
ImageNet 2015 competition,
and has become the most cited neural network of the 21st century.
Forward propagation
Given a weight matrix
for connection weights from layer
to
, and a weight matrix
for connection weights from layer
to
, then the forward propagation through the
activation function would be (aka ''
HighwayNets'')
:
where
:
the
activations (outputs) of neurons in layer
,
:
the activation function for layer
,
:
the weight matrix for neurons between layer
and
, and
:
If the number of vertices on layer
equals the number of vertices on layer
and if
is the identity matrix, then forward propagation through the activation function simplifies to
In this case, the connection between layers
and
is called an
identity block.
In the cerebral cortex such forward skips are done for several layers. Usually all forward skips start from the same layer, and successively connect to later layers. In the general case this will be expressed as (aka ''
DenseNets'')
:
.
Backward propagation
During
backpropagation learning for the normal path
:
and for the skip paths (nearly identical)
:
.
In both cases
:
a
learning rate (
,
:
the error signal of neurons at layer
, and
:
the activation of neurons at layer
.
If the skip path has fixed weights (e.g. the identity matrix, as above), then they are not updated. If they can be updated, the rule is an ordinary backpropagation update rule.
In the general case there can be
skip path weight matrices, thus
:
As the learning rules are similar, the weight matrices can be merged and learned in the same step.
References
Computational statistics
Artificial neural networks
Computational neuroscience