
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the
biological neural network
A neural circuit is a population of neurons interconnected by synapses to carry out a specific function when activated. Neural circuits interconnect to one another to form large scale brain networks.
Biological neural networks have inspired the ...
s that constitute animal
brain
The brain is an organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head ( cephalization), usually near organs for special ...
s.
An ANN is based on a collection of connected units or nodes called
artificial neurons, which loosely model the
neurons in a biological brain. Each connection, like the
synapse
In the nervous system, a synapse is a structure that permits a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or to the target effector cell.
Synapses are essential to the transmission of nervous impulses from ...
s in a biological brain, can transmit a signal to other neurons. An artificial neuron receives signals then processes them and can signal neurons connected to it. The "signal" at a connection is a
real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called ''edges''. Neurons and edges typically have a ''
weight'' that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold.
Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times.
Training
Neural network
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
s learn (or are trained) by processing examples, each of which contains a known "input" and "result," forming probability-weighted associations between the two, which are stored within the data structure of the net itself. The training of a neural network from a given example is usually conducted by determining the difference between the processed output of the network (often a prediction) and a target output. This difference is the error. The network then adjusts its weighted associations according to a learning rule and using this error value. Successive adjustments will cause the neural network to produce output which is increasingly similar to the target output. After a sufficient number of these adjustments the training can be terminated based upon certain criteria. This is known as
supervised learning.
Such systems "learn" to perform tasks by considering examples, generally without being programmed with task-specific rules. For example, in
image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually
labeled as "cat" or "no cat" and using the results to identify cats in other images. They do this without any prior knowledge of cats, for example, that they have fur, tails, whiskers, and cat-like faces. Instead, they automatically generate identifying characteristics from the examples that they process.
History
Warren McCulloch and
Walter Pitts (1943) opened the subject by creating a computational model for neural networks. In the late 1940s,
D. O. Hebb created a learning
hypothesis based on the mechanism of
neural plasticity that became known as
Hebbian learning. Farley and
Wesley A. Clark (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. In 1958, psychologist
Frank Rosenblatt invented the
perceptron, the first artificial neural network,
funded by the United States
Office of Naval Research
The Office of Naval Research (ONR) is an organization within the United States Department of the Navy responsible for the science and technology programs of the U.S. Navy and Marine Corps. Established by Congress in 1946, its mission is to plan ...
.
The first functional networks with many layers were published by
Ivakhnenko and Lapa in 1965, as the
Group Method of Data Handling.
The basics of continuous backpropagation
were derived in the context of
control theory by
Kelley in 1960 and by
Bryson in 1961,
using principles of
dynamic programming. Thereafter research stagnated following
Minsky
Minsky (Belarusian: Мінскі; Russian: Минский) is a family name originating in Eastern Europe.
People
*Hyman Minsky (1919–1996), American economist
*Marvin Minsky (1927–2016), American cognitive scientist in the field of Ar ...
and
Papert (1969), who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks.
In 1970,
Seppo Linnainmaa published the general method for
automatic differentiation (AD) of discrete connected networks of nested
differentiable functions.
In 1973, Dreyfus used
backpropagation to adapt
parameters of controllers in proportion to error gradients.
Werbos's (1975)
backpropagation algorithm enabled practical training of multi-layer networks. In 1982, he applied Linnainmaa's AD method to neural networks in the way that became widely used.
The development of
metal–oxide–semiconductor (MOS)
very-large-scale integration
Very large-scale integration (VLSI) is the process of creating an integrated circuit (IC) by combining millions or billions of MOS transistors onto a single chip. VLSI began in the 1970s when MOS integrated circuit (Metal Oxide Semiconductor) c ...
(VLSI), in the form of
complementary MOS (CMOS) technology, enabled increasing MOS
transistor counts in
digital electronics. This provided more processing power for the development of practical artificial neural networks in the 1980s.
In 1986
Rumelhart
David Everett Rumelhart (June 12, 1942 – March 13, 2011) was an American psychologist who made many contributions to the formal analysis of human cognition, working primarily within the frameworks of mathematical psychology, symbolic artific ...
,
Hinton and
Williams showed that backpropagation learned interesting internal representations of words as feature vectors when trained to predict the next word in a sequence.
[David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams ,]
Learning representations by back-propagating errors
" ''Nature', 323, pages 533–536 1986.
From 1988 onward,
[Qian, Ning, and Terrence J. Sejnowski. "Predicting the secondary structure of globular proteins using neural network models." ''Journal of molecular biology'' 202, no. 4 (1988): 865-884.][Bohr, Henrik, Jakob Bohr, Søren Brunak, Rodney MJ Cotterill, Benny Lautrup, Leif Nørskov, Ole H. Olsen, and Steffen B. Petersen. "Protein secondary structure and homology by neural networks The α-helices in rhodopsin." ''FEBS letters'' 241, (1988): 223-228] the use of neural networks transformed the field of protein structure prediction, in particular when the first cascading networks were trained on ''profiles'' (matrices) produced by multiple sequence alignments.
[Rost, Burkhard, and Chris Sander. "Prediction of protein secondary structure at better than 70% accuracy." ''Journal of molecular biology'' 232, no. 2 (1993): 584-599.]
In 1992,
max-pooling was introduced to help with least-shift invariance and tolerance to deformation to aid
3D object recognition {{FeatureDetectionCompVisNavbox
In computer vision, 3D object recognition involves recognizing and determining 3D information, such as the pose, volume, or shape, of user-chosen 3D objects in a photograph or range scan. Typically, an example of ...
.
[J. Weng, N. Ahuja and T. S. Huang,]
Cresceptron: a self-organizing neural network which grows adaptively
" ''Proc. International Joint Conference on Neural Networks'', Baltimore, Maryland, vol I, pp. 576–581, June 1992.[J. Weng, N. Ahuja and T. S. Huang,]
Learning recognition and segmentation of 3-D objects from 2-D images
" ''Proc. 4th International Conf. Computer Vision'', Berlin, Germany, pp. 121–128, May 1993.[J. Weng, N. Ahuja and T. S. Huang,]
Learning recognition and segmentation using the Cresceptron
" ''International Journal of Computer Vision'', vol. 25, no. 2, pp. 105–139, Nov. 1997. Schmidhuber adopted a multi-level hierarchy of networks (1992) pre-trained one level at a time by
unsupervised learning and fine-tuned by
backpropagation.
[J. Schmidhuber.,]
Learning complex, extended sequences using the principle of history compression
" ''Neural Computation'', 4, pp. 234–242, 1992.
Neural networks' early successes included predicting the stock market and in 1995 a (mostly) self-driving car.
Geoffrey Hinton et al. (2006) proposed learning a high-level representation using successive layers of binary or real-valued
latent variables with a
restricted Boltzmann machine to model each layer. In 2012,
Ng and
Dean created a network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images.
Unsupervised pre-training and increased computing power from
GPUs and
distributed computing allowed the use of larger networks, particularly in image and visual recognition problems, which became known as "
deep learning
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be Supervised learning, supervised, Semi-supervised l ...
".
Ciresan and colleagues (2010)
showed that despite the
vanishing gradient problem, GPUs make backpropagation feasible for many-layered feedforward neural networks.
[Dominik Scherer, Andreas C. Müller, and Sven Behnke:]
Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition
" ''In 20th International Conference Artificial Neural Networks (ICANN)'', pp. 92–101, 2010. . Between 2009 and 2012, ANNs began winning prizes in image recognition contests, approaching human level performance on various tasks, initially in
pattern recognition and
handwriting recognition. For example, the bi-directional and multi-dimensional
long short-term memory (LSTM)
of
Graves et al. won three competitions in connected handwriting recognition in 2009 without any prior knowledge about the three languages to be learned.
Ciresan and colleagues built the first pattern recognizers to achieve human-competitive/superhuman performance on benchmarks such as traffic sign recognition (IJCNN 2012).
Models

ANNs began as an attempt to exploit the architecture of the human brain to perform tasks that conventional algorithms had little success with. They soon reoriented towards improving empirical results, mostly abandoning attempts to remain true to their biological precursors. Neurons are connected to each other in various patterns, to allow the output of some neurons to become the input of others. The network forms a
directed,
weighted graph.
An artificial neural network consists of a collection of simulated neurons. Each neuron is a
node which is connected to other nodes via
links that correspond to biological axon-synapse-dendrite connections. Each link has a weight, which determines the strength of one node's influence on another.
Artificial neurons
ANNs are composed of
artificial neurons which are conceptually derived from biological
neurons. Each artificial neuron has inputs and produces a single output which can be sent to multiple other neurons.
The inputs can be the feature values of a sample of external data, such as images or documents, or they can be the outputs of other neurons. The outputs of the final ''output neurons'' of the neural net accomplish the task, such as recognizing an object in an image.
To find the output of the neuron we take the weighted sum of all the inputs, weighted by the ''weights'' of the ''connections'' from the inputs to the neuron. We add a ''bias'' term to this sum.
This weighted sum is sometimes called the ''activation''. This weighted sum is then passed through a (usually nonlinear)
activation function to produce the output. The initial inputs are external data, such as images and documents. The ultimate outputs accomplish the task, such as recognizing an object in an image.
Organization
The neurons are typically organized into multiple layers, especially in
deep learning
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be Supervised learning, supervised, Semi-supervised l ...
. Neurons of one layer connect only to neurons of the immediately preceding and immediately following layers. The layer that receives external data is the ''input layer''. The layer that produces the ultimate result is the ''output layer''. In between them are zero or more ''hidden layers''. Single layer and unlayered networks are also used. Between two layers, multiple connection patterns are possible. They can be 'fully connected', with every neuron in one layer connecting to every neuron in the next layer. They can be ''pooling'', where a group of neurons in one layer connect to a single neuron in the next layer, thereby reducing the number of neurons in that layer.
Neurons with only such connections form a
directed acyclic graph and are known as
''feedforward networks''.
Alternatively, networks that allow connections between neurons in the same or previous layers are known as
''recurrent networks''''.''
Hyperparameter
A
hyperparameter is a constant
parameter whose value is set before the learning process begins. The values of
parameters are derived via learning. Examples of hyperparameters include
learning rate, the number of hidden layers and batch size. The values of some hyperparameters can be dependent on those of other hyperparameters. For example, the size of some layers can depend on the overall number of layers.
Learning
Learning is the adaptation of the network to better handle a task by considering sample observations. Learning involves adjusting the weights (and optional thresholds) of the network to improve the accuracy of the result. This is done by minimizing the observed errors. Learning is complete when examining additional observations does not usefully reduce the error rate. Even after learning, the error rate typically does not reach 0. If after learning, the error rate is too high, the network typically must be redesigned. Practically this is done by defining a
cost function that is evaluated periodically during learning. As long as its output continues to decline, learning continues. The cost is frequently defined as a
statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...
whose value can only be approximated. The outputs are actually numbers, so when the error is low, the difference between the output (almost certainly a cat) and the correct answer (cat) is small. Learning attempts to reduce the total of the differences across the observations. Most learning models can be viewed as a straightforward application of
optimization theory and
statistical estimation.
Learning rate
The learning rate defines the size of the corrective steps that the model takes to adjust for errors in each observation. A high learning rate shortens the training time, but with lower ultimate accuracy, while a lower learning rate takes longer, but with the potential for greater accuracy. Optimizations such as
Quickprop are primarily aimed at speeding up error minimization, while other improvements mainly try to increase reliability. In order to avoid
oscillation inside the network such as alternating connection weights, and to improve the rate of convergence, refinements use an
adaptive learning rate that increases or decreases as appropriate. The concept of momentum allows the balance between the gradient and the previous change to be weighted such that the weight adjustment depends to some degree on the previous change. A momentum close to 0 emphasizes the gradient, while a value close to 1 emphasizes the last change.
Cost function
While it is possible to define a cost function
ad hoc, frequently the choice is determined by the function's desirable properties (such as
convexity) or because it arises from the model (e.g. in a probabilistic model the model's
posterior probability can be used as an inverse cost).
Backpropagation
Backpropagation is a method used to adjust the connection weights to compensate for each error found during learning. The error amount is effectively divided among the connections. Technically, backprop calculates the
gradient (the derivative) of the
cost function associated with a given state with respect to the weights. The weight updates can be done via
stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
or other methods, such as
Extreme Learning Machines
Extreme learning machines are feedforward neural networks for classification, regression, clustering, sparse approximation, compression and feature learning with a single layer or multiple layers of hidden nodes, where the parameters of hidden ...
, "No-prop" networks, training without backtracking, "weightless" networks,
and
non-connectionist neural networks.
Learning paradigms
Machine learning is commonly separated into three main learning paradigms,
supervised learning,
unsupervised learning and
reinforcement learning. Each corresponds to a particular learning task.
Supervised learning
Supervised learning uses a set of paired inputs and desired outputs. The learning task is to produce the desired output for each input. In this case the cost function is related to eliminating incorrect deductions. A commonly used cost is the
mean-squared error, which tries to minimize the average squared error between the network's output and the desired output. Tasks suited for supervised learning are
pattern recognition (also known as classification) and
regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
(also known as function approximation). Supervised learning is also applicable to sequential data (e.g., for hand writing, speech and
gesture recognition
Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or sta ...
). This can be thought of as learning with a "teacher", in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
Unsupervised learning
In
unsupervised learning, input data is given along with the cost function, some function of the data
and the network's output. The cost function is dependent on the task (the model domain) and any ''
a priori'' assumptions (the implicit properties of the model, its parameters and the observed variables). As a trivial example, consider the model
where
is a constant and the cost