
In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a
model inspired by the structure and function of
biological neural network
A neural circuit is a population of neurons interconnected by synapses to carry out a specific function when activated. Neural circuits interconnect to one another to form large scale brain networks.
Biological neural networks have inspired the ...
s in animal
brains.
An ANN consists of connected units or nodes called ''
artificial neurons'', which loosely model the
neurons in a brain. These are connected by ''edges'', which model the
synapse
In the nervous system, a synapse is a structure that permits a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or to the target effector cell.
Synapses are essential to the transmission of nervous impulses from ...
s in a brain. Each artificial neuron receives signals from connected neurons, then processes them and sends a signal to other connected neurons. The "signal" is a
real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs, called the ''
activation function''. The strength of the signal at each connection is determined by a ''
weight'', which adjusts during the learning process.
Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the ''input layer'') to the last layer (the ''output layer''), possibly passing through multiple intermediate layers (''
hidden layers''). A network is typically called a deep neural network if it has at least 2 hidden layers.
Artificial neural networks are used for various tasks, including
predictive modeling,
adaptive control, and solving problems in
artificial intelligence. They can learn from experience, and can derive conclusions from a complex and seemingly unrelated set of information.
Training
Neural networks are typically trained through
empirical risk minimization
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an alg ...
. This method is based on the idea of optimizing the network's parameters to minimize the difference, or empirical risk, between the predicted output and the actual target values in a given dataset.
Gradient based methods such as
backpropagation are usually used to estimate the parameters of the network.
During the training phase, ANNs learn from
labeled training data by iteratively updating their parameters to minimize a defined
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
.
This method allows the network to generalize to unseen data.
History
Historically, digital computers evolved from the
von Neumann model
The von Neumann architecture — also known as the von Neumann model or Princeton architecture — is a computer architecture based on a 1945 description by John von Neumann, and by others, in the ''First Draft of a Report on the EDVAC''. The ...
, and operate via the execution of explicit instructions via access to memory by a number of processors. Neural networks, on the other hand, originated from efforts to model information processing in biological systems through the framework of
connectionism. Unlike the von Neumann model, connectionist computing does not separate memory and processing.
The simplest kind of
feedforward neural network
A feedforward neural network (FNN) is an artificial neural network wherein connections between the nodes do ''not'' form a cycle. As such, it is different from its descendant: recurrent neural networks.
The feedforward neural network was the ...
(FNN) is a linear network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated at each node. The
mean squared errors between these calculated outputs and the given target values are minimized by creating an adjustment to the weights. This technique has been known for over two centuries as the
method of least squares or
linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
. It was used as a means of finding a good rough linear fit to a set of points by
Legendre (1805) and
Gauss (1795) for the prediction of planetary movement.
[Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares"][
]
Warren McCulloch and Walter Pitts (1943) also considered a non-learning computational model for neural networks.
In the late 1940s, D. O. Hebb created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning. Hebbian learning is considered to be a 'typical' unsupervised learning rule and its later variants were early models for long term potentiation
In neuroscience, long-term potentiation (LTP) is a persistent strengthening of synapses based on recent patterns of activity. These are patterns of synaptic activity that produce a long-lasting increase in signal transmission between two neuron ...
. These ideas started being applied to computational models in 1948 with Turing's " unorganized machines". Farley and Wesley A. Clark
Wesley Allison Clark (April 10, 1927 – February 22, 2016) was an American physicist who is credited for designing the first modern personal computer. He was also a computer designer and the main participant, along with Charles Molnar, in the ...
were the first to simulate a Hebbian network in 1954 at MIT. They used computational machines, then called "calculators". Other neural network computational machines were created by Rochester, Holland, Habit, and Duda in 1956. In 1958, psychologist Frank Rosenblatt invented the perceptron, the first implemented artificial neural network, funded by the United States Office of Naval Research
The Office of Naval Research (ONR) is an organization within the United States Department of the Navy responsible for the science and technology programs of the U.S. Navy and Marine Corps. Established by Congress in 1946, its mission is to plan ...
.
The invention of the perceptron raised public excitement for research in Artificial Neural Networks, causing the US government to drastically increase funding into deep learning research. This led to "the golden age of AI" fueled by the optimistic claims made by computer scientists regarding the ability of perceptrons to emulate human intelligence. For example, in 1957 Herbert Simon famously said:However, this wasn't the case as research stagnated in the United States following the work of Minsky
Minsky (Belarusian: Мінскі; Russian: Минский) is a family name originating in Eastern Europe.
People
*Hyman Minsky (1919–1996), American economist
*Marvin Minsky (1927–2016), American cognitive scientist in the field of Ar ...
and Papert (1969), who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to train useful neural networks. This, along with other factors such as the 1973 Lighthill report by James Lighthill stating that research in Artificial Intelligence has not "produced the major impact that was then promised," shutting funding in research into the field of AI in all but two universities in the UK and in many major institutions across the world. This ushered an era called the AI Winter with reduced research into connectionism due to a decrease in government funding and an increased stress on symbolic artificial intelligence
In artificial intelligence, symbolic artificial intelligence is the term for the collection of all methods in artificial intelligence research that are based on high-level symbolic (human-readable) representations of problems, logic and search. S ...
in the United States and other Western countries.
During the AI Winter era, however, research outside the United States continued, especially in Eastern Europe. By the time Minsky and Papert's book on Perceptrons came out, methods for training multilayer perceptrons
A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of mu ...
(MLPs) were already known. The first deep learning
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
De ...
MLP was published by Alexey Grigorevich Ivakhnenko and Valentin Lapa in 1965, as the Group Method of Data Handling. The first deep learning MLP trained by stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
was published in 1967 by Shun'ichi Amari. In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned useful internal representations to classify non-linearily separable pattern classes.
Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982. SOMs are neurophysiologically inspired neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning Competitive learning is a form of unsupervised learning in artificial neural networks, in which nodes compete for the right to respond to a subset of the input data. A variant of Hebbian learning, competitive learning works by increasing the specia ...
.
The convolutional neural network (CNN) architecture with convolutional layers and downsampling layers was introduced by Kunihiko Fukushima in 1980. He called it the neocognitron. In 1969, he also introduced the ReLU (rectified linear unit) activation function. The rectifier has become the most popular activation function for CNNs and deep neural networks in general. CNNs have become an essential tool for computer vision
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
.
A key in later advances in artificial neural network research was the backpropagation algorithm, an efficient application of the Leibniz chain rule (1673) to networks of differentiable nodes.[ It is also known as
the reverse mode of automatic differentiation or ]reverse accumulation
In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation, computational differentiation, auto-differentiation, or simply autodiff, is a set of techniques to evaluate the derivative of a function s ...
, due to Seppo Linnainmaa (1970). The term "back-propagating errors" was introduced in 1962 by Frank Rosenblatt, but he did not have an implementation of this procedure, although Henry J. Kelley and Bryson had dynamic programming based continuous precursors of backpropagation already in 1960–61 in the context of control theory.
In 1973, Dreyfus used backpropagation to adapt parameters of controllers in proportion to error gradients.
In 1982, Paul Werbos
Paul John Werbos (born 1947) is an American social scientist and machine learning pioneer. He is best known for his 1974 dissertation, which first described the process of training artificial neural networks through backpropagation of errors. He ...
applied backpropagation to MLPs in the way that has become standard. In 1986 Rumelhart
David Everett Rumelhart (June 12, 1942 – March 13, 2011) was an American psychologist who made many contributions to the formal analysis of human cognition, working primarily within the frameworks of mathematical psychology, symbolic artific ...
, Hinton and Williams showed that backpropagation learned interesting internal representations of words as feature vectors when trained to predict the next word in a sequence.[David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams,]
Learning representations by back-propagating errors
," ''Nature'', 323, pages 533–536 1986.
In the late 1970s to early 1980s, interest briefly emerged in theoretically investigating the Ising model created by Wilhelm Lenz
Wilhelm Lenz (February 8, 1888 in Frankfurt am Main – April 30, 1957 in Hamburg) was a German physicist, most notable for his invention of the Ising model and for his application of the Laplace–Runge–Lenz vector to the old quantum mechanical ...
(1920) and Ernst Ising (1925)
in relation to .
The Ising model is essentially a non-learning artificial recurrent neural network (RNN) consisting of neuron-like threshold elements.
In 1972, Shun'ichi Amari
, is a Japanese scholar born in 1936 in Tokyo, Japan.
Overviews
He majored in Mathematical Engineering in 1958 from the University of Tokyo then graduated in 1963 from the Graduate School of the University of Tokyo.
His M. Eng. in 1960 was en ...
described an adaptive version of this architecture,
In 1981, the Ising model was solved exactly by Peter Barth for the general case of closed Cayley trees (with loops) with an arbitrary branching ratio
and found to exhibit unusual phase transition behavior in its local-apex and long-range site-site correlations.
John Hopfield popularised this architecture in 1982,
and it is now known as a Hopfield network.
The time delay neural network (TDNN) of Alex Waibel
Alexander Waibel (born 2 May 1956 in Heidelberg, Germany) is a professor of Computer Science at Carnegie Mellon University and Karlsruhe Institute of Technology. Waibel's research interests focus on speech recognition and translation and human co ...
(1987) combined convolutions and weight sharing and backpropagation.[ Alexander Waibel et al., ]
Phoneme Recognition Using Time-Delay Neural Networks
'' IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989. In 1988, Wei Zhang et al. applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. In 1989, Yann LeCun et al. trained a CNN to recognize handwritten ZIP codes on mail.[LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition," ''Neural Computation'', 1, pp. 541–551, 1989.]
In 1992, max-pooling for CNNs was introduced by Juan Weng et al. to help with least-shift invariance and tolerance to deformation to aid 3D object recognition {{FeatureDetectionCompVisNavbox
In computer vision, 3D object recognition involves recognizing and determining 3D information, such as the pose, volume, or shape, of user-chosen 3D objects in a photograph or range scan. Typically, an example of ...
.[J. Weng, N. Ahuja and T. S. Huang,]
Cresceptron: a self-organizing neural network which grows adaptively
," ''Proc. International Joint Conference on Neural Networks'', Baltimore, Maryland, vol I, pp. 576–581, June 1992.[J. Weng, N. Ahuja and T. S. Huang,]
Learning recognition and segmentation of 3-D objects from 2-D images
," ''Proc. 4th International Conf. Computer Vision'', Berlin, Germany, pp. 121–128, May 1993.[J. Weng, N. Ahuja and T. S. Huang,]
Learning recognition and segmentation using the Cresceptron
," ''International Journal of Computer Vision'', vol. 25, no. 2, pp. 105–139, Nov. 1997.
LeNet-5 (1998), a 7-level CNN by Yann LeCun et al., that classifies digits, was applied by several banks to recognize hand-written numbers on checks digitized in 32x32 pixel images.
From 1988 onward,[Qian, Ning, and Terrence J. Sejnowski. "Predicting the secondary structure of globular proteins using neural network models." ''Journal of molecular biology'' 202, no. 4 (1988): 865-884.][Bohr, Henrik, Jakob Bohr, Søren Brunak, Rodney MJ Cotterill, Benny Lautrup, Leif Nørskov, Ole H. Olsen, and Steffen B. Petersen. "Protein secondary structure and homology by neural networks The α-helices in rhodopsin." ''FEBS letters'' 241, (1988): 223-228] the use of neural networks transformed the field of protein structure prediction, in particular when the first cascading networks were trained on ''profiles'' (matrices) produced by multiple sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
s.[Rost, Burkhard, and Chris Sander. "Prediction of protein secondary structure at better than 70% accuracy." ''Journal of molecular biology'' 232, no. 2 (1993): 584-599.]
In 1991, Sepp Hochreiter
Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 2018. ...
's diploma thesis [S. Hochreiter.,]
Untersuchungen zu dynamischen neuronalen Netzen
," ''Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber'', 1991. identified and analyzed the vanishing gradient problem and proposed recurrent residual connections to solve it. His thesis was called "one of the most important documents in the history of machine learning" by his supervisor Juergen Schmidhuber.
In 1991, Juergen Schmidhuber published adversarial neural networks that contest with each other in the form of a zero-sum game
Zero-sum game is a mathematical representation in game theory and economic theory of a situation which involves two sides, where the result is an advantage for one side and an equivalent loss for the other. In other words, player one's gain is e ...
, where one network's gain is the other network's loss. The first network is a generative model that models a probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. This was called "artificial curiosity."
In 1992, Juergen Schmidhuber proposed a hierarchy of RNNs pre-trained one level at a time by self-supervised learning
Self-supervised learning (SSL) refers to a machine learning paradigm, and corresponding methods, for processing unlabelled data to obtain useful representations that can help with downstream learning tasks. The most salient thing about SSL metho ...
. It uses predictive coding to learn internal representations at multiple self-organizing time scales. This can substantially facilitate downstream deep learning. The RNN hierarchy can be ''collapsed'' into a single RNN, by distilling a higher level ''chunker'' network into a lower level ''automatizer'' network. In the same year he also published an ''alternative to RNNs'' which is a precursor of a ''linear Transformer''. It introduces the concept ''internal spotlights of attention'': a slow feedforward neural network learns by gradient descent to control the fast weights of another neural network through outer product
In linear algebra, the outer product of two coordinate vector
In linear algebra, a coordinate vector is a representation of a vector as an ordered list of numbers (a tuple) that describes the vector in terms of a particular ordered basis. An ea ...
s of self-generated activation patterns.
The development of metal–oxide–semiconductor (MOS) very-large-scale integration
Very large-scale integration (VLSI) is the process of creating an integrated circuit (IC) by combining millions or billions of MOS transistors onto a single chip. VLSI began in the 1970s when MOS integrated circuit (Metal Oxide Semiconductor) c ...
(VLSI), in the form of complementary MOS (CMOS) technology, enabled increasing MOS transistor counts in digital electronics. This provided more processing power for the development of practical artificial neural networks in the 1980s.
Neural networks' early successes included in 1995 a (mostly) self-driving car.
1997, Sepp Hochreiter
Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 2018. ...
and Juergen Schmidhuber introduced the deep learning method called long short-term memory
Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM), published in Neural Computation. LSTM recurrent neural networks can learn "very deep learning" tasks with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. The "vanilla LSTM" with forget gate was introduced in 1999 by Felix Gers Felix Gers is a professor of computer science at Berlin University of Applied Sciences Berlin. With Jürgen Schmidhuber and Fred Cummins, he introduced the forget gate to the long short-term memory
Long short-term memory (LSTM) is an artificia ...
, Schmidhuber and Fred Cummins.
Geoffrey Hinton et al. (2006) proposed learning a high-level representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine to model each layer. In 2012, Ng and Dean created a network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images. Unsupervised pre-training and increased computing power from GPUs and distributed computing allowed the use of larger networks, particularly in image and visual recognition problems, which became known as "deep learning".
Variants of the back-propagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
algorithm, as well as unsupervised methods by Geoff Hinton
Geoffrey Everest Hinton One or more of the preceding sentences incorporates text from the royalsociety.org website where: (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on ar ...
and colleagues at the University of Toronto, can be used to train deep, highly nonlinear neural architectures, similar to the 1980 Neocognitron by Kunihiko Fukushima, and the "standard architecture of vision", inspired by the simple and complex cells identified by David H. Hubel
David Hunter Hubel (February 27, 1926 – September 22, 2013) was a Canadian American neurophysiologist noted for his studies of the structure and function of the visual cortex. He was co-recipient with Torsten Wiesel of the 1981 Nobel Pri ...
and Torsten Wiesel in the primary visual cortex.
Computational devices have been created in CMOS for both biophysical simulation and neuromorphic computing
Neuromorphic engineering, also known as neuromorphic computing, is the use of electronic circuits to mimic neuro-biological architectures present in the nervous system. A neuromorphic computer/chip is any device that uses physical artificial ne ...
. More recent efforts show promise for creating nanodevices for very large scale principal components analyses and convolution. If successful, these efforts could usher in a new era of neural computing that is a step beyond digital computing, because it depends on learning
Learning is the process of acquiring new understanding, knowledge, behaviors, skills, value (personal and cultural), values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machine learning, machines ...
rather than programming and because it is fundamentally analog
Analog or analogue may refer to:
Computing and electronics
* Analog signal, in which information is encoded in a continuous variable
** Analog device, an apparatus that operates on analog signals
*** Analog electronics, circuits which use analo ...
rather than digital
Digital usually refers to something using discrete digits, often binary digits.
Technology and computing Hardware
*Digital electronics, electronic circuits which operate using digital signals
**Digital camera, which captures and stores digital i ...
even though the first instantiations may in fact be with CMOS digital devices.
Ciresan and colleagues (2010) showed that despite the vanishing gradient problem, GPUs make backpropagation feasible for many-layered feedforward neural networks.[Dominik Scherer, Andreas C. Müller, and Sven Behnke:]
Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition
," ''In 20th International Conference Artificial Neural Networks (ICANN)'', pp. 92–101, 2010. . Between 2009 and 2012, ANNs began winning prizes in image recognition contests, approaching human level performance on various tasks, initially in pattern recognition and handwriting recognition. For example, the bi-directional and multi-dimensional long short-term memory
Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM) of Graves
A grave is a location where a dead body (typically that of a human, although sometimes that of an animal) is buried or interred after a funeral. Graves are usually located in special areas set aside for the purpose of burial, such as gravey ...
et al. won three competitions in connected handwriting recognition in 2009 without any prior knowledge about the three languages to be learned.
Ciresan and colleagues built the first pattern recognizers to achieve human-competitive/superhuman performance on benchmarks such as traffic sign recognition (IJCNN 2012).
Radial basis function and wavelet networks were introduced in 2013. These can be shown to offer best approximation properties and have been applied in nonlinear system identification and classification applications.
In 2014, the adversarial network principle was used in a generative adversarial network (GAN) by Ian Goodfellow
Ian J. Goodfellow (born ) is a computer scientist, engineer, and executive, most noted for his work on artificial neural networks and deep learning. He was previously employed as a research scientist at Google Brain and director of machine lea ...
et al. Here the adversarial network (discriminator) outputs a value between 1 and 0 depending on the likelihood of the first network's (generator) output is in a given set. This can be used to create realistic deepfakes.
Excellent image quality is achieved by Nvidia's StyleGAN
StyleGAN is a generative adversarial network (GAN) introduced by Nvidia researchers in December 2018, and made source available in February 2019.
StyleGAN depends on Nvidia's CUDA software, GPUs, and Google's TensorFlow, or Meta AI's PyTorch, w ...
(2018) based on the Progressive GAN by Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Here the GAN generator is grown from small to large scale in a pyramidal fashion.
In 2015, Rupesh Kumar Srivastava, Klaus Greff, and Schmidhuber used the LSTM principle to create the Highway network, a feedforward neural network with hundreds of layers, much deeper than previous networks. 7 months later, Kaiming He, Xiangyu Zhang; Shaoqing Ren, and Jian Sun won the ImageNet 2015 competition with an open-gated or gateless Highway network variant called Residual neural network
A residual neural network (ResNet) is an artificial neural network (ANN). It is a gateless or open-gated variant of the HighwayNet, the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural n ...
.
In 2017, Ashish Vaswani et al. introduced the modern Transformer architecture in their paper "Attention Is All You Need."
It combines this with a softmax operator and a projection matrix.
Transformers have increasingly become the model of choice for natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
. Many modern large language models such as ChatGPT, GPT-4, and BERT use it. Transformers are also increasingly being used in computer vision
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
.
Ramenzanpour et al. showed in 2020 that analytical and computational techniques derived from statistical physics of disordered systems can be extended to large-scale problems, including machine learning, e.g., to analyze the weight space of deep neural networks.
Models
ANNs began as an attempt to exploit the architecture of the human brain to perform tasks that conventional algorithms had little success with. They soon reoriented towards improving empirical results, abandoning attempts to remain true to their biological precursors. ANNs have the ability to learn and model non-linearities and complex relationships. This is achieved by neurons being connected in various patterns, allowing the output of some neurons to become the input of others. The network forms a directed, weighted graph.
An artificial neural network consists of simulated neurons. Each neuron is connected to other nodes
In general, a node is a localized swelling (a "knot") or a point of intersection (a Vertex (graph theory), vertex).
Node may refer to:
In mathematics
*Vertex (graph theory), a vertex in a mathematical graph
*Vertex (geometry), a point where two ...
via links like a biological axon-synapse-dendrite connection. All the nodes connected by links take in some data and use it to perform specific operations and tasks on the data. Each link has a weight, determining the strength of one node's influence on another, allowing weights to choose the signal between neurons.
Artificial neurons
ANNs are composed of artificial neurons which are conceptually derived from biological neurons. Each artificial neuron has inputs and produces a single output which can be sent to multiple other neurons. The inputs can be the feature values of a sample of external data, such as images or documents, or they can be the outputs of other neurons. The outputs of the final ''output neurons'' of the neural net accomplish the task, such as recognizing an object in an image.
To find the output of the neuron we take the weighted sum of all the inputs, weighted by the ''weights'' of the ''connections'' from the inputs to the neuron. We add a ''bias'' term to this sum. This weighted sum is sometimes called the ''activation''. This weighted sum is then passed through a (usually nonlinear) activation function to produce the output. The initial inputs are external data, such as images and documents. The ultimate outputs accomplish the task, such as recognizing an object in an image.
Organization
The neurons are typically organized into multiple layers, especially in deep learning
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
De ...
. Neurons of one layer connect only to neurons of the immediately preceding and immediately following layers. The layer that receives external data is the ''input layer''. The layer that produces the ultimate result is the ''output layer''. In between them are zero or more ''hidden layers''. Single layer and unlayered networks are also used. Between two layers, multiple connection patterns are possible. They can be 'fully connected', with every neuron in one layer connecting to every neuron in the next layer. They can be ''pooling'', where a group of neurons in one layer connects to a single neuron in the next layer, thereby reducing the number of neurons in that layer. Neurons with only such connections form a directed acyclic graph and are known as ''feedforward networks''. Alternatively, networks that allow connections between neurons in the same or previous layers are known as ''recurrent networks''.
Hyperparameter
A hyperparameter is a constant parameter whose value is set before the learning process begins. The values of parameters are derived via learning. Examples of hyperparameters include learning rate
In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ac ...
, the number of hidden layers and batch size. The values of some hyperparameters can be dependent on those of other hyperparameters. For example, the size of some layers can depend on the overall number of layers.
Learning
Learning is the adaptation of the network to better handle a task by considering sample observations. Learning involves adjusting the weights (and optional thresholds) of the network to improve the accuracy of the result. This is done by minimizing the observed errors. Learning is complete when examining additional observations does not usefully reduce the error rate. Even after learning, the error rate typically does not reach 0. If after learning, the error rate is too high, the network typically must be redesigned. Practically this is done by defining a cost function that is evaluated periodically during learning. As long as its output continues to decline, learning continues. The cost is frequently defined as a statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...
whose value can only be approximated. The outputs are actually numbers, so when the error is low, the difference between the output (almost certainly a cat) and the correct answer (cat) is small. Learning attempts to reduce the total of the differences across the observations. Most learning models can be viewed as a straightforward application of optimization theory and statistical estimation.
Learning rate
The learning rate defines the size of the corrective steps that the model takes to adjust for errors in each observation. A high learning rate shortens the training time, but with lower ultimate accuracy, while a lower learning rate takes longer, but with the potential for greater accuracy. Optimizations such as Quickprop
Quickprop is an iterative method for determining the minimum of the loss function of an artificial neural network, following an algorithm inspired by the Newton's method. Sometimes, the algorithm is classified to the group of the second order lear ...
are primarily aimed at speeding up error minimization, while other improvements mainly try to increase reliability. In order to avoid oscillation inside the network such as alternating connection weights, and to improve the rate of convergence, refinements use an adaptive learning rate that increases or decreases as appropriate. The concept of momentum allows the balance between the gradient and the previous change to be weighted such that the weight adjustment depends to some degree on the previous change. A momentum close to 0 emphasizes the gradient, while a value close to 1 emphasizes the last change.
Cost function
While it is possible to define a cost function ad hoc, frequently the choice is determined by the function's desirable properties (such as convexity
Convex or convexity may refer to:
Science and technology
* Convex lens, in optics
Mathematics
* Convex set, containing the whole line segment that joins points
** Convex polygon, a polygon which encloses a convex set of points
** Convex polytope, ...
) or because it arises from the model (e.g. in a probabilistic model the model's posterior probability can be used as an inverse cost).
Backpropagation
Backpropagation is a method used to adjust the connection weights to compensate for each error found during learning. The error amount is effectively divided among the connections. Technically, backprop calculates the gradient (the derivative) of the cost function associated with a given state with respect to the weights. The weight updates can be done via stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
or other methods, such as ''extreme learning machine
Extreme learning machines are feedforward neural networks for statistical classification, classification, regression analysis, regression, Cluster analysis, clustering, sparse approximation, compression and feature learning with a single layer or ...
s'', "no-prop" networks, training without backtracking, "weightless" networks, and non-connectionist neural networks.
Learning paradigms
Machine learning is commonly separated into three main learning paradigms, supervised learning, unsupervised learning and reinforcement learning. Each corresponds to a particular learning task.
Supervised learning
Supervised learning uses a set of paired inputs and desired outputs. The learning task is to produce the desired output for each input. In this case, the cost function is related to eliminating incorrect deductions. A commonly used cost is the mean-squared error, which tries to minimize the average squared error between the network's output and the desired output. Tasks suited for supervised learning are pattern recognition (also known as classification) and regression
Regression or regressions may refer to:
Science
* Marine regression, coastal advance due to falling sea level, the opposite of marine transgression
* Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...
(also known as function approximation). Supervised learning is also applicable to sequential data (e.g., for handwriting, speech and gesture recognition
Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or sta ...
). This can be thought of as learning with a "teacher", in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
Unsupervised learning
In unsupervised learning, input data is given along with the cost function, some function of the data and the network's output. The cost function is dependent on the task (the model domain) and any '' a priori'' assumptions (the implicit properties of the model, its parameters and the observed variables). As a trivial example, consider the model where is a constant and the cost