
Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the
biological neural network
A neural circuit is a population of neurons interconnected by synapses to carry out a specific function when activated. Neural circuits interconnect to one another to form large scale brain networks.
Biological neural networks have inspired t ...
s that constitute animal
brain
The brain is an organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head ( cephalization), usually near organs for special ...
s.
An ANN is based on a collection of connected units or nodes called
artificial neuron
An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing ...
s, which loosely model the
neuron
A neuron, neurone, or nerve cell is an membrane potential#Cell excitability, electrically excitable cell (biology), cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous ...
s in a biological brain. Each connection, like the
synapse
In the nervous system, a synapse is a structure that permits a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or to the target effector cell.
Synapses are essential to the transmission of nervous impulses fr ...
s in a biological brain, can transmit a signal to other neurons. An artificial neuron receives signals then processes them and can signal neurons connected to it. The "signal" at a connection is a
real number
In mathematics, a real number is a number that can be used to measurement, measure a ''continuous'' one-dimensional quantity such as a distance, time, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small var ...
, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections are called ''edges''. Neurons and edges typically have a ''
weight
In science and engineering, the weight of an object is the force acting on the object due to gravity.
Some standard textbooks define weight as a vector quantity, the gravitational force acting on the object. Others define weight as a scalar q ...
'' that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold.
Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times.
Training
Neural network
A neural network is a network or neural circuit, circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up ...
s learn (or are trained) by processing examples, each of which contains a known "input" and "result," forming probability-weighted associations between the two, which are stored within the data structure of the net itself. The training of a neural network from a given example is usually conducted by determining the difference between the processed output of the network (often a prediction) and a target output. This difference is the error. The network then adjusts its weighted associations according to a learning rule and using this error value. Successive adjustments will cause the neural network to produce output which is increasingly similar to the target output. After a sufficient number of these adjustments the training can be terminated based upon certain criteria. This is known as
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
.
Such systems "learn" to perform tasks by considering examples, generally without being programmed with task-specific rules. For example, in
image recognition
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the huma ...
, they might learn to identify images that contain cats by analyzing example images that have been manually
labeled as "cat" or "no cat" and using the results to identify cats in other images. They do this without any prior knowledge of cats, for example, that they have fur, tails, whiskers, and cat-like faces. Instead, they automatically generate identifying characteristics from the examples that they process.
History
Warren McCulloch
Warren Sturgis McCulloch (November 16, 1898 – September 24, 1969) was an American neurophysiologist and cybernetician, known for his work on the foundation for certain brain theories and his contribution to the cybernetics movement.Ken Aizawa ...
and
Walter Pitts
Walter Harry Pitts, Jr. (23 April 1923 – 14 May 1969) was a logician who worked in the field of computational neuroscience.Smalheiser, Neil R"Walter Pitts", ''Perspectives in Biology and Medicine'', Volume 43, Number 2, Winter 2000, pp. 21 ...
(1943) opened the subject by creating a computational model for neural networks. In the late 1940s,
D. O. Hebb created a learning
hypothesis
A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can testable, test it. Scientists generally base scientific hypotheses on prev ...
based on the mechanism of
neural plasticity
Neuroplasticity, also known as neural plasticity, or brain plasticity, is the ability of neural networks in the brain to change through growth and reorganization. It is when the brain is rewired to function in some way that differs from how it p ...
that became known as
Hebbian learning
Hebbian theory is a neuroscientific theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptatio ...
. Farley and
Wesley A. Clark
Wesley Allison Clark (April 10, 1927 – February 22, 2016) was an American physicist who is credited for designing the first modern personal computer. He was also a computer designer and the main participant, along with Charles Molnar, in the ...
(1954) first used computational machines, then called "calculators", to simulate a Hebbian network. In 1958, psychologist
Frank Rosenblatt
Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning.
Life and career
Rosenblatt was born in New Rochelle, New York as son o ...
invented the
perceptron
In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function which can decide whether or not an ...
, the first artificial neural network,
funded by the United States
Office of Naval Research
The Office of Naval Research (ONR) is an organization within the United States Department of the Navy responsible for the science and technology programs of the U.S. Navy and Marine Corps. Established by Congress in 1946, its mission is to pla ...
.
The first functional networks with many layers were published by
Ivakhnenko and Lapa in 1965, as the
Group Method of Data Handling Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.
GMDH is used in such fiel ...
.
The basics of continuous backpropagation
were derived in the context of
control theory
Control theory is a field of mathematics that deals with the control of dynamical systems in engineered processes and machines. The objective is to develop a model or algorithm governing the application of system inputs to drive the system to a ...
by
Kelley in 1960 and by
Bryson in 1961,
using principles of
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
I ...
. Thereafter research stagnated following
Minsky
Minsky (Belarusian: Мінскі; Russian: Минский) is a family name originating in Eastern Europe.
People
*Hyman Minsky (1919–1996), American economist
*Marvin Minsky (1927–2016), American cognitive scientist in the field of Ar ...
and
Papert (1969), who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks.
In 1970,
Seppo Linnainmaa
Seppo Ilmari Linnainmaa (born 28 September 1945) is a Finnish mathematician and computer scientist. He was born in Pori. In 1974 he obtained the first doctorate ever awarded in computer science at the University of Helsinki. In 1976, he became Ass ...
published the general method for
automatic differentiation
In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation, computational differentiation, auto-differentiation, or simply autodiff, is a set of techniques to evaluate the derivative of a function ...
(AD) of discrete connected networks of nested
differentiable
In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point i ...
functions.
In 1973, Dreyfus used
backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
to adapt
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s of controllers in proportion to error gradients.
Werbos's (1975)
backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
algorithm enabled practical training of multi-layer networks. In 1982, he applied Linnainmaa's AD method to neural networks in the way that became widely used.
The development of
metal–oxide–semiconductor
The metal–oxide–semiconductor field-effect transistor (MOSFET, MOS-FET, or MOS FET) is a type of field-effect transistor (FET), most commonly fabricated by the controlled oxidation of silicon. It has an insulated gate, the voltage of which d ...
(MOS)
very-large-scale integration
Very large-scale integration (VLSI) is the process of creating an integrated circuit (IC) by combining millions or billions of MOS transistors onto a single chip. VLSI began in the 1970s when MOS integrated circuit (Metal Oxide Semiconductor) ...
(VLSI), in the form of
complementary MOS
Complementary metal–oxide–semiconductor (CMOS, pronounced "sea-moss", ) is a type of metal–oxide–semiconductor field-effect transistor (MOSFET) fabrication process that uses complementary and symmetrical pairs of p-type and n-type MOSF ...
(CMOS) technology, enabled increasing MOS
transistor count
The transistor count is the number of transistors in an electronic device (typically on a single substrate or "chip"). It is the most common measure of integrated circuit complexity (although the majority of transistors in modern microprocessor ...
s in
digital electronics
Digital electronics is a field of electronics involving the study of digital signals and the engineering of devices that use or produce them. This is in contrast to analog electronics and analog signals.
Digital electronic circuits are usual ...
. This provided more processing power for the development of practical artificial neural networks in the 1980s.
In 1986
Rumelhart
David Everett Rumelhart (June 12, 1942 – March 13, 2011) was an American psychologist who made many contributions to the formal analysis of human cognition, working primarily within the frameworks of mathematical psychology, symbolic artific ...
,
Hinton and
Williams showed that backpropagation learned interesting internal representations of words as feature vectors when trained to predict the next word in a sequence.
[David E. Rumelhart, Geoffrey E. Hinton & Ronald J. Williams ,]
Learning representations by back-propagating errors
" ''Nature', 323, pages 533–536 1986.
From 1988 onward,
[Qian, Ning, and Terrence J. Sejnowski. "Predicting the secondary structure of globular proteins using neural network models." ''Journal of molecular biology'' 202, no. 4 (1988): 865-884.][Bohr, Henrik, Jakob Bohr, Søren Brunak, Rodney MJ Cotterill, Benny Lautrup, Leif Nørskov, Ole H. Olsen, and Steffen B. Petersen. "Protein secondary structure and homology by neural networks The α-helices in rhodopsin." ''FEBS letters'' 241, (1988): 223-228] the use of neural networks transformed the field of protein structure prediction, in particular when the first cascading networks were trained on ''profiles'' (matrices) produced by multiple sequence alignments.
[Rost, Burkhard, and Chris Sander. "Prediction of protein secondary structure at better than 70% accuracy." ''Journal of molecular biology'' 232, no. 2 (1993): 584-599.]
In 1992,
max-pooling was introduced to help with least-shift invariance and tolerance to deformation to aid
3D object recognition {{FeatureDetectionCompVisNavbox
In computer vision, 3D object recognition involves recognizing and determining 3D information, such as the pose, volume, or shape, of user-chosen 3D objects in a photograph or range scan. Typically, an example o ...
.
[J. Weng, N. Ahuja and T. S. Huang,]
Cresceptron: a self-organizing neural network which grows adaptively
" ''Proc. International Joint Conference on Neural Networks'', Baltimore, Maryland, vol I, pp. 576–581, June 1992.[J. Weng, N. Ahuja and T. S. Huang,]
Learning recognition and segmentation of 3-D objects from 2-D images
" ''Proc. 4th International Conf. Computer Vision'', Berlin, Germany, pp. 121–128, May 1993.[J. Weng, N. Ahuja and T. S. Huang,]
Learning recognition and segmentation using the Cresceptron
" ''International Journal of Computer Vision'', vol. 25, no. 2, pp. 105–139, Nov. 1997. Schmidhuber adopted a multi-level hierarchy of networks (1992) pre-trained one level at a time by
unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
and fine-tuned by
backpropagation
In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
.
[J. Schmidhuber.,]
Learning complex, extended sequences using the principle of history compression
" ''Neural Computation'', 4, pp. 234–242, 1992.
Neural networks' early successes included predicting the stock market and in 1995 a (mostly) self-driving car.
Geoffrey Hinton
Geoffrey Everest Hinton One or more of the preceding sentences incorporates text from the royalsociety.org website where: (born 6 December 1947) is a British-Canadian cognitive psychologist and computer scientist, most noted for his work on ...
et al. (2006) proposed learning a high-level representation using successive layers of binary or real-valued
latent variable
In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
s with a
restricted Boltzmann machine
A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.
RBMs were initially invented under the name Harmonium by Paul Smolensky in 1986,
and ros ...
to model each layer. In 2012,
Ng and
Dean
Dean may refer to:
People
* Dean (given name)
* Dean (surname), a surname of Anglo-Saxon English origin
* Dean (South Korean singer), a stage name for singer Kwon Hyuk
* Dean Delannoit, a Belgian singer most known by the mononym Dean
Titles
* ...
created a network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images.
Unsupervised pre-training and increased computing power from
GPU
A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mob ...
s and
distributed computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...
allowed the use of larger networks, particularly in image and visual recognition problems, which became known as "
deep learning".
Ciresan and colleagues (2010)
showed that despite the
vanishing gradient problem
In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network's ...
, GPUs make backpropagation feasible for many-layered feedforward neural networks.
[Dominik Scherer, Andreas C. Müller, and Sven Behnke:]
Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition
" ''In 20th International Conference Artificial Neural Networks (ICANN)'', pp. 92–101, 2010. . Between 2009 and 2012, ANNs began winning prizes in image recognition contests, approaching human level performance on various tasks, initially in
pattern recognition
Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphic ...
and
handwriting recognition
Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other dev ...
. For example, the bi-directional and multi-dimensional
long short-term memory
Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM)
of
Graves
A grave is a location where a dead body (typically that of a human, although sometimes that of an animal) is buried or interred after a funeral. Graves are usually located in special areas set aside for the purpose of burial, such as grave ...
et al. won three competitions in connected handwriting recognition in 2009 without any prior knowledge about the three languages to be learned.
Ciresan and colleagues built the first pattern recognizers to achieve human-competitive/superhuman performance on benchmarks such as traffic sign recognition (IJCNN 2012).
Models

ANNs began as an attempt to exploit the architecture of the human brain to perform tasks that conventional algorithms had little success with. They soon reoriented towards improving empirical results, mostly abandoning attempts to remain true to their biological precursors. Neurons are connected to each other in various patterns, to allow the output of some neurons to become the input of others. The network forms a
directed
Director may refer to:
Literature
* ''Director'' (magazine), a British magazine
* ''The Director'' (novel), a 1971 novel by Henry Denker
* ''The Director'' (play), a 2000 play by Nancy Hasty
Music
* Director (band), an Irish rock band
* ''D ...
,
weighted graph
This is a glossary of graph theory. Graph theory is the study of graphs, systems of nodes or vertices connected in pairs by lines or edges
Edge or EDGE may refer to:
Technology Computing
* Edge computing, a network load-balancing system
* ...
.
An artificial neural network consists of a collection of simulated neurons. Each neuron is a
node
In general, a node is a localized swelling (a "knot") or a point of intersection (a vertex).
Node may refer to:
In mathematics
* Vertex (graph theory), a vertex in a mathematical graph
* Vertex (geometry), a point where two or more curves, line ...
which is connected to other nodes via
links that correspond to biological axon-synapse-dendrite connections. Each link has a weight, which determines the strength of one node's influence on another.
Artificial neurons
ANNs are composed of
artificial neurons
An artificial neuron is a mathematical function conceived as a model of biological neurons, a neural network. Artificial neurons are elementary units in an artificial neural network. The artificial neuron receives one or more inputs (representing ...
which are conceptually derived from biological
neuron
A neuron, neurone, or nerve cell is an membrane potential#Cell excitability, electrically excitable cell (biology), cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous ...
s. Each artificial neuron has inputs and produces a single output which can be sent to multiple other neurons.
The inputs can be the feature values of a sample of external data, such as images or documents, or they can be the outputs of other neurons. The outputs of the final ''output neurons'' of the neural net accomplish the task, such as recognizing an object in an image.
To find the output of the neuron we take the weighted sum of all the inputs, weighted by the ''weights'' of the ''connections'' from the inputs to the neuron. We add a ''bias'' term to this sum.
This weighted sum is sometimes called the ''activation''. This weighted sum is then passed through a (usually nonlinear)
activation function
In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.
A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or " ...
to produce the output. The initial inputs are external data, such as images and documents. The ultimate outputs accomplish the task, such as recognizing an object in an image.
Organization
The neurons are typically organized into multiple layers, especially in
deep learning. Neurons of one layer connect only to neurons of the immediately preceding and immediately following layers. The layer that receives external data is the ''input layer''. The layer that produces the ultimate result is the ''output layer''. In between them are zero or more ''hidden layers''. Single layer and unlayered networks are also used. Between two layers, multiple connection patterns are possible. They can be 'fully connected', with every neuron in one layer connecting to every neuron in the next layer. They can be ''pooling'', where a group of neurons in one layer connect to a single neuron in the next layer, thereby reducing the number of neurons in that layer.
Neurons with only such connections form a
directed acyclic graph
In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one v ...
and are known as
''feedforward networks''.
Alternatively, networks that allow connections between neurons in the same or previous layers are known as
''recurrent networks''''.''
Hyperparameter
A
hyperparameter
In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis.
For example, if one is using a beta distribution to mo ...
is a constant
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
whose value is set before the learning process begins. The values of
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s are derived via learning. Examples of hyperparameters include
learning rate
In machine learning and statistics, the learning rate is a Hyperparameter (machine learning), tuning parameter in an Mathematical optimization, optimization algorithm that determines the step size at each iteration while moving toward a minimum of ...
, the number of hidden layers and batch size. The values of some hyperparameters can be dependent on those of other hyperparameters. For example, the size of some layers can depend on the overall number of layers.
Learning
Learning is the adaptation of the network to better handle a task by considering sample observations. Learning involves adjusting the weights (and optional thresholds) of the network to improve the accuracy of the result. This is done by minimizing the observed errors. Learning is complete when examining additional observations does not usefully reduce the error rate. Even after learning, the error rate typically does not reach 0. If after learning, the error rate is too high, the network typically must be redesigned. Practically this is done by defining a
cost function that is evaluated periodically during learning. As long as its output continues to decline, learning continues. The cost is frequently defined as a
statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hy ...
whose value can only be approximated. The outputs are actually numbers, so when the error is low, the difference between the output (almost certainly a cat) and the correct answer (cat) is small. Learning attempts to reduce the total of the differences across the observations. Most learning models can be viewed as a straightforward application of
optimization
Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
theory and
statistical estimation
Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their valu ...
.
Learning rate
The learning rate defines the size of the corrective steps that the model takes to adjust for errors in each observation. A high learning rate shortens the training time, but with lower ultimate accuracy, while a lower learning rate takes longer, but with the potential for greater accuracy. Optimizations such as
Quickprop
Quickprop is an iterative method for determining the minimum of the loss function of an artificial neural network, following an algorithm inspired by the Newton's method. Sometimes, the algorithm is classified to the group of the second order lear ...
are primarily aimed at speeding up error minimization, while other improvements mainly try to increase reliability. In order to avoid
oscillation
Oscillation is the repetitive or Periodic function, periodic variation, typically in time, of some measure about a central value (often a point of Mechanical equilibrium, equilibrium) or between two or more different states. Familiar examples o ...
inside the network such as alternating connection weights, and to improve the rate of convergence, refinements use an
adaptive learning rate
In machine learning and statistics, the learning rate is a Hyperparameter (machine learning), tuning parameter in an Mathematical optimization, optimization algorithm that determines the step size at each iteration while moving toward a minimum of ...
that increases or decreases as appropriate. The concept of momentum allows the balance between the gradient and the previous change to be weighted such that the weight adjustment depends to some degree on the previous change. A momentum close to 0 emphasizes the gradient, while a value close to 1 emphasizes the last change.
Cost function
While it is possible to define a cost function
ad hoc
Ad hoc is a Latin phrase meaning literally 'to this'. In English, it typically signifies a solution for a specific purpose, problem, or task rather than a generalized solution adaptable to collateral instances. (Compare with '' a priori''.)
C ...
, frequently the choice is determined by the function's desirable properties (such as
convexity) or because it arises from the model (e.g. in a probabilistic model the model's
posterior probability
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
can be used as an inverse cost).
Backpropagation
Backpropagation is a method used to adjust the connection weights to compensate for each error found during learning. The error amount is effectively divided among the connections. Technically, backprop calculates the
gradient
In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...
(the derivative) of the
cost function associated with a given state with respect to the weights. The weight updates can be done via
stochastic gradient descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
or other methods, such as
Extreme Learning Machines
Extreme learning machines are feedforward neural networks for classification, regression, clustering, sparse approximation, compression and feature learning with a single layer or multiple layers of hidden nodes, where the parameters of hidden ...
, "No-prop" networks, training without backtracking, "weightless" networks,
and
non-connectionist neural networks.
Learning paradigms
Machine learning is commonly separated into three main learning paradigms,
supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
,
unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
and
reinforcement learning
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
. Each corresponds to a particular learning task.
Supervised learning
Supervised learning
Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
uses a set of paired inputs and desired outputs. The learning task is to produce the desired output for each input. In this case the cost function is related to eliminating incorrect deductions. A commonly used cost is the
mean-squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...
, which tries to minimize the average squared error between the network's output and the desired output. Tasks suited for supervised learning are
pattern recognition
Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphic ...
(also known as classification) and
regression (also known as function approximation). Supervised learning is also applicable to sequential data (e.g., for hand writing, speech and
gesture recognition
Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or sta ...
). This can be thought of as learning with a "teacher", in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
Unsupervised learning
In
unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
, input data is given along with the cost function, some function of the data
and the network's output. The cost function is dependent on the task (the model domain) and any ''
a priori
("from the earlier") and ("from the later") are Latin phrases used in philosophy to distinguish types of knowledge, justification, or argument by their reliance on empirical evidence or experience. knowledge is independent from current ex ...
'' assumptions (the implicit properties of the model, its parameters and the observed variables). As a trivial example, consider the model
where
is a constant and the cost