machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected units or nodes called '' artificial neurons'', which loosely model the

neuron A neuron (American English), neurone (British English), or nerve cell, is an membrane potential#Cell excitability, excitable cell (biology), cell that fires electric signals called action potentials across a neural network (biology), neural net ...

s in the brain. Artificial neuron models that mimic biological neurons more closely have also been recently investigated and shown to significantly improve performance. These are connected by ''edges'', which model the

synapse In the nervous system, a synapse is a structure that allows a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or a target effector cell. Synapses can be classified as either chemical or electrical, depending o ...

s in the brain. Each artificial neuron receives signals from connected neurons, then processes them and sends a signal to other connected neurons. The "signal" is a

real number In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...

, and the output of each neuron is computed by some non-linear function of the sum of its inputs, called the '' activation function''. The strength of the signal at each connection is determined by a ''

weight In science and engineering, the weight of an object is a quantity associated with the gravitational force exerted on the object by other objects in its environment, although there is some variation and debate as to the exact definition. Some sta ...

'', which adjusts during the learning process. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (the ''input layer'') to the last layer (the ''output layer''), possibly passing through multiple intermediate layers ('' hidden layers''). A network is typically called a deep neural network if it has at least two hidden layers. Artificial neural networks are used for various tasks, including predictive modeling, adaptive control, and solving problems in

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

. They can learn from experience, and can derive conclusions from a complex and seemingly unrelated set of information.

Training

Neural networks are typically trained through

empirical risk minimization In statistical learning theory, the principle of empirical risk minimization defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large num ...

. This method is based on the idea of optimizing the network's parameters to minimize the difference, or empirical risk, between the predicted output and the actual target values in a given dataset. Gradient-based methods such as backpropagation are usually used to estimate the parameters of the network. During the training phase, ANNs learn from labeled training data by iteratively updating their parameters to minimize a defined loss function. This method allows the network to generalize to unseen data.

History

Early work

Today's deep neural networks are based on early work in

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

over 200 years ago. The simplest kind of

feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...

(FNN) is a linear network, which consists of a single layer of output nodes with linear activation functions; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated at each node. The

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwee ...

s between these calculated outputs and the given target values are minimized by creating an adjustment to the weights. This technique has been known for over two centuries as the method of least squares or

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

. It was used as a means of finding a good rough linear fit to a set of points by Legendre (1805) and

Gauss Johann Carl Friedrich Gauss (; ; ; 30 April 177723 February 1855) was a German mathematician, astronomer, Geodesy, geodesist, and physicist, who contributed to many fields in mathematics and science. He was director of the Göttingen Observat ...

(1795) for the prediction of planetary movement.Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares" Historically, digital computers such as the von Neumann model operate via the execution of explicit instructions with access to memory by a number of processors. Some neural networks, on the other hand, originated from efforts to model information processing in biological systems through the framework of

connectionism Connectionism is an approach to the study of human mental processes and cognition that utilizes mathematical models known as connectionist networks or artificial neural networks. Connectionism has had many "waves" since its beginnings. The first ...

. Unlike the von Neumann model, connectionist computing does not separate memory and processing.

Warren McCulloch Warren Sturgis McCulloch (November 16, 1898 – September 24, 1969) was an American neurophysiologist and cybernetician known for his work on the foundation for certain brain theories and his contribution to the cybernetics movement.Ken Aizawa ...

and

Walter Pitts Walter Harry Pitts, Jr. (April 23, 1923 – May 14, 1969) was an American logician who worked in the field of computational neuroscience.Smalheiser, Neil R"Walter Pitts", ''Perspectives in Biology and Medicine'', Volume 43, Number 2, Wint ...

(1943) considered a non-learning computational model for neural networks. This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence. In the late 1940s, D. O. Hebb proposed a learning

hypothesis A hypothesis (: hypotheses) is a proposed explanation for a phenomenon. A scientific hypothesis must be based on observations and make a testable and reproducible prediction about reality, in a process beginning with an educated guess o ...

based on the mechanism of neural plasticity that became known as Hebbian learning. It was used in many early neural networks, such as Rosenblatt's

perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

and the Hopfield network. Farley and Clark (1954) used computational machines to simulate a Hebbian network. Other neural network computational machines were created by Rochester, Holland, Habit and Duda (1956). In 1958, psychologist

Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning for his pioneering work on artificial neural networks. Life and career ...

described the perceptron, one of the first implemented artificial neural networks, funded by the United States Office of Naval Research. R. D. Joseph (1960) mentions an even earlier perceptron-like device by Farley and Clark: "Farley and Clark of MIT Lincoln Laboratory actually preceded Rosenblatt in the development of a perceptron-like device." However, "they dropped the subject." The perceptron raised public excitement for research in Artificial Neural Networks, causing the US government to drastically increase funding. This contributed to "the Golden Age of AI" fueled by the optimistic claims made by computer scientists regarding the ability of perceptrons to emulate human intelligence. The first perceptrons did not have adaptive hidden units. However, Joseph (1960) also discussed multilayer perceptrons with an adaptive hidden layer. Rosenblatt (1962) cited and adopted these ideas, also crediting work by H. D. Block and B. W. Knight. Unfortunately, these early efforts did not lead to a working learning algorithm for hidden units, i.e.,

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

Deep learning breakthroughs in the 1960s and 1970s

Fundamental research was conducted on ANNs in the 1960s and 1970s. The first working deep learning algorithm was the Group method of data handling, a method to train arbitrarily deep neural networks, published by Alexey Ivakhnenko and Lapa in the

Soviet Union The Union of Soviet Socialist Republics. (USSR), commonly known as the Soviet Union, was a List of former transcontinental countries#Since 1700, transcontinental country that spanned much of Eurasia from 1922 until Dissolution of the Soviet ...

(1965). They regarded it as a form of polynomial regression, or a generalization of Rosenblatt's perceptron. A 1971 paper described a deep network with eight layers trained by this method, which is based on layer by layer training through regression analysis. Superfluous hidden units are pruned using a separate validation set. Since the activation functions of the nodes are Kolmogorov-Gabor polynomials, these were also the first deep networks with multiplicative units or "gates." The first deep learning multilayer perceptron trained by

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

was published in 1967 by Shun'ichi Amari. In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned internal representations to classify non-linearily separable pattern classes. Subsequent developments in hardware and hyperparameter tunings have made end-to-end stochastic gradient descent the currently dominant training technique. In 1969, Kunihiko Fukushima introduced the ReLU (rectified linear unit) activation function. The rectifier has become the most popular activation function for deep learning. Nevertheless, research stagnated in the United States following the work of Minsky and Papert (1969), who emphasized that basic perceptrons were incapable of processing the exclusive-or circuit. This insight was irrelevant for the deep networks of Ivakhnenko (1965) and Amari (1967). In 1976 transfer learning was introduced in neural networks learning. Deep learning architectures for

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

s (CNNs) with convolutional layers and downsampling layers and weight replication began with the Neocognitron introduced by Kunihiko Fukushima in 1979, though not trained by backpropagation.

Backpropagation

Backpropagation is an efficient application of the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h ...

derived by

Gottfried Wilhelm Leibniz Gottfried Wilhelm Leibniz (or Leibnitz; – 14 November 1716) was a German polymath active as a mathematician, philosopher, scientist and diplomat who is credited, alongside Sir Isaac Newton, with the creation of calculus in addition to ...

in 1673 to networks of differentiable nodes. The terminology "back-propagating errors" was actually introduced in 1962 by Rosenblatt, but he did not know how to implement this, although Henry J. Kelley had a continuous precursor of backpropagation in 1960 in the context of

control theory Control theory is a field of control engineering and applied mathematics that deals with the control system, control of dynamical systems in engineered processes and machines. The objective is to develop a model or algorithm governing the applic ...

. In 1970, Seppo Linnainmaa published the modern form of backpropagation in his Master's

thesis A thesis (: theses), or dissertation (abbreviated diss.), is a document submitted in support of candidature for an academic degree or professional qualification presenting the author's research and findings.International Standard ISO 7144: D ...

(1970). G.M. Ostrovski et al. republished it in 1971.Ostrovski, G.M., Volin,Y.M., and Boris, W.W. (1971). On the computation of derivatives. Wiss. Z. Tech. Hochschule for Chemistry, 13:382–384. Paul Werbos applied backpropagation to neural networks in 1982 (his 1974 PhD thesis, reprinted in a 1994 book, did not yet describe the algorithm). In 1986, David E. Rumelhart et al. popularised backpropagation but did not cite the original work.

Convolutional neural networks

Kunihiko Fukushima's

(CNN) architecture of 1979 also introduced max pooling, a popular downsampling procedure for CNNs. CNNs have become an essential tool for

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

. The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel to apply CNN to phoneme recognition. It used convolutions, weight sharing, and backpropagation. Alexander Waibel et al.,
Phoneme Recognition Using Time-Delay Neural Networks
'' IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989. In 1988, Wei Zhang applied a backpropagation-trained CNN to alphabet recognition. In 1989, Yann LeCun et al. created a CNN called LeNet for recognizing handwritten ZIP codes on mail. Training required 3 days.LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition", ''Neural Computation'', 1, pp. 541–551, 1989. In 1990, Wei Zhang implemented a CNN on optical computing hardware. In 1991, a CNN was applied to medical image object segmentation and breast cancer detection in mammograms. LeNet-5 (1998), a 7-level CNN by Yann LeCun et al., that classifies digits, was applied by several banks to recognize hand-written numbers on checks digitized in 32×32 pixel images. From 1988 onward,Qian, Ning, and Terrence J. Sejnowski. "Predicting the secondary structure of globular proteins using neural network models." ''Journal of molecular biology'' 202, no. 4 (1988): 865–884.Bohr, Henrik, Jakob Bohr, Søren Brunak, Rodney MJ Cotterill, Benny Lautrup, Leif Nørskov, Ole H. Olsen, and Steffen B. Petersen. "Protein secondary structure and homology by neural networks The α-helices in rhodopsin." ''FEBS letters'' 241, (1988): 223–228 the use of neural networks transformed the field of

protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its Protein secondary structure, secondary and Protein tertiary structure, tertiary structure ...

, in particular when the first cascading networks were trained on ''profiles'' (matrices) produced by multiple

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...

s.Rost, Burkhard, and Chris Sander. "Prediction of protein secondary structure at better than 70% accuracy." ''Journal of molecular biology'' 232, no. 2 (1993): 584–599.

Recurrent neural networks

One origin of RNN was

statistical mechanics In physics, statistical mechanics is a mathematical framework that applies statistical methods and probability theory to large assemblies of microscopic entities. Sometimes called statistical physics or statistical thermodynamics, its applicati ...

. In 1972, Shun'ichi Amari proposed to modify the weights of an

Ising model The Ising model (or Lenz–Ising model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical models in physics, mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that r ...

by Hebbian learning rule as a model of associative memory, adding in the component of learning. This was popularized as the Hopfield network by John Hopfield (1982). Another origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901, Cajal observed "recurrent semicircles" in the

cerebellar cortex The cerebellum (: cerebella or cerebellums; Latin for 'little brain') is a major feature of the hindbrain of all vertebrates. Although usually smaller than the cerebrum, in some animals such as the mormyrid fishes it may be as large as it or e ...

. Hebb considered "reverberating circuit" as an explanation for short-term memory. The McCulloch and Pitts paper (1943) considered neural networks that contain cycles, and noted that the current activity of such networks can be affected by activity indefinitely far in the past. In 1982 a recurrent neural network with an array architecture (rather than a multilayer perceptron architecture), namely a Crossbar Adaptive Array, Bozinovski, S. (1982). "A self-learning system using secondary reinforcement". In Trappl, Robert (ed.). Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research. North-Holland. pp. 397–402. ISBN 978-0-444-86488-8Bozinovski S. (1995) "Neuro genetic agents and structural theory of self-reinforcement learning systems". CMPSCI Technical Report 95-107, University of Massachusetts at Amhers

used direct recurrent connections from the output to the supervisor (teaching) inputs. In addition of computing actions (decisions), it computed internal state evaluations (emotions) of the consequence situations. Eliminating the external supervisor, it introduced the self-learning method in neural networks. In cognitive psychology, the journal American Psychologist in early 1980's carried out a debate on the relation between cognition and emotion. Zajonc in 1980 stated that emotion is computed first and is independent from cognition, while Lazarus in 1982 stated that cognition is computed first and is inseparable from emotion. In 1982 the Crossbar Adaptive Array gave a neural network model of cognition-emotion relation. It was an example of a debate where an AI system, a recurrent neural network, contributed to an issue in the same time addressed by cognitive psychology. Two early influential works were the Recurrent neural network#Jordan network, Jordan network (1986) and the Elman network (1990), which applied RNN to study

cognitive psychology Cognitive psychology is the scientific study of human mental processes such as attention, language use, memory, perception, problem solving, creativity, and reasoning. Cognitive psychology originated in the 1960s in a break from behaviorism, whi ...

. In the 1980s, backpropagation did not work well for deep RNNs. To overcome this problem, in 1991, Jürgen Schmidhuber proposed the "neural sequence chunker" or "neural history compressor" which introduced the important concepts of self-supervised pre-training (the "P" in

ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...

) and neural knowledge distillation. In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN. In 1991, Sepp Hochreiter's diploma thesisS. Hochreiter.,
Untersuchungen zu dynamischen neuronalen Netzen
, , ''Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber'', 1991. identified and analyzed the vanishing gradient problem and proposed recurrent residual connections to solve it. He and Schmidhuber introduced long short-term memory (LSTM), which set accuracy records in multiple applications domains. This was not yet the modern version of LSTM, which required the forget gate, which was introduced in 1999. It became the default choice for RNN architecture. During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by Terry Sejnowski, Peter Dayan, Geoffrey Hinton, etc., including the Boltzmann machine, restricted Boltzmann machine, Helmholtz machine, and the wake-sleep algorithm. These were designed for unsupervised learning of deep generative models.

Deep learning

Between 2009 and 2012, ANNs began winning prizes in image recognition contests, approaching human level performance on various tasks, initially in

pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...

and

handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwriting, handwritten input from sources such as paper documents, photographs, touch-screens ...

. In 2011, a CNN named ''DanNet'' by Dan Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3. It then won more contests. They also showed how max-pooling CNNs on GPU improved performance significantly. In October 2012, AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the large-scale ImageNet competition by a significant margin over shallow machine learning methods. Further incremental improvements included the VGG-16 network by Karen Simonyan and Andrew Zisserman and Google's Inceptionv3. In 2012, Ng and Dean created a network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images. Unsupervised pre-training and increased computing power from GPUs and

distributed computing Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system commu ...

allowed the use of larger networks, particularly in image and visual recognition problems, which became known as "deep learning". Radial basis function and wavelet networks were introduced in 2013. These can be shown to offer best approximation properties and have been applied in

nonlinear system identification System identification is a method of identifying or measuring the mathematical model of a system from measurements of the system inputs and outputs. The applications of system identification include any system where the inputs and outputs can be mea ...

and classification applications. Generative adversarial network (GAN) ( Ian Goodfellow et al., 2014) became state of the art in generative modeling during 2014–2018 period. The GAN principle was originally published in 1991 by Jürgen Schmidhuber who called it "artificial curiosity": two neural networks contest with each other in the form of a

zero-sum game Zero-sum game is a Mathematical model, mathematical representation in game theory and economic theory of a situation that involves two competition, competing entities, where the result is an advantage for one side and an equivalent loss for the o ...

, where one network's gain is the other network's loss. The first network is a

generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsiste ...

that models a

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. Excellent image quality is achieved by

Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

's StyleGAN (2018) based on the Progressive GAN by Tero Karras et al. Here, the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning deepfakes. Diffusion models (2015) eclipsed GANs in generative modeling since then, with systems such as DALL·E 2 (2022) and

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...

(2022). In 2014, the state of the art was training "very deep neural network" with 20 to 30 layers. Stacking too many layers led to a steep reduction in

training Training is teaching, or developing in oneself or others, any skills and knowledge or fitness that relate to specific useful competencies. Training has specific goals of improving one's capability, capacity, productivity and performance. I ...

accuracy, known as the "degradation" problem. In 2015, two techniques were developed to train very deep networks: the highway network was published in May 2015, and the residual neural network (ResNet) in December 2015. ResNet behaves like an open-gated Highway Net. During the 2010s, the seq2seq model was developed, and attention mechanisms were added. It led to the modern Transformer architecture in 2017 in '' Attention Is All You Need''. It requires computation time that is quadratic in the size of the context window. Jürgen Schmidhuber's fast weight controller (1992) scales linearly and was later shown to be equivalent to the unnormalized linear Transformer. Transformers have increasingly become the model of choice for

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

. Many modern large language models such as

GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the p ...

, and BERT use this architecture.

Models

ANNs began as an attempt to exploit the architecture of the human brain to perform tasks that conventional algorithms had little success with. They soon reoriented towards improving empirical results, abandoning attempts to remain true to their biological precursors. ANNs have the ability to learn and model non-linearities and complex relationships. This is achieved by neurons being connected in various patterns, allowing the output of some neurons to become the input of others. The network forms a

directed Direct may refer to: Mathematics * Directed set, in order theory * Direct limit of (pre), sheaves * Direct sum of modules, a construction in abstract algebra which combines several vector spaces Computing * Direct access (disambiguation), a ...

, weighted graph. An artificial neural network consists of simulated neurons. Each neuron is connected to other nodes via links like a biological axon-synapse-dendrite connection. All the nodes connected by links take in some data and use it to perform specific operations and tasks on the data. Each link has a weight, determining the strength of one node's influence on another, allowing weights to choose the signal between neurons.

Artificial neurons

ANNs are composed of artificial neurons which are conceptually derived from biological

s. Each artificial neuron has inputs and produces a single output which can be sent to multiple other neurons. The inputs can be the feature values of a sample of external data, such as images or documents, or they can be the outputs of other neurons. The outputs of the final ''output neurons'' of the neural net accomplish the task, such as recognizing an object in an image. To find the output of the neuron we take the weighted sum of all the inputs, weighted by the ''weights'' of the ''connections'' from the inputs to the neuron. We add a ''bias'' term to this sum. This weighted sum is sometimes called the ''activation''. This weighted sum is then passed through a (usually nonlinear) activation function to produce the output. The initial inputs are external data, such as images and documents. The ultimate outputs accomplish the task, such as recognizing an object in an image.

Organization

The neurons are typically organized into multiple layers, especially in deep learning. Neurons of one layer connect only to neurons of the immediately preceding and immediately following layers. The layer that receives external data is the ''input layer''. The layer that produces the ultimate result is the ''output layer''. In between them are zero or more ''hidden layers''. Single layer and unlayered networks are also used. Between two layers, multiple connection patterns are possible. They can be 'fully connected', with every neuron in one layer connecting to every neuron in the next layer. They can be ''pooling'', where a group of neurons in one layer connects to a single neuron in the next layer, thereby reducing the number of neurons in that layer. Neurons with only such connections form a

directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...

and are known as ''feedforward networks''. Alternatively, networks that allow connections between neurons in the same or previous layers are known as ''recurrent networks''.

Hyperparameter

A hyperparameter is a constant

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

whose value is set before the learning process begins. The values of parameters are derived via learning. Examples of hyperparameters include learning rate, the number of hidden layers and batch size. The values of some hyperparameters can be dependent on those of other hyperparameters. For example, the size of some layers can depend on the overall number of layers.

Learning

Learning is the adaptation of the network to better handle a task by considering sample observations. Learning involves adjusting the weights (and optional thresholds) of the network to improve the accuracy of the result. This is done by minimizing the observed errors. Learning is complete when examining additional observations does not usefully reduce the error rate. Even after learning, the error rate typically does not reach 0. If after learning, the error rate is too high, the network typically must be redesigned. Practically this is done by defining a cost function that is evaluated periodically during learning. As long as its output continues to decline, learning continues. The cost is frequently defined as a

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...

whose value can only be approximated. The outputs are actually numbers, so when the error is low, the difference between the output (almost certainly a cat) and the correct answer (cat) is small. Learning attempts to reduce the total of the differences across the observations. Most learning models can be viewed as a straightforward application of

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

theory and statistical estimation.

Learning rate

The learning rate defines the size of the corrective steps that the model takes to adjust for errors in each observation. A high learning rate shortens the training time, but with lower ultimate accuracy, while a lower learning rate takes longer, but with the potential for greater accuracy. Optimizations such as Quickprop are primarily aimed at speeding up error minimization, while other improvements mainly try to increase reliability. In order to avoid

oscillation Oscillation is the repetitive or periodic variation, typically in time, of some measure about a central value (often a point of equilibrium) or between two or more different states. Familiar examples of oscillation include a swinging pendulum ...

inside the network such as alternating connection weights, and to improve the rate of convergence, refinements use an adaptive learning rate that increases or decreases as appropriate. The concept of momentum allows the balance between the gradient and the previous change to be weighted such that the weight adjustment depends to some degree on the previous change. A momentum close to 0 emphasizes the gradient, while a value close to 1 emphasizes the last change.

Cost function

While it is possible to define a cost function

ad hoc ''Ad hoc'' is a List of Latin phrases, Latin phrase meaning literally for this. In English language, English, it typically signifies a solution designed for a specific purpose, problem, or task rather than a Generalization, generalized solution ...

, frequently the choice is determined by the function's desirable properties (such as convexity) because it arises from the model (e.g. in a probabilistic model, the model's

posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posteri ...

can be used as an inverse cost).

Backpropagation

Backpropagation is a method used to adjust the connection weights to compensate for each error found during learning. The error amount is effectively divided among the connections. Technically, backpropagation calculates the

gradient In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...

(the derivative) of the cost function associated with a given state with respect to the weights. The weight updates can be done via stochastic gradient descent or other methods, such as '' extreme learning machines'', "no-prop" networks, training without backtracking, "weightless" networks, and non-connectionist neural networks.

Learning paradigms

Machine learning is commonly separated into three main learning paradigms,

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

, unsupervised learning and

reinforcement learning Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learnin ...

. Each corresponds to a particular learning task.

Supervised learning

Supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

uses a set of paired inputs and desired outputs. The learning task is to produce the desired output for each input. In this case, the cost function is related to eliminating incorrect deductions. A commonly used cost is the mean-squared error, which tries to minimize the average squared error between the network's output and the desired output. Tasks suited for supervised learning are

(also known as classification) and regression (also known as function approximation). Supervised learning is also applicable to sequential data (e.g., for handwriting, speech and

gesture recognition Gesture recognition is an area of research and development in computer science and language technology concerned with the recognition and interpretation of human gestures. A subdiscipline of computer vision, it employs mathematical algorithms to ...

). This can be thought of as learning with a "teacher", in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.

Unsupervised learning

In unsupervised learning, input data is given along with the cost function, some function of the data

\textstyle x

and the network's output. The cost function is dependent on the task (the model domain) and any ''

a priori ('from the earlier') and ('from the later') are Latin phrases used in philosophy to distinguish types of knowledge, Justification (epistemology), justification, or argument by their reliance on experience. knowledge is independent from any ...

'' assumptions (the implicit properties of the model, its parameters and the observed variables). As a trivial example, consider the model

\textstyle f(x) = a

where

\textstyle a

is a constant and the cost

\textstyle C=E x - f(x))^2 /math>. Minimizing this cost produces a value of \textstyle a that is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: for example, in compression it could be related to the

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...

between

\textstyle x

and

\textstyle f(x)

, whereas in statistical modeling, it could be related to the

of the model given the data (note that in both of those examples, those quantities would be maximized rather than minimized). Tasks that fall within the paradigm of unsupervised learning are in general

estimation Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is d ...

problems; the applications include clustering, the estimation of statistical distributions, compression and filtering.

Reinforcement learning

In applications such as playing video games, an actor takes a string of actions, receiving a generally unpredictable response from the environment after each one. The goal is to win the game, i.e., generate the most positive (lowest cost) responses. In

, the aim is to weight the network (devise a policy) to perform actions that minimize long-term (expected cumulative) cost. At each point in time the agent performs an action and the environment generates an observation and an instantaneous cost, according to some (usually unknown) rules. The rules and the long-term cost usually only can be estimated. At any juncture, the agent decides whether to explore new actions to uncover their costs or to exploit prior learning to proceed more quickly. Formally, the environment is modeled as a Markov decision process (MDP) with states

\textstyle \in S

and actions

\textstyle  \in A

. Because the state transitions are not known, probability distributions are used instead: the instantaneous cost distribution

\textstyle P(c_t, s_t)

, the observation distribution

\textstyle P(x_t, s_t)

and the transition distribution

\textstyle P(s_, s_t, a_t)

, while a policy is defined as the conditional distribution over actions given the observations. Taken together, the two define a

Markov chain In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...

(MC). The aim is to discover the lowest-cost MC. ANNs serve as the learning component in such applications. Dynamic programming coupled with ANNs (giving neurodynamic programming) has been applied to problems such as those involved in vehicle routing, video games,

natural resource management Natural resource management (NRM) is the management of natural resources such as Land (economics), land, water, soil, plants and animals, with a particular focus on how management affects the quality of life for both present and future generati ...

and

medicine Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...

because of ANNs ability to mitigate losses of accuracy even when reducing the

discretization In applied mathematics, discretization is the process of transferring continuous functions, models, variables, and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numeri ...

grid density for numerically approximating the solution of control problems. Tasks that fall within the paradigm of reinforcement learning are control problems,

game A game is a structured type of play usually undertaken for entertainment or fun, and sometimes used as an educational tool. Many games are also considered to be work (such as professional players of spectator sports or video games) or art ...

s and other sequential decision making tasks.

Self-learning

Self-learning in neural networks was introduced in 1982 along with a neural network capable of self-learning named ''crossbar adaptive array'' (CAA). It is a system with only one input, situation s, and only one output, action (or behavior) a. It has neither external advice input nor external reinforcement input from the environment. The CAA computes, in a crossbar fashion, both decisions about actions and emotions (feelings) about encountered situations. The system is driven by the interaction between cognition and emotion. Given the memory matrix, W =, , w(a,s), , , the crossbar self-learning algorithm in each iteration performs the following computation: In situation s perform action a; Receive consequence situation s'; Compute emotion of being in consequence situation v(s'); Update crossbar memory w'(a,s) = w(a,s) + v(s'). The backpropagated value (secondary reinforcement) is the emotion toward the consequence situation. The CAA exists in two environments, one is behavioral environment where it behaves, and the other is genetic environment, where from it receives initial emotions (only once) about to be encountered situations in the behavioral environment. Having received the genome vector (species vector) from the genetic environment, the CAA will learn a goal-seeking behavior, in the behavioral environment that contains both desirable and undesirable situations.

Neuroevolution

Neuroevolution can create neural network topologies and weights using

evolutionary computation Evolutionary computation from computer science is a family of algorithms for global optimization inspired by biological evolution, and the subfield of artificial intelligence and soft computing studying these algorithms. In technical terms ...

. It is competitive with sophisticated gradient descent approaches. One advantage of neuroevolution is that it may be less prone to get caught in "dead ends".

Stochastic neural network

Stochastic neural networks originating from Sherrington–Kirkpatrick models are a type of artificial neural network built by introducing random variations into the network, either by giving the network's artificial neurons

stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...

transfer functions , or by giving them stochastic weights. This makes them useful tools for

problems, since the random fluctuations help the network escape from local minima. Stochastic neural networks trained using a Bayesian approach are known as Bayesian neural networks.

Topological deep learning

Topological deep learning, first introduced in 2017, is an emerging approach in

that integrates topology with deep neural networks to address highly intricate and high-order data. Initially rooted in

algebraic topology Algebraic topology is a branch of mathematics that uses tools from abstract algebra to study topological spaces. The basic goal is to find algebraic invariant (mathematics), invariants that classification theorem, classify topological spaces up t ...

, TDL has since evolved into a versatile framework incorporating tools from other mathematical disciplines, such as differential topology and

geometric topology In mathematics, geometric topology is the study of manifolds and Map (mathematics)#Maps as functions, maps between them, particularly embeddings of one manifold into another. History Geometric topology as an area distinct from algebraic topo ...

. As a successful example of mathematical deep learning, TDL continues to inspire advancements in mathematical

, fostering a mutually beneficial relationship between AI and

mathematics Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...

Other

In a Bayesian framework, a distribution over the set of allowed models is chosen to minimize the cost. Evolutionary methods, gene expression programming, simulated annealing, expectation–maximization, non-parametric methods and particle swarm optimization are other learning algorithms. Convergent recursion is a learning algorithm for cerebellar model articulation controller (CMAC) neural networks.

Modes

Two modes of learning are available: stochastic and batch. In stochastic learning, each input creates a weight adjustment. In batch learning, weights are adjusted based on a batch of inputs, accumulating errors over the batch. Stochastic learning introduces "noise" into the process, using the local gradient calculated from one data point; this reduces the chance of the network getting stuck in local minima. However, batch learning typically yields a faster, more stable descent to a local minimum, since each update is performed in the direction of the batch's average error. A common compromise is to use "mini-batches", small batches with samples in each batch selected stochastically from the entire data set.

Types

ANNs have evolved into a broad family of techniques that have advanced the state of the art across multiple domains. The simplest types have one or more static components, including number of units, number of layers, unit weights and

topology Topology (from the Greek language, Greek words , and ) is the branch of mathematics concerned with the properties of a Mathematical object, geometric object that are preserved under Continuous function, continuous Deformation theory, deformat ...

. Dynamic types allow one or more of these to evolve via learning. The latter is much more complicated but can shorten learning periods and produce better results. Some types allow/require learning to be "supervised" by the operator, while others operate independently. Some types operate purely in hardware, while others are purely software and run on general purpose computers. Some of the main breakthroughs include: * Convolutional neural networks that have proven particularly successful in processing visual and other two-dimensional data; Yann LeCun (2016). Slides on Deep Learnin
Online
where long short-term memory avoids the vanishing gradient problem and can handle signals that have a mix of low and high frequency components aiding large-vocabulary speech recognition, text-to-speech synthesis, and photo-real talking heads; * Competitive networks such as generative adversarial networks in which multiple networks (of varying structure) compete with each other, on tasks such as winning a game or on deceiving the opponent about the authenticity of an input.

Network design

Using artificial neural networks requires an understanding of their characteristics. * Choice of model: This depends on the data representation and the application. Model parameters include the number, type, and connectedness of network layers, as well as the size of each and the connection type (full, pooling, etc.). Overly complex models learn slowly. * Learning algorithm: Numerous trade-offs exist between learning algorithms. Almost any algorithm will work well with the correct hyperparameters for training on a particular data set. However, selecting and tuning an algorithm for training on unseen data requires significant experimentation. *

Robustness Robustness is the property of being strong and healthy in constitution. When it is transposed into a system A system is a group of interacting or interrelated elements that act according to a set of rules to form a unified whole. A system, ...

: If the model, cost function and learning algorithm are selected appropriately, the resulting ANN can become robust. Neural architecture search (NAS) uses machine learning to automate ANN design. Various approaches to NAS have designed networks that compare well with hand-designed systems. The basic search algorithm is to propose a candidate model, evaluate it against a dataset, and use the results as feedback to teach the NAS network. Available systems include AutoML and AutoKeras. scikit-learn library provides functions to help with building a deep network from scratch. We can then implement a deep network with

TensorFlow TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...

or Keras. Hyperparameters must also be defined as part of the design (they are not learned), governing matters such as how many neurons are in each layer, learning rate, step, stride, depth, receptive field and padding (for CNNs), etc.

Applications

Because of their ability to reproduce and model nonlinear processes, artificial neural networks have found applications in many disciplines. These include: *

Function approximation In general, a function approximation problem asks us to select a function (mathematics), function among a that closely matches ("approximates") a in a task-specific way. The need for function approximations arises in many branches of applied ...

, or regression analysis, (including time series prediction, fitness approximation, and modeling) *

Data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an o ...

(including filtering, clustering, blind source separation, and compression) *

Nonlinear system identification System identification is a method of identifying or measuring the mathematical model of a system from measurements of the system inputs and outputs. The applications of system identification include any system where the inputs and outputs can be mea ...

and control (including vehicle control, trajectory prediction, adaptive control,

process control Industrial process control (IPC) or simply process control is a system used in modern manufacturing which uses the principles of control theory and physical industrial control systems to monitor, control and optimize continuous Industrial processe ...

, and

) *

Pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...

(including radar systems, face identification, signal classification,

novelty detection Novelty detection is the mechanism by which an intelligent organism is able to identify an incoming sensory pattern as being hitherto unknown. If the pattern is sufficiently salient or associated with a high positive or strong negative utility, ...

, 3D reconstruction, object recognition, and sequential decision making) * Sequence recognition (including

gesture A gesture is a form of nonverbal communication or non-vocal communication in which visible bodily actions communicate particular messages, either in place of, or in conjunction with, speech. Gestures include movement of the hands, face, or othe ...

speech Speech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, suc ...

, and handwritten and printed text recognition) * Sensor data analysis (including

image analysis Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading barcode, bar coded tags or a ...

) *

Robotics Robotics is the interdisciplinary study and practice of the design, construction, operation, and use of robots. Within mechanical engineering, robotics is the design and construction of the physical structures of robots, while in computer s ...

(including directing manipulators and prostheses) *

Data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

(including knowledge discovery in databases) * Finance (such as ex-ante models for specific financial long-run forecasts and artificial financial markets) *

Quantum chemistry Quantum chemistry, also called molecular quantum mechanics, is a branch of physical chemistry focused on the application of quantum mechanics to chemical systems, particularly towards the quantum-mechanical calculation of electronic contributions ...

* General game playing * Generative AI * Data visualization * Machine translation * Social network filtering *

E-mail spam Email spam, also referred to as junk email, spam mail, or simply spam, refers to unsolicited messages sent in bulk via email. The term originates from a Monty Python sketch, where the name of a canned meat product, "Spam," is used repetitively, m ...

filtering *

Medical diagnosis Medical diagnosis (abbreviated Dx, Dx, or Ds) is the process of determining which disease or condition explains a person's symptoms and signs. It is most often referred to as a diagnosis with the medical context being implicit. The information ...

ANNs have been used to diagnose several types of cancers and to distinguish highly invasive cancer cell lines from less invasive lines using only cell shape information. ANNs have been used to accelerate reliability analysis of infrastructures subject to natural disasters and to predict foundation settlements. It can also be useful to mitigate flood by the use of ANNs for modelling rainfall-runoff. ANNs have also been used for building black-box models in

geoscience Earth science or geoscience includes all fields of natural science related to the planet Earth. This is a branch of science dealing with the physical, chemical, and biological complex constitutions and synergistic linkages of Earth's four spheres ...

hydrology Hydrology () is the scientific study of the movement, distribution, and management of water on Earth and other planets, including the water cycle, water resources, and drainage basin sustainability. A practitioner of hydrology is called a hydro ...

, ocean modelling and coastal engineering, and

geomorphology Geomorphology () is the scientific study of the origin and evolution of topographic and bathymetric features generated by physical, chemical or biological processes operating at or near Earth's surface. Geomorphologists seek to understand wh ...

. ANNs have been employed in

cybersecurity Computer security (also cybersecurity, digital security, or information technology (IT) security) is a subdiscipline within the field of information security. It consists of the protection of computer software, systems and networks from thr ...

, with the objective to discriminate between legitimate activities and malicious ones. For example, machine learning has been used for classifying Android malware, for identifying domains belonging to threat actors and for detecting URLs posing a security risk. Research is underway on ANN systems designed for penetration testing, for detecting botnets, credit cards frauds and network intrusions. ANNs have been proposed as a tool to solve

partial differential equation In mathematics, a partial differential equation (PDE) is an equation which involves a multivariable function and one or more of its partial derivatives. The function is often thought of as an "unknown" that solves the equation, similar to ho ...

s in physics and simulate the properties of many-body open quantum systems. In brain research ANNs have studied short-term behavior of individual neurons, the dynamics of neural circuitry arise from interactions between individual neurons and how behavior can arise from abstract neural modules that represent complete subsystems. Studies considered long-and short-term plasticity of neural systems and their relation to learning and memory from the individual neuron to the system level. It is possible to create a profile of a user's interests from pictures, using artificial neural networks trained for object recognition. Beyond their traditional applications, artificial neural networks are increasingly being utilized in interdisciplinary research, such as materials science. For instance, graph neural networks (GNNs) have demonstrated their capability in scaling deep learning for the discovery of new stable materials by efficiently predicting the total energy of crystals. This application underscores the adaptability and potential of ANNs in tackling complex problems beyond the realms of predictive modeling and artificial intelligence, opening new pathways for scientific discovery and innovation.

Theoretical properties

Computational power

The multilayer perceptron is a universal function approximator, as proven by the universal approximation theorem. However, the proof is not constructive regarding the number of neurons required, the network topology, the weights and the learning parameters. A specific recurrent architecture with

rational Rationality is the quality of being guided by or based on reason. In this regard, a person acts rationally if they have a good reason for what they do, or a belief is rational if it is based on strong evidence. This quality can apply to an ...

-valued weights (as opposed to full precision real number-valued weights) has the power of a universal Turing machine, using a finite number of neurons and standard linear connections. Further, the use of

irrational Irrationality is cognition, thinking, talking, or acting without rationality. Irrationality often has a negative connotation, as thinking and actions that are less useful or more illogical than other more rational alternatives. The concept of ...

values for weights results in a machine with super-Turing power.

Capacity

A model's "capacity" property corresponds to its ability to model any given function. It is related to the amount of information that can be stored in the network and to the notion of complexity. Two notions of capacity are known by the community. The information capacity and the VC Dimension. The information capacity of a perceptron is intensively discussed in Sir David MacKay's book which summarizes work by Thomas Cover. The capacity of a network of standard neurons (not convolutional) can be derived by four rules that derive from understanding a neuron as an electrical element. The information capacity captures the functions modelable by the network given any data as input. The second notion, is the

VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and other Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Victorious ...

. VC Dimension uses the principles of

measure theory In mathematics, the concept of a measure is a generalization and formalization of geometrical measures (length, area, volume) and other common notions, such as magnitude (mathematics), magnitude, mass, and probability of events. These seemingl ...

and finds the maximum capacity under the best possible circumstances. This is, given input data in a specific form. As noted in, the VC Dimension for arbitrary inputs is half the information capacity of a perceptron. The VC Dimension for arbitrary points is sometimes referred to as Memory Capacity.

Convergence

Models may not consistently converge on a single solution, firstly because local minima may exist, depending on the cost function and the model. Secondly, the optimization method used might not guarantee to converge when it begins far from any local minimum. Thirdly, for sufficiently large data or parameters, some methods become impractical. Another issue worthy to mention is that training may cross some

saddle point In mathematics, a saddle point or minimax point is a Point (geometry), point on the surface (mathematics), surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a Critical point (mathematics), ...

which may lead the convergence to the wrong direction. The convergence behavior of certain types of ANN architectures are more understood than others. When the width of network approaches to infinity, the ANN is well described by its first order Taylor expansion throughout training, and so inherits the convergence behavior of affine models. Another example is when parameters are small, it is observed that ANNs often fit target functions from low to high frequencies. This behavior is referred to as the spectral bias, or frequency principle, of neural networks. This phenomenon is the opposite to the behavior of some well studied iterative numerical schemes such as Jacobi method. Deeper neural networks have been observed to be more biased towards low frequency functions.

Generalization and statistics

Applications whose goal is to create a system that generalizes well to unseen examples, face the possibility of over-training. This arises in convoluted or over-specified systems when the network capacity significantly exceeds the needed free parameters. Two approaches address over-training. The first is to use cross-validation and similar techniques to check for the presence of over-training and to select hyperparameters to minimize the generalization error. The second is to use some form of '' regularization''. This concept emerges in a probabilistic (Bayesian) framework, where regularization can be performed by selecting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimize over two quantities: the 'empirical risk' and the 'structural risk', which roughly corresponds to the error over the training set and the predicted error in unseen data due to overfitting. Synapse deployment

Supervised neural networks that use a

(MSE) cost function can use formal statistical methods to determine the confidence of the trained model. The MSE on a validation set can be used as an estimate for variance. This value can then be used to calculate the confidence interval of network output, assuming a

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

. A confidence analysis made this way is statistically valid as long as the output

stays the same and the network is not modified. By assigning a

softmax activation function The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

, a generalization of the

logistic function A logistic function or logistic curve is a common S-shaped curve ( sigmoid curve) with the equation f(x) = \frac where The logistic function has domain the real numbers, the limit as x \to -\infty is 0, and the limit as x \to +\infty is L. ...

, on the output layer of the neural network (or a softmax component in a component-based network) for categorical target variables, the outputs can be interpreted as posterior probabilities. This is useful in classification as it gives a certainty measure on classifications. The softmax activation function is: :

y_i=\frac

Criticism

Training

A common criticism of neural networks, particularly in robotics, is that they require too many training samples for real-world operation. Any learning machine needs sufficient representative examples in order to capture the underlying structure that allows it to generalize to new cases. Potential solutions include randomly shuffling training examples, by using a numerical optimization algorithm that does not take too large steps when changing the network connections following an example, grouping examples in so-called mini-batches and/or introducing a recursive least squares algorithm for CMAC. Dean Pomerleau uses a neural network to train a robotic vehicle to drive on multiple types of roads (single lane, multi-lane, dirt, etc.), and a large amount of his research is devoted to extrapolating multiple training scenarios from a single training experience, and preserving past training diversity so that the system does not become overtrained (if, for example, it is presented with a series of right turns—it should not learn to always turn right).

Theory

A central claim of ANNs is that they embody new and powerful general principles for processing information. These principles are ill-defined. It is often claimed that they are emergent from the network itself. This allows simple statistical association (the basic function of artificial neural networks) to be described as learning or recognition. In 1997, Alexander Dewdney, a former ''

Scientific American ''Scientific American'', informally abbreviated ''SciAm'' or sometimes ''SA'', is an American popular science magazine. Many scientists, including Albert Einstein and Nikola Tesla, have contributed articles to it, with more than 150 Nobel Pri ...

'' columnist, commented that as a result, artificial neural networks have a One response to Dewdney is that neural networks have been successfully used to handle many complex and diverse tasks, ranging from autonomously flying aircraft to detecting credit card fraud to mastering the game of Go. Technology writer Roger Bridgman commented: Although it is true that analyzing what has been learned by an artificial neural network is difficult, it is much easier to do so than to analyze what has been learned by a biological neural network. Moreover, recent emphasis on the explainability of AI has contributed towards the development of methods, notably those based on

attention Attention or focus, is the concentration of awareness on some phenomenon to the exclusion of other stimuli. It is the selective concentration on discrete information, either subjectively or objectively. William James (1890) wrote that "Atte ...

mechanisms, for visualizing and explaining learned neural networks. Furthermore, researchers involved in exploring learning algorithms for neural networks are gradually uncovering generic principles that allow a learning machine to be successful. For example, Bengio and LeCun (2007) wrote an article regarding local vs non-local learning, as well as shallow vs deep architecture. Biological brains use both shallow and deep circuits as reported by brain anatomy,D. J. Felleman and D. C. Van Essen,
Distributed hierarchical processing in the primate cerebral cortex
" ''Cerebral Cortex'', 1, pp. 1–47, 1991. displaying a wide variety of invariance. WengJ. Weng,
Natural and Artificial Intelligence: Introduction to Computational Brain-Mind
," BMI Press, , 2012. argued that the brain self-wires largely according to signal statistics and therefore, a serial cascade cannot catch all major statistical dependencies.

Hardware

Large and effective neural networks require considerable computing resources. While the brain has hardware tailored to the task of processing signals through a graph of neurons, simulating even a simplified neuron on

von Neumann architecture The von Neumann architecture—also known as the von Neumann model or Princeton architecture—is a computer architecture based on the '' First Draft of a Report on the EDVAC'', written by John von Neumann in 1945, describing designs discus ...

may consume vast amounts of

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembe ...

and storage. Furthermore, the designer often needs to transmit signals through many of these connections and their associated neurons which require enormous CPU power and time. Some argue that the resurgence of neural networks in the twenty-first century is largely attributable to advances in hardware: from 1991 to 2015, computing power, especially as delivered by GPGPUs (on GPUs), has increased around a million-fold, making the standard backpropagation algorithm feasible for training networks that are several layers deeper than before. The use of accelerators such as FPGAs and GPUs can reduce training times from months to days. Neuromorphic engineering or a physical neural network addresses the hardware difficulty directly, by constructing non-von-Neumann chips to directly implement neural networks in circuitry. Another type of chip optimized for neural network processing is called a Tensor Processing Unit, or TPU.

Practical counterexamples

Analyzing what has been learned by an ANN is much easier than analyzing what has been learned by a biological neural network. Furthermore, researchers involved in exploring learning algorithms for neural networks are gradually uncovering general principles that allow a learning machine to be successful. For example, local vs. non-local learning and shallow vs. deep architecture.

Hybrid approaches

Advocates of hybrid models (combining neural networks and symbolic approaches) say that such a mixture can better capture the mechanisms of the human mind.

Dataset bias

Neural networks are dependent on the quality of the data they are trained on, thus low quality data with imbalanced representativeness can lead to the model learning and perpetuating societal biases. These inherited biases become especially critical when the ANNs are integrated into real-world scenarios where the training data may be imbalanced due to the scarcity of data for a specific race, gender or other attribute. This imbalance can result in the model having inadequate representation and understanding of underrepresented groups, leading to discriminatory outcomes that exacerbate societal inequalities, especially in applications like facial recognition, hiring processes, and

law enforcement Law enforcement is the activity of some members of the government or other social institutions who act in an organized manner to enforce the law by investigating, deterring, rehabilitating, or punishing people who violate the rules and norms gove ...

. For example, in 2018,

Amazon Amazon most often refers to: * Amazon River, in South America * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon (company), an American multinational technology company * Amazons, a tribe of female warriors in Greek myth ...

had to scrap a recruiting tool because the model favored men over women for jobs in software engineering due to the higher number of male workers in the field. The program would penalize any resume with the word "woman" or the name of any women's college. However, the use of

synthetic data Synthetic data are artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data generated by a comp ...

can help reduce dataset bias and increase representation in datasets.

Gallery

File:Single layer ann.svg, A single-layer feedforward artificial neural network. Arrows originating from

\scriptstyle x_2

are omitted for clarity. There are p inputs to this network and q outputs. In this system, the value of the qth output,

y_q

, is calculated as

\scriptstyle y_q = K*(\sum_i(x_i*w_)-b_q).

File:Two layer ann.svg, A two-layer feedforward artificial neural network File:Artificial neural network.svg, An artificial neural network File:Ann dependency (graph).svg, An ANN dependency graph File:Single-layer feedforward artificial neural network.png, A single-layer feedforward artificial neural network with 4 inputs, 6 hidden nodes and 2 outputs. Given position state and direction, it outputs wheel based control values. File:Two-layer feedforward artificial neural network.png, A two-layer feedforward artificial neural network with 8 inputs, 2x8 hidden nodes and 2 outputs. Given position state, direction and other environment values, it outputs thruster based control values. File:Cmac.jpg, Parallel pipeline structure of CMAC neural network. This learning algorithm can converge in one step.

Recent advancements and future directions

Artificial neural networks (ANNs) have undergone significant advancements, particularly in their ability to model complex systems, handle large data sets, and adapt to various types of applications. Their evolution over the past few decades has been marked by a broad range of applications in fields such as image processing, speech recognition, natural language processing, finance, and medicine.

Image processing

In the realm of image processing, ANNs are employed in tasks such as image classification, object recognition, and image segmentation. For instance, deep convolutional neural networks (CNNs) have been important in handwritten digit recognition, achieving state-of-the-art performance. This demonstrates the ability of ANNs to effectively process and interpret complex visual information, leading to advancements in fields ranging from automated surveillance to medical imaging.

Speech recognition

By modeling speech signals, ANNs are used for tasks like speaker identification and speech-to-text conversion. Deep neural network architectures have introduced significant improvements in large vocabulary continuous speech recognition, outperforming traditional techniques. These advancements have enabled the development of more accurate and efficient voice-activated systems, enhancing user interfaces in technology products.

Natural language processing

In natural language processing, ANNs are used for tasks such as text classification, sentiment analysis, and machine translation. They have enabled the development of models that can accurately translate between languages, understand the context and sentiment in textual data, and categorize text based on content. This has implications for automated customer service, content moderation, and language understanding technologies.

Control systems

In the domain of control systems, ANNs are used to model dynamic systems for tasks such as system identification, control design, and optimization. For instance, deep feedforward neural networks are important in system identification and control applications.

Finance

ANNs are used for stock market prediction and credit scoring: *In investing, ANNs can process vast amounts of financial data, recognize complex patterns, and forecast stock market trends, aiding investors and risk managers in making informed decisions. *In credit scoring, ANNs offer data-driven, personalized assessments of creditworthiness, improving the accuracy of default predictions and automating the lending process. ANNs require high-quality data and careful tuning, and their "black-box" nature can pose challenges in interpretation. Nevertheless, ongoing advancements suggest that ANNs continue to play a role in finance, offering valuable insights and enhancing risk management strategies.

Medicine

ANNs are able to process and analyze vast medical datasets. They enhance diagnostic accuracy, especially by interpreting complex

medical imaging Medical imaging is the technique and process of imaging the interior of a body for clinical analysis and medical intervention, as well as visual representation of the function of some organs or tissues (physiology). Medical imaging seeks to revea ...

for early disease detection, and by predicting patient outcomes for personalized treatment planning. In drug discovery, ANNs speed up the identification of potential drug candidates and predict their efficacy and safety, significantly reducing development time and costs. Additionally, their application in personalized medicine and healthcare data analysis allows tailored therapies and efficient patient care management. Ongoing research is aimed at addressing remaining challenges such as data privacy and model interpretability, as well as expanding the scope of ANN applications in medicine.

Content creation

ANNs such as generative adversarial networks (GAN) and

transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...

are used for content creation across numerous industries. This is because deep learning models are able to learn the style of an artist or musician from huge datasets and generate completely new artworks and music compositions. For instance, DALL-E is a deep neural network trained on 650 million pairs of images and texts across the internet that can create artworks based on text entered by the user. In the field of music, transformers are used to create original music for commercials and documentaries through companies such as AIVA and Jukedeck. In the marketing industry generative models are used to create personalized advertisements for consumers. Additionally, major film companies are partnering with technology companies to analyze the financial success of a film, such as the partnership between Warner Bros and technology company Cinelytic established in 2020. Furthermore, neural networks have found uses in video game creation, where Non Player Characters (NPCs) can make decisions based on all the characters currently in the game.

References

Bibliography

* * * *
PDF
* * * * **created for

National Science Foundation The U.S. National Science Foundation (NSF) is an Independent agencies of the United States government#Examples of independent agencies, independent agency of the Federal government of the United States, United States federal government that su ...

, Contract Number EET-8716324, and

Defense Advanced Research Projects Agency The Defense Advanced Research Projects Agency (DARPA) is a research and development agency of the United States Department of Defense responsible for the development of emerging technologies for use by the military. Originally known as the Adva ...

(DOD), ARPA Order No. 4976 under Contract F33615-87-C-1499. * * * * * * * * * * * * *

External links

A Brief Introduction to Neural Networks (D. Kriesel)
– Illustrated, bilingual manuscript about artificial neural networks; Topics so far: Perceptrons, Backpropagation, Radial Basis Functions, Recurrent Neural Networks, Self Organizing Maps, Hopfield Networks.

* ttps://web.archive.org/web/20091216110504/http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html Another introduction to ANNbr>Next Generation of Neural Networks
– Google Tech Talks
Neural Networks and Information
* {{Authority control Computational statistics Classification algorithms Computational neuroscience Market research Mathematical psychology Mathematical and quantitative methods (economics) Bioinspiration

Training

History

Early work

Deep learning breakthroughs in the 1960s and 1970s

Backpropagation

Convolutional neural networks

Recurrent neural networks

Deep learning

Models

Artificial neurons

Organization

Hyperparameter

Learning

Learning rate

Cost function

Backpropagation

Learning paradigms

Supervised learning

Unsupervised learning

Reinforcement learning

Self-learning

Neuroevolution

Stochastic neural network

Topological deep learning

Other

Modes

Types

Network design

Applications

Theoretical properties

Computational power

Capacity

Convergence

Generalization and statistics

Criticism

Training

Theory

Hardware

Practical counterexamples

Hybrid approaches

Dataset bias

Gallery

Recent advancements and future directions

Image processing

Speech recognition

Natural language processing

Control systems

Finance

Medicine

Content creation

See also

References

Bibliography

External links