Deep learning is a subset of

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

that focuses on utilizing multilayered

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

s to perform tasks such as

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

regression Regression or regressions may refer to: Arts and entertainment * ''Regression'' (film), a 2015 horror film by Alejandro Amenábar, starring Ethan Hawke and Emma Watson * ''Regression'' (magazine), an Australian punk rock fanzine (1982–1984) * ...

, and

representation learning In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...

. The field takes inspiration from biological neuroscience and is centered around stacking

artificial neurons An artificial neuron is a mathematical function conceived as a model of a biological neuron in a neural network. The artificial neuron is the elementary unit of an ''artificial neural network''. The design of the artificial neuron was inspired ...

into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers (ranging from three to several hundred or thousands) in the network. Methods used can be either supervised, semi-supervised or

unsupervised ''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. On November 17, 2012, the series was canceled after one season. Plot The series f ...

. Some common deep learning network architectures include fully connected networks,

deep belief network In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not b ...

recurrent neural networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

convolutional neural networks A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different type ...

generative adversarial networks A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...

transformers ''Transformers'' is a media franchise produced by American toy company Hasbro and Japanese toy company Tomy, Takara Tomy. It primarily follows the heroic Autobots and the villainous Decepticons, two Extraterrestrials in fiction, alien robot fac ...

, and neural radiance fields. These architectures have been applied to fields including

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also ...

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

drug design Drug design, often referred to as rational drug design or simply rational design, is the invention, inventive process of finding new medications based on the knowledge of a biological target. The drug is most commonly an organic compound, organi ...

medical image analysis Medical image computing (MIC) is an interdisciplinary field at the intersection of computer science, information engineering, electrical engineering, physics, mathematics and medicine. This field develops computational and mathematical methods fo ...

climate science Climatology (from Greek , ''klima'', "slope"; and , '' -logia'') or climate science is the scientific study of Earth's climate, typically defined as weather conditions averaged over a period of at least 30 years. Climate concerns the atmospher ...

, material inspection and

board game A board game is a type of tabletop game that involves small objects () that are placed and moved in particular ways on a specially designed patterned game board, potentially including other components, e.g. dice. The earliest known uses of the ...

programs, where they have produced results comparable to and in some cases surpassing human expert performance. Early forms of neural networks were inspired by information processing and distributed communication nodes in

biological system A biological system is a complex Biological network inference, network which connects several biologically relevant entities. Biological organization spans several scales and are determined based different structures depending on what the system is ...

s, particularly the

human brain The human brain is the central organ (anatomy), organ of the nervous system, and with the spinal cord, comprises the central nervous system. It consists of the cerebrum, the brainstem and the cerebellum. The brain controls most of the activi ...

. However, current neural networks do not intend to model the brain function of organisms, and are generally seen as low-quality models for that purpose.

Overview

Most modern deep learning models are based on multi-layered

s such as

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

s and

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

s, although they can also include

propositional formula In propositional logic, a propositional formula is a type of syntactic formula which is well formed. If the values of all variables in a propositional formula are given, it determines a unique truth value. A propositional formula may also be call ...

s or latent variables organized layer-wise in deep

generative model In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsiste ...

s such as the nodes in

s and deep

Boltzmann machine A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising model), named after Ludwig Boltzmann, is a spin glass, spin-glass model with an external field, i.e., a Spin glass#Sherrington–Kirkpatrick m ...

s. Fundamentally, deep learning refers to a class of

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

s in which a hierarchy of layers is used to transform input data into a progressively more abstract and composite representation. For example, in an

image recognition Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...

model, the raw input may be an

image An image or picture is a visual representation. An image can be Two-dimensional space, two-dimensional, such as a drawing, painting, or photograph, or Three-dimensional space, three-dimensional, such as a carving or sculpture. Images may be di ...

(represented as a

tensor In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects associated with a vector space. Tensors may map between different objects such as vectors, scalars, and even other ...

pixel In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a Raster graphics, raster image, or the smallest addressable element in a dot matrix display device. In most digital display devices, p ...

s). The first representational layer may attempt to identify basic shapes such as lines and circles, the second layer may compose and encode arrangements of edges, the third layer may encode a nose and eyes, and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally place at which level ''on its own''. Prior to deep learning, machine learning techniques often involved hand-crafted

feature engineering Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with ...

to transform the data into a more suitable representation for a classification algorithm to operate on. In the deep learning approach, features are not hand-crafted and the model discovers useful feature representations from the data automatically. This does not eliminate the need for hand-tuning; for example, varying numbers of layers and layer sizes can provide different degrees of abstraction. The word "deep" in "deep learning" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial ''credit assignment path'' (CAP) depth. The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For a

feedforward neural network Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...

, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited. No universally agreed-upon threshold of depth divides shallow learning from deep learning, but most researchers agree that deep learning involves CAP depth higher than two. CAP of depth two has been shown to be a universal approximator in the sense that it can emulate any function. Beyond that, more layers do not add to the function approximator ability of the network. Deep models (CAP > two) are able to extract better features than shallow models and hence, extra layers help in learning the features effectively. Deep learning architectures can be constructed with a greedy layer-by-layer method. Deep learning helps to disentangle these abstractions and pick out which features improve performance. Deep learning algorithms can be applied to unsupervised learning tasks. This is an important benefit because unlabeled data is more abundant than the labeled data. Examples of deep structures that can be trained in an unsupervised manner are

s. The term ''Deep Learning'' was introduced to the machine learning community by

Rina Dechter Rina Dechter (; born August 13, 1950) is a distinguished professor of computer science in the Donald Bren School of Information and Computer Sciences at the University of California, Irvine. Her research is on automated reasoning in artificial i ...

in 1986,

(1986). Learning while searching in constraint-satisfaction problems. University of California, Computer Science Department, Cognitive Systems Laborator
Online
and to artificial neural networks by Igor Aizenberg and colleagues in 2000, in the context of

Boolean Any kind of logic, function, expression, or theory based on the work of George Boole is considered Boolean. Related to this, "Boolean" may refer to: * Boolean data type, a form of data with only two possible values (usually "true" and "false" ...

threshold neurons. Although the history of its appearance is apparently more complicated.

Interpretations

Deep neural networks are generally interpreted in terms of the

universal approximation theorem In the mathematical theory of artificial neural networks, universal approximation theorems are theorems of the following form: Given a family of neural networks, for each function f from a certain function space, there exists a sequence of neural ...

Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017)
The Expressive Power of Neural Networks: A View from the Width
. Neural Information Processing Systems, 6231-6239. or probabilistic inference. The classic universal approximation theorem concerns the capacity of

feedforward neural networks Feedforward refers to recognition-inference architecture of neural networks. Artificial neural network architectures are based on inputs multiplied by weights to obtain outputs (inputs-to-output): feedforward. Recurrent neural networks, or neur ...

with a single hidden layer of finite size to approximate

continuous functions In mathematics, a continuous function is a function such that a small variation of the argument induces a small variation of the value of the function. This implies there are no abrupt changes in value, known as '' discontinuities''. More preci ...

. In 1989, the first proof was published by

George Cybenko George V. Cybenko is the Dorothy and Walter Gramm Professor of Engineering at Dartmouth and a fellow of the IEEE and SIAM. Education Cybenko obtained his BA in mathematics from the University of Toronto in 1974 and received his PhD from Princ ...

for

sigmoid Sigmoid means resembling the lower-case Greek letter sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς) or the Latin letter S. Specific uses include: * Sigmoid function, a mathematical function * Sigmoid colon, part of the l ...

activation functions and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik. Recent work also showed that universal approximation also holds for non-bounded activation functions such as

Kunihiko Fukushima Kunihiko Fukushima ( Japanese: 福島邦彦, born 16 March 1936) is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a senior research scientist at the F ...

's rectified linear unit. The universal approximation theorem for

deep neural network Deep learning is a subset of machine learning that focuses on utilizing multilayered neural network (machine learning), neural networks to perform tasks such as Statistical classification, classification, Regression analysis, regression, and re ...

s concerns the capacity of networks with bounded width but the depth is allowed to grow. Lu et al. proved that if the width of a deep neural network with

ReLU In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...

activation is strictly larger than the input dimension, then the network can approximate any Lebesgue integrable function; if the width is smaller or equal to the input dimension, then a deep neural network is not a universal approximator. The

probabilistic Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

interpretation derives from the field of

. It features inference, as well as the

optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

concepts of

training Training is teaching, or developing in oneself or others, any skills and knowledge or fitness that relate to specific useful competencies. Training has specific goals of improving one's capability, capacity, productivity and performance. I ...

and

testing Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film) ...

, related to fitting and

generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...

, respectively. More specifically, the probabilistic interpretation considers the activation nonlinearity as a

cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ever ...

. The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks. The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by

Bishop A bishop is an ordained member of the clergy who is entrusted with a position of Episcopal polity, authority and oversight in a religious institution. In Christianity, bishops are normally responsible for the governance and administration of di ...

History

Before 1980

There are two

types Type may refer to: Science and technology Computing * Typing, producing text via a keyboard, typewriter, etc. * Data type, collection of values used for computations. * File type * TYPE (DOS command), a command to display contents of a file. * Ty ...

of artificial neural network (ANN):

(FNN) or

multilayer perceptron In deep learning, a multilayer perceptron (MLP) is a name for a modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is ...

(MLP) and

(RNN). RNNs have cycles in their connectivity structure, FNNs don't. In the 1920s,

Wilhelm Lenz Wilhelm Lenz (February 8, 1888 in Frankfurt am Main – April 30, 1957 in Hamburg) was a German physicist, most notable for his invention of the Ising model (named after his student, Ernst Ising), and for his application of the Laplace–Runge–Le ...

and

Ernst Ising Ernst Ising (; May 10, 1900 – May 11, 1998) was a German physicist, who is best remembered for the development of the Ising model. He was a professor of physics at Bradley University until his retirement in 1976. Life Ernst Ising was bor ...

created the

Ising model The Ising model (or Lenz–Ising model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical models in physics, mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that r ...

which is essentially a non-learning RNN architecture consisting of neuron-like threshold elements. In 1972,

Shun'ichi Amari , is a Japanese engineer and neuroscientist born in 1936 in Tokyo, Japan. Overviews He majored in Mathematical Engineering in 1958 from the University of Tokyo then graduated in 1963 from the Graduate School of the University of Tokyo. His Ma ...

made this architecture adaptive. His learning RNN was republished by

John Hopfield John Joseph Hopfield (born July 15, 1933) is an American physicist and emeritus professor of Princeton University, most widely known for his study of associative neural networks in 1982. He is known for the development of the Hopfield network. ...

in 1982. Other early

s were published by Kaoru Nakano in 1971. Already in 1948,

Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical computer ...

produced work on "Intelligent Machinery" that was not published in his lifetime, containing "ideas related to artificial evolution and learning RNNs".

Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning for his pioneering work on artificial neural networks. Life and career ...

(1958) proposed the perceptron, an MLP with 3 layers: an input layer, a hidden layer with randomized weights that did not learn, and an output layer. He later published a 1962 book that also introduced variants and computer experiments, including a version with four-layer perceptrons "with adaptive preterminal networks" where the last two layers have learned weights (here he credits H. D. Block and B. W. Knight). The book cites an earlier network by R. D. Joseph (1960) "functionally equivalent to a variation of" this four-layer system (the book mentions Joseph over 30 times). Should Joseph therefore be considered the originator of proper adaptive multilayer perceptrons with learning hidden units? Unfortunately, the learning algorithm was not a functional one, and fell into oblivion. The first working deep learning algorithm was the

Group method of data handling A group is a number of persons or things that are located, gathered, or classed together. Groups of people * Cultural group, a group whose members share the same cultural identity * Ethnic group, a group whose members share the same ethnic iden ...

, a method to train arbitrarily deep neural networks, published by

Alexey Ivakhnenko Alexey Ivakhnenko (; 30 March 1913 – 16 October 2007) was a Soviet and Ukrainian mathematician most famous for developing the group method of data handling (GMDH), a method of inductive statistical learning, for which he is considered as one of ...

and Lapa in 1965. They regarded it as a form of polynomial regression, or a generalization of Rosenblatt's perceptron. A 1971 paper described a deep network with eight layers trained by this method, which is based on layer by layer training through regression analysis. Superfluous hidden units are pruned using a separate validation set. Since the activation functions of the nodes are Kolmogorov-Gabor polynomials, these were also the first deep networks with multiplicative units or "gates". The first deep learning

trained by

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an Iterative method, iterative method for optimizing an objective function with suitable smoothness properties (e.g. Differentiable function, differentiable or Subderivative, subdifferentiable ...

was published in 1967 by

. In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned internal representations to classify non-linearily separable pattern classes. Subsequent developments in hardware and hyperparameter tunings have made end-to-end

the currently dominant training technique. In 1969,

introduced the

(rectified linear unit)

activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

. The rectifier has become the most popular activation function for deep learning. Deep learning architectures for

s (CNNs) with convolutional layers and downsampling layers began with the

Neocognitron __NOTOC__ The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the i ...

introduced by

in 1979, though not trained by backpropagation.

Backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

is an efficient application of the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the Function composition, composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h ...

derived by

Gottfried Wilhelm Leibniz Gottfried Wilhelm Leibniz (or Leibnitz; – 14 November 1716) was a German polymath active as a mathematician, philosopher, scientist and diplomat who is credited, alongside Sir Isaac Newton, with the creation of calculus in addition to ...

in 1673 to networks of differentiable nodes. The terminology "back-propagating errors" was actually introduced in 1962 by Rosenblatt, but he did not know how to implement this, although Henry J. Kelley had a continuous precursor of backpropagation in 1960 in the context of

control theory Control theory is a field of control engineering and applied mathematics that deals with the control system, control of dynamical systems in engineered processes and machines. The objective is to develop a model or algorithm governing the applic ...

. The modern form of backpropagation was first published in

Seppo Linnainmaa Seppo Ilmari Linnainmaa (born 28 September 1945) is a Finnish mathematician and computer scientist known for creating the modern version of backpropagation. Biography He was born in Pori. He received his MSc in 1970 and introduced a reverse mo ...

's master thesis (1970). G.M. Ostrovski et al. republished it in 1971.Ostrovski, G.M., Volin,Y.M., and Boris, W.W. (1971). On the computation of derivatives. Wiss. Z. Tech. Hochschule for Chemistry, 13:382–384.

Paul Werbos Paul John Werbos (born September 4, 1947) is an American social scientist and machine learning pioneer. He is best known for his 1974 dissertation, which first described the process of training artificial neural networks through backpropagation o ...

applied backpropagation to neural networks in 1982 (his 1974 PhD thesis, reprinted in a 1994 book, did not yet describe the algorithm). In 1986, David E. Rumelhart et al. popularised backpropagation but did not cite the original work.Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams.
Learning Internal Representations by Error Propagation
". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986.

1980s-2000s

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel to apply CNN to phoneme recognition. It used convolutions, weight sharing, and backpropagation.

Alexander Waibel Alexander Waibel (born 2 May 1956) is a professor of Computer Science at Carnegie Mellon University and Karlsruhe Institute of Technology (KIT). Waibel's research focuses on automatic speech recognition, translation and human-machine interaction. ...

et al.,
Phoneme Recognition Using Time-Delay Neural Networks
' IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989. In 1988, Wei Zhang applied a backpropagation-trained CNN to alphabet recognition. In 1989,

Yann LeCun Yann André Le Cun ( , ; usually spelled LeCun; born 8 July 1960) is a French-American computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Pr ...

et al. created a CNN called

LeNet LeNet is a series of convolutional neural network architectures created by a research group in AT&T Bell Laboratories during the 1988 to 1998 period, centered around Yann LeCun. They were designed for reading small grayscale images of handwritten ...

for recognizing handwritten ZIP codes on mail. Training required 3 days.LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition", ''Neural Computation'', 1, pp. 541–551, 1989. In 1990, Wei Zhang implemented a CNN on

optical computing Optical computing or photonic computing uses light waves produced by lasers or incoherent sources for data processing, data storage or data communication for computing. For decades, photons have shown promise to enable a higher bandwidth than the ...

hardware. In 1991, a CNN was applied to medical image object segmentation and breast cancer detection in mammograms.

-5 (1998), a 7-level CNN by

et al., that classifies digits, was applied by several banks to recognize hand-written numbers on checks digitized in 32x32 pixel images.

Recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

s (RNN) were further developed in the 1980s. Recurrence is used for sequence processing, and when a recurrent network is unrolled, it mathematically resembles a deep feedforward layer. Consequently, they have similar properties and issues, and their developments had mutual influences. In RNN, two early influential works were the

Jordan network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

(1986) and the

Elman network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

(1990), which applied RNN to study problems in

cognitive psychology Cognitive psychology is the scientific study of human mental processes such as attention, language use, memory, perception, problem solving, creativity, and reasoning. Cognitive psychology originated in the 1960s in a break from behaviorism, whi ...

. In the 1980s, backpropagation did not work well for deep learning with long credit assignment paths. To overcome this problem, in 1991,

Jürgen Schmidhuber Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artifici ...

proposed a hierarchy of RNNs pre-trained one level at a time by

self-supervised learning Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self ...

where each RNN tries to predict its own next input, which is the next unexpected input of the RNN below. This "neural history compressor" uses

predictive coding In neuroscience, predictive coding (also known as predictive processing) is a theory of brain function which postulates that the brain is constantly generating and updating a " mental model" of the environment. According to the theory, such a men ...

to learn internal representations at multiple self-organizing time scales. This can substantially facilitate downstream deep learning. The RNN hierarchy can be ''collapsed'' into a single RNN, by

distilling Distillation, also classical distillation, is the process of separating the component substances of a liquid mixture of two or more chemically discrete substances; the separation process is realized by way of the selective boiling of the mixt ...

a higher level ''chunker'' network into a lower level ''automatizer'' network. In 1993, a neural history compressor solved a "Very Deep Learning" task that required more than 1000 subsequent

layers Layer or layered may refer to: Arts, entertainment, and media * ''Layers'' (Kungs album) * ''Layers'' (Les McCann album) * ''Layers'' (Royce da 5′9″ album) *“Layers”, the title track of Royce da 5′9″’s sixth studio album * Layer, a ...

in an RNN unfolded in time. Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN. The "P" in

ChatGPT ChatGPT is a generative artificial intelligence chatbot developed by OpenAI and released on November 30, 2022. It uses large language models (LLMs) such as GPT-4o as well as other Multimodal learning, multimodal models to create human-like re ...

refers to such pre-training.

Sepp Hochreiter Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 201 ...

's diploma thesis (1991)S. Hochreiter.,
Untersuchungen zu dynamischen neuronalen Netzen
. . ''Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber'', 1991. implemented the neural history compressor, and identified and analyzed the

vanishing gradient problem In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights ar ...

. Hochreiter proposed recurrent residual connections to solve the vanishing gradient problem. This led to the

long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...

(LSTM), published in 1995. LSTM can learn "very deep learning" tasks with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. That LSTM was not yet the modern architecture, which required a "forget gate", introduced in 1999, which became the standard RNN architecture. In 1991,

also published adversarial neural networks that contest with each other in the form of a

zero-sum game Zero-sum game is a Mathematical model, mathematical representation in game theory and economic theory of a situation that involves two competition, competing entities, where the result is an advantage for one side and an equivalent loss for the o ...

, where one network's gain is the other network's loss. The first network is a

that models a

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

over output patterns. The second network learns by

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

to predict the reactions of the environment to these patterns. This was called "artificial curiosity". In 2014, this principle was used in

generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...

s (GANs). During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by

Terry Sejnowski Terrence Joseph Sejnowski (; born 13 August 1947) is the Francis Crick Professor at the Salk Institute for Biological Studies where he directs the Computational Neurobiology Laboratory and is the director of the Crick-Jacobs center for theoretic ...

Peter Dayan Peter Dayan is a British neuroscientist and computer scientist who is director at the Max Planck Institute for Biological Cybernetics in Tübingen, Germany, along with Ivan De Araujo. He is co-author of ''Theoretical Neuroscience'', an influent ...

Geoffrey Hinton Geoffrey Everest Hinton (born 1947) is a British-Canadian computer scientist, cognitive scientist, and cognitive psychologist known for his work on artificial neural networks, which earned him the title "the Godfather of AI". Hinton is Univer ...

, etc., including the

restricted Boltzmann machine A restricted Boltzmann machine (RBM) (also called a restricted Sherrington–Kirkpatrick model with external field or restricted stochastic Ising–Lenz–Little model) is a generative stochastic artificial neural network that can learn a prob ...

Helmholtz machine The Helmholtz machine (named after Hermann von Helmholtz and his concept of Helmholtz free energy) is a type of artificial neural network that can account for the hidden structure of a set of data by being trained to create a generative model of th ...

, and the

wake-sleep algorithm The wake-sleep algorithm is an unsupervised learning algorithm for deep Generative model, generative models, especially Helmholtz machine, Helmholtz Machines. The algorithm is similar to the Expectation–maximization algorithm, expectation-maximi ...

. These were designed for unsupervised learning of deep generative models. However, those were more computationally expensive compared to backpropagation. Boltzmann machine learning algorithm, published in 1985, was briefly popular before being eclipsed by the backpropagation algorithm in 1986. (p. 112 ). A 1988 network became state of the art in

protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its Protein secondary structure, secondary and Protein tertiary structure, tertiary structure ...

, an early application of deep learning to bioinformatics. Both shallow and deep learning (e.g., recurrent nets) of ANNs for

have been explored for many years. These methods never outperformed non-uniform internal-handcrafting Gaussian

mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observati ...

Hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

(GMM-HMM) technology based on generative models of speech trained discriminatively. Key difficulties have been analyzed, including gradient diminishing and weak temporal correlation structure in neural predictive models. Additional difficulties were the lack of training data and limited computing power. Most

researchers moved away from neural nets to pursue generative modeling. An exception was at

SRI International SRI International (SRI) is a nonprofit organization, nonprofit scientific research, scientific research institute and organization headquartered in Menlo Park, California, United States. It was established in 1946 by trustees of Stanford Univer ...

in the late 1990s. Funded by the US government's

NSA The National Security Agency (NSA) is an intelligence agency of the United States Department of Defense, under the authority of the director of national intelligence (DNI). The NSA is responsible for global monitoring, collection, and proces ...

and

DARPA The Defense Advanced Research Projects Agency (DARPA) is a research and development agency of the United States Department of Defense responsible for the development of emerging technologies for use by the military. Originally known as the Adva ...

, SRI researched in speech and

speaker recognition Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to ''speaker recognition'' or speech recognition. Speaker verification ...

. The speaker recognition team led by

Larry Heck Larry Paul Heck is the Rhesa Screven Farmer, Jr., Advanced Computing Concepts Chair, Georgia Research Alliance Eminent Scholar, and Professor at the Georgia Institute of Technology. His career spans many of the sub-disciplines of artificial intel ...

reported significant success with deep neural networks in speech processing in the 1998

NIST The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical s ...

Speaker Recognition benchmark. It was deployed in the Nuance Verifier, representing the first major industrial application of deep learning. The principle of elevating "raw" features over hand-crafted optimization was first explored successfully in the architecture of deep autoencoder on the "raw" spectrogram or linear filter-bank features in the late 1990s, showing its superiority over the Mel-Cepstral features that contain stages of fixed transformation from spectrograms. The raw features of speech,

waveform In electronics, acoustics, and related fields, the waveform of a signal is the shape of its Graph of a function, graph as a function of time, independent of its time and Magnitude (mathematics), magnitude Scale (ratio), scales and of any dis ...

s, later produced excellent larger-scale results.

2000s

Neural networks entered a lull, and simpler models that use task-specific handcrafted features such as

Gabor filter In image processing, a Gabor filter, named after Dennis Gabor, who first proposed it as a 1D filter. The Gabor filter was first generalized to 2D by Gösta Granlund, by adding a reference direction. The Gabor filter is a linear filter used for ...

s and

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

s (SVMs) became the preferred choices in the 1990s and 2000s, because of artificial neural networks' computational cost and a lack of understanding of how the brain wires its biological networks. In 2003, LSTM became competitive with traditional speech recognizers on certain tasks. In 2006,

Alex Graves Alexander John Graves is an American film director, television director, television producer and screenwriter. Early life Alex Graves was born in Kansas City, Missouri. His father, William Graves, was a reporter for ''The Kansas City Star'' and ...

, Santiago Fernández, Faustino Gomez, and Schmidhuber combined it with

connectionist temporal classification Connectionist temporal classification (CTC) is a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as Long short-term memory, LSTM networks to tackle sequence problems where the timi ...

(CTC) in stacks of LSTMs.Santiago Fernandez, Alex Graves, and Jürgen Schmidhuber (2007)
An application of recurrent neural networks to discriminative keyword spotting
. Proceedings of ICANN (2), pp. 220–229. In 2009, it became the first RNN to win a

pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...

contest, in connected

handwriting recognition Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwriting, handwritten input from sources such as paper documents, photographs, touch-screens ...

. In 2006, publications by

Geoff Hinton Geoffrey Everest Hinton (born 1947) is a British-Canadian computer scientist, cognitive science, cognitive scientist, and cognitive psychology, cognitive psychologist known for his work on artificial neural networks, which earned him the title ...

, Ruslan Salakhutdinov, Osindero and Teh

s were developed for generative modeling. They are trained by training one restricted Boltzmann machine, then freezing it and training another one on top of the first one, and so on, then optionally

fine-tuned Fine-tuning may refer to: * Fine-tuning (deep learning) * Fine-tuning (physics) * Fine-tuned universe See also * Tuning (disambiguation) {{disambiguation ...

using supervised backpropagation.G. E. Hinton.,
Learning multiple layers of representation
. . ''Trends in Cognitive Sciences'', 11, pp. 428–434, 2007. They could model high-dimensional probability distributions, such as the distribution of MNIST images, but convergence was slow. The impact of deep learning in industry began in the early 2000s, when CNNs already processed an estimated 10% to 20% of all the checks written in the US, according to Yann LeCun.

(2016). Slides on Deep Learnin
Online
Industrial applications of deep learning to large-scale speech recognition started around 2010. The 2009 NIPS Workshop on Deep Learning for Speech Recognition was motivated by the limitations of deep generative models of speech, and the possibility that given more capable hardware and large-scale data sets that deep neural nets might become practical. It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets. However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems. The nature of the recognition errors produced by the two types of systems was characteristically different, offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems. Analysis around 2009–2010, contrasting the GMM (and other generative speech models) vs. DNN models, stimulated early industrial investment in deep learning for speech recognition. That analysis was done with comparable performance (less than 1.5% in error rate) between discriminative DNNs and generative models. In 2010, researchers extended deep learning from

TIMIT TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element has been delineated in time. TIMIT was designed to further acoustic-phonetic knowledge and a ...

to large vocabulary speech recognition, by adopting large output layers of the DNN based on context-dependent HMM states constructed by

decision tree A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

Deep learning revolution

The deep learning revolution started around CNN- and GPU-based computer vision. Although CNNs trained by backpropagation had been around for decades and GPU implementations of NNs for years, including CNNs, faster implementations of CNNs on GPUs were needed to progress on computer vision. Later, as deep learning becomes widespread, specialized hardware and algorithm optimizations were developed specifically for deep learning. A key advance for the deep learning revolution was hardware advances, especially GPU. Some early work dated back to 2004. In 2009, Raina, Madhavan, and

Andrew Ng Andrew Yan-Tak Ng (; born April 18, 1976) is a British-American computer scientist and Internet Entrepreneur, technology entrepreneur focusing on machine learning and artificial intelligence (AI). Ng was a cofounder and head of Google Brain and ...

reported a 100M deep belief network trained on 30 Nvidia GeForce GTX 280 GPUs, an early demonstration of GPU-based deep learning. They reported up to 70 times faster training. In 2011, a CNN named ''DanNet'' by Dan Ciresan, Ueli Meier, Jonathan Masci,

Luca Maria Gambardella Luca Maria Gambardella (born 4 January 1962) is an Italian computer scientist and author. He is the former director of the Dalle Molle Institute for Artificial Intelligence Research in Lugano, in the Ticino canton of Switzerland. He is current ...

, and

achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3. It then won more contests. They also showed how max-pooling CNNs on GPU improved performance significantly. In 2012,

and

Jeff Dean Jeffrey Adgate Dean (born July 23, 1968) is an American computer scientist and software engineer. Since 2018, he has been the lead of Google AI. He was appointed Google's chief scientist in 2023 after the merger of DeepMind and Google Brain into ...

created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from

YouTube YouTube is an American social media and online video sharing platform owned by Google. YouTube was founded on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim who were three former employees of PayPal. Headquartered in ...

videos. In October 2012,

AlexNet AlexNet is a convolutional neural network architecture developed for image classification tasks, notably achieving prominence through its performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1, ...

Alex Krizhevsky Alex Krizhevsky (born 4 March 1986) is a Canadian computer scientist most noted for his work on artificial neural networks and deep learning. In 2012, Krizhevsky, Ilya Sutskever and their PhD advisor Geoffrey Hinton, at the University of Toronto ...

Ilya Sutskever Ilya Sutskever (; born 8 December 1986) is an Israeli-Canadian computer scientist who specializes in machine learning. He has made several major contributions to the field of deep learning. With Alex Krizhevsky and Geoffrey Hinton, he co-inv ...

, and

won the large-scale

ImageNet competition The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million ...

by a significant margin over shallow machine learning methods. Further incremental improvements included the VGG-16 network by Karen Simonyan and

Andrew Zisserman Andrew Zisserman (born 1957) is a British computer scientist and a professor at the University of Oxford, and a researcher in computer vision. As of 2014 he is affiliated with DeepMind. Education Zisserman received the Part III of the Mathema ...

and Google's

Inceptionv3 Inception is a family of convolutional neural network (CNN) for computer vision, introduced by researchers at Google in 2014 as GoogLeNet (later renamed Inception v1). The series was historically important as an early CNN that separates the stem ...

. The success in image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.... In 2014, the state of the art was training “very deep neural network” with 20 to 30 layers. Stacking too many layers led to a steep reduction in

accuracy, known as the "degradation" problem. In 2015, two techniques were developed to train very deep networks: the Highway Network was published in May 2015, and the

residual neural network A residual neural network (also referred to as a residual network or ResNet) is a deep learning architecture in which the layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition, and won ...

(ResNet) in Dec 2015. ResNet behaves like an open-gated Highway Net. Around the same time, deep learning started impacting the field of art. Early examples included Google DeepDream (2015), and neural style transfer (2015), both of which were based on pretrained image classification neural networks, such as VGG-19.

Generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...

(GAN) by ( Ian Goodfellow et al., 2014) (based on

's principle of artificial curiosity) became state of the art in generative modeling during 2014-2018 period. Excellent image quality is achieved by

Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

StyleGAN The Style Generative Adversarial Network, or StyleGAN for short, is an extension to the GAN architecture introduced by Nvidia researchers in December 2018, and made source available in February 2019. StyleGAN depends on Nvidia's CUDA software, ...

(2018) based on the Progressive GAN by Tero Karras et al. Here the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning

deepfake ''Deepfakes'' (a portmanteau of and ) are images, videos, or audio that have been edited or generated using artificial intelligence, AI-based tools or AV editing software. They may depict real or fictional people and are considered a form of ...

Diffusion model In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable model, latent variable generative model, generative models. A diffusion model consists of two ...

s (2015) eclipsed GANs in generative modeling since then, with systems such as

DALL·E 2 DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as ''prompts''. The first version of DALL-E w ...

(2022) and

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...

(2022). In 2015, Google's speech recognition improved by 49% by an LSTM-based model, which they made available through

Google Voice Search Google Voice Search or Search by Voice is a Google product that allows users to use Google Search by speaking on a mobile phone or computer, i.e. have the device search for data upon entering information on what to search into the device by sp ...

smartphone A smartphone is a mobile phone with advanced computing capabilities. It typically has a touchscreen interface, allowing users to access a wide range of applications and services, such as web browsing, email, and social media, as well as multi ...

.Google Research Blog. The neural networks behind Google Voice transcription. August 11, 2015. By Françoise Beaufays http://googleresearch.blogspot.co.at/2015/08/the-neural-networks-behind-google-voice.html Deep learning is part of state-of-the-art systems in various disciplines, particularly computer vision and

automatic speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also k ...

(ASR). Results on commonly used evaluation sets such as

(ASR) and MNIST (

image classification Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the form o ...

), as well as a range of large-vocabulary speech recognition tasks have steadily improved. Convolutional neural networks were superseded for ASR by

LSTM Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hi ...

. but are more successful in computer vision.

Yoshua Bengio Yoshua Bengio (born March 5, 1964) is a Canadian-French computer scientist, and a pioneer of artificial neural networks and deep learning. He is a professor at the Université de Montréal and scientific director of the AI institute Montreal In ...

and

were awarded the 2018

Turing Award The ACM A. M. Turing Award is an annual prize given by the Association for Computing Machinery (ACM) for contributions of lasting and major technical importance to computer science. It is generally recognized as the highest distinction in the fi ...

for "conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing".

Neural networks

Artificial neural networks (ANNs) or

connectionist Connectionism is an approach to the study of human mental processes and cognition that utilizes mathematical models known as connectionist networks or artificial neural networks. Connectionism has had many "waves" since its beginnings. The first ...

systems are computing systems inspired by the

biological neural network A neural network, also called a neuronal network, is an interconnected population of neurons (typically containing multiple neural circuits). Biological neural networks are studied to understand the organization and functioning of nervous syst ...

s that constitute animal brains. Such systems learn (progressively improve their ability) to do tasks by considering examples, generally without task-specific programming. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as "cat" or "no cat" and using the analytic results to identify cats in other images. They have found most use in applications difficult to express with a traditional computer algorithm using

rule-based programming In computer science, a rule-based system is a computer system in which domain-specific knowledge is represented in the form of rules and general-purpose reasoning is used to solve problems in the domain. Two different kinds of rule-based systems ...

. An ANN is based on a collection of connected units called

artificial neuron An artificial neuron is a mathematical function conceived as a model of a biological neuron in a neural network. The artificial neuron is the elementary unit of an ''artificial neural network''. The design of the artificial neuron was inspired ...

s, (analogous to biological

neuron A neuron (American English), neurone (British English), or nerve cell, is an membrane potential#Cell excitability, excitable cell (biology), cell that fires electric signals called action potentials across a neural network (biology), neural net ...

s in a biological brain). Each connection (

synapse In the nervous system, a synapse is a structure that allows a neuron (or nerve cell) to pass an electrical or chemical signal to another neuron or a target effector cell. Synapses can be classified as either chemical or electrical, depending o ...

) between neurons can transmit a signal to another neuron. The receiving (postsynaptic) neuron can process the signal(s) and then signal downstream neurons connected to it. Neurons may have state, generally represented by

real numbers In mathematics, a real number is a number that can be used to measurement, measure a continuous variable, continuous one-dimensional quantity such as a time, duration or temperature. Here, ''continuous'' means that pairs of values can have arbi ...

, typically between 0 and 1. Neurons and synapses may also have a weight that varies as learning proceeds, which can increase or decrease the strength of the signal that it sends downstream. Typically, neurons are organized in layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first (input), to the last (output) layer, possibly after traversing the layers multiple times. The original goal of the neural network approach was to solve problems in the same way that a human brain would. Over time, attention focused on matching specific mental abilities, leading to deviations from biology such as

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

, or passing information in the reverse direction and adjusting the network to reflect that information. Neural networks have been used on a variety of tasks, including computer vision,

social network A social network is a social structure consisting of a set of social actors (such as individuals or organizations), networks of Dyad (sociology), dyadic ties, and other Social relation, social interactions between actors. The social network per ...

filtering, playing board and video games and medical diagnosis. As of 2017, neural networks typically have a few thousand to a few million units and millions of connections. Despite this number being several order of magnitude less than the number of neurons on a human brain, these networks can perform many tasks at a level beyond that of humans (e.g., recognizing faces, or playing "Go").

Deep neural networks

A deep neural network (DNN) is an artificial neural network with multiple layers between the input and output layers. There are different types of neural networks but they always consist of the same components: neurons, synapses, weights, biases, and functions. These components as a whole function in a way that mimics functions of the human brain, and can be trained like any other ML algorithm. For example, a DNN that is trained to recognize dog breeds will go over the given image and calculate the probability that the dog in the image is a certain breed. The user can review the results and select which probabilities the network should display (above a certain threshold, etc.) and return the proposed label. Each mathematical manipulation as such is considered a layer, and complex DNN have many layers, hence the name "deep" networks. DNNs can model complex non-linear relationships. DNN architectures generate compositional models where the object is expressed as a layered composition of primitives. The extra layers enable composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing shallow network. For instance, it was proved that sparse

multivariate polynomial In mathematics, a polynomial is a mathematical expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication and exponentiation to nonnegative intege ...

s are exponentially easier to approximate with DNNs than with shallow networks. Deep architectures include many variants of a few basic approaches. Each architecture has found success in specific domains. It is not always possible to compare the performance of multiple architectures, unless they have been evaluated on the same data sets. DNNs are typically feedforward networks in which data flows from the input layer to the output layer without looping back. At first, the DNN creates a map of virtual neurons and assigns random numerical values, or "weights", to connections between them. The weights and inputs are multiplied and return an output between 0 and 1. If the network did not accurately recognize a particular pattern, an algorithm would adjust the weights. That way the algorithm can make certain parameters more influential, until it determines the correct mathematical manipulation to fully process the data.

Recurrent neural networks Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

, in which data can flow in any direction, are used for applications such as

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

ing. Long short-term memory is particularly effective for this use.

Convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

s (CNNs) are used in computer vision. CNNs also have been applied to

acoustic model An acoustic model is used in automatic speech recognition to represent the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their correspon ...

ing for automatic speech recognition (ASR).

Challenges

As with ANNs, many issues can arise with naively trained DNNs. Two common issues are

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

and computation time. DNNs are prone to overfitting because of the added layers of abstraction, which allow them to model rare dependencies in the training data.

Regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

methods such as Ivakhnenko's unit pruning or

weight decay Ridge regression (also known as Tikhonov regularization, named for Andrey Tikhonov) is a method of estimating the coefficients of multiple- regression models in scenarios where the independent variables are highly correlated. It has been used in m ...

(

\ell_2

-regularization) or

sparsity In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. There is no strict definition regarding the proportion of zero-value elements for a matrix to qualify as sparse ...

(

\ell_1

-regularization) can be applied during training to combat overfitting. Alternatively dropout regularization randomly omits units from the hidden layers during training. This helps to exclude rare dependencies. Another interesting recent development is research into models of just enough complexity through an estimation of the intrinsic complexity of the task being modelled. This approach has been successfully applied for multivariate time series prediction tasks such as traffic prediction. Finally, data can be augmented via methods such as cropping and rotating such that smaller training sets can be increased in size to reduce the chances of overfitting. DNNs must consider many training parameters, such as the size (number of layers and number of units per layer), the

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

, and initial weights. Sweeping through the parameter space for optimal parameters may not be feasible due to the cost in time and computational resources. Various tricks, such as batching (computing the gradient on several training examples at once rather than individual examples) speed up computation. Large processing capabilities of many-core architectures (such as GPUs or the Intel Xeon Phi) have produced significant speedups in training, because of the suitability of such processing architectures for the matrix and vector computations. Alternatively, engineers may look for other types of neural networks with more straightforward and convergent training algorithms. CMAC (

cerebellar model articulation controller The cerebellar model arithmetic computer (CMAC) is a type of neural network based on a model of the mammalian cerebellum. It is also known as the cerebellar model articulation controller. It is a type of associative memory. The CMAC was first pro ...

) is one such kind of neural network. It doesn't require learning rates or randomized initial weights. The training process can be guaranteed to converge in one step with a new batch of data, and the computational complexity of the training algorithm is linear with respect to the number of neurons involved.Ting Qin, et al. "A learning algorithm of CMAC based on RLS". Neural Processing Letters 19.1 (2004): 49-61.Ting Qin, et al.
Continuous CMAC-QRLS and its systolic array
. . Neural Processing Letters 22.1 (2005): 1-16.

Hardware

Since the 2010s, advances in both machine learning algorithms and

computer hardware Computer hardware includes the physical parts of a computer, such as the central processing unit (CPU), random-access memory (RAM), motherboard, computer data storage, graphics card, sound card, and computer case. It includes external devices ...

have led to more efficient methods for training deep neural networks that contain many layers of non-linear hidden units and a very large output layer. By 2019,

graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

s (GPUs), often with AI-specific enhancements, had displaced CPUs as the dominant method for training large-scale commercial cloud AI .

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

estimated the hardware computation used in the largest deep learning projects from AlexNet (2012) to AlphaZero (2017) and found a 300,000-fold increase in the amount of computation required, with a doubling-time trendline of 3.4 months. Special

electronic circuit An electronic circuit is composed of individual electronic components, such as resistors, transistors, capacitors, inductors and diodes, connected by conductive wires or Conductive trace, traces through which electric current can flow. It is a t ...

s called

deep learning processor A neural processing unit (NPU), also known as AI accelerator or deep learning processor, is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence (AI) and machine learning applications, inc ...

s were designed to speed up deep learning algorithms. Deep learning processors include neural processing units (NPUs) in

Huawei Huawei Technologies Co., Ltd. ("Huawei" sometimes stylized as "HUAWEI"; ; zh, c=华为, p= ) is a Chinese multinational corporationtechnology company in Longgang, Shenzhen, Longgang, Shenzhen, Guangdong. Its main product lines include teleco ...

cellphones and

cloud computing Cloud computing is "a paradigm for enabling network access to a scalable and elastic pool of shareable physical or virtual resources with self-service provisioning and administration on-demand," according to International Organization for ...

servers such as tensor processing units (TPU) in the

Google Cloud Platform Google Cloud Platform (GCP) is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, Computer data storage, data storage, Data analysis, data analytics, and machine learnin ...

. Cerebras Systems has also built a dedicated system to handle large deep learning models, the CS-2, based on the largest processor in the industry, the second-generation Wafer Scale Engine (WSE-2). Atomically thin

semiconductors A semiconductor is a material with electrical conductivity between that of a conductor and an insulator. Its conductivity can be modified by adding impurities (" doping") to its crystal structure. When two regions with different doping levels ...

are considered promising for energy-efficient deep learning hardware where the same basic device structure is used for both logic operations and data storage. In 2020, Marega et al. published experiments with a large-area active channel material for developing logic-in-memory devices and circuits based on

floating-gate The floating-gate MOSFET (FGMOS), also known as a floating-gate MOS transistor or floating-gate transistor, is a type of metal–oxide–semiconductor field-effect transistor (MOSFET) where the gate is electrically isolated, creating a floating no ...

field-effect transistor The field-effect transistor (FET) is a type of transistor that uses an electric field to control the current through a semiconductor. It comes in two types: junction FET (JFET) and metal-oxide-semiconductor FET (MOSFET). FETs have three termi ...

s (FGFETs). In 2021, J. Feldmann et al. proposed an integrated

photonic Photonics is a branch of optics that involves the application of generation, detection, and manipulation of light in the form of photons through emission, transmission, modulation, signal processing, switching, amplification, and sensing. Eve ...

hardware accelerator for parallel convolutional processing. The authors identify two key advantages of integrated photonics over its electronic counterparts: (1) massively parallel data transfer through

wavelength In physics and mathematics, wavelength or spatial period of a wave or periodic function is the distance over which the wave's shape repeats. In other words, it is the distance between consecutive corresponding points of the same ''phase (waves ...

division

multiplexing In telecommunications and computer networking, multiplexing (sometimes contracted to muxing) is a method by which multiple analog or digital signals are combined into one signal over a shared medium. The aim is to share a scarce resource� ...

in conjunction with

frequency comb A frequency comb or spectral comb is a spectrum made of discrete and regularly spaced spectral lines. In optics, a frequency comb can be generated by certain laser sources. A number of mechanisms exist for obtaining an optical frequency comb, i ...

s, and (2) extremely high data modulation speeds. Their system can execute trillions of multiply-accumulate operations per second, indicating the potential of integrated

photonics Photonics is a branch of optics that involves the application of generation, detection, and manipulation of light in the form of photons through emission, transmission, modulation, signal processing, switching, amplification, and sensing. E ...

in data-heavy AI applications.

Applications

Automatic speech recognition

Large-scale automatic speech recognition is the first and most convincing successful case of deep learning. LSTM RNNs can learn "Very Deep Learning" tasks that involve multi-second intervals containing speech events separated by thousands of discrete time steps, where one time step corresponds to about 10 ms. LSTM with forget gates is competitive with traditional speech recognizers on certain tasks. The initial success in speech recognition was based on small-scale recognition tasks based on TIMIT. The data set contains 630 speakers from eight major

dialect A dialect is a Variety (linguistics), variety of language spoken by a particular group of people. This may include dominant and standard language, standardized varieties as well as Vernacular language, vernacular, unwritten, or non-standardize ...

s of

American English American English, sometimes called United States English or U.S. English, is the set of variety (linguistics), varieties of the English language native to the United States. English is the Languages of the United States, most widely spoken lang ...

, where each speaker reads 10 sentences. Its small size lets many configurations be tried. More importantly, the TIMIT task concerns

phone A telephone, colloquially referred to as a phone, is a telecommunications device that enables two or more users to conduct a conversation when they are too far apart to be easily heard directly. A telephone converts sound, typically and most ...

-sequence recognition, which, unlike word-sequence recognition, allows weak phone

bigram A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an ''n''-gram for ''n''=2. The frequency distribution of every bigram in a string is commonly used f ...

language models. This lets the strength of the acoustic modeling aspects of speech recognition be more easily analyzed. The error rates listed below, including these early results and measured as percent phone error rates (PER), have been summarized since 1991. The debut of DNNs for speaker recognition in the late 1990s and speech recognition around 2009-2011 and of LSTM around 2003–2007, accelerated progress in eight major areas: * Scale-up/out and accelerated DNN training and decoding * Sequence discriminative training * Feature processing by deep models with solid understanding of the underlying mechanisms * Adaptation of DNNs and related deep models * Multi-task and

transfer learning Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...

by DNNs and related deep models * CNNs and how to design them to best exploit

domain knowledge Domain knowledge is knowledge of a specific discipline or field in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engineer who has ge ...

of speech * RNN and its rich LSTM variants * Other types of deep models including tensor-based models and integrated deep generative/discriminative models. All major commercial speech recognition systems (e.g., Microsoft Cortana,

Xbox Xbox is a video gaming brand that consists of four main home video game console lines, as well as application software, applications (games), the streaming media, streaming service Xbox Cloud Gaming, and online services such as the Xbox networ ...

, Skype Translator,

Amazon Alexa Amazon Alexa is a virtual assistant technology marketed by Amazon and implemented in software applications for smart phones, tablets, wireless smart speakers, and other electronic appliances. Alexa was largely developed from a Polish speech s ...

Google Now Google Now was a feature of Google Search of the Google app for Android and iOS. Google Now proactively delivered information to users to predict (based on search habits and other factors) information they might need in the form of information ...

Apple Siri Siri ( , backronym: Speech Interpretation and Recognition Interface) is a digital assistant purchased, developed, and popularized by Apple Inc., which is included in the iOS, iPadOS, watchOS, macOS, Apple TV, audioOS, and visionOS operating sys ...

Baidu Baidu, Inc. ( ; ) is a Chinese multinational technology company specializing in Internet services and artificial intelligence. It holds a dominant position in China's search engine market (via Baidu Search), and provides a wide variety of o ...

and

iFlyTek iFlytek (), styled as iFLYTEK, is a partially state-owned Chinese information technology company established in 1999. It creates voice recognition software and 10+ voice-based internet/mobile products covering education, communication, music, i ...

voice search, and a range of

Nuance Nuance may refer to: * Nuance (American band), 1980s dance music group * Nuance (Canadian band), 1980s pop rock group from Quebec * Nuance Communications Nuance Communications, Inc. is an American multinational computer software technology ...

speech products, etc.) are based on deep learning.

Image recognition

A common evaluation set for image classification is the

MNIST database The MNIST database (''Modified National Institute of Standards and Technology database'') is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training ...

data set. MNIST is composed of handwritten digits and includes 60,000 training examples and 10,000 test examples. As with TIMIT, its small size lets users test multiple configurations. A comprehensive list of results on this set is available. Deep learning-based image recognition has become "superhuman", producing more accurate results than human contestants. This first occurred in 2011 in recognition of traffic signs, and in 2014, with recognition of human faces. Deep learning-trained vehicles now interpret 360° camera views. Another example is Facial Dysmorphology Novel Analysis (FDNA) used to analyze cases of human malformation connected to a large database of genetic syndromes.

Visual art processing

Closely related to the progress that has been made in image recognition is the increasing application of deep learning techniques to various visual art tasks. DNNs have proven themselves capable, for example, of *identifying the style period of a given painting * Neural Style Transfer capturing the style of a given artwork and applying it in a visually pleasing manner to an arbitrary photograph or video *generating striking imagery based on random visual input fields.

Natural language processing

Neural networks have been used for implementing language models since the early 2000s. LSTM helped to improve machine translation and language modeling. Other key techniques in this field are negative sampling and

word embedding In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...

. Word embedding, such as ''

word2vec Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...

'', can be thought of as a representational layer in a deep learning architecture that transforms an atomic word into a positional representation of the word relative to other words in the dataset; the position is represented as a point in a

vector space In mathematics and physics, a vector space (also called a linear space) is a set (mathematics), set whose elements, often called vector (mathematics and physics), ''vectors'', can be added together and multiplied ("scaled") by numbers called sc ...

. Using word embedding as an RNN input layer allows the network to parse sentences and phrases using an effective compositional vector grammar. A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN. Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing. Deep neural architectures provide the best results for constituency parsing,

sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...

, information retrieval, spoken language understanding, machine translation, contextual entity linking, writing style recognition,

named-entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pr ...

(token classification), text classification, and others. Recent developments generalize

sentence embedding In natural language processing, a sentence embedding is a representation of a sentence as a vector of numbers which encodes meaningful semantic information. State of the art embeddings are based on the learned hidden layer representation of dedi ...

Google Translate Google Translate is a multilingualism, multilingual neural machine translation, neural machine translation service developed by Google to translation, translate text, documents and websites from one language into another. It offers a web applic ...

(GT) uses a large end-to-end

(LSTM) network. Google Neural Machine Translation (GNMT) uses an

example-based machine translation Example-based machine translation (EBMT) is a method of machine translation often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base at run-time. It is essentially a translation by analogy and can be vie ...

method in which the system "learns from millions of examples". It translates "whole sentences at a time, rather than pieces". Google Translate supports over one hundred languages. The network encodes the "semantics of the sentence rather than simply memorizing phrase-to-phrase translations". GT uses English as an intermediate between most language pairs.

Drug discovery and toxicology

A large percentage of candidate drugs fail to win regulatory approval. These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects. Research has explored use of deep learning to predict the

biomolecular target A biological target is anything within a living organism to which some other entity (like an endogenous ligand or a drug) is directed and/or binds, resulting in a change in its behavior or function. Examples of common classes of biological targets ...

off-target In pharmacology, an antitarget (or off-target) is a receptor, enzyme, or other biological target that, when affected by a drug, causes undesirable side-effects. During drug design and development, it is important for pharmaceutical companies to en ...

s, and toxic effects of environmental chemicals in nutrients, household products and drugs."Toxicology in the 21st century Data Challenge" AtomNet is a deep learning system for structure-based

rational drug design Drug design, often referred to as rational drug design or simply rational design, is the invention, inventive process of finding new medications based on the knowledge of a biological target. The drug is most commonly an organic compound, organi ...

. AtomNet was used to predict novel candidate biomolecules for disease targets such as the

Ebola virus ''Orthoebolavirus zairense'' or Zaire ebolavirus, more commonly known as Ebola virus (; EBOV), is one of six known species within the genus ''Ebolavirus''. Four of the six known ebolaviruses, including EBOV, cause a severe and often fatal vira ...

and

multiple sclerosis Multiple sclerosis (MS) is an autoimmune disease resulting in damage to myelinthe insulating covers of nerve cellsin the brain and spinal cord. As a demyelinating disease, MS disrupts the nervous system's ability to Action potential, transmit ...

. In 2017

graph neural network Graph neural networks (GNN) are specialized artificial neural networks that are designed for tasks whose inputs are graphs. One prominent example is molecular drug design. Each input sample is a graph representation of a molecule, where atoms f ...

s were used for the first time to predict various properties of molecules in a large toxicology data set. In 2019, generative neural networks were used to produce molecules that were validated experimentally all the way into mice.

Customer relationship management

Deep reinforcement learning {{Short description, Subfield of machine learning Deep reinforcement learning (DRL) is a subfield of machine learning that combines principles of reinforcement learning (RL) and deep learning. It involves training agents to make decisions by interac ...

has been used to approximate the value of possible

direct marketing Direct marketing is a form of communicating an offer, where organizations communicate directly to a Target market, pre-selected customer and supply a method for a direct response. Among practitioners, it is also known as ''direct response ...

actions, defined in terms of RFM variables. The estimated value function was shown to have a natural interpretation as

customer lifetime value In marketing, customer lifetime value (CLV or often CLTV), lifetime customer value (LCV), or life-time value (LTV) is a prognostication of the net profit contributed to the whole future relationship with a customer. The prediction model can have ...

Recommendation systems

Recommendation systems have used deep learning to extract meaningful features for a latent factor model for content-based music and journal recommendations. Multi-view deep learning has been applied for learning user preferences from multiple domains. The model uses a hybrid collaborative and content-based approach and enhances recommendations in multiple tasks.

Bioinformatics

autoencoder An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function ...

ANN was used in

, to predict

gene ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ...

annotations and gene-function relationships. In medical informatics, deep learning was used to predict sleep quality based on data from wearables and predictions of health complications from

electronic health record An electronic health record (EHR) is the systematized collection of electronically stored patient and population health information in a digital format. These records can be shared across different health care settings. Records are shared thro ...

data. Deep neural networks have shown unparalleled performance in predicting protein structure, according to the sequence of the amino acids that make it up. In 2020,

AlphaFold AlphaFold is an artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, which performs predictions of protein structure. It is designed using deep learning techniques. AlphaFold 1 (2018) placed first in the overall ...

, a deep-learning based system, achieved a level of accuracy significantly higher than all previous computational methods.

Deep Neural Network Estimations

Deep neural networks can be used to estimate the entropy of a

stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Sto ...

and called Neural Joint Entropy Estimator (NJEE). Such an estimation provides insights on the effects of input

random variables A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...

on an independent

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

. Practically, the DNN is trained as a classifier that maps an input

vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...

matrix Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the m ...

X to an output

over the possible classes of random variable Y, given input X. For example, in

tasks, the NJEE maps a vector of

pixels In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest addressable element in a dot matrix display device. In most digital display devices, pixels are the sma ...

' color values to probabilities over possible image classes. In practice, the probability distribution of Y is obtained by a

Softmax The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

layer with number of nodes that is equal to the

alphabet An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...

size of Y. NJEE uses continuously differentiable

s, such that the conditions for the

holds. It is shown that this method provides a strongly

consistent estimator In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the result ...

and outperforms other methods in case of large alphabet sizes.

Medical image analysis

Deep learning has been shown to produce competitive results in medical application such as cancer cell classification, lesion detection, organ segmentation and image enhancement. Modern deep learning tools demonstrate the high accuracy of detecting various diseases and the helpfulness of their use by specialists to improve the diagnosis efficiency.

Mobile advertising

Finding the appropriate mobile audience for

mobile advertising Mobile advertising is a form of advertising via mobile (wireless) phones or other mobile devices. It is a subset of mobile marketing, mobile advertising can take place as text ads via SMS, or banner advertisements that appear embedded in a ...

is always challenging, since many data points must be considered and analyzed before a target segment can be created and used in ad serving by any ad server. Deep learning has been used to interpret large, many-dimensioned advertising datasets. Many data points are collected during the request/serve/click internet advertising cycle. This information can form the basis of machine learning to improve ad selection.

Image restoration

Deep learning has been successfully applied to

inverse problems ''Inverse Problems'' is a peer-reviewed, broad-based interdisciplinary journal for pure and applied mathematicians and physicists produced by IOP Publishing. It combines theoretical, experimental and mathematical papers on inverse problems wit ...

such as

denoising Noise reduction is the process of removing noise from a signal. Noise reduction techniques exist for audio and images. Noise reduction algorithms may distort the signal to some degree. Noise rejection is the ability of a circuit to isolate an u ...

super-resolution Super-resolution imaging (SR) is a class of techniques that improve the resolution of an imaging system. In optical SR the diffraction limit of systems is transcended, while in geometrical SR the resolution of digital imaging sensors is enhanced ...

inpainting Inpainting is a conservation process where damaged, deteriorated, or missing parts of an artwork are filled in to present a complete image. This process is commonly used in image restoration. It can be applied to both physical and digital art m ...

, and

film colorization Film colorization (American English; or colourisation/colorisation [both British English], or colourization [Canadian English and Oxford English]) is any process that adds color to black-and-white, sepia tone, sepia, or other mo ...

. These applications include learning methods such as "Shrinkage Fields for Effective Image Restoration" which trains on an image dataset, and Deep Image Prior, which trains on the image that needs restoration.

Financial fraud detection

Deep learning is being successfully applied to financial

fraud detection In law, fraud is intent (law), intentional deception to deprive a victim of a legal right or to gain from a victim unlawfully or unfairly. Fraud can violate Civil law (common law), civil law (e.g., a fraud victim may sue the fraud perpetrato ...

, tax evasion detection, and anti-money laundering.

Materials science

In November 2023, researchers at

Google DeepMind DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Goo ...

and

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory (LBNL, Berkeley Lab) is a Federally funded research and development centers, federally funded research and development center in the Berkeley Hills, hills of Berkeley, California, United States. Established i ...

announced that they had developed an AI system known as GNoME. This system has contributed to

materials science Materials science is an interdisciplinary field of researching and discovering materials. Materials engineering is an engineering field of finding uses for materials in other fields and industries. The intellectual origins of materials sci ...

by discovering over 2 million new materials within a relatively short timeframe. GNoME employs deep learning techniques to efficiently explore potential material structures, achieving a significant increase in the identification of stable inorganic

crystal structure In crystallography, crystal structure is a description of ordered arrangement of atoms, ions, or molecules in a crystalline material. Ordered structures occur from intrinsic nature of constituent particles to form symmetric patterns that repeat ...

s. The system's predictions were validated through autonomous robotic experiments, demonstrating a noteworthy success rate of 71%. The data of newly discovered materials is publicly available through the

Materials Project The Materials Project is an open-access database offering material properties to accelerate the development of technology by predicting how new materials–both real and hypothetical–can be used. The project was established in 2011 with an empha ...

database, offering researchers the opportunity to identify materials with desired properties for various applications. This development has implications for the future of scientific discovery and the integration of AI in material science research, potentially expediting material innovation and reducing costs in product development. The use of AI and deep learning suggests the possibility of minimizing or eliminating manual lab experiments and allowing scientists to focus more on the design and analysis of unique compounds.

Military

The United States Department of Defense applied deep learning to train robots in new tasks through observation.

Partial differential equations

Physics informed neural networks have been used to solve

partial differential equation In mathematics, a partial differential equation (PDE) is an equation which involves a multivariable function and one or more of its partial derivatives. The function is often thought of as an "unknown" that solves the equation, similar to ho ...

s in both forward and inverse problems in a data driven manner. One example is the reconstructing fluid flow governed by the Navier-Stokes equations. Using physics informed neural networks does not require the often expensive mesh generation that conventional CFD methods rely on.

Deep backward stochastic differential equation method

Deep backward stochastic differential equation method Deep backward stochastic differential equation method is a numerical method that combines deep learning with Backward stochastic differential equation (BSDE). This method is particularly useful for solving high-dimensional problems in financial de ...

is a numerical method that combines deep learning with

Backward stochastic differential equation A backward stochastic differential equation (BSDE) is a stochastic differential equation with a terminal condition in which the solution is required to be adapted with respect to an underlying filtration. BSDEs naturally arise in various applicat ...

(BSDE). This method is particularly useful for solving high-dimensional problems in financial mathematics. By leveraging the powerful function approximation capabilities of

deep neural networks Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

, deep BSDE addresses the computational challenges faced by traditional numerical methods in high-dimensional settings. Specifically, traditional methods like finite difference methods or Monte Carlo simulations often struggle with the curse of dimensionality, where computational cost increases exponentially with the number of dimensions. Deep BSDE methods, however, employ deep neural networks to approximate solutions of high-dimensional partial differential equations (PDEs), effectively reducing the computational burden. In addition, the integration of

Physics-informed neural networks Physics-informed neural networks (PINNs), also referred to as Theory-Trained Neural Networks (TTNs), are a type of universal function approximators that can embed the knowledge of any physical laws that govern a given data-set in the learning pro ...

(PINNs) into the deep BSDE framework enhances its capability by embedding the underlying physical laws directly into the neural network architecture. This ensures that the solutions not only fit the data but also adhere to the governing stochastic differential equations. PINNs leverage the power of deep learning while respecting the constraints imposed by the physical models, resulting in more accurate and reliable solutions for financial mathematics problems.

Image reconstruction

Image reconstruction is the reconstruction of the underlying images from the image-related measurements. Several works showed the better and superior performance of the deep learning methods compared to analytical methods for various applications, e.g., spectral imaging and ultrasound imaging.

Weather prediction

Traditional weather prediction systems solve a very complex system of partial differential equations. GraphCast is a deep learning based model, trained on a long history of weather data to predict how weather patterns change over time. It is able to predict weather conditions for up to 10 days globally, at a very detailed level, and in under a minute, with precision similar to state of the art systems.

Epigenetic clock

An epigenetic clock is a biochemical test that can be used to measure age. Galkin et al. used deep neural networks to train an epigenetic aging clock of unprecedented accuracy using >6,000 blood samples. The clock uses information from 1000

CpG site The CpG sites or CG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' → 3' direction. CpG sites occur with high frequency in genomic regions called CpG isl ...

s and predicts people with certain conditions older than healthy controls: IBD,

frontotemporal dementia Frontotemporal dementia (FTD), also called frontotemporal degeneration disease or frontotemporal neurocognitive disorder, encompasses several types of dementia involving the progressive degeneration of the brain's frontal lobe, frontal and tempor ...

ovarian cancer Ovarian cancer is a cancerous tumor of an ovary. It may originate from the ovary itself or more commonly from communicating nearby structures such as fallopian tubes or the inner lining of the abdomen. The ovary is made up of three different ...

obesity Obesity is a medical condition, considered by multiple organizations to be a disease, in which excess Adipose tissue, body fat has accumulated to such an extent that it can potentially have negative effects on health. People are classifi ...

. The aging clock was planned to be released for public use in 2021 by an

Insilico Medicine Insilico Medicine is a biotechnology company headquartered in Boston, Massachusetts, with additional facilities in Pak Shek Kok, Hong Kong in Hong Kong Science Park near the Chinese University of Hong Kong, and in New York, at The Cure by Deer ...

spinoff company Deep Longevity.

Relation to human cognitive and brain development

Deep learning is closely related to a class of theories of

brain development The brain is an organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head ( cephalization), usually near organs for special sens ...

(specifically, neocortical development) proposed by cognitive neuroscientists in the early 1990s. These developmental theories were instantiated in computational models, making them predecessors of deep learning systems. These developmental models share the property that various proposed learning dynamics in the brain (e.g., a wave of

nerve growth factor Nerve growth factor (NGF) is a neurotrophic factor and neuropeptide primarily involved in the regulation of growth, maintenance, proliferation, and survival of certain target neurons. It is perhaps the prototypical growth factor, in that it was ...

) support the

self-organization Self-organization, also called spontaneous order in the social sciences, is a process where some form of overall order and disorder, order arises from local interactions between parts of an initially disordered system. The process can be spont ...

somewhat analogous to the neural networks utilized in deep learning models. Like the

neocortex The neocortex, also called the neopallium, isocortex, or the six-layered cortex, is a set of layers of the mammalian cerebral cortex involved in higher-order brain functions such as sensory perception, cognition, generation of motor commands, ...

, neural networks employ a hierarchy of layered filters in which each layer considers information from a prior layer (or the operating environment), and then passes its output (and possibly the original input), to other layers. This process yields a self-organizing stack of

transducer A transducer is a device that Energy transformation, converts energy from one form to another. Usually a transducer converts a signal in one form of energy to a signal in another. Transducers are often employed at the boundaries of automation, M ...

s, well-tuned to their operating environment. A 1995 description stated, "...the infant's brain seems to organize itself under the influence of waves of so-called trophic-factors ... different regions of the brain become connected sequentially, with one layer of tissue maturing before another and so on until the whole brain is mature".S. Blakeslee, "In brain's early growth, timetable may be critical", ''The New York Times, Science Section'', pp. B5–B6, 1995. A variety of approaches have been used to investigate the plausibility of deep learning models from a neurobiological perspective. On the one hand, several variants of the

algorithm have been proposed in order to increase its processing realism. Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical

s and

s, may be closer to biological reality. In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex. Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported. For example, the computations performed by deep learning units could be similar to those of actual neurons and neural populations. Similarly, the representations developed by deep learning models are similar to those measured in the primate visual system both at the single-unit and at the population levels.

Commercial activity

Facebook Facebook is a social media and social networking service owned by the American technology conglomerate Meta Platforms, Meta. Created in 2004 by Mark Zuckerberg with four other Harvard College students and roommates, Eduardo Saverin, Andre ...

's AI lab performs tasks such as automatically tagging uploaded pictures with the names of the people in them. Google's

DeepMind Technologies DeepMind Technologies Limited, trading as Google DeepMind or simply DeepMind, is a British–American artificial intelligence research laboratory which serves as a subsidiary of Alphabet Inc. Founded in the UK in 2010, it was acquired by Go ...

developed a system capable of learning how to play

Atari Atari () is a brand name that has been owned by several entities since its inception in 1972. It is currently owned by French holding company Atari SA (formerly Infogrames) and its focus is on "video games, consumer hardware, licensing and bl ...

video games using only pixels as data input. In 2015 they demonstrated their

AlphaGo AlphaGo is a computer program that plays the board game Go. It was developed by the London-based DeepMind Technologies, an acquired subsidiary of Google. Subsequent versions of AlphaGo became increasingly powerful, including a version that c ...

system, which learned the game of Go well enough to beat a professional Go player.

uses a neural network to translate between more than 100 languages. In 2017, Covariant.ai was launched, which focuses on integrating deep learning into factories. As of 2008, researchers at

The University of Texas at Austin The University of Texas at Austin (UT Austin, UT, or Texas) is a public research university in Austin, Texas, United States. Founded in 1883, it is the flagship institution of the University of Texas System. With 53,082 students as of fall 2 ...

(UT) developed a machine learning framework called Training an Agent Manually via Evaluative Reinforcement, or TAMER, which proposed new methods for robots or computer programs to learn how to perform tasks by interacting with a human instructor. First developed as TAMER, a new algorithm called Deep TAMER was later introduced in 2018 during a collaboration between U.S. Army Research Laboratory (ARL) and UT researchers. Deep TAMER used deep learning to provide a robot with the ability to learn new tasks through observation. Using Deep TAMER, a robot learned a task with a human trainer, watching video streams or observing a human perform a task in-person. The robot later practiced the task with the help of some coaching from the trainer, who provided feedback such as "good job" and "bad job".

Criticism and comment

Deep learning has attracted both criticism and comment, in some cases from outside the field of computer science.

Theory

A main criticism concerns the lack of theory surrounding some methods. Learning in the most common deep architectures is implemented using well-understood gradient descent. However, the theory surrounding other algorithms, such as contrastive divergence is less clear. (e.g., Does it converge? If so, how fast? What is it approximating?) Deep learning methods are often looked at as a

black box In science, computing, and engineering, a black box is a system which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings. Its implementation is "opaque" (black). The te ...

, with most confirmations done empirically, rather than theoretically. In further reference to the idea that artistic sensitivity might be inherent in relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on ''

The Guardian ''The Guardian'' is a British daily newspaper. It was founded in Manchester in 1821 as ''The Manchester Guardian'' and changed its name in 1959, followed by a move to London. Along with its sister paper, ''The Guardian Weekly'', ''The Guardi ...

's'' website.

Errors

Some deep learning architectures display problematic behaviors, such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images (2014) and misclassifying minuscule perturbations of correctly classified images (2013). Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component

artificial general intelligence Artificial general intelligence (AGI)—sometimes called human‑level intelligence AI—is a type of artificial intelligence that would match or surpass human capabilities across virtually all cognitive tasks. Some researchers argue that sta ...

(AGI) architectures. These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar decompositions of observed entities and events. Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to

commonsense reasoning In artificial intelligence (AI), commonsense reasoning is a human-like ability to make presumptions about the type and essence of ordinary situations humans encounter every day. These assumptions include judgments about the nature of physical objec ...

that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition and

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

(AI).

Cyber threat

As deep learning moves from the lab into the world, research and experience show that artificial neural networks are vulnerable to hacks and deception. By identifying patterns that these systems use to function, attackers can modify inputs to ANNs in such a way that the ANN finds a match that human observers would not recognize. For example, an attacker can make subtle changes to an image such that the ANN finds a match even though the image looks to a human nothing like the search target. Such manipulation is termed an " adversarial attack". In 2016 researchers used one ANN to doctor images in trial and error fashion, identify another's focal points, and thereby generate images that deceived it. The modified images looked no different to human eyes. Another group showed that printouts of doctored images then photographed successfully tricked an image classification system. One defense is reverse image search, in which a possible fake image is submitted to a site such as

TinEye TinEye is a reverse image search engine developed and offered by Idée, Inc., a company based in Toronto, Ontario, Canada. It is the first image search engine on the web to use image identification technology rather than keywords, metadata or w ...

that can then find other instances of it. A refinement is to search using only parts of the image, to identify images from which that piece may have been taken. Another group showed that certain

psychedelic Psychedelics are a subclass of hallucinogenic drugs whose primary effect is to trigger non-ordinary mental states (known as psychedelic experiences or "trips") and a perceived "expansion of consciousness". Also referred to as classic halluci ...

spectacles could fool a

facial recognition system A facial recognition system is a technology potentially capable of matching a human face from a digital image or a Film frame, video frame against a database of faces. Such a system is typically employed to authenticate users through ID verif ...

into thinking ordinary people were celebrities, potentially allowing one person to impersonate another. In 2017 researchers added stickers to

stop sign A stop sign is a traffic sign designed to notify drivers that they must come to a complete stop and make sure the intersection (road), intersection (or level crossing, railroad crossing) is safely clear of vehicles and pedestrians before contin ...

s and caused an ANN to misclassify them. ANNs can however be further trained to detect attempts at

deception Deception is the act of convincing of one or many recipients of untrue information. The person creating the deception knows it to be false while the receiver of the information does not. It is often done for personal gain or advantage. Tort of ...

, potentially leading attackers and defenders into an arms race similar to the kind that already defines the

malware Malware (a portmanteau of ''malicious software'')Tahir, R. (2018)A study on malware and malware detection techniques . ''International Journal of Education and Management Engineering'', ''8''(2), 20. is any software intentionally designed to caus ...

defense industry. ANNs have been trained to defeat ANN-based anti-

software by repeatedly attacking a defense with malware that was continually altered by a

genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to g ...

until it tricked the anti-malware while retaining its ability to damage the target. In 2016, another group demonstrated that certain sounds could make the

voice command system open a particular web address, and hypothesized that this could "serve as a stepping stone for further attacks (e.g., opening a web page hosting drive-by malware)". In "

data poisoning Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 revealed practitioners' common feeling for better protection of machine learning systems in ...

", false data is continually smuggled into a machine learning system's training set to prevent it from achieving mastery.

Data collection ethics

The deep learning systems that are trained using supervised learning often rely on data that is created or annotated by humans, or both. It has been argued that not only low-paid clickwork (such as on

Amazon Mechanical Turk Amazon Mechanical Turk (MTurk) is a crowdsourcing website with which businesses can hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do as economically. It is operated under Amazon Web ...

) is regularly deployed for this purpose, but also implicit forms of human

microwork Microwork is a series of many small tasks which together comprise a large unified project, and it is completed by many people over the Internet. Microwork is considered the smallest unit of work in a virtual assembly line. It is most often used ...

that are often not recognized as such. The philosopher

Rainer Mühlhoff Rainer Mühlhoff (born 1982) is a German philosopher, mathematician and full professor for ethics of artificial intelligence at Osnabrück University, Germany. Career Rainer Mühlhoff studied mathematics, theoretical physics and computer scienc ...

distinguishes five types of "machinic capture" of human microwork to generate training data: (1)

gamification Gamification is the process of enhancing systems, services, organisations and activities through the integration of game design elements and principles in non-game contexts. The goal is to increase user engagement, motivation, competition and ...

(the embedding of annotation or computation tasks in the flow of a game), (2) "trapping and tracking" (e.g.

CAPTCHA Completely Automated Public Turing Test to tell Computers and Humans Apart (CAPTCHA) ( ) is a type of challenge–response authentication, challenge–response turing test used in computing to determine whether the user is human in order to de ...

s for image recognition or click-tracking on Google search results pages), (3) exploitation of social motivations (e.g. tagging faces on

to obtain labeled facial images), (4)

information mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and s ...

(e.g. by leveraging quantified-self devices such as

activity tracker A fitness tracker or activity tracker is an electronic device or app that measures and collects data about an individual's movements and physical responses in order to monitor and improve the individual's health, fitness, or psychological wellne ...

s) and (5) clickwork.

Overview

Interpretations

History

Before 1980

1980s-2000s

2000s

Deep learning revolution

Neural networks

Deep neural networks

Challenges

Hardware

Applications

Automatic speech recognition

Image recognition

Visual art processing

Natural language processing

Drug discovery and toxicology

Customer relationship management

Recommendation systems

Bioinformatics

Deep Neural Network Estimations

Medical image analysis

Mobile advertising

Image restoration

Financial fraud detection

Materials science

Military

Partial differential equations

Deep backward stochastic differential equation method

Image reconstruction

Weather prediction

Epigenetic clock

Relation to human cognitive and brain development

Commercial activity

Criticism and comment

Theory

Errors

Cyber threat

Data collection ethics

See also

References

Further reading