A multilayer perceptron (MLP) is a fully connected class of

feedforward Feedforward is the provision of context of what one wants to communicate prior to that communication. In purposeful activity, feedforward creates an expectation which the actor anticipates. When expected experience occurs, this provides confirmato ...

artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...

(ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of

perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...

s (with threshold activation); see . Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural networks, especially when they have a single hidden layer. An MLP consists of at least three

layers Layer or layered may refer to: Arts, entertainment, and media * ''Layers'' (Kungs album) * ''Layers'' (Les McCann album) * ''Layers'' (Royce da 5'9" album) *"Layers", the title track of Royce da 5'9"'s sixth studio album * Layer, a female Maveri ...

of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear

activation function In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or " ...

. MLP utilizes a

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

technique called

backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...

for training. Its multiple layers and non-linear activation distinguish MLP from a linear

. It can distinguish data that is not linearly separable.Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function ''

Mathematics of Control, Signals, and Systems ''Mathematics of Control, Signals, and Systems'' is a peer-reviewed scientific journal that covers research concerned with mathematically rigorous system theoretic aspects of control and signal processing. The journal was founded by Eduardo Sontag ...

'', 2(4), 303–314.

Theory

Activation function

If a multilayer perceptron has a linear

in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then

linear algebra Linear algebra is the branch of mathematics concerning linear equations such as: :a_1x_1+\cdots +a_nx_n=b, linear maps such as: :(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n, and their representations in vector spaces and through matrice ...

shows that any number of layers can be reduced to a two-layer input-output model. In MLPs some neurons use a ''nonlinear'' activation function that was developed to model the frequency of

action potentials An action potential occurs when the membrane potential of a specific cell location rapidly rises and falls. This depolarization then causes adjacent locations to similarly depolarize. Action potentials occur in several types of animal cells ...

, or firing, of biological neurons. The two historically common activation functions are both sigmoids, and are described by :

y(v_i) = \tanh(v_i) ~~ \textrm ~~ y(v_i) = (1+e^)^

. In recent developments of

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. ...

the rectifier linear unit (ReLU) is more frequently used as one of the possible ways to overcome the numerical

problems A problem is a difficulty which may be resolved by problem solving. Problem(s) or The Problem may also refer to: People * Problem (rapper), (born 1985) American rapper Books * ''Problems'' (Aristotle), an Aristotelian (or pseudo-Aristotelian) co ...

related to the sigmoids. The first is a

hyperbolic tangent In mathematics, hyperbolic functions are analogues of the ordinary trigonometric functions, but defined using the hyperbola rather than the circle. Just as the points form a circle with a unit radius, the points form the right half of the ...

that ranges from -1 to 1, while the other is the

logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...

, which is similar in shape but ranges from 0 to 1. Here

y_i

is the output of the

i

th node (neuron) and

v_i

is the weighted sum of the input connections. Alternative activation functions have been proposed, including the rectifier and softplus functions. More specialized activation functions include

radial basis functions A radial basis function (RBF) is a real-valued function \varphi whose value depends only on the distance between the input and some fixed point, either the origin, so that \varphi(\mathbf) = \hat\varphi(\left\, \mathbf\right\, ), or some other fixe ...

(used in radial basis networks, another class of supervised neural network models).

Layers

The MLP consists of three or more layers (an input and an output layer with one or more ''hidden layers'') of nonlinearly-activating nodes. Since MLPs are fully connected, each node in one layer connects with a certain weight

w_

to every node in the following layer.

Learning

Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of

, and is carried out through

, a generalization of the least mean squares algorithm in the linear perceptron. We can represent the degree of error in an output node

j

in the

n

th data point (training example) by

e_j(n)=d_j(n)-y_j(n)

, where

d

is the target value and

y

is the value produced by the perceptron. The node weights can then be adjusted based on corrections that minimize the error in the entire output, given by :

\mathcal(n)=\frac\sum_j e_j^2(n)

. Using

gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

, the change in each weight is :

\Delta w_ (n) = -\eta\frac y_i(n)

where

y_i

is the output of the previous neuron and

\eta

is the ''

learning rate In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly ...

'', which is selected to ensure that the weights quickly converge to a response, without oscillations. The derivative to be calculated depends on the induced local field

v_j

, which itself varies. It is easy to prove that for an output node this derivative can be simplified to :

-\frac = e_j(n)\phi^\prime (v_j(n))

where

\phi^\prime

is the derivative of the activation function described above, which itself does not vary. The analysis is more difficult for the change in weights to a hidden node, but it can be shown that the relevant derivative is :

-\frac = \phi^\prime (v_j(n))\sum_k -\frac w_(n)

. This depends on the change in weights of the

k

th nodes, which represent the output layer. So to change the hidden layer weights, the output layer weights change according to the derivative of the activation function, and so this algorithm represents a backpropagation of the activation function.

Terminology

The term "multilayer perceptron" does not refer to a single perceptron that has multiple layers. Rather, it contains many perceptrons that are organized into layers. An alternative is "multilayer perceptron network". Moreover, MLP "perceptrons" are not perceptrons in the strictest possible sense. True perceptrons are formally a special case of artificial neurons that use a threshold activation function such as the

Heaviside step function The Heaviside step function, or the unit step function, usually denoted by or (but sometimes , or ), is a step function, named after Oliver Heaviside (1850–1925), the value of which is zero for negative arguments and one for positive argum ...

. MLP perceptrons can employ arbitrary activation functions. A true perceptron performs ''binary'' classification, an MLP neuron is free to either perform classification or regression, depending upon its activation function. The term "multilayer perceptron" later was applied without respect to nature of the nodes/layers, which can be composed of arbitrarily defined artificial neurons, and not perceptrons specifically. This interpretation avoids the loosening of the definition of "perceptron" to mean an artificial neuron in general.

Applications

MLPs are useful in research for their ability to solve problems stochastically, which often allows approximate solutions for extremely

complex Complex commonly refers to: * Complexity, the behaviour of a system whose components interact in multiple ways so possible interactions are difficult to describe ** Complex system, a system composed of many components which may interact with each ...

problems like

fitness approximation Fitness approximationY. JinA comprehensive survey of fitness approximation in evolutionary computation ''Soft Computing'', 9:3–12, 2005 aims to approximate the objective or fitness functions in evolutionary optimization by building up machine l ...

. MLPs are universal function approximators as shown by Cybenko's theorem, so they can be used to create mathematical models by regression analysis. As

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

is a particular case of regression when the response variable is categorical, MLPs make good classifier algorithms. MLPs were a popular machine learning solution in the 1980s, finding applications in diverse fields such as

speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ...

, image recognition, and

machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates ...

software, but thereafter faced strong competition from much simpler (and relatedR. Collobert and S. Bengio (2004). Links between Perceptrons, MLPs and SVMs. Proc. Int'l Conf. on Machine Learning (ICML).)

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborat ...

s. Interest in backpropagation networks returned due to the successes of

References

External links

Weka: Open source data mining software with multilayer perceptron implementation

Neuroph Studio documentation, implements this algorithm and a few others
{{Differentiable computing Classification algorithms Neural network architectures de:Perzeptron#Mehrlagiges Perzeptron