History
One of the first versions of the ''arbitrary width'' case was proven by George Cybenko in 1989 for sigmoid activation functions. Kurt Hornik, Maxwell Stinchcombe, and Halbert White showed in 1989 that multilayer feed-forward networks with as few as one hidden layer are universal approximators. Hornik also showed in 1991 that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno ''et al'' in 1993 and later Allan Pinkus in 1999 showed that the universal approximation property is equivalent to having a nonpolynomial activation function. The ''arbitrary depth'' case was also studied by a number of authors, such as Gustaf Gripenberg in 2003, Dmitry Yarotsky, Zhou Lu ''et al'' in 2017, Boris Hanin and Mark Sellke in 2018, and Patrick Kidger and Terry Lyons in 2020. The result minimal width per layer was refined in 2020 for residual networks. The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999. They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators. Using algorithmic and computer programming techniques, Guliyev and Ismailov constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers. It was constructively proved in 2018 paper that single hidden layer networks with bounded width are still universal approximators for univariate functions, but this property is no longer true for multivariable functions. Several extensions of the theorem exist, such as to discontinuous activation functions, noncompact domains, certifiable networks, random neural networks, and alternative network architectures and topologies.Arbitrary-width case
The classical form of the universal approximation theorem for arbitrary width and bounded depth is as follows. It extends the classical results of George Cybenko andUniversal approximation theorem: Let denote the set of continuous functions from to . Let . Note that , so denotes applied to each component of . Then is notSuch an can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers.polynomial In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An ex ...if and only if In logic and related fields such as mathematics and philosophy, "if and only if" (shortened as "iff") is a biconditional logical connective between statements, where either both statements are true or both are false. The connective is bi ...for every , ,compact Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to: * Interstate compact * Blood compact, an ancient ritual of the Philippines * Compact government, a type of colonial rule utilized in British ..., there exist , , , such that where
Arbitrary-depth case
The 'dual' versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017. They showed that networks of width ''n+4'' with ReLU activation functions can approximate any Lebesgue integrable function on ''n''-dimensional input space with respect to distance if network depth is allowed to grow. It was also shown that if the width was less than or equal to ''n'', this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper it was shown that ReLU networks with width ''n+1'' were sufficient to approximate any continuous function of ''n''-dimensional input variables. The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to.Universal approximation theorem ''(L1 distance, ReLU activation, arbitrary depth, minimal width).'' For any Bochner–Lebesgue p-integrable function and any , there exists a fully-connected ReLU network of width exactly , satisfying :. Moreover, there exists a function and some , for which there is no fully-connected ReLU network of width less than satisfying the above approximation bound. ''Quantitative Refinement:'' In the case where, when and and where is the ReLU activation function then, the exact depth and width for a ReLU network to achive error is also known. If, moreover, the target function is smooth then the required number of layer and their width can be exponentially smaller. Even if is not smooth, the curse of dimensionality can be broken if f admits additional "compositional structure".Together, the central result of yields the following universal approximation theorem for networks with bounded width (cf. also for the first result of this kind).
Universal approximation theorem (Uniform non-Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions.affine Affine may describe any of various topics concerned with connections or affinities. It may refer to: * Affine, a Affinity_(law)#Terminology, relative by marriage in law and anthropology * Affine cipher, a special case of the more general substi ...activation, arbitrary depth, constrained width). Let be acompact subset In mathematics, specifically general topology, compactness is a property that seeks to generalize the notion of a closed and bounded subset of Euclidean space by making precise the idea of a space having no "punctures" or "missing endpoints", i ...of . Let be any non-affine Affine may describe any of various topics concerned with connections or affinities. It may refer to: * Affine, a Affinity_(law)#Terminology, relative by marriage in law and anthropology * Affine cipher, a special case of the more general substi ...continuous function which iscontinuously differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non- vertical tangent line at each interior point in ...at at least one point, with nonzeroderivative In mathematics, the derivative of a function of a real variable measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value). Derivatives are a fundamental tool of calculus. ...at that point. Let denote the space of feed-forward neural networks with input neurons, output neurons, and an arbitrary number of hidden layers each with neurons, such that every hidden neuron has activation function and every output neuron has the identity as its activation function, with input layer , and output layer . Then given any and any , there exists such that : In other words, is dense in with respect to the topology ofuniform convergence In the mathematical field of analysis, uniform convergence is a mode of convergence of functions stronger than pointwise convergence. A sequence of functions (f_n) converges uniformly to a limiting function f on a set E if, given any arbitra .... ''Quantitative Refinement:'' The number of layers and the width of each layer required to approximate f to precision known; moreover, the result hold true when and are replaced with any non-positively curvedRiemannian manifold In differential geometry, a Riemannian manifold or Riemannian space , so called after the German mathematician Bernhard Riemann, is a real, smooth manifold ''M'' equipped with a positive-definite inner product ''g'p'' on the tangent spac ....
Bounded depth and bounded width case
The first result on approximation capabilities of neural networks with bounded number of layers, each containing a limited number of artificial neurons was obtained by Maiorov and Pinkus. Their remarkable result revealed that such networks can be universal approximators and for achieving this property two hidden layers are enough.Universal approximation theorem: There exists an activation function which is analytic, strictly increasing and sigmoidal and has the following property: For any and there exist constants , and vectors for which for all .This is an existence result. It says that activation functions providing universal approximation property for bounded depth bounded width networks exist. Using certain algorithmic and computer programming techniques, Guliyev and Ismailov efficiently constructed such activation functions depending on a numerical parameter. The developed algorithm allows one to compute the activation functions at any point of the real axis instantly. For the algorithm and the corresponding computer code see. The theoretical result can be formulated as follows.
Universal approximation theorem: LetHere “
Graph input
Achieving useful universal function approximation on graphs (or rather on graph isomorphism classes) has been a longstanding problem. The popular graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test. In 2020, a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanyingSee also
* Kolmogorov–Arnold representation theorem *References
{{Differentiable computing Theorems in analysis Artificial neural networks Network architecture Networks