Universal approximation theorem explained

In the mathematical theory of artificial neural networks, universal approximation theorems are theorems[1] [2] of the following form: Given a family of neural networks, for each function

f

from a certain function space, there exists a sequence of neural networks

\phi1,\phi2,...

from the family, such that

\phin\tof

according to some criterion. That is, the family of neural networks is dense in the function space.

The most popular version states that feedforward networks with non-polynomial activation functions are dense in the space of continuous functions between two Euclidean spaces, with respect to the compact convergence topology.

Universal approximation theorems are existence theorems: They simply state that there exists such a sequence

\phi1,\phi2,...\tof

, and do not provide any way to actually find such a sequence. They also do not guarantee any method, such as backpropagation, might actually find such a sequence. Any method for searching the space of neural networks, including backpropagation, might find a converging sequence, or not (i.e. the backpropagation might get stuck in a local optimum).

Universal approximation theorems are limit theorems: They simply state that for any

f

and a criteria of closeness

\epsilon>0

, if there are enough neurons in a neural network, then there exists a neural network with that many neurons that does approximate

f

to within

\epsilon

. There is no guarantee that any finite size, say, 10000 neurons, is enough.

Setup

Artificial neural networks are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued vectors to real-valued vectors. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces.

Most universal approximation theorems are in one of two classes. The first quantifies the approximation capabilities of neural networks with an arbitrary number of artificial neurons ("arbitrary width" case) and the second focuses on the case with an arbitrary number of hidden layers, each containing a limited number of artificial neurons ("arbitrary depth" case). In addition to these two classes, there are also universal approximation theorems for neural networks with bounded number of hidden layers and a limited number of neurons in each layer ("bounded depth and bounded width" case).

History

Arbitrary width

The first examples were the arbitrary width case. George Cybenko in 1989 proved it for sigmoid activation functions.[3], Maxwell Stinchcombe, and Halbert White showed in 1989 that multilayer feed-forward networks with as few as one hidden layer are universal approximators. Hornik also showed in 1991[4] that it is not the specific choice of the activation function but rather the multilayer feed-forward architecture itself that gives neural networks the potential of being universal approximators. Moshe Leshno et al in 1993[5] and later Allan Pinkus in 1999[6] showed that the universal approximation property is equivalent to having a nonpolynomial activation function.

Arbitrary depth

The arbitrary depth case was also studied by a number of authors such as Gustaf Gripenberg in 2003,[7] Dmitry Yarotsky,[8] Zhou Lu et al in 2017,[9] Boris Hanin and Mark Sellke in 2018[10] who focused on neural networks with ReLU activation function. In 2020, Patrick Kidger and Terry Lyons[11] extended those results to neural networks with general activation functions such, e.g. tanh, GeLU, or Swish.

One special case of arbitrary depth is that each composition component comes from a finite set of mappings. In 2024, Cai [12] constructed a finite set of mappings, named a vocabulary, such that any continuous function can be approximated by compositing a sequence from the vocabulary. This is similar to the concept of compositionality in linguistics, which is the idea that a finite vocabulary of basic elements can be combined via grammar to express an infinite range of meanings.

Bounded depth and bounded width

The bounded depth and bounded width case was first studied by Maiorov and Pinkus in 1999.[13] They showed that there exists an analytic sigmoidal activation function such that two hidden layer neural networks with bounded number of units in hidden layers are universal approximators.

Guliyev and Ismailov[14] constructed a smooth sigmoidal activation function providing universal approximation property for two hidden layer feedforward neural networks with less units in hidden layers.

[15] constructed single hidden layer networks with bounded width that are still universal approximators for univariate functions. However, this does not apply for multivariable functions.

[16] obtained precise quantitative information on the depth and width required to approximate a target function by deep and wide ReLU neural networks.

Quantitative bounds

The question of minimal possible width for universality was first studied in 2021, Park et al obtained the minimum width required for the universal approximation of Lp functions using feed-forward neural networks with ReLU as activation functions.[17] Similar results that can be directly applied to residual neural networks were also obtained in the same year by Paulo Tabuada and Bahman Gharesifard using control-theoretic arguments.[18] [19] In 2023, Cai obtained the optimal minimum width bound for the universal approximation.[20]

For the arbitrary depth case, Leonie Papon and Anastasis Kratsios derived explicit depth estimates depending on the regularity of the target function and of the activation function.[21]

Kolmogorov network

The Kolmogorov–Arnold representation theorem is similar in spirit. Indeed, certain neural network families can directly apply the Kolmogorov–Arnold theorem to yield a universal approximation theorem. Robert Hecht-Nielsen showed that a three-layer neural network can approximate any continuous multivariate function.[22] This was extended to the discontinuous case by Vugar Ismailov.[23] In 2024, Ziming Liu and co-authors showed a practical application.[24]

Variants

Discontinuous activation functions, noncompact domains,[25] certifiable networks,[26] random neural networks,[27] and alternative network architectures and topologies.[28]

The universal approximation property of width-bounded networks has been studied as a dual of classical universal approximation results on depth-bounded networks. For input dimension dx and output dimension dy the minimum width required for the universal approximation of the Lp functions is exactly max (for a ReLU network). More generally this also holds if both ReLU and a threshold activation function are used.

Universal function approximation on graphs (or rather on graph isomorphism classes) by popular graph convolutional neural networks (GCNs or GNNs) can be made as discriminative as the Weisfeiler–Leman graph isomorphism test.[29] In 2020,[30] a universal approximation theorem result was established by Brüel-Gabrielsson, showing that graph representation with certain injective properties is sufficient for universal function approximation on bounded graphs and restricted universal function approximation on unbounded graphs, with an accompanying

lO(\left|V\right|\left|E\right|)

-runtime method that performed at state of the art on a collection of benchmarks (where

V

and

E

are the sets of nodes and edges of the graph respectively).

There are also a variety of results between non-Euclidean spaces[31] and other commonly used architectures and, more generally, algorithmically generated sets of functions, such as the convolutional neural network (CNN) architecture,[32] [33] radial basis functions,[34] or neural networks with specific properties.[35] [36]

Arbitrary-width case

A spate of papers in the 1980s—1990s, from George Cybenko and etc, established several universal approximation theorems for arbitrary width and bounded depth.[37] [38] See[39] [40] for reviews. The following is the most often quoted:

Also, certain non-continuous activation functions can be used to approximate a sigmoid function, which then allows the above theorem to apply to those functions. For example, the step function works. In particular, this shows that a perceptron network with a single infinitely wide hidden layer can approximate arbitrary functions.

Such an

f

can also be approximated by a network of greater depth by using the same construction for the first layer and approximating the identity function with later layers.

The above proof has not specified how one might use a ramp function to approximate arbitrary functions in

n,
C
0(\R

\R)

. A sketch of the proof is that one can first construct flat bump functions, intersect them to obtain spherical bump functions that approximate the Dirac delta function, then use those to approximate arbitrary functions in
n,
C
0(\R

\R)

.[41] The original proofs, such as the one by Cybenko, use methods from functional analysis, including the Hahn-Banach and Riesz–Markov–Kakutani representation theorems.

Notice also that the neural network is only required to approximate within a compact set

K

. The proof does not describe how the function would be extrapolated outside of the region.

The problem with polynomials may be removed by allowing the outputs of the hidden layers to be multiplied together (the "pi-sigma networks"), yielding the generalization:

Arbitrary-depth case

The "dual" versions of the theorem consider networks of bounded width and arbitrary depth. A variant of the universal approximation theorem was proved for the arbitrary depth case by Zhou Lu et al. in 2017. They showed that networks of width n + 4 with ReLU activation functions can approximate any Lebesgue-integrable function on n-dimensional input space with respect to

L1

distance if network depth is allowed to grow. It was also shown that if the width was less than or equal to n, this general expressive power to approximate any Lebesgue integrable function was lost. In the same paper it was shown that ReLU networks with width n + 1 were sufficient to approximate any continuous function of n-dimensional input variables.[42] The following refinement, specifies the optimal minimum width for which such an approximation is possible and is due to.[43]

Together, the central result of yields the following universal approximation theorem for networks with bounded width (see also[7] for the first result of this kind).

Certain necessary conditions for the bounded width, arbitrary depth case have been established, but there is still a gap between the known sufficient and necessary conditions.[44]

Bounded depth and bounded width case

The first result on approximation capabilities of neural networks with bounded number of layers, each containing a limited number of artificial neurons was obtained by Maiorov and Pinkus. Their remarkable result revealed that such networks can be universal approximators and for achieving this property two hidden layers are enough.

This is an existence result. It says that activation functions providing universal approximation property for bounded depth bounded width networks exist. Using certain algorithmic and computer programming techniques, Guliyev and Ismailov efficiently constructed such activation functions depending on a numerical parameter. The developed algorithm allows one to compute the activation functions at any point of the real axis instantly. For the algorithm and the corresponding computer code see. The theoretical result can be formulated as follows.

Here “

\sigma\colonR\toR

is

λ

-strictly increasing on some set

X

” means that there exists a strictly increasing function

u\colonX\toR

such that

|\sigma(x)-u(x)|\leλ

for all

x\inX

. Clearly, a

λ

-increasing function behaves like a usual increasing function as

λ

gets small.In the "depth-width" terminology, the above theorem says that for certain activation functions depth-

2

width-

2

networks are universal approximators for univariate functions and depth-

3

width-

(2d+2)

networks are universal approximators for

d

-variable functions (

d>1

).

See also

Notes and References

  1. Hornik . Kurt . Stinchcombe . Maxwell . White . Halbert . Multilayer feedforward networks are universal approximators . Neural Networks . January 1989 . 2 . 5 . 359–366 . 10.1016/0893-6080(89)90020-8 .
  2. Balázs Csanád Csáji (2001) Approximation with Artificial Neural Networks; Faculty of Sciences; Eötvös Loránd University, Hungary
  3. 10.1.1.441.7873 . 10.1007/BF02551274. Approximation by superpositions of a sigmoidal function. 1989. Cybenko. G.. Mathematics of Control, Signals, and Systems. 2. 4. 303–314. 1989MCSS....2..303C . 3958369.
  4. 10.1016/0893-6080(91)90009-T. Approximation capabilities of multilayer feedforward networks. 1991. Hornik. Kurt. Neural Networks. 4. 2. 251–257. 7343126 .
  5. Leshno. Moshe. Lin. Vladimir Ya.. Pinkus. Allan. Schocken. Shimon. January 1993. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks. 6. 6. 861–867. 10.1016/S0893-6080(05)80131-5. 206089312.
  6. Pinkus. Allan. January 1999. Approximation theory of the MLP model in neural networks. Acta Numerica. 8. 143–195. 10.1017/S0962492900002919. 1999AcNum...8..143P. 16800260 .
  7. Gripenberg. Gustaf. June 2003. Approximation by neural networks with a bounded number of nodes at each level. Journal of Approximation Theory . 122. 2. 260–266. 10.1016/S0021-9045(03)00078-9 .
  8. Yarotsky . Dmitry . Error bounds for approximations with deep ReLU networks . Neural Networks . October 2017 . 94 . 103–114 . 10.1016/j.neunet.2017.07.002 . 28756334 . 1610.01145 . 426133 .
  9. Lu . Zhou . Pu . Hongming . Wang . Feicheng . Hu . Zhiqiang . Wang . Liwei . The Expressive Power of Neural Networks: A View from the Width . Advances in Neural Information Processing Systems . 30 . 2017 . 6231–6239 . Curran Associates . 1709.02540 .
  10. Hanin. Boris. Sellke. Mark. Approximating Continuous Functions by ReLU Nets of Minimal Width. 1710.11278. stat.ML. 2018.
  11. Kidger. Patrick. Lyons. Terry. July 2020. Universal Approximation with Deep Narrow Networks. 1905.08539. Conference on Learning Theory.
  12. Yongqiang. Cai. 2024. Vocabulary for Universal Approximation: A Linguistic Perspective of Mapping Compositions. ICML. 5189–5208 . 2305.12205 .
  13. Maiorov. Vitaly. Pinkus. Allan. April 1999. Lower bounds for approximation by MLP neural networks. Neurocomputing. 25. 1–3. 81–91. 10.1016/S0925-2312(98)00111-8.
  14. Guliyev . Namig . Ismailov . Vugar . November 2018 . Approximation capability of two hidden layer feedforward neural networks with fixed weights . Neurocomputing . 316 . 262–269 . 2101.09181 . 10.1016/j.neucom.2018.07.075 . 52285996.
  15. Guliyev. Namig. Ismailov. Vugar. February 2018. On the approximation by single hidden layer feedforward neural networks with fixed weights. Neural Networks. 98. 296–304. 10.1016/j.neunet.2017.12.007. 29301110 . 1708.06219 . 4932839 .
  16. Shen . Zuowei . Yang . Haizhao . Zhang . Shijun . January 2022 . Optimal approximation rate of ReLU networks in terms of width and depth . Journal de Mathématiques Pures et Appliquées . 157 . 101–135 . 2103.00502 . 10.1016/j.matpur.2021.07.009 . 232075797.
  17. Park . Sejun . Yun . Chulhee . Lee . Jaeho . Shin . Jinwoo . 2021 . Minimum Width for Universal Approximation . International Conference on Learning Representations . 2006.08859.
  18. Tabuada . Paulo . Gharesifard . Bahman . 2021 . Universal approximation power of deep residual neural networks via nonlinear control theory . International Conference on Learning Representations . 2007.06007.
  19. Tabuada . Paulo . Gharesifard . Bahman . May 2023 . Universal Approximation Power of Deep Residual Neural Networks Through the Lens of Control . IEEE Transactions on Automatic Control . 68 . 5 . 2715–2728 . 10.1109/TAC.2022.3190051 . 250512115.
  20. Cai . Yongqiang . 2023-02-01 . Achieve the Minimum Width of Neural Networks for Universal Approximation . ICLR . en . 2209.11395.
  21. Kratsios . Anastasis . Papon . Léonie . 2022 . Universal Approximation Theorems for Differentiable Geometric Deep Learning . Journal of Machine Learning Research . 23 . 196 . 1–73 . 2101.05390.
  22. Hecht-Nielsen . Robert . 1987 . Kolmogorov's mapping neural network existence theorem . Proceedings of International Conference on Neural Networks, 1987 . 3 . 11–13.
  23. Ismailov . Vugar E. . July 2023 . A three layer neural network can represent any multivariate function . Journal of Mathematical Analysis and Applications . 523 . 1 . 127096 . 2012.03016 . 10.1016/j.jmaa.2023.127096 . 265100963.
  24. Liu . Ziming . KAN: Kolmogorov-Arnold Networks . 2024-05-24 . 2404.19756 . Wang . Yixuan . Vaidya . Sachin . Ruehle . Fabian . Halverson . James . Soljačić . Marin . Hou . Thomas Y. . Tegmark . Max. cs.LG .
  25. van Nuland . Teun . 2024 . Noncompact uniform universal approximation . Neural Networks . 173. 10.1016/j.neunet.2024.106181 . 38412737 . 2308.03812 .
  26. Baader . Maximilian . Mirman . Matthew . Vechev . Martin . 2020 . Universal Approximation with Certified Networks . ICLR.
  27. Gelenbe . Erol . Mao . Zhi Hong . Li . Yan D. . 1999 . Function approximation with spiked random networks . IEEE Transactions on Neural Networks . 10 . 1 . 3–9 . 10.1109/72.737488 . 18252498.
  28. Lin . Hongzhou . Jegelka . Stefanie. Stefanie Jegelka . 2018 . ResNet with one-neuron hidden layers is a Universal Approximator . Curran Associates . 30 . 6169–6178 . Advances in Neural Information Processing Systems.
  29. Xu . Keyulu . Hu . Weihua . Leskovec . Jure . Jegelka . Stefanie. Stefanie Jegelka . 2019 . How Powerful are Graph Neural Networks? . International Conference on Learning Representations.
  30. Brüel-Gabrielsson . Rickard . 2020 . Universal Function Approximation on Graphs . Curran Associates . 33 . Advances in Neural Information Processing Systems.
  31. Kratsios . Anastasis . Bilokopytov . Eugene . 2020 . Non-Euclidean Universal Approximation . Curran Associates . 33 . Advances in Neural Information Processing Systems.
  32. Zhou . Ding-Xuan . 2020 . Universality of deep convolutional neural networks . . 48 . 2 . 787–794 . 1805.10769 . 10.1016/j.acha.2019.06.004 . 44113176.
  33. Heinecke . Andreas . Ho . Jinn . Hwang . Wen-Liang . 2020 . Refinement and Universal Approximation via Sparsely Connected ReLU Convolution Nets . IEEE Signal Processing Letters . 27 . 1175–1179 . 2020ISPL...27.1175H . 10.1109/LSP.2020.3005051 . 220669183.
  34. Park . J. . Sandberg . I. W. . 1991 . Universal Approximation Using Radial-Basis-Function Networks . Neural Computation . 3 . 2 . 246–257 . 10.1162/neco.1991.3.2.246 . 31167308 . 34868087.
  35. Yarotsky . Dmitry . 2021 . Universal Approximations of Invariant Maps by Neural Networks . Constructive Approximation . 55 . 407–474 . 1804.10306 . 10.1007/s00365-021-09546-1 . 13745401.
  36. Zakwan . Muhammad . d’Angelo . Massimiliano . Ferrari-Trecate . Giancarlo . 2023 . Universal Approximation Property of Hamiltonian Deep Neural Networks . IEEE Control Systems Letters . 1 . 2303.12147 . 10.1109/LCSYS.2023.3288350 . 257663609.
  37. Funahashi . Ken-Ichi . On the approximate realization of continuous mappings by neural networks . Neural Networks . January 1989 . 2 . 3 . 183–192 . 10.1016/0893-6080(89)90003-8 .
  38. Hornik . Kurt . Stinchcombe . Maxwell . White . Halbert . Multilayer feedforward networks are universal approximators . Neural Networks . January 1989 . 2 . 5 . 359–366 . 10.1016/0893-6080(89)90020-8 .
  39. Haykin, Simon (1998). Neural Networks: A Comprehensive Foundation, Volume 2, Prentice Hall. .
  40. Hassoun, M. (1995) Fundamentals of Artificial Neural Networks MIT Press, p. 48
  41. Nielsen . Michael A. . 2015 . Neural Networks and Deep Learning . en.
  42. Hanin, B. (2018). Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv preprint arXiv:1710.11278.
  43. Park, Yun, Lee, Shin . Sejun, Chulhee, Jaeho, Jinwoo . 2020-09-28 . Minimum Width for Universal Approximation . ICLR . 2006.08859 . en.
  44. Johnson . Jesse . International Conference on Learning Representations . 2019 . Deep, Skinny Neural Networks are not Universal Approximators.