In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. In particular, it is commonly used in support vector machine classification.[1]
The RBF kernel on two samples
x\inRk
x'
K(x,x')=\exp\left(-
\|x-x'\|2 | |
2\sigma2 |
\right)
style\|x-x'\|2
\sigma
style\gamma=\tfrac{1}{2\sigma2}
K(x,x')=\exp(-\gamma\|x-x'\|2)
Since the value of the RBF kernel decreases with distance and ranges between zero (in the infinite-distance limit) and one (when), it has a ready interpretation as a similarity measure.[2] The feature space of the kernel has an infinite number of dimensions; for
\sigma=1
\begin{alignat}{2}\exp\left(-
1 | |
2 |
\|x-x'\|2\right) &=\exp(
2 | |
2 |
x\topx'-
1 | |
2 |
\|x\|2-
1 | |
2 |
\|x'\|2)\\[5pt] &=\exp(x\topx')\exp(-
1 | |
2 |
\|x\|2)\exp(-
1 | |
2 |
\|x'\|2)\\[5pt] &=
infty | |
\sum | |
j=0 |
(x\topx')j | \exp\left(- | |
j! |
1 | |
2 |
\|x\|2\right)\exp\left(-
1 | |
2 |
\|x'\|2\right)\\[5pt] &=
infty | |
\sum | |
j=0 |
\sum | \exp\left(- | |
n1+n2+...+nk=j |
1 | |
2 |
\|x\|2\right)
| ||||||||||||||||
\sqrt{n1! … nk! |
\varphi(x) = \exp\left(- | 1 |
2 |
\|x\|2\right) \left(a
(0) | |
\ell0 |
(1) | |
,a | |
\ell1 |
(j) | |
,...,a | |
\ellj |
,...\right)
\ellj=\tbinom{k+j-1}{j}
(j) | ||
a | = | |
\ell |
| ||||||||||||||||
\sqrt{n1! … nk! |
Because support vector machines and other models employing the kernel trick do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been introduced.[4] Typically, these take the form of a function z that maps a single vector to a vector of higher dimensionality, approximating the kernel:
\langlez(x),z(x')\rangle ≈ \langle\varphi(x),\varphi(x')\rangle=K(x,x')
where
style\varphi
See main article: Random Fourier feature.
One way to construct such a z is to randomly sample from the Fourier transformation of the kernel[5] where
w1,...,wD
N(0,\sigma-2I)
Theorem:
\operatornameE[\langle\varphi(x),\varphi(y)\rangle]=
\|x-y\|2/(2\sigma2) | |
e |
.
Proof: It suffices to prove the case of
D=1
\cos(a-b)=\cos(a)\cos(b)+\sin(a)\sin(b)
infty | |
\int | |
-infty |
| |||||||
\sqrt{2\pi |
Theorem:
\operatorname{Var}[\langle\varphi(x),\varphi(y)\rangle]=O(D-1)
Another approach uses the Nyström method to approximate the eigendecomposition of the Gram matrix K, using only a random sample of the training set.[7]