Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs. For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature. It is helpful to understand them from a Bayesian perspective. Because the kernels are not necessarily positive semidefinite, the underlying structure may not be inner product spaces, but instead more general reproducing kernel Hilbert spaces. In Bayesian probability kernel methods are a key component of Gaussian processes, where the kernel function is known as the covariance function. Kernel methods have traditionally been used in supervised learning problems where the input space is usually a space of vectors while the output space is a space of scalars. More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.[1]
A mathematical equivalence between the regularization and the Bayesian point of view is easily proved in cases where the reproducing kernel Hilbert space is finite-dimensional. The infinite-dimensional case raises subtle mathematical issues; we will consider here the finite-dimensional case. We start with a brief review of the main ideas underlying kernel methods for scalar learning, and briefly introduce the concepts of regularization and Gaussian processes. We then show how both points of view arrive at essentially equivalent estimators, and show the connection that ties them together.
The classical supervised learning problem requires estimating the output for some new input point
x'
\hat{f}(x')
S
n
S=(X,Y)=(x1,y1),\ldots,(xn,yn)
k( ⋅ , ⋅ )
where
K\equivk(X,X)
Kij=k(xi,xj)
k=[k(x1,x'),\ldots,k(x
\top | |
n,x')] |
Y=[y1,\ldots,y
\top | |
n] |
The main assumption in the regularization perspective is that the set of functions
l{F}
l{H}k
A reproducing kernel Hilbert space (RKHS)
l{H}k
k:l{X} x l{X} → R
k(x, ⋅ )
l{H}k
x\inl{X}
1. The reproducing property, which gives name to the space,
f(x)=\langlef,k(x, ⋅ )\ranglek, \forall f\inl{H}k,
where
\langle ⋅ , ⋅ \ranglek
l{H}k
2. Functions in an RKHS are in the closure of the linear combination of the kernel at given points,
f(x)=\sumik(xi,x)ci
This allows the construction in a unified framework of both linear and generalized linear models.
3. The squared norm in an RKHS can be written as
2 | |
\|f\| | |
k |
=\sumi,jk(xi,xj)cicj
and could be viewed as measuring the complexity of the function.
The estimator is derived as the minimizer of the regularized functional
where
f\inl{H}k
\| ⋅ \|k
l{H}k
f(xi)
yi
f(xi)
yi
λ
λ
λ
The explicit form of the estimator in equation is derived in two steps. First, the representer theorem[8] [9] [10] states that the minimizer of the functional can always be written as a linear combination of the kernels centered at the training-set points,
for some
c\inRn
c=[c1,\ldots,c
\top | |
n] |
f( ⋅ )
2 | |
\begin{align} \|f\| | |
k |
&=\langlef,f\ranglek,\\ &=\left\langle
N | |
\sum | |
i=1 |
cik(xi, ⋅ ),
N | |
\sum | |
j=1 |
cjk(xj, ⋅ )\right\ranglek,\\ &=
N | |
\sum | |
i=1 |
N | |
\sum | |
j=1 |
cicj\langlek(xi, ⋅ ),k(xj, ⋅ )\ranglek,\\ &=
N | |
\sum | |
i=1 |
N | |
\sum | |
j=1 |
cicjk(xi,xj),\\ &=c\topKc. \end{align}
We can rewrite the functional as
1 | |
n |
\|y-Kc\|2+λc\topKc.
This functional is convex in
c
c
\begin{align} - | 1 |
n |
K(Y-Kc)+λKc&=0,\\ (K+λnI)c&=Y,\\ c&=(K+λnI)-1Y. \end{align}
Substituting this expression for the coefficients in equation, we obtain the estimator stated previously in equation,
\hat{f}(x')=k\top(K+λnI)-1Y.
The notion of a kernel plays a crucial role in Bayesian probability as the covariance function of a stochastic process called the Gaussian process.
As part of the Bayesian framework, the Gaussian process specifies the prior distribution that describes the prior beliefs about the properties of the function being modeled. These beliefs are updated after taking into account observational data by means of a likelihood function that relates the prior beliefs to the observations. Taken together, the prior and likelihood lead to an updated distribution called the posterior distribution that is customarily used for predicting test cases.
A Gaussian process (GP) is a stochastic process in which any finite number of random variables that are sampled follow a joint Normal distribution. The mean vector and covariance matrix of the Gaussian distribution completely specify the GP. GPs are usually used as a priori distribution for functions, and as such the mean vector and covariance matrix can be viewed as functions, where the covariance function is also called the kernel of the GP. Let a function
f
m
k
f\siml{GP}(m,k).
In terms of the underlying Gaussian distribution, we have that for any finite set
X=\{xi\}
n | |
i=1 |
f(X)=[f(x1),\ldots,f(x
\top | |
n)] |
f(X)\siml{N}(m,K),
where
m=m(X)=[m(x1),\ldots,m(x
\top | |
N)] |
K=k(X,X)
In a regression context, the likelihood function is usually assumed to be a Gaussian distribution and the observations to be independent and identically distributed (iid),
p(y|f,x,\sigma2)=l{N}(f(x),\sigma2).
This assumption corresponds to the observations being corrupted with zero-mean Gaussian noise with variance
\sigma2
X
\sigma2
x'
S=\{X,Y\}
p(f(x')|S,x',\boldsymbol{\phi})=l{N}(m(x'),\sigma2(x')),
where
\boldsymbol{\phi}
\sigma2
k
\begin{align} m(x')&=k\top(K+\sigma2I)-1Y,\\ \sigma2(x')&=k(x',x')-k\top(K+\sigma2I)-1k. \end{align}
A connection between regularization theory and Bayesian theory can only be achieved in the case of finite dimensional RKHS. Under this assumption, regularization theory and Bayesian theory are connected through Gaussian process prediction.[11] [12] [13]
In the finite dimensional case, every RKHS can be described in terms of a feature map
\Phi:l{X} → Rp
k(x,x')=
p | |
\sum | |
i=1 |
\Phii(x)\Phii(x').
Functions in the RKHS with kernel
K
fw(x)=
p | |
\sum | |
i=1 |
wi\Phii(x)=\langlew,\Phi(x)\rangle,
and we also have that
\|fw\|k=\|w\|.
We can now build a Gaussian process by assuming
w=[w1,\ldots,wp]\top
w\siml{N}(0,I)\propto\exp(-\|w\|2).
If we assume a Gaussian likelihood we have
P(Y|X,f)=l{N}(f(X),\sigma2I)\propto\exp\left(-
1 | |
\sigma2 |
\|fw(X)-Y\|2\right),
where
fw(X)=(\langlew,\Phi(x1)\rangle,\ldots,\langlew,\Phi(xn\rangle)
P(f|X,Y)\propto\exp\left(-
1 | |
\sigma2 |
\|fw(X)-
2 | |
Y\| | |
n |
+\|w\|2\right)
We can see that a maximum posterior (MAP) estimate is equivalent to the minimization problem defining Tikhonov regularization, where in the Bayesian case the regularization parameter is related to the noise variance.
From a philosophical perspective, the loss function in a regularization setting plays a different role than the likelihood function in the Bayesian setting. Whereas the loss function measures the error that is incurred when predicting
f(x)
y
f
y