For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer
f*
The following Representer Theorem and its proof are due to Schölkopf, Herbrich, and Smola: [1]
Theorem: Consider a positive-definite real-valued kernel
k:l{X} x l{X}\to\R
l{X}
Hk
(x1,y1),...c,(xn,yn)\inl{X} x \R
g\colon[0,infty)\to\R
E\colon(l{X} x \R2)n\to\R\cup\lbraceinfty\rbrace
Hk
f\mapstoE\left((x1,y1,f(x1)),\ldots,(xn,yn,f(xn))\right)+g\left(\lVertf\rVert\right).
Then, any minimizer of the empirical risk
f*=\underset{f\inHk}{\operatorname{argmin}}\left\lbraceE\left((x1,y1,f(x1)),\ldots,(xn,yn,f(xn))\right)+g\left(\lVertf\rVert\right)\right\rbrace, (*)
admits a representation of the form:
f*( ⋅ )=
n | |
\sum | |
i=1 |
\alphaik( ⋅ ,xi),
where
\alphai\in\R
1\lei\len
Proof:Define a mapping
\begin{align} \varphi\colonl{X}&\toHk\\ \varphi(x)&=k( ⋅ ,x) \end{align}
(so that
\varphi(x)=k( ⋅ ,x)
l{X}\to\R
k
\varphi(x)(x')=k(x',x)=\langle\varphi(x'),\varphi(x)\rangle,
\langle ⋅ , ⋅ \rangle
Hk
Given any
x1,\ldots,xn
f\inHk
\operatorname{span}\left\lbrace\varphi(x1),\ldots,\varphi(xn)\right\rbrace
f=
n | |
\sum | |
i=1 |
\alphai\varphi(xi)+v,
\langlev,\varphi(xi)\rangle=0
i
The above orthogonal decomposition and the reproducing property together show that applying
f
xj
f(xj)=\left\langle
n | |
\sum | |
i=1 |
\alphai\varphi(xi)+v,\varphi(xj)\right\rangle=
n | |
\sum | |
i=1 |
\alphai\langle\varphi(xi),\varphi(xj)\rangle,
which we observe is independent of
v
E
v
v
n | |
\sum | |
i=1 |
\alphai\varphi(xi)
g
\begin{align} g\left(\lVertf\rVert\right)&=g\left(\lVert
n | |
\sum | |
i=1 |
\alphai\varphi(xi)+v\rVert\right)\\ &=g\left(\sqrt{\lVert
n | |
\sum | |
i=1 |
\alphai\varphi(xi)\rVert2+\lVertv\rVert2}\right)\\ &\geg\left(\lVert
n | |
\sum | |
i=1 |
\alphai\varphi(xi)\rVert\right). \end{align}
Therefore setting
v=0
f*
v=0
f*( ⋅ )=
n | |
\sum | |
i=1 |
\alphai\varphi(xi)=
n | |
\sum | |
i=1 |
\alphaik( ⋅ ,xi),
which is the desired result.
The Theorem stated above is a particular example of a family of results that are collectively referred to as "representer theorems"; here we describe several such.
The first statement of a representer theorem was due to Kimeldorf and Wahba for the special case in which
\begin{align} E\left((x1,y1,f(x1)),\ldots,(xn,yn,f(xn))\right)&=
1 | |
n |
n | |
\sum | |
i=1 |
(f(xi)-
2, | |
y | |
i) |
\\ g(\lVertf\rVert)&=λ\lVertf\rVert2 \end{align}
for
λ>0
g( ⋅ )
It is possible to generalize further by augmenting the regularized empirical risk functional through the addition of unpenalized offset terms. For example, Schölkopf, Herbrich, and Smola also consider the minimization
\tilde{f}*=\operatorname{argmin}\left\lbraceE\left((x1,y1,\tilde{f}(x1)),\ldots,(xn,yn,\tilde{f}(xn))\right)+g\left(\lVertf\rVert\right)\mid\tilde{f}=f+h\inHk ⊕ \operatorname{span}\lbrace\psip\mid1\lep\leM\rbrace\right\rbrace, (\dagger)
i.e., we consider functions of the form
\tilde{f}=f+h
f\inHk
h
\lbrace\psip\colonl{X}\to\R\mid1\lep\leM\rbrace
n x M
\left(\psip(xi)\right)ip
M
\tilde{f}*
(\dagger)
\tilde{f}*( ⋅ )=
n | |
\sum | |
i=1 |
\alphaik( ⋅ ,xi)+
M | |
\sum | |
p=1 |
\betap\psip( ⋅ )
where
\alphai,\betap\in\R
\betap
The conditions under which a representer theorem exists were investigated by Argyriou, Micchelli, and Pontil, who proved the following:
Theorem: Let
l{X}
k
l{X} x l{X}
Hk
R\colonHk\to\R
(x1,y1),\ldots,(xn,yn)\inl{X} x \R
E\colon(l{X} x \R2)m\to\R\cup\lbraceinfty\rbrace
f*=\underset{f\inHk}{\operatorname{argmin}}\left\lbraceE\left((x1,y1,f(x1)),\ldots,(xn,yn,f(xn))\right)+R(f)\right\rbrace (\ddagger)
of the regularized empirical risk admits a representation of the form
f*( ⋅ )=
n | |
\sum | |
i=1 |
\alphaik( ⋅ ,xi),
where
\alphai\in\R
1\lei\len
h\colon[0,infty)\to\R
R(f)=h(\lVertf\rVert).
Effectively, this result provides a necessary and sufficient condition on a differentiable regularizer
R( ⋅ )
(\ddagger)
Representer theorems are useful from a practical standpoint because they dramatically simplify the regularized empirical risk minimization problem
(\ddagger)
Hk
L2(l{X})
f*( ⋅ )
n
\alpha=(\alpha1,\ldots,\alphan)\in\Rn
\alpha
The following provides an example of how to solve for the minimizer whose existence is guaranteed by the representer theorem. This method works for any positive definite kernel
K
Assume that we are using a least squares error function
E[(x1,y1,f(x1)),...,(xn,yn,f(xn))]:=
n | |
\sum | |
i=1 |
(yi-
2 | |
f(x | |
i)) |
and a regularization function
g(x)=λx2
λ>0
f*=\underset{f\inl{H}}{\operatorname{argmin}}\{E[(x1,y1,f(x1)),...,(xn,yn,f(xn))]+g(\|f\|l{H
has the form
f*(x)=
n | |
\sum | |
i=1 |
* | |
\alpha | |
i |
k(x,xi)
for some
\alpha*=
*, | |
(\alpha | |
1 |
...,
*) | |
\alpha | |
n |
\inRn
\|f\|l{H
we see that
\alpha*
\alpha*=\underset{\alpha\inRn}{\operatorname{argmin}}\left\{
n | |
\sum | |
i=1 |
\left(yi-
n | |
\sum | |
j=1 |
\alphajk(xi,
2 | |
x | |
j)\right) |
+λ\|f\|l{H
where
Aij=k(xi,xj)
y=(y1,...,yn)
\alpha*=\underset{\alpha\inRn}{\operatorname{argmin}} \left\{\alpha\intercal(A\intercalA+λA)\alpha-2\alpha\intercalAy\right\}.
Since
A\intercalA+λA
F(\alpha)=\alpha\intercal(A\intercalA+λA)\alpha-2\alpha\intercalAy
F
\alpha*
\nabla\alphaF=0
\nabla\alphaF=2(A\intercalA+λA)\alpha*-2Ay=0\Longrightarrow\alpha*=(A\intercalA+λA)-1Ay,
so the minimizer may be found via a linear solve.