Laplace's approximation explained

Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.[1] [2] The approximation is justified by the Bernstein–von Mises theorem, which states that, under regularity conditions, the error of the approximation tends to 0 as the number of data points tends to infinity.[3] [4]

For example, consider a regression or classification model with data set

\{xn,yn\}n=1,\ldots,N

comprising inputs

x

and outputs

y

with (unknown) parameter vector

\theta

of length

D

. The likelihood is denoted

p({\bfy}|{\bfx},\theta)

and the parameter prior

p(\theta)

. Suppose one wants to approximate the joint density of outputs and parameters

p({\bfy},\theta|{\bfx})

. Bayes' formula reads:

p({\bfy},\theta|{\bfx}) = p({\bfy}|{\bfx},\theta)p(\theta|{\bfx}) = p({\bfy}|{\bfx})p(\theta|{\bfy},{\bfx})\simeq\tildeq(\theta) = Zq(\theta).

p({\bfy}|{\bfx})

and posterior

p(\theta|{\bfy},{\bfx})

. Seen as a function of

\theta

the joint is an un-normalised density.

In Laplace's approximation, we approximate the joint by an un-normalised Gaussian

\tildeq(\theta)=Zq(\theta)

, where we use

q

to denote approximate density,

\tildeq

for un-normalised density and

Z

the normalisation constant of

\tildeq

(independent of

\theta

). Since the marginal likelihood

p({\bfy}|{\bfx})

doesn't depend on the parameter

\theta

and the posterior

p(\theta|{\bfy},{\bfx})

normalises over

\theta

we can immediately identify them with

Z

and

q(\theta)

of our approximation, respectively.

Laplace's approximation is

p({\bfy},\theta|{\bfx})\simeqp({\bfy},\hat\theta|{\bfx})\exp(-\tfrac{1}{2}(\theta-\hat\theta)\topS-1(\theta-\hat\theta)) = \tildeq(\theta),

where we have defined

\begin{align} \hat\theta& = \operatorname{argmax}\thetalogp({\bfy},\theta|{\bfx}),\\ S-1& = -\left.\nabla\theta\nabla\thetalogp({\bfy},\theta|{\bfx})\right|\theta=\hat\theta,\end{align}

where

\hat\theta

is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and

S-1

is the

D x D

positive definite matrix of second derivatives of the negative log joint target density at the mode

\theta=\hat\theta

. Thus, the Gaussian approximation matches the value and the log-curvature of the un-normalised target density at the mode. The value of

\hat\theta

is usually found using a gradient based method.

In summary, we have

\begin{align} q(\theta)& = {\calN}(\theta|\mu=\hat\theta,\Sigma=S),\\ logZ& = logp({\bfy},\hat\theta|{\bfx})+\tfrac{1}{2}log|S|+\tfrac{D}{2}log(2\pi), \end{align}

for the approximate posterior over

\theta

and the approximate log marginal likelihood respectively.

The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,[5] and for Gaussian processes by Williams and Barber.[6]

Further reading

Notes and References

  1. Book: Robert E. . Kass . Luke . Tierney . Joseph B. . Kadane . Laplace’s method in Bayesian analysis . Statistical Multiple Integration . Contemporary Mathematics . 1991 . 115 . 89–100 . 0-8218-5122-5 . 10.1090/conm/115/07 .
  2. Web site: Information Theory, Inference and Learning Algorithms, chapter 27: Laplace's method. David J. C.. MacKay. 2003.
  3. Book: Hartigan, J. A. . John A. Hartigan

    . John A. Hartigan . Asymptotic Normality of Posterior Distributions . Bayes Theory . Springer Series in Statistics . New York . Springer . 1983 . 107–118 . 978-1-4613-8244-7. 10.1007/978-1-4613-8242-3_11 .

  4. Book: Robert E. . Kass . Luke . Tierney . Joseph B. . Kadane . The Validity of Posterior Expansions Based on Laplace's Method . 473–488 . S. . Geisser . J. S. . Hodges . S. J. . Press . A. . Zellner . Bayesian and Likelihood Methods in Statistics and Econometrics . Elsevier . 1990 . 0-444-88376-2 .
  5. MacKay . David J. C. . 1992 . Neural Computation . Bayesian Interpolation. MIT Press . 4 . 3 . 415–447 . 10.1162/neco.1992.4.3.415 . 1762283 .
  6. Williams . Christopher K. I. . Barber . David . 1998 . IEEE Transactions on Pattern Analysis and Machine Intelligence. Bayesian classification with Gaussian Processes . IEEE . 20 . 12 . 1342–1351 . 10.1109/34.735807 .