Reparameterization trick explained
The reparameterization trick (aka "reparameterization gradient estimator") is a technique used in statistical machine learning, particularly in variational inference, variational autoencoders, and stochastic optimization. It allows for the efficient computation of gradients through random variables, enabling the optimization of parametric probability models using stochastic gradient descent, and the variance reduction of estimators.
It was developed in the 1980s in operations research, under the name of "pathwise gradients", or "stochastic gradients".[1] [2] Its use in variational inference was proposed in 2013.[3]
Mathematics
Let
be a random variable with distribution
, where
is a vector containing the parameters of the distribution.
REINFORCE estimator
Consider an objective function of the form:Without the reparameterization trick, estimating the gradient
can be challenging, because the parameter appears in the random variable itself. In more detail, we have to statistically estimate:
The REINFORCE estimator, widely used in
reinforcement learning and especially policy gradient,
[4] uses the following equality:
This allows the gradient to be estimated:
The REINFORCE estimator has high variance, and many methods were developed to
reduce its variance.
[5] Reparameterization estimator
The reparameterization trick expresses
as:
Here,
is a deterministic function parameterized by
, and
is a noise variable drawn from a fixed distribution
. This gives:
Now, the gradient can be estimated as:
Examples
For some common distributions, the reparameterization trick takes specific forms:
Normal distribution
For
, we can use:
Exponential distribution
For
, we can use:
Discrete distribution can be reparameterized by the
Gumbel distribution (Gumbel-softmax trick or "concrete distribution").
[6] In general, any distribution that is differentiable with respect to its parameters can be reparameterized by inverting the multivariable CDF function, then apply the implicit method. See for an exposition and application to the Gamma Beta, Dirichlet, and von Mises distributions.
Applications
Variational autoencoder
In Variational Autoencoders (VAEs), the VAE objective function, known as the Evidence Lower Bound (ELBO), is given by:
where
is the encoder (recognition model),
is the decoder (generative model), and
is the prior distribution over latent variables. The gradient of ELBO with respect to
is simply
but the gradient with respect to
requires the trick. Express the sampling operation
as:
where
and
are the outputs of the encoder network, and
denotes
element-wise multiplication. Then we have
where
z=\mu\phi(x)+\sigma\phi(x)\odot\epsilon
. This allows us to estimate the gradient using Monte Carlo sampling:
where
zl=\mu\phi(x)+\sigma\phi(x)\odot\epsilonl
and
for
.
This formulation enables backpropagation through the sampling process, allowing for end-to-end training of the VAE model using stochastic gradient descent or its variants.
Variational inference
More generally, the trick allows using stochastic gradient descent for variational inference. Let the variational objective (ELBO) be of the form:Using the reparameterization trick, we can estimate the gradient of this objective with respect to
:
Dropout
The reparameterization trick has been applied to reduce the variance in dropout, a regularization technique in neural networks. The original dropout can be reparameterized with Bernoulli distributions:where
is the weight matrix,
is the input, and
are the (fixed) dropout rates.
More generally, other distributions can be used than the Bernoulli distribution, such as the gaussian noise:where
and
, with
and
being the mean and variance of the
-th output neuron. The reparameterization trick can be applied to all such cases, resulting in the
variational dropout method.
[7] See also
Further reading
Notes and References
- Figurnov . Mikhail . Mohamed . Shakir . Mnih . Andriy . 2018 . Implicit Reparameterization Gradients . Advances in Neural Information Processing Systems . Curran Associates, Inc. . 31.
- Fu, Michael C. "Gradient estimation." Handbooks in operations research and management science 13 (2006): 575-616.
- Kingma . Diederik P. . Auto-Encoding Variational Bayes . 2022-12-10 . 1312.6114 . Welling . Max. stat.ML .
- Williams . Ronald J. . 1992-05-01 . Simple statistical gradient-following algorithms for connectionist reinforcement learning . Machine Learning . en . 8 . 3 . 229–256 . 10.1007/BF00992696 . 1573-0565.
- Greensmith . Evan . Bartlett . Peter L. . Baxter . Jonathan . 2004 . Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning . Journal of Machine Learning Research . 5 . Nov . 1471–1530 . 1533-7928.
- Maddison . Chris J. . The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables . 2017-03-05 . 1611.00712 . Mnih . Andriy . Teh . Yee Whye. cs.LG .
- Kingma . Durk P . Salimans . Tim . Welling . Max . 2015 . Variational Dropout and the Local Reparameterization Trick . Advances in Neural Information Processing Systems . 28. 1506.02557 .