In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence[1]), denoted
DKL(P\parallelQ)
DKL(P\parallelQ)=\sum}P(x) log\left(
P(x) | |
Q(x) |
\right).
A simple interpretation of the KL divergence of from is the expected excess surprise from using as a model instead of when the actual distribution is . While it is a measure of how different two distributions are, and in some sense is thus a "distance", it is not actually a metric, which is the most familiar and formal type of distance. In particular, it is not symmetric in the two distributions (in contrast to variation of information), and does not satisfy the triangle inequality. Instead, in terms of information geometry, it is a type of divergence, a generalization of squared distance, and for certain classes of distributions (notably an exponential family), it satisfies a generalized Pythagorean theorem (which applies to squared distances).
Relative entropy is always a non-negative real number, with value 0 if and only if the two distributions in question are identical. It has diverse applications, both theoretical, such as characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference; and practical, such as applied statistics, fluid mechanics, neuroscience, bioinformatics, and machine learning.
Consider two probability distributions and . Usually, represents the data, the observations, or a measured probability distribution. Distribution represents instead a theory, a model, a description or an approximation of . The Kullback–Leibler divergence
DKL(P\parallelQ)
The relative entropy was introduced by Solomon Kullback and Richard Leibler in as "the mean information for discrimination between
H1
H2
\mu1
\mu1,\mu2
H1,H2
\mu1,\mu2
I(1:2)
\mu1
\mu2
J(1,2)=I(1:2)+I(2:1)
For discrete probability distributions and defined on the same sample space,
l{X} ,
DKL(P\parallelQ)=\sum}P(x) log\left(
P(x) | |
Q(x) |
\right) ,
which is equivalent to
DKL(P\parallelQ)=-\sum}P(x) log\left(
Q(x) | |
P(x) |
\right)~.
In other words, it is the expectation of the logarithmic difference between the probabilities and, where the expectation is taken using the probabilities .
Relative entropy is only defined in this way if, for all,
Q(x)=0
P(x)=0
+infty
Q(x)\ne0
l{X}
Whenever
P(x)
\lim | |
x\to0+ |
xlog(x)=0~.
For distributions and of a continuous random variable, relative entropy is defined to be the integral[7]
DKL(P\parallelQ)=
infty | ||
\int | p(x) log\left( | |
-infty |
p(x) | |
q(x) |
\right) d x ,
where and denote the probability densities of and .
l{X} ,
DKL(P\parallelQ)=\int} log\left(
P(d x) | |
Q(d x) |
\right) P(d x) ,
where
P(d x) | |
Q(d x) |
l{X}
P(d x)=r(x)Q(d x)
DKL(P\parallelQ)=\int}
P(d x) | log\left( | |
Q(d x) |
P(d x) | |
Q(d x) |
\right) Q(d x) ,
which is the entropy of relative to . Continuing in this case, if
\mu
l{X}
P(d x)=p(x)\mu(d x)
Q(d x)=q(x)\mu(d x)
\mu
DKL(P\parallelQ)=\intx
Note that such a measure
\mu
\mu=
1 | |
2 |
\left(P+Q\right)
Various conventions exist for referring to
DKL(P\parallelQ)
DKL(P\parallelQ)
Kullback gives the following example (Table 2.1, Example 2.1). Let and be the distributions shown in the table and figure. is the distribution on the left side of the figure, a binomial distribution with
N=2
p=0.4
x=
l{X}=\{0,1,2\}
p=1/3
0 | 1 | 2 | |||||||||||
Distribution P(x) |
|
|
| ||||||||||
Distribution Q(x) |
|
|
|
Relative entropies
DKL(P\parallelQ)
DKL(Q\parallelP)
\begin{align} DKL(P\parallelQ)&=\sumx\inl{X
\begin{align} DKL(Q\parallelP)&=\sumx\inl{X
In the field of statistics, the Neyman–Pearson lemma states that the most powerful way to distinguish between the two distributions and based on an observation (drawn from one of them) is through the log of the ratio of their likelihoods:
logP(Y)-logQ(Y)
In the context of coding theory,
DKL(P\parallelQ)
In the context of machine learning,
DKL(P\parallelQ)
Expressed in the language of Bayesian inference,
DKL(P\parallelQ)
In applications, typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while typically represents a theory, model, description, or approximation of . In order to find a distribution that is closest to, we can minimize the KL divergence and compute an information projection.
While it is a statistical distance, it is not a metric, the most familiar type of distance, but instead it is a divergence. While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general
DKL(P\parallelQ)
DKL(Q\parallelP)
The relative entropy is the Bregman divergence generated by the negative entropy, but it is also of the form of an -divergence. For probabilities over a finite alphabet, it is unique in being a member of both of these classes of statistical divergences. The application of Bregman divergence can be found in mirror descent.[11]
Consider a growth-optimizing investor in a fair game with mutually exclusive outcomes (e.g. a “horse race” in which the official odds add up to one).The rate of return expected by such an investor is equal to the relative entropy between the investor's believed probabilities and the official odds.[12] This is a special case of a much more general connection between financial returns and divergence measures.[13]
Financial risks are connected to
DKL
In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value
xi
-\elli | |
q(x | |
i)=2 |
\elli
xi
\begin{align} DKL(P\parallelQ)&=\sumx\inl{X
where
Η(P,Q)
Η(P)
The relative entropy
DKL(P\parallelQ)
H(P,Q)
H(P,P)=:H(P)
H(P)
DKL(P\parallelQ)
Relative entropy relates to "rate function" in the theory of large deviations.[16] [17]
Arthur Hobson proved that relative entropy is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy.[18] Consequently, mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.
DKL(P\parallelQ)
P=Q
In particular, if
P(dx)=p(x)\mu(dx)
Q(dx)=q(x)\mu(dx)
p(x)=q(x)
\mu
Η(P)
Η(P,Q)
DKL(P\parallelQ)
y(x)
P(dx)=p(x)dx=\tilde{p}(y)dy=\tilde{p}(y(x))|\tfrac{dy}{dx}(x)|dx
Q(dx)=q(x)dx=\tilde{q}(y)dy=\tilde{q}(y)|\tfrac{dy}{dx}(x)|dx
|\tfrac{dy}{dx}(x)|
ya=y(xa)
yb=y(xb)
p(x)
q(x)
P(dx)=p(x)dx
P1,P2
P(dx,dy)=P1(dx)P2(dy)
Q(dx,dy)=Q1(dx)Q2(dy)
Q1,Q2
DKL(P\parallelQ)
(P,Q)
(P1,Q1)
(P2,Q2)
DKL(P\parallelQ)
P=Q
P\leq2Q
Q
f(\alpha):=DKL((1-\alpha)Q+\alphaP\parallelQ)
DKL(P\parallelQ)=f(1)
f
DKL(P\parallelQ)
f
0
\alpha=1
P\leq2Q
P\leq2Q
P>2Q
0
\epsilon>0
\rho>0
U<infty
P\geq2Q+\epsilon
Q\leqU
\rho
1-\rho
P\leq2Q
\rho
P\geq2Q+\epsilon
n
1{n(n-1)} | |
\rho |
\left(1+
\epsilon | |
U |
\right)n
n\toinfty
The following result, due to Donsker and Varadhan,[21] is known as Donsker and Varadhan's variational formula.
Suppose that we have two multivariate normal distributions, with means
\mu0,\mu1
\Sigma0,\Sigma1.
DKL\left(l{N}0\parallell{N}1\right)=
1 | |
2 |
\left(
-1 | |
\operatorname{tr}\left(\Sigma | |
1 |
\Sigma0\right)-k+ \left(\mu1-
T | |
\mu | |
0\right) |
-1 | |
\Sigma | |
1 |
\left(\mu1-\mu0\right)+ ln\left(
\det\Sigma1 | |
\det\Sigma0 |
\right) \right).
The logarithm in the last term must be taken to base since all terms apart from the last are base- logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by
ln(2)
In a numerical implementation, it is helpful to express the result in terms of the Cholesky decompositions
L0,L1
\Sigma0=L0L
T | |
0 |
\Sigma1=L1L
T | |
1 |
L1M=L0
L1y=\mu1-\mu0
DKL\left(l{N}0\parallell{N}1\right)=
1 | |
2 |
\left(
k | |
\sum | |
i,j=1 |
(Mij)2-k+ |y|2+
k | |
2\sum | |
i=1 |
ln
(L1)ii | |
(L0)ii |
\right).
A special case, and a common quantity in variational inference, is the relative entropy between a diagonal multivariate normal, and a standard normal distribution (with zero mean and unit variance):
DKL\left(l{N}\left(\left(\mu1,\ldots,
T, | |
\mu | |
k\right) |
\operatorname{diag}
2, | |
\left(\sigma | |
1 |
\ldots,
2\right)\right) | |
\sigma | |
k |
\parallel l{N}\left(0,I\right) \right)= {1\over2}
k | |
\sum | |
i=1 |
2 | |
\left(\sigma | |
i |
+
2 | |
\mu | |
i |
-1-
2\right)\right). | |
ln\left(\sigma | |
i |
For two univariate normal distributions and the above simplifies to[23]
DKL\left(l{p}\parallell{q}\right)=log
\sigma1 | |
\sigma0 |
+
| ||||||||||||||||
|
-
1 | |
2 |
In the case of co-centered normal distributions with
k=\sigma1/\sigma0
DKL\left(l{p}\parallell{q}\right)=log2k+(k-2-1)/2/ln(2)bits
Consider two uniform distributions, with the support of
p=[A,B]
q=[C,D]
C\leA<B\leD
DKL\left(l{p}\parallell{q}\right)=log
D-C | |
B-A |
Intuitively,[24] the information gain to a times narrower uniform distribution contains
log2k
log2k
While relative entropy is a statistical distance, it is not a metric on the space of probability distributions, but instead it is a divergence. While metrics are symmetric and generalize linear distance, satisfying the triangle inequality, divergences are asymmetric in general and generalize squared distance, in some cases satisfying a generalized Pythagorean theorem. In general
DKL(P\parallelQ)
DKL(Q\parallelP)
It generates a topology on the space of probability distributions. More concretely, if
\{P1,P2,\ldots\}
\limnDKL(Pn\parallelQ)=0
then it is said that
Pn\xrightarrow{D}Q
Pinsker's inequality entails that
Pn\xrightarrow{D}P ⇒ Pn\xrightarrow{TV}P
where the latter stands for the usual convergence in total variation.
Relative entropy is directly related to the Fisher information metric. This can be made explicit as follows. Assume that the probability distributions and are both parameterized by some (possibly multi-dimensional) parameter
\theta
P=P(\theta)
Q=P(\theta0)
\theta
\theta0
P(\theta)=P(\theta0)+\Delta\thetajPj(\theta0)+ …
with
\Delta\thetaj=(\theta-\theta0)j
\theta
Pj\left(\theta0\right)=
\partialP | |
\partial\thetaj |
(\theta0)
P=Q
\theta=\theta0
\Delta\thetaj
\left. | \partial |
\partial\thetaj |
\right| | |
\theta=\theta0 |
DKL(P(\theta)\parallelP(\theta0))=0,
and by the Taylor expansion one has up to second order
DKL(P(\theta)\parallelP(\theta0))=
1 | |
2 |
\Delta\thetaj\Delta\thetakgjk(\theta0)+ …
where the Hessian matrix of the divergence
gjk(\theta0)=\left.
\partial2 | |
\partial\thetaj\partial\thetak |
\right| | |
\theta=\theta0 |
DKL(P(\theta)\parallelP(\theta0))
must be positive semidefinite. Letting
\theta0
gjk(\theta)
When
p(x,
\partiallog(p) | |
\partial\rho |
,
\partial2log(p) | |
\partial\rho2 |
,
\partial3log(p) | |
\partial\rho3 |
\begin{align} \left|
\partialp | |
\partial\rho |
\right|&<F(x):
infty | |
\int | |
x=0 |
F(x)dx<infty,\ \left|
\partial2p | |
\partial\rho2 |
\right|&<G(x):
infty | |
\int | |
x=0 |
G(x)dx<infty\ \left|
\partial3log(p) | |
\partial\rho3 |
\right|&<H(x):
infty | |
\int | |
x=0 |
p(x,0)H(x)dx<\xi<infty \end{align}
where is independent of
infty | |
\left.\int | |
x=0 |
\partialp(x,\rho) | |
\partial\rho |
\right|\rho=0dx=
infty | |
\left.\int | |
x=0 |
\partial2p(x,\rho) | |
\partial\rho2 |
\right|\rho=0dx=0
l{D}(p(x,0)\parallelp(x,\rho))=
c\rho2 | |
2 |
+l{O}\left(\rho3\right)as\rho\to0.
Another information-theoretic metric is variation of information, which is roughly a symmetrization of conditional entropy. It is a metric on the set of partitions of a discrete probability space.
MAUVE is a measure of the statistical gap between two text distributions, such as the difference between text generated by a model and human-written text. This measure is computed using Kullback-Leibler divergences between the two distributions in a quantized embedding space of a foundation model.
Many of the other quantities of information theory can be interpreted as applications of relative entropy to specific cases.
See main article: Information content. The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.
When applied to a discrete random variable, the self-information can be represented as
\operatorname\operatorname{I}(m)=DKL\left(\deltaim\parallel\{pi\}\right),
is the relative entropy of the probability distribution
P(i)
i=m
P(i)
i=m
The mutual information,
\begin{align} \operatorname{I}(X;Y) &=DKL(P(X,Y)\parallelP(X)P(Y))\\[5pt] &=\operatorname{E}X\{DKL(P(Y\midX)\parallelP(Y))\}\\[5pt] &=\operatorname{E}Y\{DKL(P(X\midY)\parallelP(X))\} \end{align}
P(X,Y)
P(X)P(Y)
P(X,Y)
The Shannon entropy,
\begin{align} Η(X)&=\operatorname{E}\left[\operatorname{I}X(x)\right]\\ &=log(N)-DKL\left(pX(x)\parallelPU(X)\right) \end{align}
is the number of bits which would have to be transmitted to identify from equally likely possibilities, less the relative entropy of the uniform distribution on the random variates of,
PU(X)
P(X)
PU(X)
P(X)
\limNHN(X)=log(N)-\intp(x)log
p(x) | |
m(x) |
dx,
log(N)-DKL(p(x)||m(x))
The conditional entropy,
\begin{align} Η(X\midY) &=log(N)-DKL(P(X,Y)\parallelPU(X)P(Y))\\[5pt] &=log(N)-DKL(P(X,Y)\parallelP(X)P(Y))-DKL(P(X)\parallelPU(X))\\[5pt] &=Η(X)-\operatorname{I}(X;Y)\\[5pt] &=log(N)-\operatorname{E}Y\left[DKL\left(P\left(X\midY\right)\parallelPU(X)\right)\right] \end{align}
is the number of bits which would have to be transmitted to identify from equally likely possibilities, less the relative entropy of the product distribution
PU(X)P(Y)
P(X,Y)
PU(X)
P(X|Y)
When we have a set of possible events, coming from the distribution, we can encode them (with a lossless data compression) using entropy encoding. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code (e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)). If we know the distribution in advance, we can devise an encoding that would be optimal (e.g.: using Huffman coding). Meaning the messages we encode will have the shortest length on average (assuming the encoded events are sampled from), which will be equal to Shannon's Entropy of (denoted as
Η(p)
The cross entropy between two probability distributions (and) measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution, rather than the "true" distribution . The cross entropy for two distributions and over the same probability space is thus defined as follows.
Η(p,q)=\operatorname{E}p[-log(q)]=Η(p)+DKL(p\parallelq).
For explicit derivation of this, see the Motivation section above.
Under this scenario, relative entropies (kl-divergence) can be interpreted as the extra number of bits, on average, that are needed (beyond
Η(p)
In Bayesian statistics, relative entropy can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution:
p(x)\top(x\midI)
Y=y
p(x\midI)
p(x\midy,I)
p(x\midy,I)=
p(y\midx,I)p(x\midI) | |
p(y\midI) |
This distribution has a new entropy:
Η(p(x\midy,I))=-\sumxp(x\midy,I)logp(x\midy,I),
which may be less than or greater than the original entropy
Η(p(x\midI))
p(x\midI)
p(x\midy,I)
DKL(p(x\midy,I)\parallelp(x\midI))=\sumxp(x\midy,I)log\left(
p(x\midy,I) | |
p(x\midI) |
\right)
to the message length. This therefore represents the amount of useful information, or information gain, about, that has been learned by discovering
Y=y
If a further piece of data,
Y2=y2
p(x\midy1,y2,I)
p(x\midy1,I)
p(x\midI)
\sumxp(x\midy1,y2,I)log\left(
p(x\midy1,y2,I) | |
p(x\midI) |
\right)
\displaystyle\sumxp(x\midy1,I)log\left(
p(x\midy1,I) | |
p(x\midI) |
\right)
and so the combined information gain does not obey the triangle inequality:
DKL(p(x\midy1,y2,I)\parallelp(x\midI))
DKL(p(x\midy1,y2,I)\parallelp(x\midy1,I))+DKL(p(x\midy1,I)\parallelp(x\midI))
All one can say is that on average, averaging using
p(y2\midy1,x,I)
A common goal in Bayesian experimental design is to maximise the expected relative entropy between the prior and the posterior.[25] When posteriors are approximated to be Gaussian distributions, a design maximising the expected relative entropy is called Bayes d-optimal.
Relative entropy can also be interpreted as the expected discrimination information for
H1
H0
H1
H0
H1
H1
H0
The expected weight of evidence for
H1
H0
p(H)
DKL(p(x\midH1)\parallelp(x\midH0)) ≠ IG=DKL(p(H\midx)\parallelp(H\midI)).
Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.
On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.
The idea of relative entropy as discrimination information led Kullback to propose the Principle of (MDI): given new facts, a new distribution should be chosen which is as hard to discriminate from the original distribution
f0
DKL(f\parallelf0)
For example, if one had a prior distribution
p(x,a)
u(a)
q(x\mida)u(a)
DKL(q(x\mida)u(a)\parallelp(x,a))=\operatorname{E}u(a)\left\{DKL(q(x\mida)\parallelp(x\mida))\right\}+DKL(u(a)\parallelp(a)),
i.e. the sum of the relative entropy of
p(a)
u(a)
u(a)
p(x\mida)
q(x\mida)
DKL(q(x\mida)\parallelp(x\mida))
q(x\mida)=p(x\mida)
u(a)
u(a)
MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the relative entropy continues to be just as relevant.
In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising relative entropy from to with respect to is equivalent to minimizing the cross-entropy of and, since
Η(p,m)=Η(p)+DKL(p\parallelm),
which is appropriate if one is trying to choose an adequate approximation to . However, this is just as often not the task one is trying to achieve. Instead, just as often it is that is some fixed prior reference measure, and that one is attempting to optimise by minimising
DKL(p\parallelm)
DKL(p\parallelm)
Η(p,m)
Surprisals[27] add where probabilities multiply. The surprisal for an event of probability is defined as
s=kln(1/p)
\left\{1,1/ln2,1.38 x 10-23\right\}
\{
J/K\}
Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal (entropy) for a given set of control parameters (like pressure or volume). This constrained entropy maximization, both classically[28] and quantum mechanically,[29] minimizes Gibbs availability in entropy units[30]
A\equiv-kln(Z)
When temperature is fixed, free energy (
T x A
T,V
F\equivU-TS
G=U+PV-TS
To
Po
W=\DeltaG=NkTo\Theta(V/Vo)
Vo=NkTo/Po
\Theta(x)=x-1-lnx\ge0
More generally[31] the work available relative to some ambient is obtained by multiplying ambient temperature
To
\DeltaI\ge0,
kln(p/po)
po
Vo
To
W=To\DeltaI
\DeltaI=Nk\left[\Theta\left(
V | |
Vo |
\right)+
3 | \Theta\left( | |
2 |
T | |
To |
\right)\right].
The resulting contours of constant relative entropy, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.[32] Thus relative entropy measures thermodynamic availability in bits.
For density matrices and on a Hilbert space, the quantum relative entropy from to is defined to be
DKL(P\parallelQ)=\operatorname{Tr}(P(log(P)-log(Q))).
In quantum information science the minimum of
DKL(P\parallelQ)
Just as relative entropy of "actual from ambient" measures thermodynamic availability, relative entropy of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case relative entropy describes distance to equilibrium or (when multiplied by ambient temperature) the amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, how much the model has yet to learn.
Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting a statistical model via Akaike information criterion are particularly well described in papers[33] and a book[34] by Burnham and Anderson. In a nutshell the relative entropy of reality from a model may be estimated, to within a constant additive term, by a function of the deviations observed between data and the model's predictions (like the mean squared deviation) . Estimates of such divergence for models that share the same additive term can in turn be used to select among models.
When trying to fit parametrized models to data there are various estimators which attempt to minimize relative entropy, such as maximum likelihood and maximum spacing estimators.
also considered the symmetrized function:
DKL(P\parallelQ)+DKL(Q\parallelP)
which they referred to as the "divergence", though today the "KL divergence" refers to the asymmetric function (see for the evolution of the term). This function is symmetric and nonnegative, and had already been defined and used by Harold Jeffreys in 1948; it is accordingly called the Jeffreys divergence.
This quantity has sometimes been used for feature selection in classification problems, where and are the conditional pdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to as Population Stability Index (PSI), and is used to assess distributional shifts in model features through time.
An alternative is given via the
λ
Dλ(P\parallelQ)=λDKL(P\parallelλP+(1-λ)Q)+(1-λ)DKL(Q\parallelλP+(1-λ)Q),
which can be interpreted as the expected information gain about from discovering which probability distribution is drawn from, or, if they currently have probabilities
λ
1-λ
The value
λ=0.5
DJS=
1 | |
2 |
DKL(P\parallelM)+
1 | |
2 |
DKL(Q\parallelM)
where is the average of the two distributions,
M=
1 | |
2 |
(P+Q).
We can also interpret
DJS
Furthermore, the Jensen–Shannon divergence can be generalized using abstract statistical M-mixtures relying on an abstract mean M.[35] [36]
There are many other important measures of probability distance. Some of these are particularly connected with relative entropy. For example:
\delta(p,q)
DKL(P\parallelQ)>2
\alpha
Other notable measures of distance include the Hellinger distance, histogram intersection, Chi-squared statistic, quadratic form distance, match distance, Kolmogorov–Smirnov distance, and earth mover's distance.[39]
See main article: Data differencing. Just as absolute entropy serves as theoretical background for data compression, relative entropy serves as theoretical background for data differencing – the absolute entropy of a set of data in this sense being the data required to reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the target given the source (minimum size of a patch).