The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon,[1] based on the binary logarithm. Although "bit" is more frequently used in place of "shannon", its name is not distinguished from the bit as used in data-processing to refer to a binary value or stream regardless of its entropy (information content) Other units include the nat, based on the natural logarithm, and the hartley, based on the base 10 or common logarithm.
In what follows, an expression of the form
plogp
p
\limpplogp=0
Shannon derived a measure of information content called the self-information or "surprisal" of a message
m
\operatorname{I}(m)=log\left(
1 | |
p(m) |
\right)=-log(p(m))
where
p(m)=Pr(M=m)
m
M
Information from a source is gained by a recipient only if the recipient did not already have that information to begin with. Messages that convey information over a certain (P=1) event (or one which is known with certainty, for instance, through a back-channel) provide no information, as the above equation indicates. Infrequently occurring messages contain more information than more frequently occurring messages.
It can also be shown that a compound message of two (or more) unrelated messages would have a quantity of information that is the sum of the measures of information of each message individually. That can be derived using this definition by considering a compound message
m\&n
\operatorname{I}(m)
\operatorname{I}(n)
P(m\&n)=P(m)P(n)
\operatorname{I}(m\&n)=\operatorname{I}(m)+\operatorname{I}(n)
An example: The weather forecast broadcast is: "Tonight's forecast: Dark. Continued darkness until widely scattered light in the morning." This message contains almost no information. However, a forecast of a snowstorm would certainly contain information since such does not happen every evening. There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as Miami. The amount of information in a forecast of snow for a location where it never snows (impossible event) is the highest (infinity).
The entropy of a discrete message space
M
m
Η(M)=E\left[\operatorname{I}(M)\right]=\summp(m)\operatorname{I}(m)=-\summp(m)logp(m)
where
E[-]
An important property of entropy is that it is maximized when all the messages in the message space are equiprobable (e.g.
p(m)=1/|M|
Η(M)=log|M|
Sometimes the function
Η
Η(p1,p2,\ldots,pk)=
k | |
-\sum | |
i=1 |
pilogpi,
pi\geq0
k | |
\sum | |
i=1 |
pi=1
An important special case of this is the binary entropy function:
Ηb(p)=Η(p,1-p)=-plogp-(1-p)log(1-p)
The joint entropy of two discrete random variables
X
Y
X
Y
Η(X,Y)=EX,Y\left[-logp(x,y)\right]=-\sumx,p(x,y)logp(x,y)
If
X
Y
(Note: The joint entropy should not be confused with the cross entropy, despite similar notations.)
Given a particular value of a random variable
Y
X
Y=y
Η(X|y)=E\left[X|Y[-logp(x|y)]=-\sumxp(x|y)logp(x|y)
where
p(x|y)=
p(x,y) | |
p(y) |
x
y
The conditional entropy of
X
Y
X
Y
Η(X|Y)=EY\left[Η\left(X|y\right)\right]=-\sumyp(y)\sumxp(x|y)logp(x|y)=\sumx,yp(x,y)log
p(y) | |
p(x,y) |
.
This uses the conditional expectation from probability theory.
A basic property of the conditional entropy is that:
Η(X|Y)=Η(X,Y)-Η(Y).
p
q
q
p
DKLl(p(X)\|q(X)r)=\sumxp(x)log
p(x) | |
q(x) |
.
q
p
It turns out that one of the most useful and important measures of information is the mutual information, or transinformation. This is a measure of how much information can be obtained about one random variable by observing another. The mutual information of
X
Y
X
Y
\operatorname{I}(X;Y)=\sumy\inp(y)\sumx\in{p(x|y)log
p(x|y) | |
p(x) |
A basic property of the mutual information is that:
\operatorname{I}(X;Y)=Η(X)-Η(X|Y).
That is, knowing
Y
\operatorname{I}(X;Y)
X
Y
\operatorname{I}(X;Y)=\operatorname{I}(Y;X) =Η(X)+Η(Y)-Η(X,Y).
Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) of the posterior probability distribution of
X
Y
X
\operatorname{I}(X;Y)=Ep(y)\left[DKLl(p(X|Y=y)\|p(X)r)\right].
X
Y
\operatorname{I}(X;Y)=DKLl(p(X,Y)\|p(X)p(Y)r).
Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
See main article: Differential entropy.
The basic measures of discrete entropy have been extended by analogy to continuous spaces by replacing sums with integrals and probability mass functions with probability density functions. Although, in both cases, mutual information expresses the number of bits of information common to the two sources in question, the analogy does not imply identical properties; for example, differential entropy may be negative.
The differential analogies of entropy, joint entropy, conditional entropy, and mutual information are defined as follows:
h(X)=-\intXf(x)logf(x)dx
h(X,Y)=-\intY\intXf(x,y)logf(x,y)dxdy
h(X|y)=-\intXf(x|y)logf(x|y)dx
h(X|Y)=\intY\intXf(x,y)log
f(y) | |
f(x,y) |
dxdy
\operatorname{I}(X;Y)=\intY\intXf(x,y)log
f(x,y) | |
f(x)f(y) |
dxdy
where
f(x,y)
f(x)
f(y)
f(x|y)