In statistics, probability theory and information theory, pointwise mutual information (PMI),[1] or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be if the events were independent.[2]
PMI (especially in its positive pointwise mutual information variant) has been described as "one of the most important concepts in NLP", where it "draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in [a] corpus than we would have expected them to appear by chance."
The concept was introduced in 1961 by Robert Fano under the name of "mutual information", but today that term is instead used for a related measure of dependence between random variables: The mutual information (MI) of two discrete random variables refers to the average PMI of all possible events.
The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:
\operatorname{pmi}(x;y)\equiv
log | ||||
|
=
log | ||||
|
=
log | ||||
|
(with the latter two expressions being equal to the first by Bayes' theorem). The mutual information (MI) of the random variables X and Y is the expected value of the PMI (over all possible outcomes).
The measure is symmetric (
\operatorname{pmi}(x;y)=\operatorname{pmi}(y;x)
p(x|y)
p(y|x)=1
-infty\leq\operatorname{pmi}(x;y)\leqmin\left[-logp(x),-logp(y)\right].
Finally,
\operatorname{pmi}(x;y)
p(x|y)
p(x)
Here is an example to illustrate:
x | y | p(x, y) | |
---|---|---|---|
0 | 0 | 0.1 | |
0 | 1 | 0.7 | |
1 | 0 | 0.15 | |
1 | 1 | 0.05 |
p(x) | p(y) | ||
---|---|---|---|
0 | 0.8 | 0.25 | |
1 | 0.2 | 0.75 |
\operatorname{pmi}(x;y)
pmi(x=0;y=0) | = | -1 |
pmi(x=0;y=1) | = | 0.222392 |
pmi(x=1;y=0) | = | 1.584963 |
pmi(x=1;y=1) | = | -1.584963 |
\operatorname{I}(X;Y)
Pointwise Mutual Information has many of the same relationships as the mutual information. In particular,
\begin{align} \operatorname{pmi}(x;y)&=&h(x)+h(y)-h(x,y)\ &=&h(x)-h(x\midy)\ &=&h(y)-h(y\midx) \end{align}
Where
h(x)
-log2p(x)
Several variations of PMI have been proposed, in particular to address what has been described as its "two main limitations":[3]
The positive pointwise mutual information (PPMI) measure is defined by setting negative values of PMI to zero:
\operatorname{ppmi}(x;y)\equiv
max\left(log | ||||
|
,0\right)
This definition is motivated by the observation that "negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous" and also by a concern that "it's not clear whether it's even possible to evaluate such scores of 'unrelatedness' with human judgment". It also avoid having to deal with
-infty
p(x,y)=0
Pointwise mutual information can be normalized between [-1,+1] resulting in -1 (in the limit) for never occurring together, 0 for independence, and +1 for complete co-occurrence.[4]
\operatorname{npmi}(x;y)=
\operatorname{pmi | |
(x;y)}{h(x, |
y)}
Where
h(x,y)
-log2p(x,y)
The PMIk measure (for k=2, 3 etc.), which was introduced by Béatrice Daille around 1994, and as of 2011 was described as being "among the most widely used variants", is defined as[5]
\operatorname{pmi}k(x;y)\equiv
log | ||||
|
=\operatorname{pmi}(x;y)-(-(k-1))log2p(x,y))
In particular,
pmi1(x;y)=pmi(x;y)
p(x,y)
pmi(x;y)
Total correlation is an extension of mutual information to multi-variables. Analogously to the definition of total correlation, the extension of PMI to multi-variables is "specific correlation."[6] The SI of the results of random variables
\boldsymbol{x}=(x1,x2,\ldots{},xn)
SI(x1,x2,\ldots,xn) \equivlog
p(x1,x2,\ldots,xn) | ||||||||
|
=logp(\boldsymbol{x})-log
n | |
\prod | |
i=1 |
p\left(xi\right)
Like mutual information,[7] point mutual information follows the chain rule, that is,
\operatorname{pmi}(x;yz)=\operatorname{pmi}(x;y)+\operatorname{pmi}(x;z|y)
This is proven through application of Bayes' theorem:
\begin{align} \operatorname{pmi}(x;y)+\operatorname{pmi}(x;z|y)&{}=log
p(x,y) | |
p(x)p(y) |
+log
p(x,z|y) | |
p(x|y)p(z|y) |
\ &{}=log\left[
p(x,y) | |
p(x)p(y) |
p(x,z|y) | |
p(x|y)p(z|y) |
\right]\ &{}=log
p(x|y)p(y)p(x,z|y) | |
p(x)p(y)p(x|y)p(z|y) |
\\ &{}=log
p(x,yz) | |
p(x)p(yz) |
\\ &{}=\operatorname{pmi}(x;yz) \end{align}
PMI could be used in various disciplines e.g. in information theory, linguistics or chemistry (in profiling and analysis of chemical compounds).[8] In computational linguistics, PMI has been used for finding collocations and associations between words. For instance, countings of occurrences and co-occurrences of words in a text corpus can be used to approximate the probabilities
p(x)
p(x,y)
word 1 | word 2 | count word 1 | count word 2 | count of co-occurrences | PMI | |
---|---|---|---|---|---|---|
puerto | rico | 1938 | 1311 | 1159 | 10.0349081703 | |
hong | kong | 2438 | 2694 | 2205 | 9.72831972408 | |
los | angeles | 3501 | 2808 | 2791 | 9.56067615065 | |
carbon | dioxide | 4265 | 1353 | 1032 | 9.09852946116 | |
prize | laureate | 5131 | 1676 | 1210 | 8.85870710982 | |
san | francisco | 5237 | 2477 | 1779 | 8.83305176711 | |
nobel | prize | 4098 | 5131 | 2498 | 8.68948811416 | |
ice | hockey | 5607 | 3002 | 1933 | 8.6555759741 | |
star | trek | 8264 | 1594 | 1489 | 8.63974676575 | |
car | driver | 5578 | 2749 | 1384 | 8.41470768304 | |
it | the | 283891 | 3293296 | 3347 | -1.72037278119 | |
are | of | 234458 | 1761436 | 1019 | -2.09254205335 | |
this | the | 199882 | 3293296 | 1211 | -2.38612756961 | |
is | of | 565679 | 1761436 | 1562 | -2.54614706831 | |
and | of | 1375396 | 1761436 | 2949 | -2.79911817902 | |
a | and | 984442 | 1375396 | 1457 | -2.92239510038 | |
in | and | 1187652 | 1375396 | 1537 | -3.05660070757 | |
to | and | 1025659 | 1375396 | 1286 | -3.08825363041 | |
to | in | 1025659 | 1187652 | 1066 | -3.12911348956 | |
of | and | 1761436 | 1375396 | 1190 | -3.70663100173 |
Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.
. Fano. R M. Robert Fano. 1961. Transmission of Information: A Statistical Theory of Communications. MIT Press, Cambridge, MA. chapter 2. 978-0262561693.