In statistics, Gower's distance between two mixed-type objects is a similarity measure that can handle different types of data within the same dataset and is particularly useful in cluster analysis or other multivariate statistical techniques. Data can be binary, ordinal, or continuous variables. It works by normalizing the differences between each pair of variables and then computing a weighted average of these differences. The distance was defined in 1971 by Gower[1] and it takes values between 0 and 1 with smaller values indicating higher similarity.
For two objects
i
j
p
S
where the
wijk
1
sijk
k
sijk
sijk=1-
|xi-xj| | |
Rk |
Rk
k
0\leqsijk\leq1
Sij
In its original exposition, the distance does not treat ordinal variables in a special manner. In the 1990s, first Kaufman and Rousseeuw[4] and later Podani[5] suggested extensions where the ordering of an ordinal feature is used. For example, Podani obtains relative rank differences as
sijk=1-
|ri-rj| | |
max{\{r\ |
r
k
Many programming languages and statistical packages, such as R, Python, etc., include implementations of Gower's distance. The implementations may follow Kaufmann and Rousseeuw's extensions, which change the similarity for continuous variables to
sijk=
|xi-xj| | |
Rk |
Language/program | Function | Ref. | |
---|---|---|---|
StatMatch::gower.dist(X) | https://search.r-project.org/CRAN/refmans/StatMatch/html/gower.dist.html | ||
gower.gower_matrix(X) | https://pypi.org/project/gower/ |