In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes.[1] That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.).
Multinomial logistic regression is known by a variety of other names, including polytomous LR,[2] [3] multiclass LR, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model.[4]
Multinomial logistic regression is used when the dependent variable in question is nominal (equivalently categorical, meaning that it falls into any one of a set of categories that cannot be ordered in any meaningful way) and for which there are more than two categories. Some examples would be:
These are all statistical classification problems. They all have in common a dependent variable to be predicted that comes from one of a limited set of items that cannot be meaningfully ordered, as well as a set of independent variables (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logistic regression is a particular solution to classification problems that use a linear combination of the observed features and some problem-specific parameters to estimate the probability of each particular value of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken).
The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. As with other types of regression, there is no need for the independent variables to be statistically independent from each other (unlike, for example, in a naive Bayes classifier); however, collinearity is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case.[5]
If the multinomial logit is used to model choices, it relies on the assumption of independence of irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of K alternatives to be modeled as a set of K − 1 independent binary choices, in which one alternative is chosen as a "pivot" and the other K − 1 compared against it, one at a time. The IIA hypothesis is a core hypothesis in rational choice theory; however numerous studies in psychology show that individuals often violate this assumption when making choices. An example of a problem case arises if choices include a car and a blue bus. Suppose the odds ratio between the two is 1 : 1. Now if the option of a red bus is introduced, a person may be indifferent between a red and a blue bus, and hence may exhibit a car : blue bus : red bus odds ratio of 1 : 0.5 : 0.5, thus maintaining a 1 : 1 ratio of car : any bus while adopting a changed car : blue bus ratio of 1 : 0.5. Here the red bus option was not in fact irrelevant, because a red bus was a perfect substitute for a blue bus.
If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. It is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like the nested logit or the multinomial probit may be used in such cases as they allow for violation of the IIA.[6]
See also: Logistic regression.
There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make it difficult to compare different treatments of the subject in different texts. The article on logistic regression presents a number of equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model.
The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation using a dot product:
\operatorname{score}(Xi,k)=\boldsymbol\betak ⋅ Xi,
where Xi is the vector of explanatory variables describing observation i, βk is a vector of weights (or regression coefficients) corresponding to outcome k, and score(Xi, k) is the score associated with assigning observation i to category k. In discrete choice theory, where observations represent people and outcomes represent choices, the score is considered the utility associated with person i choosing outcome k. The predicted outcome is the one with the highest score.
The difference between the multinomial logit model and numerous other methods, models, algorithms, etc. with the same basic setup (the perceptron algorithm, support vector machines, linear discriminant analysis, etc.) is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. In particular, in the multinomial logit model, the score can directly be converted to a probability value, indicating the probability of observation i choosing outcome k given the measured characteristics of the observation. This provides a principled way of incorporating the prediction of a particular multinomial logit model into a larger procedure that may involve multiple such predictions, each with a possibility of error. Without such means of combining predictions, errors tend to multiply. For example, imagine a large predictive model that is broken down into a series of submodels where the prediction of a given submodel is used as the input of another submodel, and that prediction is in turn used as the input into a third submodel, etc. If each submodel has 90% accuracy in its predictions, and there are five submodels in series, then the overall model has only 0.95 = 59% accuracy. If each submodel has 80% accuracy, then overall accuracy drops to 0.85 = 33% accuracy. This issue is known as error propagation and is a serious problem in real-world predictive models, which are usually composed of numerous parts. Predicting probabilities of each possible outcome, rather than simply making a single optimal prediction, is one means of alleviating this issue.
The basic setup is the same as in logistic regression, the only difference being that the dependent variables are categorical rather than binary, i.e. there are K possible outcomes rather than just two. The following description is somewhat shortened; for more details, consult the logistic regression article.
Specifically, it is assumed that we have a series of N observed data points. Each data point i (ranging from 1 to N) consists of a set of M explanatory variables x1,i ... xM,i (also known as independent variables, predictor variables, features, etc.), and an associated categorical outcome Yi (also known as dependent variable, response variable), which can take on one of K possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to K. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of N "experiments" — although an "experiment" may consist of nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so that the outcome of a new "experiment" can be correctly predicted for a new data point for which the explanatory variables, but not the outcome, are available. In the process, the model attempts to explain the relative effect of differing explanatory variables on the outcome.
Some examples:
f(k,i)
f(k,i)=\beta0,k+\beta1,kx1,i+\beta2,kx2,i+ … +\betaM,kxM,i,
where
\betam,k
f(k,i)=\boldsymbol\betak ⋅ xi,
where
\boldsymbol\betak
xi
To arrive at the multinomial logit model, one can imagine, for K possible outcomes, running K independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other K − 1 outcomes are separately regressed against the pivot outcome. If outcome K (the last outcome) is chosen as the pivot, the K − 1 regression equations are:
ln
\Pr(Yi=k) | |
\Pr(Yi=K) |
=\boldsymbol\betak ⋅ Xi, 1\leqk<K
This formulation is also known as the Additive Log Ratio transform commonly used in compositional data analysis. In other applications it’s referred to as “relative risk”.[7]
If we exponentiate both sides and solve for the probabilities, we get:
\Pr(Yi=k)=
\boldsymbol\betak ⋅ Xi | |
{\Pr(Y | |
i=K)} e |
, 1\leqk<K
Using the fact that all K of the probabilities must sum to one, we find:
\begin{align} \Pr(Yi=K)={}&1-
K-1 | |
\sum | |
j=1 |
\Pr(Yi=j)\ ={}&1-
K-1 | |
\sum | |
j=1 |
\boldsymbol\betaj ⋅ Xi | |
{\Pr(Y | |
i=K)} e |
⇒ \Pr(Yi=K)\\ ={}&
1 | ||||||||||||
|
. \end{align}
We can use this to find the other probabilities:
\Pr(Yi=k)=
| ||||||||||||
|
, 1\leqk<K
The fact that we run multiple regressions reveals why the model relies on the assumption of independence of irrelevant alternatives described above.
The unknown parameters in each vector βk are typically jointly estimated by maximum a posteriori (MAP) estimation, which is an extension of maximum likelihood using regularization of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-mean Gaussian prior distribution on the weights, but other distributions are also possible). The solution is typically found using an iterative procedure such as generalized iterative scaling,[8] iteratively reweighted least squares (IRLS),[9] by means of gradient-based optimization algorithms such as L-BFGS,[4] or by specialized coordinate descent algorithms.[10]
The formulation of binary logistic regression as a log-linear model can be directly extended to multi-way regression. That is, we model the logarithm of the probability of seeing a given output using the linear predictor as well as an additional normalization factor, the logarithm of the partition function:
ln\Pr(Yi=k)=\boldsymbol\betak ⋅ Xi-lnZ, 1\leqk\leK.
As in the binary case, we need an extra term
-lnZ
K | |
\sum | |
k=1 |
\Pr(Yi=k)=1
The reason why we need to add a term to ensure normalization, rather than multiply as is usual, is because we have taken the logarithm of the probabilities. Exponentiating both sides turns the additive term into a multiplicative factor, so that the probability is just the Gibbs measure:
\Pr(Yi=k)=
1 | |
Z |
\boldsymbol\betak ⋅ Xi | |
e |
, 1\leqk\leK.
The quantity Z is called the partition function for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1:
1=
K | |
\sum | |
k=1 |
\Pr(Yi=k) =
K | |
\sum | |
k=1 |
1 | |
Z |
\boldsymbol\betak ⋅ Xi | |
e |
=
1 | |
Z |
K | |
\sum | |
k=1 |
\boldsymbol\betak ⋅ Xi | |
e |
.
Therefore
Z=
K | |
\sum | |
k=1 |
\boldsymbol\betak ⋅ Xi | |
e |
.
Note that this factor is "constant" in the sense that it is not a function of Yi, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients βk, which we will need to determine through some sort of optimization procedure.
The resulting equations for the probabilities are
\Pr(Yi=k)=
| |||||||||||||
|
, 1\leqk\leK.
The following function:
\operatorname{softmax}(k,s1,\ldots,sK)=
| |||||||||||||
|
is referred to as the softmax function. The reason is that the effect of exponentiating the values
s1,\ldots,sK
\operatorname{softmax}(k,s1,\ldots,sK)
sk
f(k)=\begin{cases} 1&rm{if} k=\operatorname{\argmax}jsj,\\ 0&rm{otherwise}. \end{cases}
Thus, we can write the probability equations as
\Pr(Yi=k)=\operatorname{softmax}(k,\boldsymbol\beta1 ⋅ Xi,\ldots,\boldsymbol\betaK ⋅ Xi)
The softmax function thus serves as the equivalent of the logistic function in binary logistic regression.
Note that not all of the
\boldsymbol{\beta}k
K-1
K-1
\begin{align} |
| ||||||||||||
|
&=
| |||||||||||||||||
|
\\ &=
| ||||||||||||||||
|
\\ &=
| |||||||||||||
|
\end{align}
As a result, it is conventional to set
C=-\boldsymbol\betaK
\boldsymbol0
\begin{align} \boldsymbol\beta'k&=\boldsymbol\betak-\boldsymbol\betaK, 1\leqk<K,\\ \boldsymbol\beta'K&=0. \end{align}
This leads to the following equations:
\Pr(Yi=k)=
| ||||||||||||
|
, 1\leqk\leK
Other than the prime symbols on the regression coefficients, this is exactly the same as the form of the model described above, in terms of K − 1 independent two-way regressions.
It is also possible to formulate multinomial logistic regression as a latent variable model, following the two-way latent variable model described for binary logistic regression. This formulation is common in the theory of discrete choice models, and makes it easier to compare multinomial logistic regression to the related multinomial probit model, as well as to extend it to more complex models.
Imagine that, for each data point i and possible outcome k = 1,2,...,K, there is a continuous latent variable Yi,k* (i.e. an unobserved random variable) that is distributed as follows:
\ast | |
Y | |
i,k |
=\boldsymbol\betak ⋅ Xi+\varepsilonk , k\leK
where
\varepsilonk\sim\operatorname{EV}1(0,1),
This latent variable can be thought of as the utility associated with data point i choosing outcome k, where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable
Yi
\ast | |
Y | |
i,k |
\begin{align} \Pr(Yi=1)&=
\ast | |
\Pr(Y | |
i,1 |
>
\ast | |
Y | |
i,2 |
and
\ast | |
Y | |
i,1 |
>
\ast | |
Y | |
i,3 |
and … and
\ast | |
Y | |
i,1 |
>
\ast | |
Y | |
i,K |
)\\ \Pr(Yi=2)&=
\ast | |
\Pr(Y | |
i,2 |
>
\ast | |
Y | |
i,1 |
and
\ast | |
Y | |
i,2 |
>
\ast | |
Y | |
i,3 |
and … and
\ast | |
Y | |
i,2 |
>
\ast | |
Y | |
i,K |
)\\ &\vdots\\ \Pr(Yi=K)&=
\ast | |
\Pr(Y | |
i,K |
>
\ast | |
Y | |
i,1 |
and
\ast | |
Y | |
i,K |
>
\ast | |
Y | |
i,2 |
and … and
\ast | |
Y | |
i,K |
>
\ast | |
Y | |
i,K-1 |
)\\ \end{align}
Or equivalently:
\Pr(Yi=k) =
\ast | |
\Pr(max(Y | |
i,1 |
\ast | |
,Y | |
i,2 |
\ast | |
,\ldots,Y | |
i,K |
\ast | |
)=Y | |
i,k |
) , k\leK
Let's look more closely at the first equation, which we can write as follows:
\begin{align} \Pr(Yi=1)&=
\ast | |
\Pr(Y | |
i,1 |
>
\ast | |
Y | |
i,k |
\forall k=2,\ldots,K)\\ &=
\ast | |
\Pr(Y | |
i,1 |
-
\ast | |
Y | |
i,k |
>0 \forall k=2,\ldots,K)\\ &=\Pr(\boldsymbol\beta1 ⋅ Xi+\varepsilon1-(\boldsymbol\betak ⋅ Xi+\varepsilonk)>0 \forall k=2,\ldots,K)\\ &=\Pr((\boldsymbol\beta1-\boldsymbol\betak) ⋅ Xi>\varepsilonk-\varepsilon1 \forall k=2,\ldots,K) \end{align}
There are a few things to realize here:
X\sim\operatorname{EV}1(a,b)
Y\sim\operatorname{EV}1(a,b)
X-Y\sim\operatorname{Logistic}(0,b).
X\sim\operatorname{Logistic}(0,1)
bX\sim\operatorname{Logistic}(0,b).
Actually finding the values of the above probabilities is somewhat difficult, and is a problem of computing a particular order statistic (the first, i.e. maximum) of a set of values. However, it can be shown that the resulting expressions are the same as in above formulations, i.e. the two are equivalent.
When using multinomial logistic regression, one category of the dependent variable is chosen as the reference category. Separate odds ratios are determined for all independent variables for each category of the dependent variable with the exception of the reference category, which is omitted from the analysis. The exponential beta coefficient represents the change in the odds of the dependent variable being in a particular category vis-a-vis the reference category, associated with a one unit change of the corresponding independent variable.
The observed values
yi\in\{1,...,K\}
i=1,...,n
Y1,...,Yn
The likelihood function for this model is defined by
L=
n | |
\prod | |
i=1 |
P(Yi=yi)=
n | |
\prod | |
i=1 |
K | |
\prod | |
j=1 |
| |||||
P(Y | |||||
i=j) |
,
i
j
\delta | |
j,yi |
=\begin{cases}1,forj=yi\ 0,otherwise\end{cases}
The negative log-likelihood function is therefore the well-known cross-entropy:
-logL=-
n | |
\sum | |
i=1 |
K | |
\sum | |
j=1 |
\delta | |
j,yi |
log(P(Yi=j))=-
K\sum | |
\sum | |
yi=j |
log(P(Yi=j)).
In natural language processing, multinomial LR classifiers are commonly used as an alternative to naive Bayes classifiers because they do not assume statistical independence of the random variables (commonly known as features) that serve as predictors. However, learning in such a model is slower than for a naive Bayes classifier, and thus may not be appropriate given a very large number of classes to learn. In particular, learning in a naive Bayes classifier is a simple matter of counting up the number of co-occurrences of features and classes, while in a maximum entropy classifier the weights, which are typically maximized using maximum a posteriori (MAP) estimation, must be learned using an iterative procedure; see