In statistics, the logistic model (or logit model) is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression[1] (or logit regression) estimates the parameters of a logistic model (the coefficients in the linear or non linear combinations). In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See and for formal mathematics, and for a worked example.
Binary variables are widely used in statistics to model the probability of a certain class or event taking place, such as the probability of a team winning, of a patient being healthy, etc. (see), and the logistic model has been the most commonly used model for binary regression since about 1970. Binary variables can be generalized to categorical variables when there are more than two possible values (e.g. whether an image is of a cat, dog, lion, etc.), and the binary logistic regression generalized to multinomial logistic regression. If the multiple categories are ordered, one can use the ordinal logistic regression (for example the proportional odds ordinal logistic model). See for further extensions. The logistic regression model itself simply models probability of output in terms of input and does not perform statistical classification (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other; this is a common way to make a binary classifier.
Analogous linear models for binary variables with a different sigmoid function instead of the logistic function (to convert the linear combination to a probability) can also be used, most notably the probit model; see . The defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio. More abstractly, the logistic function is the natural parameter for the Bernoulli distribution, and in this sense is the "simplest" way to convert a real number to a probability. In particular, it maximizes entropy (minimizes added information), and in this sense makes the fewest assumptions of the data being modeled; see .
The parameters of a logistic regression are most commonly estimated by maximum-likelihood estimation (MLE). This does not have a closed-form expression, unlike linear least squares; see . Logistic regression by MLE plays a similarly basic role for binary or categorical responses as linear regression by ordinary least squares (OLS) plays for scalar responses: it is a simple, well-analyzed baseline model; see for discussion. The logistic regression as a general statistical model was originally developed and popularized primarily by Joseph Berkson, beginning in, where he coined "logit"; see .
Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd using logistic regression.[2] Many other medical scales used to assess severity of a patient have been developed using logistic regression.[3] [4] [5] [6] Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).[7] [8] Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or Any Other Party, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process, system or product.[9] [10] It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc.[11] In economics, it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing. Disaster planners and engineers rely on these models to predict decision take by householders or building occupants in small-scale and large-scales evacuations, such as building fires, wildfires, hurricanes among others.[12] [13] [14] These models help in the development of reliable disaster managing plans and safer design for the built environment.
Logistic regression is a supervised machine learning algorithm widely used for binary classification tasks, such as identifying whether an email is spam or not and diagnosing diseases by assessing the presence or absence of specific conditions based on patient test results. This approach utilizes the logistic (or sigmoid) function to transform a linear combination of input features into a probability value ranging between 0 and 1. This probability indicates the likelihood that a given input corresponds to one of two predefined categories. The essential mechanism of logistic regression is grounded in the logistic function's ability to model the probability of binary outcomes accurately. With its distinctive S-shaped curve, the logistic function effectively maps any real-valued number to a value within the 0 to 1 interval. This feature renders it particularly suitable for binary classification tasks, such as sorting emails into "spam" or "not spam". By calculating the probability that the dependent variable will be categorized into a specific group, logistic regression provides a probabilistic framework that supports informed decision-making.[15]
As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question:
A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?
The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not cardinal numbers. If the problem was changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers), then simple regression analysis could be used.
The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).
Hours (xk) | 0.50 | 0.75 | 1.00 | 1.25 | 1.50 | 1.75 | 2.00 | 2.25 | 2.50 | 2.75 | 3.00 | 3.25 | 3.50 | 4.00 | 4.25 | 4.50 | 4.75 | 5.00 | 5.50 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pass (yk) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
We wish to fit a logistic function to the data consisting of the hours studied (xk) and the outcome of the test (yk =1 for pass, 0 for fail). The data points are indexed by the subscript k which runs from
k=1
k=K=20
The logistic function is of the form:
p(x)= | 1 |
1+e-(x-\mu)/s |
where μ is a location parameter (the midpoint of the curve, where
p(\mu)=1/2
p(x)= | 1 | |||
|
where
\beta0=-\mu/s
y=\beta0+\beta1x
\beta1=1/s
\mu=-\beta0/\beta1
s=1/\beta1
Remark: This model is actually an oversimplification, since it assumes everybody will pass if they learn long enough (limit = 1). The limit value should be a variable parameter too, if you want to make it more realistic.
The usual measure of goodness of fit for a logistic regression uses logistic loss (or log loss), the negative log-likelihood. For a given xk and yk, write
pk=p(xk)
The log loss for the k-th point is:
\ellk=\begin{cases} -lnpk&ifyk=1,\\ -ln(1-pk)&ifyk=0. \end{cases}
The log loss can be interpreted as the "surprisal" of the actual outcome relative to the prediction, and is a measure of information content. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when
pk=1
yk=1
pk=0
yk=0
yk=1
pk\to0
yk=0
pk\to1
These can be combined into a single expression:
\ellk=-yklnpk-(1-yk)ln(1-pk).
This expression is more formally known as the cross-entropy of the predicted distribution
(pk,(1-pk))
(yk,(1-yk))
The sum of these, the total loss, is the overall negative log-likelihood, and the best fit is obtained for those choices of and for which is minimized.
Alternatively, instead of minimizing the loss, one can maximize its inverse, the (positive) log-likelihood:
\ell=
\sum | |
k:yk=1 |
ln(pk)+
\sum | |
k:yk=0 |
ln(1-pk)=
K | |
\sum | |
k=1 |
\left(ykln(pk)+(1-yk)ln(1-pk)\right)
L=
\prod | |
k:yk=1 |
pk\prod
k:yk=0 |
(1-pk)
Since ℓ is nonlinear in and, determining their optimum values will require numerical methods. One method of maximizing ℓ is to require the derivatives of ℓ with respect to and to be zero:
0=
\partial\ell | |
\partial\beta0 |
=
K(y | |
\sum | |
k-p |
k)
0=
\partial\ell | |
\partial\beta1 |
=
K(y | |
\sum | |
k-p |
k)xk
and the maximization procedure can be accomplished by solving the above two equations for and, which, again, will generally require the use of numerical methods.
The values of and which maximize ℓ and L using the above data are found to be:
\beta0 ≈ -4.1
\beta1 ≈ 1.5
which yields a value for μ and s of:
\mu=-\beta0/\beta1 ≈ 2.7
s=1/\beta1 ≈ 0.67
The and coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam.
For example, for a student who studies 2 hours, entering the value
x=2
t=\beta0+2\beta1 ≈ -4.1+2 ⋅ 1.5=-1.1
p=
1 | |
1+e-t |
≈ 0.25=Probabilityofpassingexam
Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87:
t=\beta0+4\beta1 ≈ -4.1+4 ⋅ 1.5=1.9
p=
1 | |
1+e-t |
≈ 0.87=Probabilityofpassingexam
This table shows the estimated probability of passing the exam for several values of hours studying.
Hours of study (x) | Passing exam | |||
---|---|---|---|---|
Log-odds (t) | Odds (et) | Probability (p) | ||
1 | −2.57 | 0.076 ≈ 1:13.1 | 0.07 | |
2 | −1.07 | 0.34 ≈ 1:2.91 | 0.26 | |
0 | 1 | \tfrac{1}{2} | ||
3 | 0.44 | 1.55 | 0.61 | |
4 | 1.94 | 6.96 | 0.87 | |
5 | 3.45 | 31.4 | 0.97 |
The logistic regression analysis gives the following output.
Coefficient | Std. Error | z-value | p-value (Wald) | ||
---|---|---|---|---|---|
Intercept (β0) | −4.1 | 1.8 | −2.3 | 0.021 | |
Hours (β1) | 1.5 | 0.9 | 2.4 | 0.017 |
By the Wald test, the output indicates that hours studying is significantly associated with the probability of passing the exam (
p=0.017
p ≈ 0.00064
This simple model is an example of binary logistic regression, and has one explanatory variable and a binary categorical variable which can assume one of two categorical values. Multinomial logistic regression is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories.
An explanation of logistic regression can begin with an explanation of the standard logistic function. The logistic function is a sigmoid function, which takes any real input
t
\sigma:R → (0,1)
\sigma(t)=
et | |
et+1 |
=
1 | |
1+e-t |
A graph of the logistic function on the t-interval (−6,6) is shown in Figure 1.
Let us assume that
t
x
t
t
t=\beta0+\beta1x
And the general logistic function
p:R → (0,1)
p(x)=\sigma(t)=
1 | ||||
|
In the logistic model,
p(x)
Y
Yi
P(Yi=1\midX)
Xi
X
\beta
We can now define the logit (log odds) function as the inverse
g=\sigma-1
g(p(x))=\sigma-1(p(x))=\operatorname{logit}p(x)=ln\left(
p(x) | |
1-p(x) |
\right)=\beta0+\beta1x,
and equivalently, after exponentiating both sides we have the odds:
p(x) | |
1-p(x) |
=
\beta0+\beta1x | |
e |
.
In the above equations, the terms are as follows:
g
g(p(x))
ln
p(x)
p(x)
p(x)
\beta0
\beta1x
e
The odds of the dependent variable equaling a case (given some linear combination
x
So we define odds of the dependent variable equaling a case (given some linear combination
x
odds=
\beta0+\beta1x | |
e |
.
For a continuous independent variable the odds ratio can be defined as:
OR=
\operatorname{odds | |
(x+1)}{\operatorname{odds}(x)} |
=
| |||||
|
=
| |||||
|
=
\beta1 | |
e |
This exponential relationship provides an interpretation for
\beta1
\beta1 | |
e |
For a binary independent variable the odds ratio is defined as
ad | |
bc |
If there are multiple explanatory variables, the above expression
\beta0+\beta1x
\beta0+\beta1x1+\beta2x2+ … +\betamxm=\beta0+
m | |
\sum | |
i=1 |
\betaixi
\betai
i=0,1,2,...,m
Again, the more traditional equations are:
log
p | |
1-p |
=\beta0+\beta1x1+\beta2x2+ … +\betamxm
and
p=
1 | ||||
|
where usually
b=e
A dataset contains N points. Each point i consists of a set of m input variables x1,i ... xm,i (also called independent variables, explanatory variables, predictor variables, features, or attributes), and a binary outcome variable Yi (also known as a dependent variable, response variable, output variable, or class), i.e. it can assume only the two possible values 0 (often meaning "no" or "failure") or 1 (often meaning "yes" or "success"). The goal of logistic regression is to use the dataset to create a predictive model of the outcome variable.
As in linear regression, the outcome variables Yi are assumed to depend on the explanatory variables x1,i ... xm,i.
(Discrete variables referring to more than two possible choices are typically coded using dummy variables (or indicator variables), that is, separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have that value".)
\begin{align} Yi\midx1,i,\ldots,xm,i &\sim\operatorname{Bernoulli}(pi)\\[5pt] \operatorname{E}[Yi\midx1,i,\ldots,xm,i]&=pi\\[5pt] \Pr(Yi=y\midx1,i,\ldots,xm,i)&= \begin{cases} pi&ify=1\\ 1-pi&ify=0 \end{cases} \\[5pt] \Pr(Yi=y\midx1,i,\ldots,xm,i)&=
y | |
p | |
i |
(1-y) | |
(1-p | |
i) |
\end{align}
The meanings of these four lines are:
f(i)
f(i)=\beta0+\beta1x1,i+ … +\betamxm,i,
where
\beta0,\ldots,\betam
The model is usually put into a more compact form as follows:
This makes it possible to write the linear predictor function as follows:
f(i)=\boldsymbol\beta ⋅ Xi,
using the notation for a dot product between two vectors.
The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables x1, x2,... and any number of categorical values
y=0,1,2,...
To begin with, we may consider a logistic model with M explanatory variables, x1, x2 ... xM and, as in the example above, two categorical values (y = 0 and 1). For the simple binary logistic regression model, we assumed a linear relationship between the predictor variable and the log-odds (also called logit) of the event that
y=1
t=logb
p | |
1-p |
=\beta0+\beta1x1+\beta2x2+ … +\betaMxM
where t is the log-odds and
\betai
b
For a more compact notation, we will specify the explanatory variables and the β coefficients as -dimensional vectors:
\boldsymbol{x}=\{x0,x1,x2,...,xM\}
\boldsymbol{\beta}=\{\beta0,\beta1,\beta2,...,\betaM\}
with an added explanatory variable x0 =1. The logit may now be written as:
t
M | |
=\sum | |
m=0 |
\betamxm=\boldsymbol{\beta} ⋅ x
Solving for the probability p that
y=1
p(\boldsymbol{x})=
b\boldsymbol{\beta ⋅ \boldsymbol{x | |
where
Sb
b
\betam
y=1
y=1
\boldsymbol{x}
p(\boldsymbol{x})
y=1
\boldsymbol{x}k
yk
M=1
\ell=
K | |
\sum | |
k=1 |
yklogb(p(\boldsymbol{xk}))+\sum
K | |
k=1 |
(1-yk)logb(1-p(\boldsymbol{xk}))
As in the simple example above, finding the optimum β parameters will require numerical methods. One useful technique is to equate the derivatives of the log likelihood with respect to each of the β parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood:
\partial\ell | |
\partial\betam |
=0=
K | |
\sum | |
k=1 |
ykxmk-
K | |
\sum | |
k=1 |
p(\boldsymbol{x}k)xmk
where xmk is the value of the xm explanatory variable from the k-th measurement.
Consider an example with
M=2
b=10
\beta0=-3
\beta1=1
\beta2=2
t=log10
p | |
1-p |
=-3+x1+2x2
p=
b\boldsymbol{\beta ⋅ \boldsymbol{x | |
where p is the probability of the event that
y=1
\beta0=-3
y=1
x1=x2=0
x1=x2=0
y=1
10-3
y=1
x1=x2=0
1/(1000+1)=1/1001.
\beta1=1
x1
1
x1
y=1
101
y=1
\beta2=2
x2
2
x2
y=1
102.
x2
x1
y=1
See main article: Multinomial logistic regression.
In the above cases of two categories (binomial logistic regression), the categories were indexed by "0" and "1", and we had two probabilities: The probability that the outcome was in category 1 was given by
p(\boldsymbol{x})
1-p(\boldsymbol{x})
In general, if we have explanatory variables (including x0) and categories, we will need separate probabilities, one for each category, indexed by n, which describe the probability that the categorical outcome y will be in category y=n, conditional on the vector of covariates x. The sum of these probabilities over all categories must equal 1. Using the mathematically convenient base e, these probabilities are:
pn(\boldsymbol{x})=
| ||||||||||
n=1,2,...,N
p0(\boldsymbol{x})=
N | |
1-\sum | |
n=1 |
p | |||||||||||||||||||
|
Each of the probabilities except
p0(\boldsymbol{x})
\boldsymbol{\beta}n
pn(\boldsymbol{x})
p0(\boldsymbol{x})
tn=ln\left(
pn(\boldsymbol{x | |
)}{p |
0(\boldsymbol{x})}\right)=\boldsymbol{\beta}n ⋅ \boldsymbol{x}
Note also that for the simple case of
N=1
p(\boldsymbol{x})=p1(\boldsymbol{x})
p0(\boldsymbol{x})=1-p1(\boldsymbol{x})
The log-likelihood that a particular set of K measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by k, let the k-th set of measured explanatory variables be denoted by
\boldsymbol{x}k
yk
\ell=
K | |
\sum | |
k=1 |
N | |
\sum | |
n=0 |
\Delta(n,yk)ln(pn(\boldsymbol{x}k))
where
\Delta(n,yk)
\partial\ell | |
\partial\betanm |
=0=
K | |
\sum | |
k=1 |
\Delta(n,yk)xmk-
K | |
\sum | |
k=1 |
pn(\boldsymbol{x}k)xmk
where
\betanm
\boldsymbol{\beta}n
xmk
There are various equivalent specifications and interpretations of logistic regression, which fit into different types of more general models, and allow different generalizations.
The particular model used by logistic regression, which distinguishes it from standard linear regression and from other types of regression analysis used for binary-valued outcomes, is the way the probability of a particular outcome is linked to the linear predictor function:
\operatorname{logit}(\operatorname{E}[Yi\midx1,i,\ldots,xm,i])=\operatorname{logit}(pi)=ln\left(
pi | |
1-pi |
\right)=\beta0+\beta1x1,i+ … +\betamxm,i
Written using the more compact notation described above, this is:
\operatorname{logit}(\operatorname{E}[Yi\midXi])=
\operatorname{logit}(p | ||||
|
\right)=\boldsymbol\beta ⋅ Xi
This formulation expresses logistic regression as a type of generalized linear model, which predicts variables with various types of probability distributions by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable.
The intuition for transforming using the logit function (the natural log of the odds) was explained above. It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over
(-infty,+infty)
Both the probabilities pi and the regression coefficients are unobserved, and the means of determining them is not part of the model itself. They are typically determined by some sort of optimization procedure, e.g. maximum likelihood estimation, that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to regularization conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients. The use of a regularization condition is equivalent to doing maximum a posteriori (MAP) estimation, an extension of maximum likelihood. (Regularization is most commonly done using a squared regularizing function, which is equivalent to placing a zero-mean Gaussian prior distribution on the coefficients, but other regularizers are also possible.) Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as iteratively reweighted least squares (IRLS) or, more commonly these days, a quasi-Newton method such as the L-BFGS method.[19]
The interpretation of the βj parameter estimates is as the additive effect on the log of the odds for a unit change in the j the explanatory variable. In the case of a dichotomous explanatory variable, for instance, gender
e\beta
An equivalent formula uses the inverse of the logit function, which is the logistic function, i.e.:
\operatorname{E}[Yi\midXi]=pi=\operatorname{logit}-1(\boldsymbol\beta ⋅ Xi)=
1 | ||||
|
The formula can also be written as a probability distribution (specifically, using a probability mass function):
\Pr(Yi=y\midXi)=
1-y | ||
{p | =\left( | |
i) |
| |||||
|
\right)y\left(1-
| |||||
|
\right)1-y=
| |||||||
|
The logistic model has an equivalent formulation as a latent-variable model. This formulation is common in the theory of discrete choice models and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related probit model.
Imagine that, for each trial i, there is a continuous latent variable Yi* (i.e. an unobserved random variable) that is distributed as follows:
\ast | |
Y | |
i |
=\boldsymbol\beta ⋅ Xi+\varepsiloni
\varepsiloni\sim\operatorname{Logistic}(0,1)
Then Yi can be viewed as an indicator for whether this latent variable is positive:
Yi=\begin{cases}1&
\ast | |
ifY | |
i |
>0 i.e.{-\varepsiloni}<\boldsymbol\beta ⋅ Xi,\\ 0&otherwise.\end{cases}
The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact, it is not. It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution. For example, a logistic error-variable distribution with a non-zero location parameter μ (which sets the mean) is equivalent to a distribution with a zero location parameter, where μ has been added to the intercept coefficient. Both situations produce the same value for Yi* regardless of settings of explanatory variables. Similarly, an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by s. In the latter case, the resulting value of Yi* will be smaller by a factor of s than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same Yi choice.
(This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.)
It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the generalized linear model and without any latent variables. This can be shown as follows, using the fact that the cumulative distribution function (CDF) of the standard logistic distribution is the logistic function, which is the inverse of the logit function, i.e.
\Pr(\varepsiloni<x)=\operatorname{logit}-1(x)
Then:
\begin{align} \Pr(Yi=1\midXi)&=
\ast | |
\Pr(Y | |
i |
>0\midXi)\\[5pt] &=\Pr(\boldsymbol\beta ⋅ Xi+\varepsiloni>0)\\[5pt] &=\Pr(\varepsiloni>-\boldsymbol\beta ⋅ Xi)\\[5pt] &=\Pr(\varepsiloni<\boldsymbol\beta ⋅ Xi)&&(becausethelogisticdistributionissymmetric)\\[5pt] &=\operatorname{logit}-1(\boldsymbol\beta ⋅ Xi)&\\[5pt] &=pi&&(seeabove) \end{align}
This formulation—which is standard in discrete choice models—makes clear the relationship between logistic regression (the "logit model") and the probit model, which uses an error variable distributed according to a standard normal distribution instead of a standard logistic distribution. Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape. The only difference is that the logistic distribution has somewhat heavier tails, which means that it is less sensitive to outlying data (and hence somewhat more robust to model mis-specifications or erroneous data).
Yet another formulation uses two separate latent variables:
0\ast | |
\begin{align} Y | |
i |
&=\boldsymbol\beta0 ⋅ Xi+\varepsilon0
1\ast | |
\\ Y | |
i |
&=\boldsymbol\beta1 ⋅ Xi+\varepsilon1 \end{align}
where
\begin{align} \varepsilon0&\sim\operatorname{EV}1(0,1)\\ \varepsilon1&\sim\operatorname{EV}1(0,1) \end{align}
where EV1(0,1) is a standard type-1 extreme value distribution: i.e.
\Pr(\varepsilon0=x)=\Pr(\varepsilon1=x)=e-x
-e-x | |
e |
Then
Yi=\begin{cases}1&
1\ast | |
ifY | |
i |
>
0\ast | |
Y | |
i |
,\\ 0&otherwise.\end{cases}
This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable. The reason for this separation is that it makes it easy to extend logistic regression to multi-outcome categorical variables, as in the multinomial logit model. In such a model, it is natural to model each possible outcome using a different set of regression coefficients. It is also possible to motivate each of the separate latent variables as the theoretical utility associated with making the associated choice, and thus motivate logistic regression in terms of utility theory. (In terms of utility theory, a rational actor always chooses the choice with the greatest associated utility.) This is the approach taken by economists when formulating discrete choice models, because it both provides a theoretically strong foundation and facilitates intuitions about the model, which in turn makes it easy to consider various sorts of extensions. (See the example below.)
The choice of the type-1 extreme value distribution seems fairly arbitrary, but it makes the mathematics work out, and it may be possible to justify its use through rational choice theory.
It turns out that this model is equivalent to the previous model, although this seems non-obvious, since there are now two sets of regression coefficients and error variables, and the error variables have a different distribution. In fact, this model reduces directly to the previous one with the following substitutions:
\boldsymbol\beta=\boldsymbol\beta1-\boldsymbol\beta0
\varepsilon=\varepsilon1-\varepsilon0
\varepsilon=\varepsilon1-\varepsilon0\sim\operatorname{Logistic}(0,1).
\begin{align} \Pr(Yi=1\midXi)={}&\Pr\left
1\ast | |
(Y | |
i |
>
0\ast | |
Y | |
i |
\midXi\right)&\\[5pt] ={}&\Pr\left
1\ast | |
(Y | |
i |
-
0\ast | |
Y | |
i |
>0\midXi\right)&\\[5pt] ={}&\Pr\left(\boldsymbol\beta1 ⋅ Xi+\varepsilon1-\left(\boldsymbol\beta0 ⋅ Xi+\varepsilon0\right)>0\right)&\\[5pt] ={}&\Pr\left((\boldsymbol\beta1 ⋅ Xi-\boldsymbol\beta0 ⋅ Xi)+(\varepsilon1-\varepsilon0)>0\right)&\\[5pt] ={}&\Pr((\boldsymbol\beta1-\boldsymbol\beta0) ⋅ Xi+(\varepsilon1-\varepsilon0)>0)&\\[5pt] ={}&\Pr((\boldsymbol\beta1-\boldsymbol\beta0) ⋅ Xi+\varepsilon>0)&&(substitute\varepsilonasabove)\\[5pt] ={}&\Pr(\boldsymbol\beta ⋅ Xi+\varepsilon>0)&&(substitute\boldsymbol\betaasabove)\\[5pt] ={}&\Pr(\varepsilon>-\boldsymbol\beta ⋅ Xi)&&(now,sameasabovemodel)\\[5pt] ={}&\Pr(\varepsilon<\boldsymbol\beta ⋅ Xi)&\\[5pt] ={}&\operatorname{logit}-1(\boldsymbol\beta ⋅ Xi)\\[5pt] ={}&pi \end{align}
As an example, consider a province-level election where the choice is between a right-of-center party, a left-of-center party, and a secessionist party (e.g. the Parti Québécois, which wants Quebec to secede from Canada). We would then use three latent variables, one for each choice. Then, in accordance with utility theory, we can then interpret the latent variables as expressing the utility that results from making each of the choices. We can also interpret the regression coefficients as indicating the strength that the associated factor (i.e. explanatory variable) has in contributing to the utility — or more correctly, the amount by which a unit change in an explanatory variable changes the utility of a given choice. A voter might expect that the right-of-center party would lower taxes, especially on rich people. This would give low-income people no benefit, i.e. no change in utility (since they usually don't pay taxes); would cause moderate benefit (i.e. somewhat more money, or moderate utility increase) for middle-incoming people; would cause significant benefits for high-income people. On the other hand, the left-of-center party might be expected to raise taxes and offset it with increased welfare and other assistance for the lower and middle classes. This would cause significant positive benefit to low-income people, perhaps a weak benefit to middle-income people, and significant negative benefit to high-income people. Finally, the secessionist party would take no direct actions on the economy, but simply secede. A low-income or middle-income voter might expect basically no clear utility gain or loss from this, but a high-income voter might expect negative utility since he/she is likely to own companies, which will have a harder time doing business in such an environment and probably lose money.
These intuitions can be expressed as follows:
Center-right | Center-left | Secessionist | |
---|---|---|---|
High-income | strong + | strong − | strong − |
Middle-income | moderate + | weak + | none |
Low-income | none | strong + | none |
This clearly shows that
Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the multinomial logit.
Here, instead of writing the logit of the probabilities pi as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:
\begin{align} ln\Pr(Yi=0)&=\boldsymbol\beta0 ⋅ Xi-lnZ\\ ln\Pr(Yi=1)&=\boldsymbol\beta1 ⋅ Xi-lnZ \end{align}
Two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the logarithm of the associated probability as a linear predictor, with an extra term
-lnZ
\begin{align} \Pr(Yi=0)&=
1 | |
Z |
\boldsymbol\beta0 ⋅ Xi | |
e |
\\[5pt] \Pr(Yi=1)&=
1 | |
Z |
\boldsymbol\beta1 ⋅ Xi | |
e |
\end{align}
In this form it is clear that the purpose of Z is to ensure that the resulting distribution over Yi is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply the sum of all un-normalized probabilities, and by dividing each probability by Z, the probabilities become "normalized". That is:
Z=
\boldsymbol\beta0 ⋅ Xi | |
e |
+
\boldsymbol\beta1 ⋅ Xi | |
e |
and the resulting equations are
\begin{align} \Pr(Yi=0)&=
| ||||||||||
|
\\[5pt] \Pr(Yi=1)&=
| ||||||||||
|
. \end{align}
Or generally:
\Pr(Yi=c)=
| ||||||||||
|
This shows clearly how to generalize this formulation to more than two outcomes, as in multinomial logit.This general formulation is exactly the softmax function as in
\Pr(Yi=c)=\operatorname{softmax}(c,\boldsymbol\beta0 ⋅ Xi,\boldsymbol\beta1 ⋅ Xi,...).
In order to prove that this is equivalent to the previous model, the above model is overspecified, in that
\Pr(Yi=0)
\Pr(Yi=1)
\Pr(Yi=0)+\Pr(Yi=1)=1
\begin{align} \Pr(Yi=1)&=
| ||||||||||
|
\\[5pt] &=
| ||||||||||||||||||
|
\\[5pt] &=
| ||||||||||||||
|
\\[5pt] &=
| ||||||||||
|
. \end{align}
As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors. We choose to set
\boldsymbol\beta0=0.
\boldsymbol\beta0 ⋅ Xi | |
e |
=
0 ⋅ Xi | |
e |
=1
and so
\Pr(Yi=1)=
| ||||||
|
=
1 | ||||
|
=pi
which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where
\boldsymbol\beta=\boldsymbol\beta1-\boldsymbol\beta0
Most treatments of the multinomial logit model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes. In general, the presentation with latent variables is more common in econometrics and political science, where discrete choice models and utility theory reign, while the "log-linear" formulation here is more common in computer science, e.g. machine learning and natural language processing.
The model has an equivalent formulation
pi=
1 | ||||
|
.
This functional form is commonly called a single-layer perceptron or single-layer artificial neural network. A single-layer neural network computes a continuous output instead of a step function. The derivative of pi with respect to X = (x1, ..., xk) is computed from the general form:
y=
1 | |
1+e-f(X) |
where f(X) is an analytic function in X. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated:
dy | |
dX |
=y(1-y)
df | |
dX |
.
A closely related model assumes that each i is associated not with a single Bernoulli trial but with ni independent identically distributed trials, where the observation Yi is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a binomial distribution:
Yi\sim\operatorname{Bin}(ni,pi),fori=1,...,n
An example of this distribution is the fraction of seeds (pi) that germinate after ni are planted.
In terms of expected values, this model is expressed as follows:
pi=
\operatorname{
| ||||
i |
\right],
so that
\operatorname{logit}\left(\operatorname{
| ||||
i |
\right]\right)=\operatorname{logit}(pi)=ln\left(
pi | |
1-pi |
\right)=\boldsymbol\beta ⋅ Xi,
Or equivalently:
\Pr(Yi=y\midXi)={ni\choosey}
ni-y | |
p | |
i) |
={ni\choosey}\left(
1 | ||||
|
\right)y\left(1-
1 | ||||
|
ni-y | |
\right) |
.
This model can be fit using the same sorts of methods as the above more basic model.
The regression coefficients are usually estimated using maximum likelihood estimation.[20] Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function so an iterative process must be used instead; for example Newton's method. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.
In some instances, the model may not reach convergence. Non-convergence of a model indicates that the coefficients are not meaningful because the iterative process was unable to find appropriate solutions. A failure to converge may occur for a number of reasons: having a large ratio of predictors to cases, multicollinearity, sparseness, or complete separation.
Binary logistic regression (
y=0
y=1
T=[\beta | |
w | |
0,\beta |
1,\beta2,\ldots]
x(i)=[1,x1(i),x2(i),\ldots]T
\mu(i)= | 1 | |||
|
w
wk+1=
-1 | |
\left(X | |
kX\right) |
XT\left(SkXwk+y-\boldsymbol\muk\right)
where
S=\operatorname{diag}(\mu(i)(1-\mu(i)))
\boldsymbol\mu=[\mu(1),\mu(2),\ldots]
X=\begin{bmatrix} 1&x1(1)&x2(1)&\ldots\\ 1&x1(2)&x2(2)&\ldots\\ \vdots&\vdots&\vdots\end{bmatrix}
The regressor matrix and
y(i)=[y(1),y(2),\ldots]T
In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients, for example in the form of Gaussian distributions. There is no conjugate prior of the likelihood function in logistic regression. When Bayesian inference was performed analytically, this made the posterior distribution difficult to calculate except in very low dimensions. Now, though, automatic software such as OpenBUGS, JAGS, PyMC, Stan or Turing.jl allows these posteriors to be computed using simulation, so lack of conjugacy is not a concern. However, when the sample size or the number of parameters is large, full Bayesian simulation can be slow, and people often use approximate methods such as variational Bayesian methods and expectation propagation.
See main article: One in ten rule.
Widely used, the "one in ten rule", states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events per explanatory variable (EPV); where event denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use
k
p
10k/p
Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required. Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.
In any fitting procedure, the addition of another fitting parameter to a model (e.g. the beta parameters in a logistic regression model) will almost always improve the ability of the model to predict the measured outcomes. This will be true even if the additional term has no predictive value, since the model will simply be "overfitting" to the noise in the data. The question arises as to whether the improvement gained by the addition of another fitting parameter is significant enough to recommend the inclusion of the additional term, or whether the improvement is simply that which may be expected from overfitting.
In short, for logistic regression, a statistic known as the deviance is defined which is a measure of the error between the logistic model fit and the outcome data. In the limit of a large number of data points, the deviance is chi-squared distributed, which allows a chi-squared test to be implemented in order to determine the significance of the explanatory variables.
Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of K data points (xk, yk) are fitted to a proposed model function of the form
y=b0+b1x
K | |
\varepsilon | |
k=1 |
(b0+b1xk-y
2. | |
k) |
The minimum value which constitutes the fit will be denoted by
\hat{\varepsilon}2
The idea of a null model may be introduced, in which it is assumed that the x variable is of no use in predicting the yk outcomes: The data points are fitted to a null model function of the form y = b0 with a squared error term:
K | |
\varepsilon | |
k=1 |
(b0-y
2. | |
k) |
The fitting process consists of choosing a value of b0 which minimizes
\varepsilon2
2 | |
\varepsilon | |
\varphi |
\varphi
b0=\overline{y}
\overline{y}
2 | |
\varepsilon | |
\varphi |
K | |
\hat{\varepsilon} | |
k=1 |
2 | |
(\overline{y}-y | |
k) |
which is proportional to the square of the (uncorrected) sample standard deviation of the yk data points.
We can imagine a case where the yk data points are randomly assigned to the various xk, and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the yk outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a chi-squared distribution, with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be
2-1=1
For logistic regression, the measure of goodness-of-fit is the likelihood function L, or its logarithm, the log-likelihood ℓ. The likelihood function L is analogous to the
\varepsilon2
\hat{\ell}
In the case of simple binary logistic regression, the set of K data points are fitted in a probabilistic sense to a function of the form:
p(x)= | 1 |
1+e-t |
where is the probability that
y=1
t=\beta0+\beta1x
and the log-likelihood is:
K | |
\ell=\sum | |
k=1 |
\left(ykln(p(xk))+(1-yk)ln(1-p(xk))\right)
For the null model, the probability that
y=1
p | ||||||||
|
The log-odds for the null model are given by:
t\varphi=\beta0
and the log-likelihood is:
\ell\varphi=\sum
K | |
k=1 |
\left(ykln(p\varphi)+(1-yk)ln(1-p\varphi)\right)
Since we have
p\varphi=\overline{y}
\hat{\ell}\varphi=K(\overline{y}ln(\overline{y})+(1-\overline{y})ln(1-\overline{y}))
The optimum
\beta0
\beta | ||||
|
where
\overline{y}
\hat{\ell}\ge\hat{\ell}\varphi
Also, as an analog to the error of the linear regression case, we may define the deviance of a logistic regression fit as:
D=ln\left( | \hat{L |
2}{\hat{L} |
2}\right) | |
\varphi |
=2(\hat{\ell}-\hat{\ell}\varphi)
which will always be positive or zero. The reason for this choice is that not only is the deviance a good measure of the goodness of fit, it is also approximately chi-squared distributed, with the approximation improving as the number of data points (K) increases, becoming exactly chi-square distributed in the limit of an infinite number of data points. As in the case of linear regression, we may use this fact to estimate the probability that a random set of data points will give a better fit than the fit obtained by the proposed model, and so have an estimate how significantly the model is improved by including the xk data points in the proposed model.
For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is
\hat{\ell}\varphi=-13.8629\ldots
\hat{\ell}=-8.02988\ldots
D=2(\hat{\ell}-\hat{\ell}\varphi)=11.6661\ldots
Using the chi-squared test of significance, the integral of the chi-squared distribution with one degree of freedom from 11.6661... to infinity is equal to 0.00063649...
This effectively means that about 6 out of a 10,000 fits to random yk can be expected to have a better fit (smaller deviance) than the given yk and so we can conclude that the inclusion of the x variable and data in the proposed model is a very significant improvement over the null model. In other words, we reject the null hypothesis with
1-D ≈ 99.94\%
Goodness of fit in linear regression models is generally measured using R2. Since this has no direct analog in logistic regression, various methods[26] including the following can be used instead.
In linear regression analysis, one is concerned with partitioning variance via the sum of squares calculations – variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, deviance is used in lieu of a sum of squares calculations. Deviance is analogous to the sum of squares calculations in linear regression and is a measure of the lack of fit to the data in a logistic regression model. When a "saturated" model is available (a model with a theoretically perfect fit), deviance is calculated by comparing a given model with the saturated model. This computation gives the likelihood-ratio test:
D=-2ln
likelihoodofthefittedmodel | |
likelihoodofthesaturatedmodel |
.
In the above equation, represents the deviance and ln represents the natural logarithm. The log of this likelihood ratio (the ratio of the fitted model to the saturated model) will produce a negative value, hence the need for a negative sign. can be shown to follow an approximate chi-squared distribution. Smaller values indicate better fit as the fitted model deviates less from the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square values indicate very little unexplained variance and thus, good model fit. Conversely, a significant chi-square value indicates that a significant amount of the variance is unexplained.
When the saturated model is not available (a common case), deviance is calculated simply as −2·(log likelihood of the fitted model), and the reference to the saturated model's log likelihood can be removed from all that follows without harm.
Two measures of deviance are particularly important in logistic regression: null deviance and model deviance. The null deviance represents the difference between a model with only the intercept (which means "no predictors") and the saturated model. The model deviance represents the difference between a model with at least one predictor and the saturated model. In this respect, the null model provides a baseline upon which to compare predictor models. Given that deviance is a measure of the difference between a given model and the saturated model, smaller values indicate better fit. Thus, to assess the contribution of a predictor or set of predictors, one can subtract the model deviance from the null deviance and assess the difference on a
2 | |
\chi | |
s-p |
,
Let
\begin{align} Dnull&=-2ln
likelihoodofnullmodel | |
likelihoodofthesaturatedmodel |
\\[6pt] Dfitted&=-2ln
likelihoodoffittedmodel | |
likelihoodofthesaturatedmodel |
. \end{align}
Then the difference of both is:
\begin{align}Dnull-Dfitted&=-2\left(ln
likelihoodofnullmodel | |
likelihoodofthesaturatedmodel |
-ln
likelihoodoffittedmodel | |
likelihoodofthesaturatedmodel |
\right)\\[6pt] &=-2ln
\left(\dfrac{likelihoodofnullmodel | |
likelihoodofthesaturatedmodel |
\right)}{\left(\dfrac{likelihoodoffittedmodel
If the model deviance is significantly smaller than the null deviance then one can conclude that the predictor or set of predictors significantly improve the model's fit. This is analogous to the -test used in linear regression analysis to assess the significance of prediction.
See main article: article and Pseudo-R-squared. In linear regression the squared multiple correlation, 2 is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors. In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations.[27]
Four of the most commonly used indices and one less commonly used one are examined on this page:
The Hosmer–Lemeshow test uses a test statistic that asymptotically follows a \chi2
After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor. In logistic regression, however, the regression coefficients represent the change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient – the odds ratio (see definition). In linear regression, the significance of a regression coefficient is assessed by computing a t test. In logistic regression, there are several different tests designed to assess the significance of an individual predictor, most notably the likelihood ratio test and the Wald statistic.
The likelihood-ratio test discussed above to assess model fit is also the recommended procedure to assess the contribution of individual "predictors" to a given model. In the case of a single predictor model, one simply compares the deviance of the predictor model with that of the null model on a chi-square distribution with a single degree of freedom. If the predictor model has significantly smaller deviance (c.f. chi-square using the difference in degrees of freedom of the two models), then one can conclude that there is a significant association between the "predictor" and the outcome. Although some common statistical packages (e.g. SPSS) do provide likelihood ratio test statistics, without this computationally intensive test it would be more difficult to assess the contribution of individual predictors in the multiple logistic regression case. To assess the contribution of individual predictors one can enter the predictors hierarchically, comparing each new model with the previous to determine the contribution of each predictor. There is some debate among statisticians about the appropriateness of so-called "stepwise" procedures. The fear is that they may not preserve nominal statistical properties and may become misleading.[29]
Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the Wald statistic. The Wald statistic, analogous to the t-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.
Wj=
| |||||||
|
Although several statistical packages (e.g., SPSS, SAS) report the Wald statistic to assess the contribution of individual predictors, the Wald statistic has limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be larger increasing the probability of Type-II error. The Wald statistic also tends to be biased when data are sparse.
Suppose cases are rare. Then we might wish to sample them more frequently than their prevalence in the population. For example, suppose there is a disease that affects 1 person in 10,000 and to collect our data we need to do a complete physical. It may be too expensive to do thousands of physicals of healthy people in order to obtain data for only a few diseased individuals. Thus, we may evaluate more diseased individuals, perhaps all of the rare outcomes. This is also retrospective sampling, or equivalently it is called unbalanced data. As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.[30]
Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome. That is to say, if we form a logistic model from such data, if the model is correct in the general population, the
\betaj
\beta0
\beta0
* | |
\widehat{\beta} | |
0 |
=\widehat{\beta}0+log
\pi | |
1-\pi |
-log{\tilde{\pi}\over{1-\tilde{\pi}}}
where
\pi
\tilde{\pi}
Like other forms of regression analysis, logistic regression makes use of one or more predictor variables that may be either continuous or categorical. Unlike ordinary linear regression, however, logistic regression is used for predicting dependent variables that take membership in one of a limited number of categories (treating the dependent variable in the binomial case as the outcome of a Bernoulli trial) rather than a continuous outcome. Given this difference, the assumptions of linear regression are violated. In particular, the residuals cannot be normally distributed. In addition, linear regression may make nonsensical predictions for a binary dependent variable. What is needed is a way to convert a binary variable into a continuous one that can take on any real value (negative or positive). To do that, binomial logistic regression first calculates the odds of the event happening for different levels of each independent variable, and then takes its logarithm to create a continuous criterion as a transformed version of the dependent variable. The logarithm of the odds is the of the probability, the is defined as follows:
Using the more condensed vector notation:
M | |
\sum | |
m=0 |
λnmxmk=\boldsymbol{λ}n ⋅ \boldsymbol{x}k
and dropping the primes on the n and k indices, and then solving for
pnk
pnk
\boldsymbol{λ | |
=e | |
n ⋅ \boldsymbol{x} |
k}/Zk
where:
1+\alphak | |
Z | |
k=e |
Imposing the normalization constraint, we can solve for the Zk and write the probabilities as:
pnk=
| |||||||
k}}{\sum |
N | |
u=0 |
\boldsymbol{λ | |
e | |
u ⋅ \boldsymbol{x} |
k}}
The
\boldsymbol{λ}n
\boldsymbol{λ}n
pnk
\boldsymbol{λ}n
\boldsymbol{λ}0
\boldsymbol{λ}n
\boldsymbol{λ}0
\boldsymbol{\beta}n=\boldsymbol{λ}n-\boldsymbol{λ}0
In machine learning applications where logistic regression is used for binary classification, the MLE minimises the cross-entropy loss function.
Logistic regression is an important machine learning algorithm. The goal is to model the probability of a random variable
Y
Consider a generalized linear model function parameterized by
\theta
h\theta(X)=
1 | ||||||
|
=\Pr(Y=1\midX;\theta)
Therefore,
\Pr(Y=0\midX;\theta)=1-h\theta(X)
Y\in\{0,1\}
\Pr(y\midX;\theta)
\Pr(y\midX;\theta)=
y(1 | |
h | |
\theta(X) |
-
(1-y) | |
h | |
\theta(X)) |
.
\begin{align} L(\theta\midy;x)&=\Pr(Y\midX;\theta)\\ &=\prodi\Pr(yi\midxi;\theta)\\ &=\prodih\theta(x
yi | |
i) |
(1-h\theta(x
(1-yi) | |
i)) |
\end{align}
Typically, the log likelihood is maximized,
N-1logL(\theta\midy;x)=N-1
N | |
\sum | |
i=1 |
log\Pr(yi\midxi;\theta)
Assuming the
(x,y)
\begin{align} &\lim\limitsNN-1
N | |
\sum | |
i=1 |
log\Pr(yi\midxi;\theta)=\sumx
H(Y\midX)
DKL
Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between the dependent and independent variables) from those of linear regression. In particular, the key differences between these two models can be seen in the following two features of logistic regression. First, the conditional distribution
y\midx
A common alternative to the logistic model (logit model) is the probit model, as the related names suggest. From the perspective of generalized linear models, these differ in the choice of link function: the logistic model uses the logit function (inverse logistic function), while the probit model uses the probit function (inverse error function). Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard logistic distribution of errors and the second a standard normal distribution of errors.[33] Other sigmoid functions or error distributions can be used instead.
Logistic regression is an alternative to Fisher's 1936 method, linear discriminant analysis.[34] If the assumptions of linear discriminant analysis hold, the conditioning can be reversed to produce logistic regression. The converse is not true, however, because logistic regression does not require the multivariate normal assumption of discriminant analysis.[35]
The assumption of linear predictor effects can easily be relaxed using techniques such as spline functions.
A detailed history of the logistic regression is given in . The logistic function was developed as a model of population growth and named "logistic" by Pierre François Verhulst in the 1830s and 1840s, under the guidance of Adolphe Quetelet; see for details. In his earliest paper (1838), Verhulst did not specify how he fit the curves to the data.[36] In his more detailed paper (1845), Verhulst determined the three parameters of the model by making the curve pass through three observed points, which yielded poor predictions.[37]
The logistic function was independently developed in chemistry as a model of autocatalysis (Wilhelm Ostwald, 1883). An autocatalytic reaction is one in which one of the products is itself a catalyst for the same reaction, while the supply of one of the reactants is fixed. This naturally gives rise to the logistic equation for the same reason as population growth: the reaction is self-reinforcing but constrained.
The logistic function was independently rediscovered as a model of population growth in 1920 by Raymond Pearl and Lowell Reed, published as, which led to its use in modern statistics. They were initially unaware of Verhulst's work and presumably learned about it from L. Gustave du Pasquier, but they gave him little credit and did not adopt his terminology. Verhulst's priority was acknowledged and the term "logistic" revived by Udny Yule in 1925 and has been followed since. Pearl and Reed first applied the model to the population of the United States, and also initially fitted the curve by making it pass through three points; as with Verhulst, this again yielded poor results.
In the 1930s, the probit model was developed and systematized by Chester Ittner Bliss, who coined the term "probit" in, and by John Gaddum in, and the model fit by maximum likelihood estimation by Ronald A. Fisher in, as an addendum to Bliss's work. The probit model was principally used in bioassay, and had been preceded by earlier work dating to 1860; see . The probit model influenced the subsequent development of the logit model and these models competed with each other.
The logistic model was likely first used as an alternative to the probit model in bioassay by Edwin Bidwell Wilson and his student Jane Worcester in . However, the development of the logistic model as a general alternative to the probit model was principally due to the work of Joseph Berkson over many decades, beginning in, where he coined "logit", by analogy with "probit", and continuing through and following years. The logit model was initially dismissed as inferior to the probit model, but "gradually achieved an equal footing with the probit", particularly between 1960 and 1970. By 1970, the logit model achieved parity with the probit model in use in statistics journals and thereafter surpassed it. This relative popularity was due to the adoption of the logit outside of bioassay, rather than displacing the probit within bioassay, and its informal use in practice; the logit's popularity is credited to the logit model's computational simplicity, mathematical properties, and generality, allowing its use in varied fields.
Various refinements occurred during that time, notably by David Cox, as in .[38]
The multinomial logit model was introduced independently in and, which greatly increased the scope of application and the popularity of the logit model. In 1973 Daniel McFadden linked the multinomial logit to the theory of discrete choice, specifically Luce's choice axiom, showing that the multinomial logit followed from the assumption of independence of irrelevant alternatives and interpreting odds of alternatives as relative preferences;[39] this gave a theoretical foundation for the logistic regression.
There are large numbers of extensions:
software in C for teaching purposes
\Delta(n,y)=1-(y-n)2