Applications
Example
Problem
A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?
Model
Fit
Parameter estimation
Predictions
Model evaluation
Generalizations
Background
Definition of the logistic function
Definition of the inverse of the logistic function
Interpretation of these terms
Definition of the odds
The odds ratio
Multiple explanatory variables
Definition
Many explanatory variables, two categories
Multinomial logistic regression: Many explanatory variables and many categories
Interpretations
As a generalized linear model
As a latent-variable model
Two-way latent-variable model
As a "log-linear" model
As a single-layer perceptron
In terms of binomial data
Model fitting
Maximum likelihood estimation (MLE)
Iteratively reweighted least squares (IRLS)
Bayesian
"Rule of ten"
Error and significance of fit
Deviance and likelihood ratio test ─ a simple case
Goodness of fit summary
Deviance and likelihood ratio tests
Pseudo-R-squared
Hosmer–Lemeshow test
Coefficient significance
Likelihood ratio test
Wald statistic
Case-control sampling
Discussion
Although the dependent variable in logistic regression is Bernoulli, the logit is on an unrestricted scale. The logit function is the link function in this kind of generalized linear model, i.e. \operatorname \operatorname(Y) = \beta_0 + \beta_1 x is the Bernoulli-distributed response variable and is the predictor variable; the values are the linear parameters. The of the probability of success is then fitted to the predictors. The predicted value of the is converted back into predicted odds, via the inverse of the natural logarithm – the exponential function The exponential function is a mathematical function denoted by f(x)=\exp(x) or e^x (where the argument is written as an exponent). Unless otherwise specified, the term generally refers to the positive-valued function of a real variable, .... Thus, although the observed dependent variable in binary logistic regression is a 0-or-1 variable, the logistic regression estimates the odds, as a continuous variable, that the dependent variable is a 'success'. In some applications, the odds are all that is needed. In others, a specific yes-or-no prediction is needed for whether the dependent variable is or is not a 'success'; this categorical prediction can be based on the computed odds of success, with predicted odds above some chosen cutoff value being translated into a prediction of success. Maximum entropy Of all the functional forms used for estimating the probabilities of a particular categorical outcome which optimize the fit by maximizing the likelihood function (e.g. probit regression, Poisson regression In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable ''Y'' has a Poisson distribution, and assumes the logari ..., etc.), the logistic regression solution is unique in that it is a maximum entropy solution. This is a case of a general property: an exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ... of distributions maximizes entropy, given an expected value. In the case of the logistic model, the logistic function is the natural parameter of the Bernoulli distribution (it is in "canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an ...", and the logistic function is the canonical link function), while other sigmoid functions are non-canonical link functions; this underlies its mathematical elegance and ease of optimization. See for details. Proof In order to show this, we use the method of Lagrange multipliers In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied e .... The Lagrangian is equal to the entropy plus the sum of the products of Lagrange multipliers times various constraint expressions. The general multinomial case will be considered, since the proof is not made that much simpler by considering simpler cases. Equating the derivative of the Lagrangian with respect to the various probabilities to zero yields a functional form for those probabilities which corresponds to those used in logistic regression. As in the above section on multinomial logistic regression, we will consider explanatory variables denoted and which include x_0=1. There will be a total of ''K'' data points, indexed by k=\, and the data points are given by x_ and . The ''xmk'' will also be represented as an -dimensional vector \boldsymbol_k = \. There will be possible values of the categorical variable ''y'' ranging from 0 to N. Let ''pn(x)'' be the probability, given explanatory variable vector x, that the outcome will be y=n. Define p_=p_n(\boldsymbol_k) which is the probability that for the ''k''-th measurement, the categorical outcome is ''n''. The Lagrangian will be expressed as a function of the probabilities ''pnk'' and will minimized by equating the derivatives of the Lagrangian with respect to these probabilities to zero. An important point is that the probabilities are treated equally and the fact that they sum to unity is part of the Lagrangian formulation, rather than being assumed from the beginning. The first contribution to the Lagrangian is the entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...: :\mathcal_=-\sum_^K\sum_^N p_\ln(p_) The log-likelihood is: :\ell=\sum_^K\sum_^N \Delta(n,y_k)\ln(p_) Assuming the multinomial logistic function, the derivative of the log-likelihood with respect the beta coefficients was found to be: :\frac=\sum_^K ( p_x_-\Delta(n,y_k)x_) A very important point here is that this expression is (remarkably) not an explicit function of the beta coefficients. It is only a function of the probabilities ''pnk'' and the data. Rather than being specific to the assumed multinomial logistic case, it is taken to be a general statement of the condition at which the log-likelihood is maximized and makes no reference to the functional form of ''pnk''. There are then (''M''+1)(''N''+1) fitting constraints and the fitting constraint term in the Lagrangian is then: :\mathcal_=\sum_^N\sum_^M \lambda_\sum_^K (p_x_-\Delta(n,y_k)x_) where the ''λnm'' are the appropriate Lagrange multipliers. There are ''K'' normalization constraints which may be written: :\sum_^N p_=1 so that the normalization term in the Lagrangian is: :\mathcal_=\sum_^K \alpha_k \left(1-\sum_^N p_\right) where the ''αk'' are the appropriate Lagrange multipliers. The Lagrangian is then the sum of the above three terms: :\mathcal=\mathcal_ + \mathcal_ + \mathcal_ Setting the derivative of the Lagrangian with respect to one of the probabilities to zero yields: :\frac=0=-\ln(p_)-1+\sum_^M (\lambda_x_)-\alpha_ Using the more condensed vector notation: :\sum_^M \lambda_x_ = \boldsymbol_n\cdot\boldsymbol_k and dropping the primes on the ''n'' and ''k'' indices, and then solving for p_ yields: :p_=e^/Z_k where: :Z_k=e^ Imposing the normalization constraint, we can solve for the ''Zk'' and write the probabilities as: :p_=\frac The \boldsymbol_n are not all independent. We can add any constant -dimensional vector to each of the \boldsymbol_n without changing the value of the p_ probabilities so that there are only ''N'' rather than independent \boldsymbol_n. In the multinomial logistic regression section above, the \boldsymbol_0 was subtracted from each \boldsymbol_n which set the exponential term involving \boldsymbol_0 to unity, and the beta coefficients were given by \boldsymbol_n=\boldsymbol_n-\boldsymbol_0. Other approaches In machine learning applications where logistic regression is used for binary classification, the MLE minimises the Cross entropy In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ... loss function. Logistic regression is an important machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ... algorithm. The goal is to model the probability of a random variable Y being 0 or 1 given experimental data. Consider a generalized linear model function parameterized by \theta, : h_\theta(X) = \frac = \Pr(Y=1 \mid X; \theta) Therefore, : \Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X) and since Y \in \, we see that \Pr(y\mid X;\theta) is given by \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^. We now calculate the likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ... assuming that all the observations in the sample are independently Bernoulli distributed, :\begin L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\ &= \prod_i \Pr(y_i \mid x_i; \theta) \\ &= \prod_i h_\theta(x_i)^(1 - h_\theta(x_i))^ \end Typically, the log likelihood is maximized, : N^ \log L(\theta \mid y; x) = N^ \sum_^N \log \Pr(y_i \mid x_i; \theta) which is maximized using optimization techniques such as gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the .... Assuming the (x, y) pairs are drawn uniformly from the underlying distribution, then in the limit of large ''N'', :\begin & \lim \limits_ N^ \sum_^N \log \Pr(y_i \mid x_i; \theta) = \sum_ \sum_ \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\ pt= & \sum_ \sum_ \Pr(X=x, Y=y) \left( - \log\frac + \log \Pr(Y=y \mid X=x) \right) \\ pt= & - D_\text( Y \parallel Y_\theta ) - H(Y \mid X) \end where H(Y\mid X) is the conditional entropy In information theory, the conditional entropy quantifies the amount of information needed to describe the outcome of a random variable Y given that the value of another random variable X is known. Here, information is measured in shannons, na ... and D_\text is the Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr .... This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters. Comparison with linear regression Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between the dependent and independent variables) from those of linear regression. In particular, the key differences between these two models can be seen in the following two features of logistic regression. First, the conditional distribution y \mid x is a Bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,James Victor Uspensky: ''Introduction to Mathematical Probability'', McGraw-Hill, New York 1937, page 45 is the discrete probabi ... rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes rather than the outcomes themselves. Alternatives A common alternative to the logistic model (logit model) is the probit model In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from ''probability'' + ''unit''. The purpose of the model is to est ..., as the related names suggest. From the perspective of generalized linear models, these differ in the choice of link function: the logistic model uses the logit function In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the in ... (inverse logistic function), while the probit model uses the probit function (inverse error function). Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard logistic distribution Logistic may refer to: Mathematics * Logistic function, a sigmoid function used in many fields ** Logistic map, a recurrence relation that sometimes exhibits chaos ** Logistic regression, a statistical model using the logistic function ** Logit, ... of errors and the second a standard normal distribution of errors. Other sigmoid function A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula: :S(x) = \frac = \ ...s or error distributions can be used instead. Logistic regression is an alternative to Fisher's 1936 method, linear discriminant analysis. If the assumptions of linear discriminant analysis hold, the conditioning can be reversed to produce logistic regression. The converse is not true, however, because logistic regression does not require the multivariate normal assumption of discriminant analysis. The assumption of linear predictor effects can easily be relaxed using techniques such as spline functions. History A detailed history of the logistic regression is given in . The logistic function was developed as a model of population growth and named "logistic" by Pierre François Verhulst Pierre François Verhulst (28 October 1804, Brussels – 15 February 1849, Brussels) was a Belgian mathematician and a doctor in number theory from the University of Ghent in 1825. He is best known for the logistic growth model. Logistic e ... in the 1830s and 1840s, under the guidance of Adolphe Quetelet Lambert Adolphe Jacques Quetelet FRSF or FRSE (; 22 February 1796 – 17 February 1874) was a Belgian astronomer, mathematician, statistician and sociologist who founded and directed the Brussels Observatory and was influential in intro ...; see for details. In his earliest paper (1838), Verhulst did not specify how he fit the curves to the data. In his more detailed paper (1845), Verhulst determined the three parameters of the model by making the curve pass through three observed points, which yielded poor predictions. The logistic function was independently developed in chemistry as a model of autocatalysis A single chemical reaction is said to be autocatalytic if one of the reaction products is also a catalyst for the same or a coupled reaction.Steinfeld J.I., Francisco J.S. and Hase W.L. ''Chemical Kinetics and Dynamics'' (2nd ed., Prentice-Hall 199 ... (Wilhelm Ostwald Friedrich Wilhelm Ostwald (; 4 April 1932) was a Baltic German chemist and philosopher. Ostwald is credited with being one of the founders of the field of physical chemistry, with Jacobus Henricus van 't Hoff, Walther Nernst, and Svante Arrhen ..., 1883). An autocatalytic reaction is one in which one of the products is itself a catalyst Catalysis () is the process of increasing the rate of a chemical reaction by adding a substance known as a catalyst (). Catalysts are not consumed in the reaction and remain unchanged after it. If the reaction is rapid and the catalyst recyc ... for the same reaction, while the supply of one of the reactants is fixed. This naturally gives rise to the logistic equation for the same reason as population growth: the reaction is self-reinforcing but constrained. The logistic function was independently rediscovered as a model of population growth in 1920 by Raymond Pearl and Lowell Reed Lowell Jacob Reed (January 8, 1886 – April 29, 1966) was 7th president of the Johns Hopkins University in Baltimore, Maryland. He was born in Berlin, New Hampshire, the son of Jason Reed, a millwright and farmer, and Louella Coffin Reed. Edu ..., published as , which led to its use in modern statistics. They were initially unaware of Verhulst's work and presumably learned about it from L. Gustave du Pasquier, but they gave him little credit and did not adopt his terminology. Verhulst's priority was acknowledged and the term "logistic" revived by Udny Yule George Udny Yule FRS (18 February 1871 – 26 June 1951), usually known as Udny Yule, was a British statistician, particularly known for the Yule distribution. Personal life Yule was born at Beech Hill, a house in Morham near Haddingto ... in 1925 and has been followed since. Pearl and Reed first applied the model to the population of the United States, and also initially fitted the curve by making it pass through three points; as with Verhulst, this again yielded poor results. In the 1930s, the probit model In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from ''probability'' + ''unit''. The purpose of the model is to est ... was developed and systematized by Chester Ittner Bliss Chester Ittner Bliss (February 1, 1899 – March 14, 1979) was primarily a biologist, who is best known for his contributions to statistics. He was born in Springfield, Ohio in 1899 and died in 1979. He was the first secretary of the International ..., who coined the term "probit" in , and by John Gaddum Sir John Henry Gaddum (31 March 1900 – 30 June 1965) was an English pharmacologist who, with Ulf von Euler, co-discovered the neuropeptide Substance P in 1931. He was a founder member of the British Pharmacological Society and first editor ... in , and the model fit by maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ... by Ronald A. Fisher in , as an addendum to Bliss's work. The probit model was principally used in bioassay A bioassay is an analytical method to determine the concentration or potency of a substance by its effect on living animals or plants (''in vivo''), or on living cells or tissues(''in vitro''). A bioassay can be either quantal or quantitative, dir ..., and had been preceded by earlier work dating to 1860; see . The probit model influenced the subsequent development of the logit model and these models competed with each other. The logistic model was likely first used as an alternative to the probit model in bioassay by Edwin Bidwell Wilson Edwin Bidwell Wilson (April 25, 1879 – December 28, 1964) was an American mathematician, statistician, physicist and general polymath. He was the sole protégé of Yale University physicist Josiah Willard Gibbs and was mentor to MIT economist ... and his student Jane Worcester in . However, the development of the logistic model as a general alternative to the probit model was principally due to the work of Joseph Berkson over many decades, beginning in , where he coined "logit", by analogy with "probit", and continuing through and following years. The logit model was initially dismissed as inferior to the probit model, but "gradually achieved an equal footing with the logit", particularly between 1960 and 1970. By 1970, the logit model achieved parity with the probit model in use in statistics journals and thereafter surpassed it. This relative popularity was due to the adoption of the logit outside of bioassay, rather than displacing the probit within bioassay, and its informal use in practice; the logit's popularity is credited to the logit model's computational simplicity, mathematical properties, and generality, allowing its use in varied fields. Various refinements occurred during that time, notably by David Cox, as in . The multinomial logit model was introduced independently in and , which greatly increased the scope of application and the popularity of the logit model. In 1973 Daniel McFadden Daniel Little McFadden (born July 29, 1937) is an American econometrician who shared the 2000 Nobel Memorial Prize in Economic Sciences with James Heckman. McFadden's share of the prize was "for his development of theory and methods for analyzi ... linked the multinomial logit to the theory of discrete choice, specifically Luce's choice axiom, showing that the multinomial logit followed from the assumption of independence of irrelevant alternatives and interpreting odds of alternatives as relative preferences; this gave a theoretical foundation for the logistic regression. Extensions There are large numbers of extensions: * Multinomial logistic regression (or multinomial logit) handles the case of a multi-way categorical dependent variable (with unordered values, also called "classification"). Note that the general case of having dependent variables with more than two values is termed ''polytomous regression''. * Ordered logistic regression (or ordered logit) handles ordinal dependent variables (ordered values). * Mixed logit is an extension of multinomial logit that allows for correlations among the choices of the dependent variable. * An extension of the logistic model to sets of interdependent variables is the conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid .... * Conditional logistic regression Conditional logistic regression is an extension of logistic regression that allows one to take into account stratification and matching. Its main field of application is observational studies and in particular epidemiology. It was devised in 1978 ... handles matched or stratified data when the strata are small. It is mostly used in the analysis of observational studies In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concern .... Software Most statistical software can do binary logistic regression. * SPSS *for basic logistic regression. * Stata * SAS *PROC LOGISTICfor basic logistic regression. *when all the variables are categorical. *for multilevel model logistic regression. * R ** glm in the stats package (using family = binomial) ** lrm in thrms package** GLMNET package for an efficient implementation regularized logistic regression ** lmer for mixed effects logistic regression ** Rfast package command gm_logistic for fast and heavy calculations involving large scale data. ** arm package for bayesian logistic regression * Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ... *Logitin the Statsmodels Statsmodels is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available f ... module. *LogisticRegressionin the scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ... module. *LogisticRegressorin the TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learnin ... module. ** Full example of logistic regression in the Theano tutoria** Bayesian Logistic Regression with ARD priocodetutorial** Variational Bayes Logistic Regression with ARD priocodetutorial** Bayesian Logistic Regressioncodetutorial* NCSS (statistical software), NCSS *Logistic Regression in NCSS* Matlab MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementa ... ** mnrfit in the Statistics and Machine Learning Toolbox (with "incorrect" coded as 2 instead of 0) ** fminunc/fmincon, fitglm, mnrfit, fitclinear, mle can all do logistic regression. *Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ... ( JVM) ** LibLinear Apache Flink** Apache Spark *** SparkML supports Logistic Regression * FPGA *Logistic Regresesion IP corein HLS for FPGA. Notably, Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for App ...'s statistics extension package does not include it. See also * Logistic function * Discrete choice * Jarrow–Turnbull model * Limited dependent variable * Multinomial logit model * Ordered logit * Hosmer–Lemeshow test * Brier score The Brier Score is a ''strictly proper score function'' or ''strictly proper scoring rule'' that measures the accuracy of probabilistic predictions. For unidimensional predictions, it is strictly equivalent to the mean squared error as applied t ... * mlpack mlpack is a machine learning software library for C++, built on top of the Armadillo library and thensmallennumerical optimization library. mlpack has an emphasis on scalability, speed, and ease-of-use. Its aim is to make machine learning possib ... - contains a C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ... implementation of logistic regression * Local case-control sampling * Logistic model tree References Further reading * * * ** Published in: * * * * * * * * * * * * External links * * by Mark Thoma Mark Allen Thoma (born December 15, 1956) is a macroeconomist and econometrician and a professor of economics at the Department of Economics of the University of Oregon. Thoma is best known as a regular columnist for ''The Fiscal Times'' throug ...Logistic Regression tutorial software in C for teaching purposes {{Authority control Predictive analytics Regression models
Maximum entropy
Proof
Other approaches
Comparison with linear regression
Alternatives
History
Extensions
Software
glm
lrm
gm_logistic
Logit
LogisticRegression
LogisticRegressor
mnrfit
fminunc/fmincon, fitglm, mnrfit, fitclinear, mle
Logistic Regresesion IP core
See also
References
Further reading
External links