Akaike information criterion
   HOME

TheInfoList



OR:

The Akaike information criterion (AIC) is an
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of prediction error and thereby relative quality of
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
s for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for
model selection Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
. AIC is founded on
information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...
. When a statistical model is used to represent the process that generated the data, the representation will almost never be exact; so some information will be lost by using the model to represent the process. AIC estimates the relative amount of information lost by a given model: the less information a model loses, the higher the quality of that model. In estimating the amount of information lost by a model, AIC deals with the trade-off between the
goodness of fit The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measure ...
of the model and the simplicity of the model. In other words, AIC deals with both the risk of
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
and the risk of underfitting. The Akaike information criterion is named after the Japanese statistician
Hirotugu Akaike was a Japanese statistician. In the early 1970s, he formulated the Akaike information criterion (AIC). AIC is now widely used for model selection, which is commonly the most difficult aspect of statistical inference; additionally, AIC is the basi ...
, who formulated it. It now forms the basis of a paradigm for the
foundations of statistics The foundations of statistics concern the epistemological debate in statistics over how one should conduct inductive inference from data. Among the issues considered in statistical inference are the question of Bayesian inference versus frequentist ...
and is also widely used for
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...
.


Definition

Suppose that we have a
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
of some data. Let be the number of estimated
parameters A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
in the model. Let \hat L be the maximized value of the
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
for the model. Then the AIC value of the model is the following. :\mathrm \, = \, 2k - 2\ln(\hat L) Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards
goodness of fit The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measure ...
(as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit. AIC is founded in
information theory Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. ...
. Suppose that the data is generated by some unknown process ''f''. We consider two candidate models to represent ''f'': ''g''1 and ''g''2. If we knew ''f'', then we could find the information lost from using ''g''1 to represent ''f'' by calculating the
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
, ; similarly, the information lost from using ''g''2 to represent ''f'' could be found by calculating . We would then, generally, choose the candidate model that minimized the information loss. We cannot choose with certainty, because we do not know ''f''. showed, however, that we can estimate, via AIC, how much more (or less) information is lost by ''g''1 than by ''g''2. The estimate, though, is only valid
asymptotically In analytic geometry, an asymptote () of a curve is a line such that the distance between the curve and the line approaches zero as one or both of the ''x'' or ''y'' coordinates tends to infinity. In projective geometry and related contexts, ...
; if the number of data points is small, then some correction is often necessary (see AICc, below). Note that AIC tells nothing about the absolute quality of a model, only the quality relative to other models. Thus, if all the candidate models fit poorly, AIC will not give any warning of that. Hence, after selecting a model via AIC, it is usually good practice to validate the absolute quality of the model. Such validation commonly includes checks of the model's residuals (to determine whether the residuals seem like random) and tests of the model's predictions. For more on this topic, see ''
statistical model validation In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misunderstan ...
''.


How to use AIC in practice

To apply AIC in practice, we start with a set of candidate models, and then find the models' corresponding AIC values. There will almost always be information lost due to using a candidate model to represent the "true model," i.e. the process that generated the data. We wish to select, from among the candidate models, the model that minimizes the information loss. We cannot choose with certainty, but we can minimize the estimated information loss. Suppose that there are ''R'' candidate models. Denote the AIC values of those models by AIC1, AIC2, AIC3, ..., AIC''R''. Let AICmin be the minimum of those values. Then the quantity exp((AICmin − AIC''i'')/2) can be interpreted as being proportional to the probability that the ''i''th model minimizes the (estimated) information loss. As an example, suppose that there are three candidate models, whose AIC values are 100, 102, and 110. Then the second model is times as probable as the first model to minimize the information loss. Similarly, the third model is times as probable as the first model to minimize the information loss. In this example, we would omit the third model from further consideration. We then have three options: (1) gather more data, in the hope that this will allow clearly distinguishing between the first two models; (2) simply conclude that the data is insufficient to support selecting one model from among the first two; (3) take a weighted average of the first two models, with weights proportional to 1 and 0.368, respectively, and then do
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...
based on the weighted multimodel. The quantity is known as the ''
relative likelihood In statistics, suppose that we have been given some data, and we are selecting a statistical model for that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a ...
'' of model ''i''. It is closely related to the likelihood ratio used in the
likelihood-ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...
. Indeed, if all the models in the candidate set have the same number of parameters, then using AIC might at first appear to be very similar to using the likelihood-ratio test. There are, however, important distinctions. In particular, the likelihood-ratio test is valid only for nested models, whereas AIC (and AICc) has no such restriction.


Hypothesis testing

Every
statistical hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
can be formulated as a comparison of statistical models. Hence, every statistical hypothesis test can be replicated via AIC. Two examples are briefly described in the subsections below. Details for those examples, and many more examples, are given by and .


Replicating Student's ''t''-test

As an example of a hypothesis test, consider the ''t''-test to compare the means of two normally-distributed populations. The input to the ''t''-test comprises a random sample from each of the two populations. To formulate the test as a comparison of models, we construct two different models. The first model models the two populations as having potentially different means and standard deviations. The likelihood function for the first model is thus the product of the likelihoods for two distinct normal distributions; so it has four parameters: . To be explicit, the
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
is as follows (denoting the sample sizes by and ). : \mathcal(\mu_1,\sigma_1,\mu_2,\sigma_2) \, = \, : \; \; \; \; \; \; \; \; \prod_^ \frac \exp\left( -\frac\right) \; \, \boldsymbol\cdot \, \prod_^ \frac \exp\left( -\frac\right) The second model models the two populations as having the same means but potentially different standard deviations. The likelihood function for the second model thus sets in the above equation; so it has three parameters. We then maximize the likelihood functions for the two models (in practice, we maximize the log-likelihood functions); after that, it is easy to calculate the AIC values of the models. We next calculate the relative likelihood. For instance, if the second model was only 0.01 times as likely as the first model, then we would omit the second model from further consideration: so we would conclude that the two populations have different means. The ''t''-test assumes that the two populations have identical standard deviations; the test tends to be unreliable if the assumption is false and the sizes of the two samples are very different ( Welch's ''t''-test would be better). Comparing the means of the populations via AIC, as in the example above, has an advantage by not making such assumptions.


Comparing categorical data sets

For another example of a hypothesis test, suppose that we have two populations, and each member of each population is in one of two categories—category #1 or category #2. Each population is
binomially distributed In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no q ...
. We want to know whether the distributions of the two populations are the same. We are given a random sample from each of the two populations. Let be the size of the sample from the first population. Let be the number of observations (in the sample) in category #1; so the number of observations in category #2 is . Similarly, let be the size of the sample from the second population. Let be the number of observations (in the sample) in category #1. Let be the probability that a randomly-chosen member of the first population is in category #1. Hence, the probability that a randomly-chosen member of the first population is in category #2 is . Note that the distribution of the first population has one parameter. Let be the probability that a randomly-chosen member of the second population is in category #1. Note that the distribution of the second population also has one parameter. To compare the distributions of the two populations, we construct two different models. The first model models the two populations as having potentially different distributions. The likelihood function for the first model is thus the product of the likelihoods for two distinct binomial distributions; so it has two parameters: , . To be explicit, the likelihood function is as follows. : \mathcal(p,q) \, = \, \frac p^ (1-p)^ \; \, \boldsymbol\cdot \; \; \frac q^ (1-q)^ The second model models the two populations as having the same distribution. The likelihood function for the second model thus sets in the above equation; so the second model has one parameter. We then maximize the likelihood functions for the two models (in practice, we maximize the log-likelihood functions); after that, it is easy to calculate the AIC values of the models. We next calculate the relative likelihood. For instance, if the second model was only 0.01 times as likely as the first model, then we would omit the second model from further consideration: so we would conclude that the two populations have different distributions.


Foundations of statistics

Statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...
is generally regarded as comprising hypothesis testing and
estimation Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...
. Hypothesis testing can be done via AIC, as discussed above. Regarding estimation, there are two types:
point estimation In statistics, point estimation involves the use of sample data to calculate a single value (known as a point estimate since it identifies a point in some parameter space) which is to serve as a "best guess" or "best estimate" of an unknown popula ...
and
interval estimation In statistics, interval estimation is the use of sample data to estimate an '' interval'' of plausible values of a parameter of interest. This is in contrast to point estimation, which gives a single value. The most prevalent forms of interva ...
. Point estimation can be done within the AIC paradigm: it is provided by
maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
. Interval estimation can also be done within the AIC paradigm: it is provided by likelihood intervals. Hence, statistical inference generally can be done within the AIC paradigm. The most commonly used paradigms for statistical inference are frequentist inference and
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and ...
. AIC, though, can be used to do statistical inference without relying on either the frequentist paradigm or the Bayesian paradigm: because AIC can be interpreted without the aid of significance levels or Bayesian priors. In other words, AIC can be used to form a foundation of statistics that is distinct from both frequentism and Bayesianism.


Modification for small sample size

When the
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of ...
size is small, there is a substantial probability that AIC will select models that have too many parameters, i.e. that AIC will overfit. To address such potential overfitting, AICc was developed: AICc is AIC with a correction for small sample sizes. The formula for AICc depends upon the statistical model. Assuming that the model is
univariate In mathematics, a univariate object is an expression, equation, function or polynomial involving only one variable. Objects involving more than one variable are multivariate. In some cases the distinction between the univariate and multivariate ...
, is linear in its parameters, and has normally-distributed residuals (conditional upon regressors), then the formula for AICc is as follows. :\mathrm \, = \, \mathrm + \frac —where denotes the sample size and denotes the number of parameters. Thus, AICc is essentially AIC with an extra penalty term for the number of parameters. Note that as , the extra penalty term converges to 0, and thus AICc converges to AIC. If the assumption that the model is univariate and linear with normal residuals does not hold, then the formula for AICc will generally be different from the formula above. For some models, the formula can be difficult to determine. For every model that has AICc available, though, the formula for AICc is given by AIC plus terms that includes both and 2. In comparison, the formula for AIC includes but not 2. In other words, AIC is a first-order estimate (of the information loss), whereas AICc is a second-order estimate. Further discussion of the formula, with examples of other assumptions, is given by and by . In particular, with other assumptions, bootstrap estimation of the formula is often feasible. To summarize, AICc has the advantage of tending to be more accurate than AIC (especially for small samples), but AICc also has the disadvantage of sometimes being much more difficult to compute than AIC. Note that if all the candidate models have the same and the same formula for AICc, then AICc and AIC will give identical (relative) valuations; hence, there will be no disadvantage in using AIC, instead of AICc. Furthermore, if is many times larger than 2, then the extra penalty term will be negligible; hence, the disadvantage in using AIC, instead of AICc, will be negligible.


History

The Akaike information criterion was formulated by the statistician
Hirotugu Akaike was a Japanese statistician. In the early 1970s, he formulated the Akaike information criterion (AIC). AIC is now widely used for model selection, which is commonly the most difficult aspect of statistical inference; additionally, AIC is the basi ...
. It was originally named "an information criterion". It was first announced in English by Akaike at a 1971 symposium; the proceedings of the symposium were published in 1973. The 1973 publication, though, was only an informal presentation of the concepts. The first formal publication was a 1974 paper by Akaike. , the 1974 paper had received more than 14,000 citations in the Web of Science: making it the 73rd most-cited research paper of all time. Nowadays, AIC has become common enough that it is often used without citing Akaike's 1974 paper. Indeed, there are over 150,000 scholarly articles/books that use AIC (as assessed by
Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes ...
). The initial derivation of AIC relied upon some strong assumptions. showed that the assumptions could be made much weaker. Takeuchi's work, however, was in Japanese and was not widely known outside Japan for many years. AICc was originally proposed for
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
(only) by . That instigated the work of , and several further papers by the same authors, which extended the situations in which AICc could be applied. The first general exposition of the information-theoretic approach was the volume by . It includes an English presentation of the work of Takeuchi. The volume led to far greater use of AIC, and it now has more than 48,000 citations on
Google Scholar Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats and disciplines. Released in beta in November 2004, the Google Scholar index includes ...
. Akaike called his approach an "entropy maximization principle", because the approach is founded on the concept of entropy in information theory. Indeed, minimizing AIC in a statistical model is effectively equivalent to maximizing entropy in a thermodynamic system; in other words, the information-theoretic approach in statistics is essentially applying the
Second Law of Thermodynamics The second law of thermodynamics is a physical law based on universal experience concerning heat and energy interconversions. One simple statement of the law is that heat always moves from hotter objects to colder objects (or "downhill"), unle ...
. As such, AIC has roots in the work of
Ludwig Boltzmann Ludwig Eduard Boltzmann (; 20 February 1844 – 5 September 1906) was an Austrian physicist and philosopher. His greatest achievements were the development of statistical mechanics, and the statistical explanation of the second law of ther ...
on
entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodyna ...
. For more on these issues, see and .


Usage tips


Counting parameters

A
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
must account for
random error Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a "mistake" ...
s. A straight line model might be formally described as ''y''''i'' = ''b''0 + ''b''1''x''''i'' + ''ε''''i''. Here, the ''ε''''i'' are the residuals from the straight line fit. If the ''ε''''i'' are assumed to be
i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponym ...
(with zero mean), then the model has three parameters: ''b''0, ''b''1, and the variance of the Gaussian distributions. Thus, when calculating the AIC value of this model, we should use ''k''=3. More generally, for any
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...
model with i.i.d. Gaussian residuals, the variance of the residuals' distributions should be counted as one of the parameters. As another example, consider a first-order
autoregressive model In statistics, econometrics and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes in nature, economics, etc. The autoregressive model spe ...
, defined by ''x''''i'' = ''c'' + ''φx''''i''−1 + ''ε''''i'', with the ''ε''''i'' being i.i.d. Gaussian (with zero mean). For this model, there are three parameters: ''c'', ''φ'', and the variance of the ''ε''''i''. More generally, a ''p''th-order autoregressive model has parameters. (If, however, ''c'' is not estimated from the data, but instead given in advance, then there are only parameters.)


Transforming data

The AIC values of the candidate models must all be computed with the same data set. Sometimes, though, we might want to compare a model of the
response variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...
, , with a model of the logarithm of the response variable, . More generally, we might want to compare a model of the data with a model of transformed data. Following is an illustration of how to deal with data transforms (adapted from : "Investigators should be sure that all hypotheses are modeled using the same response variable"). Suppose that we want to compare two models: one with a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
of and one with a normal distribution of . We should ''not'' directly compare the AIC values of the two models. Instead, we should transform the normal
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Eve ...
to first take the logarithm of . To do that, we need to perform the relevant
integration by substitution In calculus, integration by substitution, also known as ''u''-substitution, reverse chain rule or change of variables, is a method for evaluating integrals and antiderivatives. It is the counterpart to the chain rule for differentiation, and ...
: thus, we need to multiply by the derivative of the (natural) logarithm function, which is . Hence, the transformed distribution has the following
probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) ca ...
: : y \mapsto \, \frac \frac\,\exp \left(-\frac\right) —which is the probability density function for the
log-normal distribution In probability theory, a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable is log-normally distributed, then has a norma ...
. We then compare the AIC value of the normal model against the AIC value of the log-normal model.


Comparisons with other model selection methods

The critical difference between AIC and BIC (and their variants) is the asymptotic property under well-specified and misspecified model classes. Their fundamental differences have been well-studied in regression variable selection and autoregression order selection problems. In general, if the goal is prediction, AIC and leave-one-out cross-validations are preferred. If the goal is selection, inference, or interpretation, BIC or leave-many-out cross-validations are preferred. A comprehensive overview of AIC and other popular model selection methods is given b
Ding et al.


Comparison with BIC

The formula for the
Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, o ...
(BIC) is similar to the formula for AIC, but with a different penalty for the number of parameters. With AIC the penalty is , whereas with BIC the penalty is . A comparison of AIC/AICc and BIC is given by , with follow-up remarks by . The authors show that AIC/AICc can be derived in the same Bayesian framework as BIC, just by using different prior probabilities. In the Bayesian derivation of BIC, though, each candidate model has a prior probability of 1/''R'' (where ''R'' is the number of candidate models). Additionally, the authors present a few simulation studies that suggest AICc tends to have practical/performance advantages over BIC. A point made by several researchers is that AIC and BIC are appropriate for different tasks. In particular, BIC is argued to be appropriate for selecting the "true model" (i.e. the process that generated the data) from the set of candidate models, whereas AIC is not appropriate. To be specific, if the "true model" is in the set of candidates, then BIC will select the "true model" with probability 1, as ; in contrast, when selection is done via AIC, the probability can be less than 1. Proponents of AIC argue that this issue is negligible, because the "true model" is virtually never in the candidate set. Indeed, it is a common aphorism in statistics that "
all models are wrong All or ALL may refer to: Language * All, an indefinite pronoun in English * All, one of the English determiners * Allar language (ISO 639-3 code) * Allative case (abbreviated ALL) Music * All (band), an American punk rock band * ''All'' (All ...
"; hence the "true model" (i.e. reality) cannot be in the candidate set. Another comparison of AIC and BIC is given by . Vrieze presents a simulation study—which allows the "true model" to be in the candidate set (unlike with virtually all real data). The simulation study demonstrates, in particular, that AIC sometimes selects a much better model than BIC even when the "true model" is in the candidate set. The reason is that, for finite , BIC can have a substantial risk of selecting a very bad model from the candidate set. This reason can arise even when is much larger than 2. With AIC, the risk of selecting a very bad model is minimized. If the "true model" is not in the candidate set, then the most that we can hope to do is select the model that best approximates the "true model". AIC is appropriate for finding the best approximating model, under certain assumptions. (Those assumptions include, in particular, that the approximating is done with regard to information loss.) Comparison of AIC and BIC in the context of regression is given by . In regression, AIC is asymptotically optimal for selecting the model with the least
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
, under the assumption that the "true model" is not in the candidate set. BIC is not asymptotically optimal under the assumption. Yang additionally shows that the rate at which AIC converges to the optimum is, in a certain sense, the best possible.


Comparison with cross-validation

Leave-one-out cross-validation is asymptotically equivalent to AIC, for ordinary linear regression models. Asymptotic equivalence to AIC also holds for mixed-effects models.


Comparison with least squares

Sometimes, each candidate model assumes that the residuals are distributed according to independent identical normal distributions (with zero mean). That gives rise to
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...
model fitting. With least squares fitting, the
maximum likelihood estimate In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statist ...
for the variance of a model's residuals distributions is the
reduced chi-squared statistic In statistics, the reduced chi-square statistic is used extensively in goodness of fit testing. It is also known as mean squared weighted deviation (MSWD) in isotopic dating and variance of unit weight in the context of weighted least squares. ...
, \hat\sigma^2 = \mathrm/n, where \mathrm is the
residual sum of squares In statistics, the residual sum of squares (RSS), also known as the sum of squared estimate of errors (SSE), is the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepa ...
: \textstyle \mathrm = \sum_^n (y_i-f(x_i;\hat))^2. Then, the maximum value of a model's log-likelihood function is : -\frac\ln(2\pi) - \frac\ln(\hat\sigma^2) - \frac\mathrm \, = \, - \frac\ln(\mathrm/n) + C —where is a constant independent of the model, and dependent only on the particular data points, i.e. it does not change if the data does not change. That gives: :AIC = . Because only differences in AIC are meaningful, the constant can be ignored, which allows us to conveniently take the following for model comparisons: :ΔAIC = Note that if all the models have the same , then selecting the model with minimum AIC is equivalent to selecting the model with minimum —which is the usual objective of model selection based on least squares.


Comparison with Mallows's ''Cp''

Mallows's ''Cp'' is equivalent to AIC in the case of (Gaussian)
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
.


See also


Bridge Criterion
* Deviance information criterion * Focused information criterion * Hannan–Quinn information criterion *
Maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
*
Principle of maximum entropy The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...


Notes


References

*. *. Republished in . *. *. *. *. *. *. *. *. 'Note:'' the AIC defined by Claeskens & Hjort is the negative of the standard definition—as originally given by Akaike and followed by other authors. *. *. *. *. *. *. *. *. *. *. *. *. *. *.


Further reading

* irotogu Akaike comments on how he arrived at AIC* * * * * * * * * * {{DEFAULTSORT:Akaike Information Criterion Entropy and information Model selection Regression variable selection Mathematical modeling Japanese inventions