The Bayes factor is a ratio of two competing
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
s represented by their
marginal likelihood, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a
null hypothesis
In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
and an alternative, but this is not necessary; for instance, it could also be a non-linear model compared to its
linear approximation. The Bayes factor can be thought of as a Bayesian analog to the
likelihood-ratio test, but since it uses the (integrated) marginal likelihood instead of the maximized likelihood, both tests only coincide under simple hypotheses (e.g., two specific parameter values). Also, in contrast with
null hypothesis significance testing
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
, Bayes factors support evaluation of evidence ''in favor'' of a null hypothesis, rather than only allowing the null to be rejected or not rejected.
Although conceptually simple, the computation of the Bayes factor can be challenging depending on the complexity of the model and the hypotheses. Since closed-form expressions of the marginal likelihood are generally not available, numerical approximations based on
MCMC samples have been suggested. For certain special cases, simplified algebraic expressions can be derived; for instance, the Savage–Dickey density ratio in the case of a precise (equality constrained) hypothesis against an unrestricted alternative. Another approximation, derived by applying
Laplace's method
In mathematics, Laplace's method, named after Pierre-Simon Laplace, is a technique used to approximate integrals of the form
:\int_a^b e^ \, dx,
where f(x) is a twice-differentiable function, ''M'' is a large number, and the endpoints ''a'' an ...
to the integrated likelihoods, is known as the
Bayesian information criterion
In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...
(BIC); in large data sets the Bayes factor will approach the BIC as the influence of the priors wanes. In small data sets, priors generally matter and must not be
improper since the Bayes factor will be undefined if either of the two integrals in its ratio is not finite.
Definition
The Bayes factor is the ratio of two marginal likelihoods; that is, the
likelihoods of two statistical models integrated over the
prior probabilities
In Bayesian probability, Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some e ...
of their parameters.
The
posterior probability
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
of a model ''M'' given data ''D'' is given by
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
:
:
The key data-dependent term
represents the probability that some data are produced under the assumption of the model ''M''; evaluating it correctly is the key to Bayesian model comparison.
Given a
model selection problem in which one wishes to choose between two models on the basis of observed data ''D'', the plausibility of the two different models ''M''
1 and ''M''
2, parametrised by model parameter vectors
and
, is assessed by the Bayes factor ''K'' given by
:
When the two models have equal prior probability, so that
, the Bayes factor is equal to the ratio of the posterior probabilities of ''M''
1 and ''M''
2. If instead of the Bayes factor integral, the likelihood corresponding to the
maximum likelihood estimate of the parameter for each statistical model is used, then the test becomes a classical
likelihood-ratio test. Unlike a likelihood-ratio test, this Bayesian model comparison does not depend on any single set of parameters, as it integrates over all parameters in each model (with respect to the respective priors). However, an advantage of the use of Bayes factors is that it automatically, and quite naturally, includes a penalty for including too much model structure.
It thus guards against
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. For models where an explicit version of the likelihood is not available or too costly to evaluate numerically,
approximate Bayesian computation can be used for model selection in a Bayesian framework,
with the caveat that approximate-Bayesian estimates of Bayes factors are often biased.
Other approaches are:
* to treat model comparison as a
decision problem
In computability theory and computational complexity theory, a decision problem is a computational problem that can be posed as a yes–no question of the input values. An example of a decision problem is deciding by means of an algorithm whethe ...
, computing the expected value or cost of each model choice;
* to use
minimum message length (MML).
* to use
minimum description length (MDL).
Interpretation
A value of ''K'' > 1 means that ''M''
1 is more strongly supported by the data under consideration than ''M''
2. Note that classical
hypothesis testing gives one hypothesis (or model) preferred status (the 'null hypothesis'), and only considers evidence ''against'' it.
Harold Jeffreys gave a scale for interpretation of ''K'':
style="text-align: center; margin-left: auto; margin-right: auto; border: none;"
! ''K'' !! dHart !! bits !! Strength of evidence
, -
, < 10
0 , , < 0 , , < 0 , , Negative (supports ''M''
2)
, -
, 10
0 to 10
1/2 , , 0 to 5 , , 0 to 1.6 , , Barely worth mentioning
, -
, 10
1/2 to 10
1 , , 5 to 10 , , 1.6 to 3.3 , , Substantial
, -
, 10
1 to 10
3/2 , , 10 to 15 , , 3.3 to 5.0 , , Strong
, -
, 10
3/2 to 10
2 , , 15 to 20 , , 5.0 to 6.6 , , Very strong
, -
, > 10
2 , , > 20 , , > 6.6 , , Decisive
, -
The second column gives the corresponding weights of evidence in
decihartleys (also known as
deciban
The hartley (symbol Hart), also called a ban, or a dit (short for decimal digit), is a logarithmic unit that measures information or entropy, based on base 10 logarithms and powers of 10. One hartley is the information content of an event if th ...
s);
bits are added in the third column for clarity. According to
I. J. Good
Irving John Good (9 December 1916 – 5 April 2009)The Times of 16-apr-09, http://www.timesonline.co.uk/tol/comment/obituaries/article6100314.ece
was a British mathematician who worked as a cryptologist at Bletchley Park with Alan Turing. Afte ...
a change in a weight of evidence of 1 deciban or 1/3 of a bit (i.e. a change in an odds ratio from evens to about 5:4) is about as finely as
human
Humans (''Homo sapiens'') are the most abundant and widespread species of primate, characterized by bipedalism and exceptional cognitive skills due to a large and complex brain. This has enabled the development of advanced tools, culture, ...
s can reasonably perceive their
degree of belief
Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification o ...
in a hypothesis in everyday use.
An alternative table, widely cited, is provided by Kass and Raftery (1995):
[
style="text-align: center; margin-left: auto; margin-right: auto; border: none;"
! log10 ''K'' !! ''K'' !! Strength of evidence
, -
, 0 to 1/2 , , 1 to 3.2 , , Not worth more than a bare mention
, -
, 1/2 to 1 , , 3.2 to 10 , , Substantial
, -
, 1 to 2 , , 10 to 100 , , Strong
, -
, > 2 , , > 100 , , Decisive
, -
]
Example
Suppose we have a random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
that produces either a success or a failure. We want to compare a model ''M''1 where the probability of success is ''q'' = , and another model ''M''2 where ''q'' is unknown and we take a prior distribution
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
for ''q'' that is uniform
A uniform is a variety of clothing worn by members of an organization while participating in that organization's activity. Modern uniforms are most often worn by armed forces and paramilitary organizations such as police, emergency services, se ...
on ,1 We take a sample of 200, and find 115 successes and 85 failures. The likelihood can be calculated according to the binomial distribution
In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
:
:
Thus we have for ''M''1
:
whereas for ''M''2 we have
:
The ratio is then 1.2, which is "barely worth mentioning" even if it points very slightly towards ''M''1.
A frequentist hypothesis test of ''M''1 (here considered as a null hypothesis
In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
) would have produced a very different result. Such a test says that ''M''1 should be rejected at the 5% significance level, since the probability of getting 115 or more successes from a sample of 200 if ''q'' = is 0.02, and as a two-tailed test of getting a figure as extreme as or more extreme than 115 is 0.04. Note that 115 is more than two standard deviations away from 100. Thus, whereas a frequentist hypothesis test would yield significant results at the 5% significance level, the Bayes factor hardly considers this to be an extreme result. Note, however, that a non-uniform prior (for example one that reflects the fact that you expect the number of success and failures to be of the same order of magnitude) could result in a Bayes factor that is more in agreement with the frequentist hypothesis test.
A classical likelihood-ratio test would have found the maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimate for ''q'', namely , whence
:
(rather than averaging over all possible ''q''). That gives a likelihood ratio of 0.1 and points towards ''M''2.
''M''2 is a more complex model than ''M''1 because it has a free parameter which allows it to model the data more closely. The ability of Bayes factors to take this into account is a reason why Bayesian inference
Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
has been put forward as a theoretical justification for and generalisation of Occam's razor
Occam's razor, Ockham's razor, or Ocham's razor ( la, novacula Occami), also known as the principle of parsimony or the law of parsimony ( la, lex parsimoniae), is the problem-solving principle that "entities should not be multiplied beyond neces ...
, reducing Type I error
In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
s.Sharpening Ockham's Razor On a Bayesian Strop
/ref>
On the other hand, the modern method of relative likelihood In statistics, suppose that we have been given some data, and we are selecting a statistical model for that data. The relative likelihood compares the relative plausibilities of different candidate models or of different values of a parameter of a ...
takes into account the number of free parameters in the models, unlike the classical likelihood ratio. The relative likelihood method could be applied as follows. Model ''M''1 has 0 parameters, and so its Akaike information criterion (AIC) value is . Model ''M''2 has 1 parameter, and so its AIC value is . Hence ''M''1 is about times as probable as ''M''2 to minimize the information loss. Thus ''M''2 is slightly preferred, but ''M''1 cannot be excluded.
See also
* Akaike information criterion
* Approximate Bayesian computation
* Bayesian information criterion
In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...
* Deviance information criterion The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been o ...
* Lindley's paradox
* Minimum message length
* Model selection
; Statistical ratios
* Odds ratio
An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
* Relative risk
The relative risk (RR) or risk ratio is the ratio of the probability of an outcome in an exposed group to the probability of an outcome in an unexposed group. Together with risk difference and odds ratio, relative risk measures the association bet ...
References
Further reading
*
*
*Dienes, Z. (2019). How do I know what my theory predicts? ''Advances in Methods and Practices in Psychological Science''
*
*
* Jaynes, E. T. (1994),
Probability Theory: the logic of science
', chapter 24.
*
*
*
External links
BayesFactor
—an R package for computing Bayes factors in common research designs
— Online calculator for informed Bayes factors
Bayes Factor Calculators
—web-based version of much of the BayesFactor package
{{DEFAULTSORT:Bayes Factor
Factor
Model selection
Statistical ratios