Bayesian inference ( or ) is a method of

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

in which

Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...

is used to calculate a probability of a hypothesis, given prior

evidence Evidence for a proposition is what supports the proposition. It is usually understood as an indication that the proposition is truth, true. The exact definition and role of evidence vary across different fields. In epistemology, evidence is what J ...

, and update it as more

information Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...

becomes available. Fundamentally, Bayesian inference uses a

prior distribution A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

to estimate posterior probabilities. Bayesian inference is an important technique in

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, and especially in

mathematical statistics Mathematical statistics is the application of probability theory and other mathematical concepts to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques that are commonly used in statistics inc ...

. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including

science Science is a systematic discipline that builds and organises knowledge in the form of testable hypotheses and predictions about the universe. Modern science is typically divided into twoor threemajor branches: the natural sciences, which stu ...

engineering Engineering is the practice of using natural science, mathematics, and the engineering design process to Problem solving#Engineering, solve problems within technology, increase efficiency and productivity, and improve Systems engineering, s ...

philosophy Philosophy ('love of wisdom' in Ancient Greek) is a systematic study of general and fundamental questions concerning topics like existence, reason, knowledge, Value (ethics and social sciences), value, mind, and language. It is a rational an ...

medicine Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...

sport Sport is a physical activity or game, often Competition, competitive and organization, organized, that maintains or improves physical ability and skills. Sport may provide enjoyment to participants and entertainment to spectators. The numbe ...

, and

law Law is a set of rules that are created and are enforceable by social or governmental institutions to regulate behavior, with its precise definition a matter of longstanding debate. It has been variously described as a science and as the ar ...

. In the philosophy of

decision theory Decision theory or the theory of rational choice is a branch of probability theory, probability, economics, and analytic philosophy that uses expected utility and probabilities, probability to model how individuals would behave Rationality, ratio ...

, Bayesian inference is closely related to subjective probability, often called "

Bayesian probability Bayesian probability ( or ) is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quant ...

Introduction to Bayes' rule

Formal explanation

Bayesian inference derives the

posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posteri ...

as a consequence of two antecedents: a

prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

and a "

likelihood function A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...

" derived from a

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...

for the observed data. Bayesian inference computes the posterior probability according to

P(H \mid E) = \frac,

where * stands for any ''hypothesis'' whose probability may be affected by

data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...

(called ''evidence'' below). Often there are competing hypotheses, and the task is to determine which is the most probable. *

P(H)

, the ''

'', is the estimate of the probability of the hypothesis ''before'' the data , the current evidence, is observed. * , the ''evidence'', corresponds to new data that were not used in computing the prior probability. *

P(H \mid E)

, the ''

'', is the probability of ''given'' , i.e., ''after'' is observed. This is what we want to know: the probability of a hypothesis ''given'' the observed evidence. *

P(E \mid H)

is the probability of observing ''given'' and is called the ''

likelihood A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...

''. As a function of with fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, , while the posterior probability is a function of the hypothesis, . *

P(E)

is sometimes termed the

marginal likelihood A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample for all possible values of the parameters; it can be under ...

or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis does not appear anywhere in the symbol, unlike for all the other factors) and hence does not factor into determining the relative probabilities of different hypotheses. *

P(E)>0

(Else one has

0/0

.) For different values of , only the factors

P(H)

and

P(E \mid H)

, both in the numerator, affect the value of

P(H \mid E)

the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence). In cases where

\neg H

("not "), the

logical negation In logic, negation, also called the logical not or logical complement, is an operation that takes a proposition P to another proposition "not P", written \neg P, \mathord P, P^\prime or \overline. It is interpreted intuitively as being true ...

of , is a valid likelihood, Bayes' rule can be rewritten as follows:

\begin
 P(H \mid E) &= \frac \\ \\
             &= \frac \\ \\
             &= \frac \\
\end

because

P(E) = P(E \mid H) P(H) + P(E \mid \neg H) P(\neg H)

and

P(H) + P(\neg H) = 1 .

This focuses attention on the term

\left(\tfrac - 1\right) \tfrac .

If that term is approximately 1, then the probability of the hypothesis given the evidence,

P(H \mid E)

, is about

\tfrac

, about 50% likely - equally likely or not likely. If that term is very small, close to zero, then the probability of the hypothesis, given the evidence,

P(H \mid E)

is close to 1 or the conditional hypothesis is quite likely. If that term is very large, much larger than 1, then the hypothesis, given the evidence, is quite unlikely. If the hypothesis (without consideration of evidence) is unlikely, then

P(H)

is small (but not necessarily astronomically small) and

\tfrac

is much larger than 1 and this term can be approximated as

\tfrac

and relevant probabilities can be compared directly to each other. One quick and easy way to remember the equation would be to use rule of multiplication:

P(E \cap H) = P(E \mid H) P(H) = P(H \mid E) P(E).

Alternatives to Bayesian updating

Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.

Ian Hacking Ian MacDougall Hacking (February 18, 1936 – May 10, 2023) was a Canadian philosopher specializing in the philosophy of science. Throughout his career, he won numerous awards, such as the Killam Prize for the Humanities and the Balzan Prize, ...

noted that traditional "

Dutch book In decision theory, economics, and probability theory, the Dutch book arguments are a set of results showing that agents must satisfy the axioms of rational choice to avoid a kind of self-contradiction called a Dutch book. A Dutch book, somet ...

" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote: "And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour." Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on " probability kinematics") following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability. The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.

Inference over exclusive and exhaustive possibilities

If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

General formulation

Suppose a process is generating independent and identically distributed events

E_n,\ n = 1, 2, 3, \ldots

, but the

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

is unknown. Let the event space

\Omega

represent the current state of belief for this process. Each model is represented by event

M_m

. The conditional probabilities

P(E_n \mid M_m)

are specified to define the models.

P(M_m)

is the degree of belief in

M_m

. Before the first inference step,

\

is a set of ''initial prior probabilities''. These must sum to 1, but are otherwise arbitrary. Suppose that the process is observed to generate

E \in \

. For each

M \in \

, the prior

P(M)

is updated to the posterior

P(M \mid E)

. From

P(M \mid E) = \frac \cdot P(M).

Upon observation of further evidence, this procedure may be repeated.

Multiple observations

For a sequence of

independent and identically distributed Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...

observations

\mathbf = (e_1, \dots, e_n)

, it can be shown by induction that repeated application of the above is equivalent to

P(M \mid \mathbf) = \frac \cdot P(M),

where

P(\mathbf \mid M) = \prod_k.

Parametric formulation: motivating the formal description

By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is, however, equally applicable to discrete distributions. Let the vector

\boldsymbol

span the parameter space. Let the initial prior distribution over

\boldsymbol

p(\boldsymbol \mid \boldsymbol)

, where

\boldsymbol

is a set of parameters to the prior itself, or '' hyperparameters''. Let

\mathbf = (e_1, \dots, e_n)

be a sequence of

event observations, where all

e_i

are distributed as

p(e \mid \boldsymbol)

for some

\boldsymbol

is applied to find the

posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...

over

\boldsymbol

\begin
 p(\boldsymbol \mid \mathbf, \boldsymbol) &= \frac \cdot p(\boldsymbol \mid \boldsymbol) \\
  &= \frac \cdot p(\boldsymbol \mid \boldsymbol),
\end

where

p(\mathbf \mid \boldsymbol, \boldsymbol) = \prod_k p(e_k \mid \boldsymbol).

Formal description of Bayesian inference

Definitions

x

, a data point in general. This may in fact be a

vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...

of values. *

\theta

, the

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

of the data point's distribution, i.e., This may be a

of parameters. *

\alpha

, the hyperparameter of the parameter distribution, i.e., This may be a

of hyperparameters. *

\mathbf

is the sample, a set of

n

observed data points, i.e.,

x_1, \ldots, x_n

. *

\tilde

, a new data point whose distribution is to be predicted.

Bayesian inference

*The

is the distribution of the parameter(s) before any data is observed, i.e.

p(\theta \mid \alpha)

. The prior distribution might not be easily determined; in such a case, one possibility may be to use the

Jeffreys prior In Bayesian statistics, the Jeffreys prior is a non-informative prior distribution for a parameter space. Named after Sir Harold Jeffreys, its density function is proportional to the square root of the determinant of the Fisher information matri ...

to obtain a prior distribution before updating it with newer observations. *The

sampling distribution In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. For an arbitrarily large number of samples where each sample, involving multiple observations (data poi ...

is the distribution of the observed data conditional on its parameters, i.e. This is also termed the

, especially when viewed as a function of the parameter(s), sometimes written

\operatorname(\theta  \mid \mathbf) = p(\mathbf \mid \theta)

. *The

(sometimes also termed the ''evidence'') is the distribution of the observed data

marginalized Social exclusion or social marginalisation is the social disadvantage and relegation to the fringe of society. It is a term that has been used widely in Europe and was first used in France in the late 20th century. In the EU context, the Euro ...

over the parameter(s), i.e.

p(\mathbf \mid \alpha) = \int p(\mathbf \mid \theta) p(\theta \mid \alpha) d\theta.

It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise. If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied. *The

is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference:

p(\theta \mid \mathbf,\alpha) = \frac = \frac
= \frac \propto p(\mathbf \mid \theta,\alpha) p(\theta \mid \alpha).

This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence". * In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution

p(\theta \mid \mathbf,\alpha)

is not obtained in a closed form distribution, mainly because the parameter space for

\theta

can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations

\mathbf

and parameter

\theta

. In such situations, we need to resort to approximation techniques. * General case: Let

P_Y^x

be the conditional distribution of

Y

given

X = x

and let

P_X

be the distribution of

X

. The joint distribution is then

P_ (dx,dy) = P_Y^x (dy) P_X (dx)

. The conditional distribution

P_X^y

X

given

Y=y

is then determined by

P_X^y (A) = E (1_A (X) ,  Y = y)

Existence and uniqueness of the needed

conditional expectation In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value evaluated with respect to the conditional probability distribution. If the random variable can take on ...

is a consequence of the

Radon–Nikodym theorem In mathematics, the Radon–Nikodym theorem is a result in measure theory that expresses the relationship between two measures defined on the same measurable space. A ''measure'' is a set function that assigns a consistent magnitude to the measurab ...

. This was formulated by

Kolmogorov Andrey Nikolaevich Kolmogorov ( rus, Андре́й Никола́евич Колмого́ров, p=ɐnˈdrʲej nʲɪkɐˈlajɪvʲɪtɕ kəlmɐˈɡorəf, a=Ru-Andrey Nikolaevich Kolmogorov.ogg, 25 April 1903 – 20 October 1987) was a Soviet ...

in his famous book from 1933. Kolmogorov underlines the importance of conditional probability by writing "I wish to call attention to ... and especially the theory of conditional probabilities and conditional expectations ..." in the Preface. The Bayes theorem determines the posterior distribution from the prior distribution. Uniqueness requires continuity assumptions. Bayes' theorem can be generalized to include improper prior distributions such as the uniform distribution on the real line. Modern

Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from a probability distribution. Given a probability distribution, one can construct a Markov chain whose elements' distribution approximates it – that ...

methods have boosted the importance of Bayes' theorem including cases with improper priors.

Bayesian prediction

*The

posterior predictive distribution In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values. Given a set of ''N'' i.i.d. observations \mathbf = \, a new value \tilde will be drawn from a ...

is the distribution of a new data point, marginalized over the posterior:

p(\tilde \mid \mathbf,\alpha) = \int p(\tilde \mid \theta) p(\theta \mid \mathbf,\alpha) d\theta

*The

prior predictive distribution In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values. Given a set of ''N'' i.i.d. observations \mathbf = \, a new value \tilde will be drawn from a ...

is the distribution of a new data point, marginalized over the prior:

p(\tilde \mid \alpha) = \int p(\tilde \mid \theta) p(\theta \mid \alpha) d\theta

Bayesian theory calls for the use of the posterior predictive distribution to do

predictive inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properties of ...

, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. Only this way is the entire posterior distribution of the parameter(s) used. By comparison, prediction in

frequentist statistics Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...

often involves finding an optimum point estimate of the parameter(s)—e.g., by

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

maximum a posteriori estimation An estimation procedure that is often claimed to be part of Bayesian statistics is the maximum a posteriori (MAP) estimate of an unknown quantity, that equals the mode of the posterior density with respect to some reference measure, typically t ...

(MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

of the predictive distribution. In some instances, frequentist statistics can work around this problem. For example, confidence intervals and

prediction interval In statistical inference, specifically predictive inference, a prediction interval is an estimate of an interval (statistics), interval in which a future observation will fall, with a certain probability, given what has already been observed. Pr ...

s in frequentist statistics when constructed from a

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

with unknown

mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...

and

are constructed using a

Student's t-distribution In probability theory and statistics, Student's distribution (or simply the distribution) t_\nu is a continuous probability distribution that generalizes the Normal distribution#Standard normal distribution, standard normal distribu ...

. This correctly estimates the variance, due to the facts that (1) the average of normally distributed random variables is also normally distributed, and (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used. Both types of predictive distributions have the form of a

compound probability distribution In probability and statistics, a compound probability distribution (also known as a mixture distribution or contagious distribution) is the probability distribution that results from assuming that a random variable is distributed according to some ...

(as does the

). In fact, if the prior distribution is a

conjugate prior In Bayesian probability theory, if, given a likelihood function p(x \mid \theta), the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posteri ...

, such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the

article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.

Mathematical properties

Interpretation of factor

\frac > 1 \Rightarrow P(E \mid M) > P(E)

. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change,

\frac = 1 \Rightarrow P(E \mid M) = P(E)

. That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

Cromwell's rule

P(M) = 0

then

P(M \mid E) = 0

. If

P(M) = 1

and

P(E) > 0

, then

P(M, E) = 1

. This can be interpreted to mean that hard convictions are insensitive to counter-evidence. The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not

M

" in place of "

M

", yielding "if

1 - P(M) = 0

, then

1 - P(M \mid E) = 0

", from which the result immediately follows.

Asymptotic behaviour of posterior

Consider the behaviour of a belief distribution as it is updated a large number of times with

trials. For sufficiently nice prior probabilities, the Bernstein-von Mises theorem gives that in the limit of infinite trials, the posterior converges to a

Gaussian distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is f(x ...

independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite

probability space In probability theory, a probability space or a probability triple (\Omega, \mathcal, P) is a mathematical construct that provides a formal model of a random process or "experiment". For example, one can define a probability space which models ...

. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers in 1963 and 1965 when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable

(i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernstein-von Mises theorem is not applicable. In this case there is

almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (with respect to the probability measure). In other words, the set of outcomes on which the event does not occur ha ...

no asymptotic convergence. Later in the 1980s and 1990s

Freedman A freedman or freedwoman is a person who has been released from slavery, usually by legal means. Historically, slaves were freed by manumission (granted freedom by their owners), emancipation (granted freedom as part of a larger group), or self- ...

and

Persi Diaconis Persi Warren Diaconis (; born January 31, 1945) is an American mathematician of Greek descent and former professional magician. He is the Mary V. Sunseri Professor of Statistics and Mathematics at Stanford University. He is particularly known f ...

continued to work on the case of infinite countable probability spaces. To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

Conjugate priors

In parameterized form, the prior distribution is often assumed to come from a family of distributions called

s. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.

Estimates of parameters and predictions

It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution. For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator. If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.

\tilde \theta = \operatorname

theta Theta (, ) uppercase Θ or ; lowercase θ or ; ''thē̂ta'' ; Modern: ''thī́ta'' ) is the eighth letter of the Greek alphabet, derived from the Phoenician letter Teth 𐤈. In the system of Greek numerals, it has a value of 9. Gree ...

= \int \theta \, p(\theta \mid \mathbf,\alpha) \, d\theta Taking a value with the greatest probability defines maximum ''a posteriori'' (MAP) estimates:

\ \subset \arg \max_\theta p(\theta \mid \mathbf,\alpha) .

There are examples where no maximum is attained, in which case the set of MAP estimates is empty. There are other methods of estimation that minimize the posterior ''

risk In simple terms, risk is the possibility of something bad happening. Risk involves uncertainty about the effects/implications of an activity with respect to something that humans value (such as health, well-being, wealth, property or the environ ...

'' (expected-posterior loss) with respect to a

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

, and these are of interest to statistical decision theory using the sampling distribution ("frequentist statistics"). The

of a new observation

\tilde

(that is independent of previous observations) is determined by

p(\tilde, \mathbf,\alpha) = \int p(\tilde,\theta \mid \mathbf,\alpha) \, d\theta = \int p(\tilde \mid \theta) p(\theta \mid \mathbf,\alpha) \, d\theta .

Examples

Probability of a hypothesis

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1? Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let

H_1

correspond to bowl #1, and

H_2

to bowl #2. It is given that the bowls are identical from Fred's point of view, thus

P(H_1)=P(H_2)

, and the two must add up to 1, so both are equal to 0.5. The event

E

is the observation of a plain cookie. From the contents of the bowls, we know that

P(E \mid H_1) = 30/40 = 0.75

and

P(E \mid H_2) = 20/40 = 0.5.

Bayes' formula then yields

\begin
P(H_1 \mid E) &= \frac \\
 \\
 \ & = \frac \\
 \\
 \ & = 0.6
\end

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability,

P(H_1)

, which was 0.5. After observing the cookie, we must revise the probability to

P(H_1 \mid E)

, which is 0.6.

Making a prediction

An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed? The degree of belief in the continuous variable

C

(century) is to be calculated, with the discrete set of events

\

as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,

P(E=GD \mid C=c) = (0.01 + \frac(c-11))(0.5 - \frac(c-11))

P(E=G \bar D \mid C=c) = (0.01 + \frac(c-11))(0.5 + \frac(c-11))

P(E=\bar G D \mid C=c) = ((1-0.01) - \frac(c-11))(0.5 - \frac(c-11))

P(E=\bar G \bar D \mid C=c) = ((1-0.01) - \frac(c-11))(0.5 + \frac(c-11))

Assume a uniform prior of

f_C(c) = 0.2

, and that trials are

. When a new fragment of type

e

is discovered, Bayes' theorem is applied to update the degree of belief for each

c

f_C(c \mid E=e) = \fracf_C(c) = \fracf_C(c)

A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or

c=15.2

. By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. The Bernstein-von Mises theorem asserts here the asymptotic convergence to the "true" distribution because the

corresponding to the discrete set of events

\

is finite (see above section on asymptotic behaviour of the posterior).

In frequentist statistics and decision theory

A decision-theoretic justification of the use of Bayesian inference was given by

Abraham Wald Abraham Wald (; ; , ; – ) was a Hungarian and American mathematician and statistician who contributed to decision theory, geometry and econometrics, and founded the field of sequential analysis. One of his well-known statistical works was ...

, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.Bickel & Doksum (2001, p. 32) Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of

frequentist inference Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pr ...

as parameter estimation,

hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

, and computing confidence intervals. For example: * "Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility." * "In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution." *"In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis." *"A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible" *"An important area of investigation in the development of admissibility ideas has been that of conventional sampling-theory procedures, and many interesting results have been obtained."

Model selection

Bayesian methodology also plays a role in

model selection Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of ...

where the aim is to select one model from a set of competing models that represents most closely the underlying process that generated the observed data. In Bayesian model comparison, the model with the highest

given the data is selected. The posterior probability of a model depends on the evidence, or

, which reflects the probability that the data is generated by the model, and on the prior belief of the model. When two competing models are a priori considered to be equiprobable, the ratio of their posterior probabilities corresponds to the

Bayes factor The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in question can have a common set of parameters, such as a null hypothesis ...

. Since Bayesian model comparison is aimed on selecting the model with the highest posterior probability, this methodology is also referred to as the maximum a posteriori (MAP) selection rule or the MAP probability rule.

Probabilistic programming

While conceptually simple, Bayesian methods can be mathematically and numerically challenging. Probabilistic programming languages (PPLs) implement functions to easily build Bayesian models together with efficient automatic inference methods. This helps separate the model building from the inference, allowing practitioners to focus on their specific problems and leaving PPLs to handle the computational details for them.

Applications

Statistical data analysis

See the separate Wikipedia entry on

Bayesian statistics Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...

, specifically the

statistical modeling A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...

section in that page.

Computer applications

Bayesian inference has applications in

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

and

expert system In artificial intelligence (AI), an expert system is a computer system emulating the decision-making ability of a human expert. Expert systems are designed to solve complex problems by reasoning through bodies of knowledge, represented mainly as ...

s. Bayesian inference techniques have been a fundamental part of computerized

pattern recognition Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...

techniques since the late 1950s. There is also an ever-growing connection between Bayesian methods and simulation-based

Monte Carlo Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...

techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure ''may'' allow for efficient simulation algorithms like the

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate distribution, multivariate probability distribution when direct sampling from the joint distribution is dif ...

and other

Metropolis–Hastings algorithm In statistics and statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult. New sample ...

schemes. Recently Bayesian inference has gained popularity among the

phylogenetics In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...

community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously. As applied to

statistical classification When classification is performed by a computer, statistical methods are normally used to develop the algorithm. Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or ''f ...

, Bayesian inference has been used to develop algorithms for identifying

e-mail spam Email spam, also referred to as junk email, spam mail, or simply spam, refers to unsolicited messages sent in bulk via email. The term originates from a Monty Python sketch, where the name of a canned meat product, "Spam," is used repetitively, m ...

. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter,

SpamAssassin Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It ...

, SpamBayes,

Mozilla Mozilla is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, publishes and supports Mozilla products, thereby promoting free software and open standards. The community is supported institution ...

, XEAMS, and others. Spam classification is treated in more detail in the article on the naïve Bayes classifier. Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable

. It is a formal inductive framework that combines two well-studied principles of inductive inference: Bayesian statistics and

Occam's Razor In philosophy, Occam's razor (also spelled Ockham's razor or Ocham's razor; ) is the problem-solving principle that recommends searching for explanations constructed with the smallest possible set of elements. It is also known as the principle o ...

. Solomonoff's universal prior probability of any prefix ''p'' of a computable sequence ''x'' is the sum of the probabilities of all programs (for a universal computer) that compute something starting with ''p''. Given some ''p'' and any computable but unknown probability distribution from which ''x'' is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of ''x'' in optimal fashion.

Bioinformatics and healthcare applications

Bayesian inference has been applied in different

Bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

applications, including differential gene expression analysis.Robinson, Mark D & McCarthy, Davis J & Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics. Bayesian inference is also used in a general cancer risk model, called CIRI (Continuous Individualized Risk Index), where serial measurements are incorporated to update a Bayesian model which is primarily built from prior knowledge.

In the courtroom

Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for "

beyond a reasonable doubt Beyond (a) reasonable doubt is a legal standard of proof required to validate a criminal conviction in most adversarial legal systems. It is a higher standard of proof than the standard of balance of probabilities (US English: preponderance of t ...

". Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle. If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population. For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000. The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence

expert witness An expert witness, particularly in common law countries such as the United Kingdom, Australia, and the United States, is a person whose opinion by virtue of education, training, certification, skills or experience, is accepted by the judge as ...

explained Bayes' theorem to the jury in '' R v Adams''. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task." Gardner-Medwin argues that the criterion on which a verdict in a criminal trial should be based is ''not'' the probability of guilt, but rather the ''probability of the evidence, given that the defendant is innocent'' (akin to a

frequentist Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...

p-value In null-hypothesis significance testing, the ''p''-value is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A very small ''p''-value means ...

). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions: : ''A'' – the known facts and testimony could have arisen if the defendant is guilty. : ''B'' – the known facts and testimony could have arisen if the defendant is innocent. : ''C'' – the defendant is guilty. Gardner-Medwin argues that the jury should believe both ''A'' and not-''B'' in order to convict. ''A'' and not-''B'' implies the truth of ''C'', but the reverse is not true. It is possible that ''B'' and ''C'' are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

Bayesian epistemology

Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.

Karl Popper Sir Karl Raimund Popper (28 July 1902 – 17 September 1994) was an Austrian–British philosopher, academic and social commentator. One of the 20th century's most influential philosophers of science, Popper is known for his rejection of the ...

and David Miller have rejected the idea of Bayesian rationalism, i.e. using Bayes rule to make epistemological inferences: It is prone to the same

vicious circle A vicious circle (or cycle) is a complex chain of events that reinforces itself through a feedback loop, with detrimental results. It is a system with no tendency toward equilibrium (social, economic, ecological, etc.), at least in the shor ...

as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

Other

* The

scientific method The scientific method is an Empirical evidence, empirical method for acquiring knowledge that has been referred to while doing science since at least the 17th century. Historically, it was developed through the centuries from the ancient and ...

is sometimes interpreted as an application of Bayesian inference. In this view, Bayes' rule guides (or should guide) the updating of probabilities about

hypotheses A hypothesis (: hypotheses) is a proposed explanation for a phenomenon. A scientific method, scientific hypothesis must be based on observations and make a testable and reproducible prediction about reality, in a process beginning with an educ ...

conditional on new observations or

experiment An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs whe ...

s. The Bayesian inference has also been applied to treat

stochastic scheduling Stochastic scheduling concerns scheduling problems involving random attributes, such as random processing times, random due dates, random weights, and stochastic machine breakdowns. Major applications arise in manufacturing systems, computer syste ...

problems with incomplete information by Cai et al. (2009). * Bayesian search theory is used to search for lost objects. *

Bayesian inference in phylogeny Bayesian Computational phylogenetics, inference of phylogeny combines the information in the prior and in the data likelihood to create the so-called posterior probability of trees, which is the probability that the tree is correct given the data, ...

* Bayesian tool for methylation analysis * Bayesian approaches to brain function investigate the brain as a Bayesian mechanism. * Bayesian inference in ecological studies * Bayesian inference is used to estimate parameters in stochastic chemical kinetic models * Bayesian inference in

econophysics Econophysics is a non-orthodox (in economics) interdisciplinary research field, applying theories and methods originally developed by physicists in order to solve problems in economics, usually those including uncertainty or stochastic processes ...

for currency or prediction of trend changes in financial quotations * Bayesian inference in marketing * Bayesian inference in motor learning * Bayesian inference is used in

probabilistic numerics Probabilistic numerics is aactivefield of study at the intersection of applied mathematics, statistics, and machine learning centering on the concept of uncertainty in computation. In probabilistic numerics, tasks in numerical analysis such as find ...

to solve numerical problems

Bayes and Bayesian inference

The problem considered by Bayes in Proposition 9 of his essay, "

An Essay Towards Solving a Problem in the Doctrine of Chances "An Essay Towards Solving a Problem in the Doctrine of Chances" is a work on the mathematical theory of probability by Thomas Bayes, published in 1763, two years after its author's death, and containing multiple amendments and additions due to his ...

", is the posterior distribution for the parameter ''a'' (the success rate) of the

binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...

History

The term ''Bayesian'' refers to

Thomas Bayes Thomas Bayes ( , ; 7 April 1761) was an English statistician, philosopher and Presbyterian minister who is known for formulating a specific case of the theorem that bears his name: Bayes' theorem. Bayes never published what would become his m ...

(1701–1761), who proved that probabilistic limits could be placed on an unknown event. However, it was

Pierre-Simon Laplace Pierre-Simon, Marquis de Laplace (; ; 23 March 1749 – 5 March 1827) was a French polymath, a scholar whose work has been instrumental in the fields of physics, astronomy, mathematics, engineering, statistics, and philosophy. He summariz ...

(1749–1827) who introduced (as Principle VI) what is now called

and used it to address problems in

celestial mechanics Celestial mechanics is the branch of astronomy that deals with the motions of objects in outer space. Historically, celestial mechanics applies principles of physics (classical mechanics) to astronomical objects, such as stars and planets, to ...

, medical statistics,

reliability Reliability, reliable, or unreliable may refer to: Science, technology, and mathematics Computing * Data reliability (disambiguation), a property of some disk arrays in computer storage * Reliability (computer networking), a category used to des ...

, and

jurisprudence Jurisprudence, also known as theory of law or philosophy of law, is the examination in a general perspective of what law is and what it ought to be. It investigates issues such as the definition of law; legal validity; legal norms and values ...

. Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "

inverse probability In probability theory, inverse probability is an old term for the probability distribution of an unobserved variable. Today, the problem of determining an unobserved variable (by whatever method) is called inferential statistics. The method of i ...

" (because it

infer Inferences are steps in logical reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinctio ...

s backwards from observations to parameters, or from effects to causes). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called

. In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to ''objective'' and ''subjective'' currents in Bayesian practice. In the objective or "non-informative" current, the statistical analysis depends on only the model assumed, the data analyzed, and the method assigning the prior, which differs from one objective Bayesian practitioner to another. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc. In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of

methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications. Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics. Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

References

Citations

Sources

* Aster, Richard; Borchers, Brian, and Thurber, Clifford (2012). ''Parameter Estimation and Inverse Problems'', Second Edition, Elsevier. , * * Box, G. E. P. and Tiao, G. C. (1973). ''Bayesian Inference in Statistical Analysis'', Wiley, * * * Jaynes E. T. (2003) ''Probability Theory: The Logic of Science'', CUP.
Link to Fragmentary Edition of March 1996
. * *

External links

*
Bayesian Statistics
from Scholarpedia.

from Queen Mary University of London

* ttp://cocosci.berkeley.edu/tom/bayes.html Bayesian reading list, categorized and annotated b
Tom Griffiths
* A. Hajek and S. Hartmann
Bayesian Epistemology
in: J. Dancy et al. (eds.), A Companion to Epistemology. Oxford: Blackwell 2010, 93–106. * S. Hartmann and J. Sprenger
Bayesian Epistemology
in: S. Bernecker and D. Pritchard (eds.), Routledge Companion to Epistemology. London: Routledge 2010, 609–620.
''Stanford Encyclopedia of Philosophy'': "Inductive Logic"Bayesian Confirmation Theory
(PDF)

— Informal introduction with many examples, ebook (PDF) freely available a
causaScientia
{{DEFAULTSORT:Bayesian Inference Logic and statistics Statistical forecasting Probabilistic arguments

Introduction to Bayes' rule

Formal explanation

Alternatives to Bayesian updating

Inference over exclusive and exhaustive possibilities

General formulation

Multiple observations

Parametric formulation: motivating the formal description

Formal description of Bayesian inference

Definitions

Bayesian inference

Bayesian prediction

Mathematical properties

Interpretation of factor

Cromwell's rule

Asymptotic behaviour of posterior

Conjugate priors

Estimates of parameters and predictions

Examples

Probability of a hypothesis

Making a prediction

In frequentist statistics and decision theory

Model selection

Probabilistic programming

Applications

Statistical data analysis

Computer applications

Bioinformatics and healthcare applications

In the courtroom

Bayesian epistemology

Other

Bayes and Bayesian inference

History

See also

References

Citations

Sources

Further reading

Elementary

Intermediate or advanced

External links