Belief propagation, also known as sum–product message passing, is a message-passing

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

for performing

inference Inferences are steps in logical reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinct ...

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. Graphical models are commonly used in ...

s, such as

Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Whi ...

s and

Markov random field In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph In discrete mathematics, particularly ...

s. It calculates the

marginal distribution In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variable ...

for each unobserved node (or variable), conditional on any observed nodes (or variables). Belief propagation is commonly used in

artificial intelligence Artificial intelligence (AI) is the capability of computer, computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of re ...

and

information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

, and has demonstrated empirical success in numerous applications, including

low-density parity-check codes Low-density parity-check (LDPC) codes are a class of error correction codes which (together with the closely-related turbo codes) have gained prominence in coding theory and information theory since the late 1990s. The codes today are widely u ...

, turbo codes, free energy approximation, and

satisfiability In mathematical logic, a formula is ''satisfiable'' if it is true under some assignment of values to its variables. For example, the formula x+3=y is satisfiable because it is true when x=3 and y=6, while the formula x+1=x is not satisfiable over ...

. The algorithm was first proposed by

Judea Pearl Judea Pearl (; born September 4, 1936) is an Israeli-American computer scientist and philosopher, best known for championing the probabilistic approach to artificial intelligence and the development of Bayesian networks (see the article on belie ...

in 1982, who formulated it as an exact inference algorithm on

tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, e.g., including only woody plants with secondary growth, only ...

s, later extended to

polytree In mathematics, and more specifically in graph theory, a polytree (also called directed tree, oriented tree; . or singly connected network.) is a directed acyclic graph whose underlying undirected graph is a tree. In other words, a polytree is form ...

s. While the algorithm is not exact on general graphs, it has been shown to be a useful approximate algorithm.

Motivation

Given a finite set of

discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory * Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit * Discrete group, ...

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

X_1, \ldots, X_n

with

joint A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

probability mass function In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...

p

, a common task is to compute the marginal distributions of the

X_i

. The marginal of a single

X_i

is defined to be :

p_(x_i) = \sum_ p(\mathbf')

where

\mathbf x' = (x'_1, \ldots, x'_n)

is a vector of possible values for the

X_i

, and the notation

\mathbf x' : x'_i = x_i

means that the sum is taken over those

\mathbf x'

whose

i

th coordinate is equal to

x_i

. Computing marginal distributions using this formula quickly becomes computationally prohibitive as the number of variables grows. For example, given 100 binary variables

X_1, \ldots, X_

, computing a single marginal

X_i

using

p

and the above formula would involve summing over

2^ \approx 6.34 \times 10^

possible values for

\mathbf x'

. If it is known that the probability mass function

p

factors in a convenient way, belief propagation allows the marginals to be computed much more efficiently.

Description of the sum-product algorithm

Variants of the belief propagation algorithm exist for several types of graphical models (

Bayesian networks A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their Conditional dependence, conditional dependencies via a directed a ...

and

Markov random fields In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to ...

in particular). We describe here the variant that operates on a

factor graph A factor graph is a bipartite graph representing the factorization of a function (mathematics), function. In probability theory and its applications, factor graphs are used to represent factorization of a Probability distribution function (disam ...

. A factor graph is a

bipartite graph In the mathematics, mathematical field of graph theory, a bipartite graph (or bigraph) is a Graph (discrete mathematics), graph whose vertex (graph theory), vertices can be divided into two disjoint sets, disjoint and Independent set (graph theo ...

containing nodes corresponding to variables

V

and factors

F

, with edges between variables and the factors in which they appear. We can write the joint mass function: :

p(\mathbf) = \prod_ f_a (\mathbf_a)

where

\mathbf x_a

is the vector of neighboring variable nodes to the factor node

a

. Any

can be represented as a factor graph by using a factor for each node with its parents or a factor for each node with its neighborhood respectively. The algorithm works by passing real valued functions called ''messages'' along the edges between the nodes. More precisely, if

v

is a variable node and

a

is a factor node connected to

v

in the factor graph, then the messages

\mu_

from

v

a

and the messages

\mu_

from

a

v

are real-valued functions

\mu_, \mu_ : \operatorname(v) \to \mathbb R

, whose domain is the set of values that can be taken by the random variable associated with

v

, denoted

\operatorname(v)

. These messages contain the "influence" that one variable exerts on another. The messages are computed differently depending on whether the node receiving the message is a variable node or a factor node. Keeping the same notation: * A message

\mu_: \operatorname(v) \to \mathbb R

from a variable node

v

to a factor node

a

is defined by

\mu_ (x_v) = \prod_ \mu_ (x_v)

for

x_v \in \operatorname(v)

, where

N(v)

is the set of neighboring factor nodes of

v

. If

N(v)\setminus\

is empty then

\mu_(x_v)

is set to the uniform distribution over

\operatorname(v)

. * A message

\mu_: \operatorname(v) \to \mathbb R

from a factor node

a

to a variable node

v

is defined to be the product of the factor with messages from all other nodes, marginalized over all variables except the one associated with

v

\mu_ (x_v) = \sum_ \left( f_a (\mathbf'_a) \prod_ \mu_ (x'_) \right)

for

x_v \in \operatorname(v)

, where

N(a)

is the set of neighboring (variable) nodes to

a

. If

N(a) \setminus \

is empty, then

\mu_ (x_v) = f_a(x_v)

, since in this case

x_v = x_a

. As shown by the previous formula: the complete marginalization is reduced to a sum of products of simpler terms than the ones appearing in the full joint distribution. This is the reason that belief propagation is sometimes called ''sum-product message passing'', or the ''sum-product algorithm''. In a typical run, each message will be updated iteratively from the previous value of the neighboring messages. Different scheduling can be used for updating the messages. In the case where the graphical model is a tree, an optimal scheduling converges after computing each message exactly once (see next sub-section). When the factor graph has cycles, such an optimal scheduling does not exist, and a typical choice is to update all messages simultaneously at each iteration. Upon convergence (if convergence happened), the estimated marginal distribution of each node is proportional to the product of all messages from adjoining factors (missing the normalization constant): :

p_ (x_v) \propto \prod_ \mu_ (x_v).

Likewise, the estimated joint marginal distribution of the set of variables belonging to one factor is proportional to the product of the factor and the messages from the variables: :

p_ (\mathbf_a) \propto f_a(\mathbf_a) \prod_ \mu_ (x_v).

In the case where the factor graph is acyclic (i.e. is a tree or a forest), these estimated marginal actually converge to the true marginals in a finite number of iterations. This can be shown by

mathematical induction Mathematical induction is a method for mathematical proof, proving that a statement P(n) is true for every natural number n, that is, that the infinitely many cases P(0), P(1), P(2), P(3), \dots all hold. This is done by first proving a ...

Exact algorithm for trees

In the case when the

is a

, the belief propagation algorithm will compute the exact marginals. Furthermore, with proper scheduling of the message updates, it will terminate after two full passes through the tree. This optimal scheduling can be described as follows: Before starting, the graph is oriented by designating one node as the ''root''; any non-root node which is connected to only one other node is called a ''leaf''. In the first step, messages are passed inwards: starting at the leaves, each node passes a message along the (unique) edge towards the root node. The tree structure guarantees that it is possible to obtain messages from all other adjoining nodes before passing the message on. This continues until the root has obtained messages from all of its adjoining nodes. The second step involves passing the messages back out: starting at the root, messages are passed in the reverse direction. The algorithm is completed when all leaves have received their messages.

Approximate algorithm for general graphs

Although it was originally designed for acyclic graphical models, the Belief Propagation algorithm can be used in general

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discret ...

s. The algorithm is then sometimes called loopy belief propagation, because graphs typically contain

cycle Cycle, cycles, or cyclic may refer to: Anthropology and social sciences * Cyclic history, a theory of history * Cyclical theory, a theory of American political history associated with Arthur Schlesinger, Sr. * Social cycle, various cycles in ...

s, or loops. The initialization and scheduling of message updates must be adjusted slightly (compared with the previously described schedule for acyclic graphs) because graphs might not contain any leaves. Instead, one initializes all variable messages to 1 and uses the same message definitions above, updating all messages at every iteration (although messages coming from known leaves or tree-structured subgraphs may no longer need updating after sufficient iterations). It is easy to show that in a tree, the message definitions of this modified procedure will converge to the set of message definitions given above within a number of iterations equal to the

diameter In geometry, a diameter of a circle is any straight line segment that passes through the centre of the circle and whose endpoints lie on the circle. It can also be defined as the longest Chord (geometry), chord of the circle. Both definitions a ...

of the tree. The precise conditions under which loopy belief propagation will converge are still not well understood; it is known that on graphs containing a single loop it converges in most cases, but the probabilities obtained might be incorrect. Several sufficient (but not necessary) conditions for convergence of loopy belief propagation to a unique fixed point exist. There exist graphs which will fail to converge, or which will oscillate between multiple states over repeated iterations. Techniques like

EXIT chart An extrinsic information transfer chart, commonly called an EXIT chart, is a technique to aid the construction of good iteratively-decoded error-correcting codes (in particular low-density parity-check (LDPC) codes and Turbo codes). EXIT charts ...

s can provide an approximate visualization of the progress of belief propagation and an approximate test for convergence. There are other approximate methods for marginalization including

variational method The calculus of variations (or variational calculus) is a field of mathematical analysis that uses variations, which are small changes in functions and functionals, to find maxima and minima of functionals: mappings from a set of functions t ...

s and

Monte Carlo method Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be ...

s. One method of exact marginalization in general graphs is called the

junction tree algorithm The junction tree algorithm (also known as 'Clique Tree') is a method used in machine learning to extract marginalization in general graphs. In essence, it entails performing belief propagation on a modified graph called a junction tree. The g ...

, which is simply belief propagation on a modified graph guaranteed to be a tree. The basic premise is to eliminate cycles by clustering them into single nodes.

Related algorithm and complexity issues

A similar algorithm is commonly referred to as the

Viterbi algorithm The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events. This i ...

, but also known as a special case of the max-product or min-sum algorithm, which solves the related problem of maximization, or most probable explanation. Instead of attempting to solve the marginal, the goal here is to find the values

\mathbf

that maximizes the global function (i.e. most probable values in a probabilistic setting), and it can be defined using the

arg max In mathematics, the arguments of the maxima (abbreviated arg max or argmax) and arguments of the minima (abbreviated arg min or argmin) are the input points at which a Function (mathematics), function output value is Maxima and minima, maximized ...

: :

\operatorname*_ g(\mathbf).

An algorithm that solves this problem is nearly identical to belief propagation, with the sums replaced by maxima in the definitions. It is worth noting that

problems like marginalization and maximization are

NP-hard In computational complexity theory, a computational problem ''H'' is called NP-hard if, for every problem ''L'' which can be solved in non-deterministic polynomial-time, there is a polynomial-time reduction from ''L'' to ''H''. That is, assumi ...

to solve exactly and approximately (at least for

relative error The approximation error in a given data value represents the significant discrepancy that arises when an exact, true value is compared against some approximation derived for it. This inherent error in approximation can be quantified and express ...

) in a graphical model. More precisely, the marginalization problem defined above is #P-complete and maximization is

NP-complete In computational complexity theory, NP-complete problems are the hardest of the problems to which ''solutions'' can be verified ''quickly''. Somewhat more precisely, a problem is NP-complete when: # It is a decision problem, meaning that for any ...

. The memory usage of belief propagation can be reduced through the use of the Island algorithm (at a small cost in time complexity).

Relation to free energy

The sum-product algorithm is related to the calculation of free energy in

thermodynamics Thermodynamics is a branch of physics that deals with heat, Work (thermodynamics), work, and temperature, and their relation to energy, entropy, and the physical properties of matter and radiation. The behavior of these quantities is governed b ...

. Let ''Z'' be the partition function. A probability distribution :

P(\mathbf) = \frac \prod_ f_j(x_j)

(as per the factor graph representation) can be viewed as a measure of the

internal energy The internal energy of a thermodynamic system is the energy of the system as a state function, measured as the quantity of energy necessary to bring the system from its standard internal state to its present internal state of interest, accoun ...

present in a system, computed as :

E(\mathbf) = -\log \prod_ f_j(x_j).

The free energy of the system is then :

F = U - H = \sum_ P(\mathbf) E(\mathbf) + \sum_  P(\mathbf) \log P(\mathbf).

It can then be shown that the points of convergence of the sum-product algorithm represent the points where the free energy in such a system is minimized. Similarly, it can be shown that a fixed point of the iterative belief propagation algorithm in graphs with cycles is a stationary point of a free energy approximation.

Generalized belief propagation (GBP)

Belief propagation algorithms are normally presented as message update equations on a factor graph, involving messages between variable nodes and their neighboring factor nodes and vice versa. Considering messages between ''regions'' in a graph is one way of generalizing the belief propagation algorithm. There are several ways of defining the set of regions in a graph that can exchange messages. One method uses ideas introduced by Kikuchi in the physics literature, and is known as Kikuchi's

cluster variation method may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Cluster II (spacecraft), a European Space Agency mission to study the magnetosphere * Asteroid cluster, a small ...

. Improvements in the performance of belief propagation algorithms are also achievable by breaking the replicas symmetry in the distributions of the fields (messages). This generalization leads to a new kind of algorithm called

survey propagation Survey may refer to: * Survey (human research), including opinion polls * Surveying, the technique and science of measuring positions and distances on Earth * Statistical survey, a method for collecting quantitative information about items in a p ...

(SP), which have proved to be very efficient in

problems like

and

graph coloring In graph theory, graph coloring is a methodic assignment of labels traditionally called "colors" to elements of a Graph (discrete mathematics), graph. The assignment is subject to certain constraints, such as that no two adjacent elements have th ...

. The cluster variational method and the survey propagation algorithms are two different improvements to belief propagation. The name generalized survey propagation (GSP) is waiting to be assigned to the algorithm that merges both generalizations.

Gaussian belief propagation (GaBP)

Gaussian belief propagation is a variant of the belief propagation algorithm when the underlying distributions are Gaussian. The first work analyzing this special model was the seminal work of Weiss and Freeman. The GaBP algorithm solves the following marginalization problem: :

P(x_i) = \frac \int_ \exp(-\tfrac 1 2 x^TAx + b^Tx)\,dx_j

where Z is a normalization constant, ''A'' is a symmetric

positive definite matrix In mathematics, a symmetric matrix M with real entries is positive-definite if the real number \mathbf^\mathsf M \mathbf is positive for every nonzero real column vector \mathbf, where \mathbf^\mathsf is the row vector transpose of \mathbf. Mo ...

(inverse covariance matrix a.k.a.

precision matrix In statistics, the precision matrix or concentration matrix is the matrix inverse of the covariance matrix or dispersion matrix, P = \Sigma^. For univariate distributions, the precision matrix degenerates into a scalar precision, defined as the ...

) and ''b'' is the shift vector. Equivalently, it can be shown that using the Gaussian model, the solution of the marginalization problem is equivalent to the

MAP A map is a symbolic depiction of interrelationships, commonly spatial, between things within a space. A map may be annotated with text and graphics. Like any graphic, a map may be fixed to paper or other durable media, or may be displayed on ...

assignment problem: :

\underset\  P(x) = \frac \exp(-\tfrac 1 2 x^TAx + b^Tx).

This problem is also equivalent to the following minimization problem of the quadratic form: :

\underset\ 1/2x^TAx - b^Tx.

Which is also equivalent to the linear system of equations :

Ax = b.

Convergence of the GaBP algorithm is easier to analyze (relatively to the general BP case) and there are two known sufficient convergence conditions. The first one was formulated by Weiss et al. in the year 2000, when the information matrix ''A'' is

diagonally dominant In mathematics, a square matrix is said to be diagonally dominant if, for every row of the matrix, the magnitude of the diagonal entry in a row is greater than or equal to the sum of the magnitudes of all the other (off-diagonal) entries in that ro ...

. The second convergence condition was formulated by Johnson et al. in 2006, when the

spectral radius ''Spectral'' is a 2016 Hungarian-American military science fiction action film co-written and directed by Nic Mathieu. Written with Ian Fried (screenwriter), Ian Fried & George Nolfi, the film stars James Badge Dale as DARPA research scientist Ma ...

of the matrix :

\rho (I - , D^AD^, ) < 1 \,

where ''D'' = diag(''A''). Later, Su and Wu established the necessary and sufficient convergence conditions for synchronous GaBP and damped GaBP, as well as another sufficient convergence condition for asynchronous GaBP. For each case, the convergence condition involves verifying 1) a set (determined by A) being non-empty, 2) the spectral radius of a certain matrix being smaller than one, and 3) the singularity issue (when converting BP message into belief) does not occur. The GaBP algorithm was linked to the linear algebra domain,Gaussian belief propagation solver for systems of linear equations. By O. Shental, D. Bickson, P. H. Siegel, J. K. Wolf, and D. Dolev, IEEE Int. Symp. on Inform. Theory (ISIT), Toronto, Canada, July 2008. http://www.cs.huji.ac.il/labs/danss/p2p/gabp/ and it was shown that the GaBP algorithm can be viewed as an iterative algorithm for solving the linear system of equations ''Ax'' = ''b'' where ''A'' is the information matrix and ''b'' is the shift vector. Empirically, the GaBP algorithm is shown to converge faster than classical iterative methods like the Jacobi method, the Gauss–Seidel method,

successive over-relaxation In numerical linear algebra, the method of successive over-relaxation (SOR) is a variant of the Gauss–Seidel method for solving a linear system of equations, resulting in faster convergence. A similar method can be used for any slowly convergi ...

, and others.Linear Detection via Belief Propagation. Danny Bickson, Danny Dolev, Ori Shental, Paul H. Siegel and Jack K. Wolf. In the 45th Annual Allerton Conference on Communication, Control, and Computing, Allerton House, Illinois, 7 Sept.. http://www.cs.huji.ac.il/labs/danss/p2p/gabp/ Additionally, the GaBP algorithm is shown to be immune to numerical problems of the preconditioned

conjugate gradient method In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-semidefinite. The conjugate gradient method is often implemented as an it ...

Distributed large scale network utility maximization. D. Bickson, Y. Tock, A. Zymnis, S. Boyd and D. Dolev. In the International symposium on information theory (ISIT), July 2009. http://www.cs.huji.ac.il/labs/danss/p2p/gabp/

Syndrome-based BP decoding

The previous description of BP algorithm is called the codeword-based decoding, which calculates the approximate marginal probability

P(x, X)

, given received codeword

X

. There is an equivalent form, which calculate

P(e, s)

, where

s

is the syndrome of the received codeword

X

and

e

is the decoded error. The decoded input vector is

x=X+e

. This variation only changes the interpretation of the mass function

f_a(X_a)

. Explicitly, the messages are :

\forall x_v\in Dom(v),\; \mu_ (x_v) = P(X_v)\prod_ \mu_ (x_v).

:where

P(X_v)

is the prior error probability on variable

v

\forall x_v\in Dom(v),\; \mu_ (x_v) = \sum_ \delta(\text('_v)=) \prod_ \mu_ (x'_).

This syndrome-based decoder doesn't require information on the received bits, thus can be adapted to quantum codes, where the only information is the measurement syndrome. In the binary case,

x_i \in \

, those messages can be simplified to cause an exponential reduction of

2^

in the complexity Define log-likelihood ratio

l_v=\log \frac

L_a=\log \frac

, then :

v \to a: l_v=l_v^+\sum_ (L_)

a \to v: L_a = (-1)^ 2 \tanh^ \prod_ \tanh (l_/2)

:where

l_v^=\log (P(x_v=0)/P(x_v=1)) = \text

The posterior log-likelihood ratio can be estimated as

l_v=l_v^+\sum_ (L_)

Motivation

Description of the sum-product algorithm

Exact algorithm for trees

Approximate algorithm for general graphs

Related algorithm and complexity issues

Relation to free energy

Generalized belief propagation (GBP)

Gaussian belief propagation (GaBP)

Syndrome-based BP decoding

References

Further reading