A maximal entropy random walk (MERW) is a popular type of

biased random walk on a graph Bias is an inclination toward something, or a predisposition, partiality, prejudice, preference, or predilection. Bias may also refer to: Scientific method and statistics * The bias introduced into an experiment through a confounder * Al ...

, in which transition probabilities are chosen accordingly to the

principle of maximum entropy The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...

, which says that the

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

which best represents the current state of knowledge is the one with largest entropy. While a standard

random walk In mathematics, a random walk, sometimes known as a drunkard's walk, is a stochastic process that describes a path that consists of a succession of random steps on some Space (mathematics), mathematical space. An elementary example of a rand ...

samples for every vertex a uniform probability distribution of outgoing edges, locally maximizing

entropy rate In the mathematical theory of probability, the entropy rate or source information rate is a function assigning an entropy to a stochastic process. For a strongly stationary process, the conditional entropy for latest random variable eventually ...

, MERW maximizes it globally (average

entropy production Entropy production (or generation) is the amount of entropy which is produced during heat process to evaluate the efficiency of the process. Short history Entropy is produced in irreversible processes. The importance of avoiding irreversible p ...

) by sampling a uniform probability distribution among all paths in a given graph. MERW is used in various fields of science. A direct application is choosing probabilities to maximize transmission rate through a constrained channel, analogously to

Fibonacci coding In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers. Each code word ends with "11" and contains n ...

. Its properties also made it useful for example in analysis of complex networks, like link prediction, community detection, robust transport over networks and

centrality In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, ke ...

measures. It is also used in

image analysis Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading barcode, bar coded tags or a ...

, for example for detecting visual saliency regions, object localization,L. Wang, J. Zhao, X. Hu, J. Lu, ''Weakly supervised object localization via maximal entropy random walk''
ICIP, 2014. tampering detection or

tractography In neuroscience Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions, and its disorders. It is a multidisciplinary science that combines physiology, anatomy, ...

problem. Additionally, it recreates some properties of

quantum mechanics Quantum mechanics is the fundamental physical Scientific theory, theory that describes the behavior of matter and of light; its unusual characteristics typically occur at and below the scale of atoms. Reprinted, Addison-Wesley, 1989, It is ...

, suggesting a way to repair the discrepancy between

diffusion Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical p ...

models and quantum predictions, like

Anderson localization In condensed matter physics, Anderson localization (also known as strong localization) is the absence of diffusion of waves in a ''disordered'' medium. This phenomenon is named after the American physicist P. W. Anderson, who was the first to su ...

Basic model

Consider a

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discret ...

with

n

vertices, defined by an

adjacency matrix In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph (discrete mathematics), graph. The elements of the matrix (mathematics), matrix indicate whether pairs of Vertex (graph theory), vertices ...

A \in \left\^

A_=1

if there is an edge from vertex

i

j

, 0 otherwise. For simplicity, assume it is an

undirected graph In discrete mathematics, particularly in graph theory, a graph is a structure consisting of a set of objects where some pairs of the objects are in some sense "related". The objects are represented by abstractions called '' vertices'' (also call ...

, which corresponds to a symmetric

A

; however, MERWs can also be generalized for directed and weighted graphs (for example

Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability tha ...

among paths instead of uniform). We would like to choose a random walk as a

Markov process In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, ...

on this graph: for every vertex

i

and its outgoing edge to

j

, choose probability

S_

of the walker randomly using this edge after visiting

i

. Formally, find a

stochastic matrix In mathematics, a stochastic matrix is a square matrix used to describe the transitions of a Markov chain. Each of its entries is a nonnegative real number representing a probability. It is also called a probability matrix, transition matrix, ''s ...

S

(containing the transition probabilities of a Markov chain) such that *

0\leq S_ \leq A_

for all

i, j

and *

\sum_^n S_=1

for all

i

. Assuming this graph is connected and not

periodic Periodicity or periodic may refer to: Mathematics * Bott periodicity theorem, addresses Bott periodicity: a modulo-8 recurrence relation in the homotopy groups of classical groups * Periodic function, a function whose output contains values tha ...

ergodic theory Ergodic theory is a branch of mathematics that studies statistical properties of deterministic dynamical systems; it is the study of ergodicity. In this context, "statistical properties" refers to properties which are expressed through the behav ...

says that evolution of this

stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Sto ...

leads to some stationary probability distribution

\rho

such that

\rho S = \rho

. Using

Shannon entropy Shannon may refer to: People * Shannon (given name) * Shannon (surname) * Shannon (American singer), stage name of singer Brenda Shannon Greene (born 1958) * Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum ...

for every vertex and averaging over probability of visiting this vertex (to be able to use its entropy), we get the following formula for average entropy production (

) of the stochastic process: :

H(S)=\sum_^n \rho_i \sum_^n S_ \log(1/S_)

This definition turns out to be equivalent to the asymptotic average entropy (per length) of the probability distribution in the space of paths for this stochastic process. In the standard random walk, referred to here as generic random walk (GRW), we naturally choose that each outgoing edge is equally probable: :

S_ = \frac

. For a symmetric

A

it leads to a stationary probability distribution

\rho

with :

\rho_i = \frac

. It locally maximizes entropy production (uncertainty) for every vertex, but usually leads to a suboptimal averaged global entropy rate

H(S)

. MERW chooses the stochastic matrix which maximizes

H(S)

, or equivalently assumes uniform probability distribution among all paths in a given graph. Its formula is obtained by first calculating the dominant

eigenvalue In linear algebra, an eigenvector ( ) or characteristic vector is a vector that has its direction unchanged (or reversed) by a given linear transformation. More precisely, an eigenvector \mathbf v of a linear transformation T is scaled by a ...

\lambda

and corresponding

eigenvector In linear algebra, an eigenvector ( ) or characteristic vector is a vector that has its direction unchanged (or reversed) by a given linear transformation. More precisely, an eigenvector \mathbf v of a linear transformation T is scaled by ...

\psi

of the adjacency matrix, i.e. the largest

\lambda \in \mathbb

with corresponding

\psi \in \mathbb^n

such that

\psi A = \lambda \psi

. Then the stochastic matrix and stationary probability distribution are given by :

S_ = \frac \frac

for which every possible path of length

l

from the

i

-th to

j

-th vertex has probability :

\frac \frac

. Its entropy rate is

\log(\lambda)

and the stationary probability distribution

\rho

is :

\rho_i = \frac

. In contrast to GRW, the MERW transition probabilities generally depend on the structure of the entire graph, making it nonlocal. Hence, they should not be imagined as directly applied by the walkerif random-looking decisions are made based on the local situation, like a person would make, the GRW approach is more appropriate. MERW is based on the principle of maximum entropy, making it the safest assumption when we do not have any additional knowledge about the system. For example, it would be appropriate for modelling our knowledge about an object performing some complex dynamicsnot necessarily random, like a particle.

Sketch of derivation

Assume for simplicity that the considered graph is undirected, connected and aperiodic, allowing to conclude from the

Perron–Frobenius theorem In matrix theory, the Perron–Frobenius theorem, proved by and , asserts that a real square matrix with positive entries has a unique eigenvalue of largest magnitude and that eigenvalue is real. The corresponding eigenvector can be chosen to ha ...

that the dominant eigenvector is unique. Hence

A^l

can be asymptotically (

l\rightarrow\infty

) approximated by

\lambda^l \psi \psi^T

(or

\lambda^l , \psi\rangle \langle \psi,

bra–ket notation Bra–ket notation, also called Dirac notation, is a notation for linear algebra and linear operators on complex vector spaces together with their dual space both in the finite-dimensional and infinite-dimensional case. It is specifically de ...

). MERW requires a uniform distribution along paths. The number

m_

of paths with length

2l

and vertex

i

in the center is :

m_ = \sum_^n \sum_^n \left(A^l\right)_ \left(A^l\right)_ \approx \sum_^n \sum_^n \left(\lambda^l \psi \psi^\top\right)_ \left(\lambda^l \psi \psi^\top\right)_ = \sum_^n \sum_^n \lambda^ \psi_j \psi_i \psi_i \psi_k  = \lambda^ \psi_i^2 \underbrace_

, hence for all

i

, :

\rho_i = \lim_ \frac = \lim_ \frac = \lim_ \frac = \frac = \frac

. Analogously calculating probability distribution for two succeeding vertices, one obtains that the probability of being at the

i

-th vertex and next at the

j

-th vertex is :

\frac = \frac = \frac

. Dividing by the probability of being at the

i

-th vertex, i.e.

\rho_i

, gives for the

conditional probability In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...

S_

of the

j

-th vertex being next after the

i

-th vertex :

S_ = \frac \frac

Weighted MERW: Boltzmann path ensemble

We have assumed that

A_ \in \

, yielding a MERW corresponding to the uniform ensemble among paths. However, the above derivation works for any real nonnegative

A

for which the Perron-Frobenius theorem applies. Given

A_ = \exp(-E_)

, the probability of a particular length-

l

path

(\gamma_0, \ldots,\gamma_l)

is as follows: :

\textrm(\gamma_0, \ldots,\gamma_l)=\rho_ S_\ldots S_= \psi_ \frac \psi_=\psi_\frac \psi_

, which is the same as the

of paths with energy defined as the sum of

E_

over the edges of the path. For example, this can be used with the transfer matrix to calculate the probability distribution of patterns in the

Ising model The Ising model (or Lenz–Ising model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical models in physics, mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that r ...

Examples

Let us first look at a simple nontrivial situation: Fibonacci coding, where we want to transmit a message as a sequence of 0s and 1s, but not using two successive 1s: after a 1 there has to be a 0. To maximize the amount of information transmitted in such sequence, we should assume a uniform probability distribution in the space of all possible sequences fulfilling this constraint. To practically use such long sequences, after 1 we have to use 0, but there remains the freedom of choosing the probability of 0 after 0. Let us denote this probability

q

Entropy coding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...

allows encoding a message using this chosen probability distribution. The stationary probability distribution of symbols for a given

q

turns out to be

\rho=(1/(2-q),1-1/(2-q))

. Hence, entropy produced is

H(S)=\rho_0 \left(q\log(1/q)+(1-q)\log(1/(1-q))\right)

, which is maximized for

q=(\sqrt-1)/2\approx 0.618

, known as the

golden ratio In mathematics, two quantities are in the golden ratio if their ratio is the same as the ratio of their summation, sum to the larger of the two quantities. Expressed algebraically, for quantities and with , is in a golden ratio to if \fr ...

. In contrast, a standard random walk would choose the suboptimal

q=0.5

. While choosing a larger

q

reduces the amount of information produced after 0, it also reduces the frequency of 1, after which we cannot write any information. A more complex case is the defected one-dimensional cyclic lattice, for example, a ring with 1000 connected nodes, for which all nodes but the defects have a self-loop (edge to itself). In a standard random walk (GRW), the stationary probability distribution would have the defect probability be 2/3 of probability of the non-defect verticesthere is nearly no localization, also analogously for standard diffusion, which is the infinitesimal limit of a GRW. For a MERW, we have to first find the dominant eigenvector of the adjacency matrix – maximizing

\lambda

in:

(\lambda \psi)_x = (A\psi)_x = \psi_+ (1-V_x) \psi_x +\psi_

for all positions

x

, where

V_x=1

for defects, 0 otherwise. Substituting

3\psi_x

and multiplying the equation by −1 we get:

E \psi_x =-(\psi_ -2 \psi_x +\psi_) + V_x \psi_x

where

E = 3-\lambda

is minimized now, becoming the analog of energy. The formula inside the bracket is

discrete Laplace operator In mathematics, the discrete Laplace operator is an analog of the continuous Laplace operator, defined so that it has meaning on a Graph (discrete mathematics), graph or a lattice (group), discrete grid. For the case of a finite-dimensional graph ...

, making this equation a discrete analogue of the stationary

Schrödinger equation The Schrödinger equation is a partial differential equation that governs the wave function of a non-relativistic quantum-mechanical system. Its discovery was a significant landmark in the development of quantum mechanics. It is named after E ...

. As in quantum mechanics, MERWs predict that the probability distribution is that of the quantum

ground state The ground state of a quantum-mechanical system is its stationary state of lowest energy; the energy of the ground state is known as the zero-point energy of the system. An excited state is any state with energy greater than the ground state ...

\rho_x \propto \psi_x^2

with its strongly localized density (in contrast to standard diffusion). Taking the

infinitesimal In mathematics, an infinitesimal number is a non-zero quantity that is closer to 0 than any non-zero real number is. The word ''infinitesimal'' comes from a 17th-century Modern Latin coinage ''infinitesimus'', which originally referred to the " ...

limit, we can get the standard continuous stationary (time-independent) Schrödinger equation (

E\psi=-C\psi_+V\psi

for

C=\hbar^2/2m

) here.J. Duda, ''Extended Maximal Entropy Random Walk''
PhD Thesis, 2012.

References

External links