Maximal entropy random walk (MERW) is a popular type of

biased random walk on a graph In network science, a biased random walk on a graph is a time path process in which an evolving variable jumps from its current state to one of various potential new states; unlike in a pure random walk, the probabilities of the potential new st ...

, in which transition probabilities are chosen accordingly to the

principle of maximum entropy The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...

, which says that the probability distribution which best represents the current state of knowledge is the one with largest entropy. While standard

random walk In mathematics, a random walk is a random process that describes a path that consists of a succession of random steps on some mathematical space. An elementary example of a random walk is the random walk on the integer number line \mathbb Z ...

chooses for every vertex uniform probability distribution among its outgoing edges, locally maximizing

entropy rate In the mathematical theory of probability, the entropy rate or source information rate of a stochastic process is, informally, the time density of the average information in a stochastic process. For stochastic processes with a countable index, the ...

, MERW maximizes it globally (average

entropy production Entropy production (or generation) is the amount of entropy which is produced in any irreversible processes such as heat and mass transfer processes including motion of bodies, heat exchange, fluid flow, substances expanding or mixing, anelastic d ...

) by assuming uniform probability distribution among all paths in a given graph. MERW is used in various fields of science. A direct application is choosing probabilities to maximize transmission rate through a constrained channel, analogously to

Fibonacci coding In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers. Each code word ends with "11" and contains no ...

. Its properties also made it useful for example in analysis of complex networks, like link prediction, community detection, robust transport over networks and

centrality In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key ...

measures. Also in

image analysis Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading bar coded tags or as sophi ...

, for example for detecting visual saliency regions, object localization,L. Wang, J. Zhao, X. Hu, J. Lu, ''Weakly supervised object localization via maximal entropy random walk''
ICIP, 2014. tampering detection or

tractography In neuroscience Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions and disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular b ...

problem. Additionally, it recreates some properties of

quantum mechanics Quantum mechanics is a fundamental theory in physics that provides a description of the physical properties of nature at the scale of atoms and subatomic particles. It is the foundation of all quantum physics including quantum chemistry, ...

, suggesting a way to repair the discrepancy between

diffusion Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical p ...

models and quantum predictions, like

Anderson localization In condensed matter physics, Anderson localization (also known as strong localization) is the absence of diffusion of waves in a ''disordered'' medium. This phenomenon is named after the American physicist P. W. Anderson, who was the first to sug ...

Basic model

Consider a

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...

with

n

vertices, defined by an

adjacency matrix In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In the special case of a finite simp ...

A \in \left\^

A_=1

if there is an edge from vertex

i

j

, 0 otherwise. For simplicity assume it is an

undirected graph In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called '' v ...

, which corresponds to a symmetric

A

; however, MERW can also be generalized for directed and

weighted graphs A weight function is a mathematical device used when performing a sum, integral, or average to give some elements more "weight" or influence on the result than other elements in the same set. The result of this application of a weight function is ...

(for example

Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability t ...

among paths instead of uniform). We would like to choose a random walk as a

Markov process A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happe ...

on this graph: for every vertex

i

and its outgoing edge to

j

, choose probability

S_

of the walker randomly using this edge after visiting

i

. Formally, find a

stochastic matrix In mathematics, a stochastic matrix is a square matrix used to describe the transitions of a Markov chain. Each of its entries is a nonnegative real number representing a probability. It is also called a probability matrix, transition matrix, ...

S

(containing the transition probabilities of a Markov chain) such that *

0\leq S_ \leq A_

for all

i, j

and *

\sum_^n S_=1

for all

i

. Assuming this graph is connected and not periodic,

ergodic theory Ergodic theory (Greek: ' "work", ' "way") is a branch of mathematics that studies statistical properties of deterministic dynamical systems; it is the study of ergodicity. In this context, statistical properties means properties which are expres ...

says that evolution of this

stochastic process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appea ...

leads to some stationary probability distribution

\rho

such that

\rho S = \rho

. Using

Shannon entropy Shannon may refer to: People * Shannon (given name) * Shannon (surname) * Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958) * Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum Wi ...

for every vertex and averaging over probability of visiting this vertex (to be able to use its entropy), we get the following formula for average entropy production (

) of the stochastic process: :

H(S)=\sum_^n \rho_i \sum_^n S_ \log(1/S_)

This definition turns out to be equivalent to the asymptotic average entropy (per length) of the probability distribution in the space of paths for this stochastic process. In the standard random walk, referred to here as generic random walk (GRW), we naturally choose that each outgoing edge is equally probable: :

S_ = \frac

. For a symmetric

A

it leads to a stationary probability distribution

\rho

with :

\rho_i = \frac

. It locally maximizes entropy production (uncertainty) for every vertex, but usually leads to a suboptimal averaged global entropy rate

H(S)

. MERW chooses the stochastic matrix which maximizes

H(S)

, or equivalently assumes uniform probability distribution among all paths in a given graph. Its formula is obtained by first calculating the dominant

eigenvalue In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...

\lambda

and corresponding

eigenvector In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted b ...

\psi

of the adjacency matrix, i.e. the largest

\lambda \in \mathbb

with corresponding

\psi \in \mathbb^n

such that

\psi A = \lambda \psi

. Then stochastic matrix and stationary probability distribution are given by :

S_ = \frac \frac

for which every possible path of length

l

from the

i

-th to

j

-th vertex has probability :

\frac \frac

. Its entropy rate is

\log(\lambda)

and the stationary probability distribution

\rho

is :

\rho_i = \frac

. In contrast to GRW, the MERW transition probabilities generally depend on the structure of the entire graph (are nonlocal). Hence, they should not be imagined as directly applied by the walker – if random-looking decisions are made based on the local situation, like a person would make, the GRW approach is more appropriate. MERW is based on the

, making it the safest assumption when we don't have any additional knowledge about the system. For example, it would be appropriate for modelling our knowledge about an object performing some complex dynamics – not necessarily random, like a particle.

Sketch of derivation

Assume for simplicity that the considered graph is indirected, connected and aperiodic, allowing to conclude from the

Perron–Frobenius theorem In matrix theory, the Perron–Frobenius theorem, proved by and , asserts that a real square matrix with positive entries has a unique largest real eigenvalue and that the corresponding eigenvector can be chosen to have strictly positive component ...

that the dominant eigenvector is unique. Hence

A^l

can be asymptotically (

l\rightarrow\infty

) approximated by

\lambda^l \psi \psi^T

(or

\lambda^l , \psi\rangle \langle \psi,

bra–ket notation In quantum mechanics, bra–ket notation, or Dirac notation, is used ubiquitously to denote quantum states. The notation uses angle brackets, and , and a vertical bar , to construct "bras" and "kets". A ket is of the form , v \rangle. Mathema ...

). MERW requires uniform distribution along paths. The number

m_

of paths with length

2l

and vertex

i

in the center is :

m_ = \sum_^n \sum_^n \left(A^l\right)_ \left(A^l\right)_ \approx \sum_^n \sum_^n \left(\lambda^l \psi \psi^\top\right)_ \left(\lambda^l \psi \psi^\top\right)_ = \sum_^n \sum_^n \lambda^ \psi_j \psi_i \psi_i \psi_k  = \lambda^ \psi_i^2 \underbrace_

, hence for all

i

, :

\rho_i = \lim_ \frac = \lim_ \frac = \lim_ \frac = \frac = \frac

. Analogously calculating probability distribution for two succeeding vertices, one obtains that the probability of being at the

i

-th vertex and next at the

j

-th vertex is :

\frac = \frac = \frac

. Dividing by the probability of being at the

i

-th vertex, i.e.

\rho_i

, gives for the

conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...

S_

of the

j

-th vertex being next after the

i

-th vertex :

S_ = \frac \frac

Weighted MERW: Boltzmann path ensemble

We have assumed that

A_ \in \

for MERW corresponding to uniform ensemble among paths. However, the above derivation works for real nonnegative

A

. Parametrizing

A_ = \exp(-E_)

and asking for probability of length

l

path

(\gamma_0, \ldots,\gamma_l)

, we get: :

\textrm(\gamma_0, \ldots,\gamma_l)=\rho_ S_\ldots S_= \psi_ \frac \psi_=\psi_\frac \psi_

As in

of paths for energy defined as sum of

E_

over given path. For example it allows to calculate probability distribution of patterns in

Ising model The Ising model () (or Lenz-Ising model or Ising-Lenz model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent ...

Examples

Let us first look at a simple nontrivial situation:

, where we want to transmit a message as a sequence of 0s and 1s, but not using two successive 1s: after a 1 there has to be a 0. To maximize the amount of information transmitted in such sequence, we should assume uniform probability distribution in the space of all possible sequences fulfilling this constraint. To practically use such long sequences, after 1 we have to use 0, but there remains a freedom of choosing the probability of 0 after 0. Let us denote this probability by

q

, then

entropy coding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...

would allow encoding a message using this chosen probability distribution. The stationary probability distribution of symbols for a given

q

turns out to be

\rho=(1/(2-q),1-1/(2-q))

. Hence, entropy production is

H(S)=\rho_0 \left(q\log(1/q)+(1-q)\log(1/(1-q))\right)

, which is maximized for

q=(\sqrt-1)/2\approx 0.618

, known as the

golden ratio In mathematics, two quantities are in the golden ratio if their ratio is the same as the ratio of their sum to the larger of the two quantities. Expressed algebraically, for quantities a and b with a > b > 0, where the Greek letter phi ( ...

. In contrast, standard random walk would choose suboptimal

q=0.5

. While choosing larger

q

reduces the amount of information produced after 0, it also reduces frequency of 1, after which we cannot write any information. A more complex example is the defected one-dimensional cyclic lattice: let say 1000 nodes connected in a ring, for which all nodes but the defects have a self-loop (edge to itself). In standard random walk (GRW) the stationary probability distribution would have defect probability being 2/3 of probability of the non-defect vertices – there is nearly no localization, also analogously for standard

, which is infinitesimal limit of GRW. For MERW we have to first find the dominant eigenvector of the adjacency matrix – maximizing

\lambda

in:

(\lambda \psi)_x = (A\psi)_x = \psi_+ (1-V_x) \psi_x +\psi_

for all positions

x

, where

V_x=1

for defects, 0 otherwise. Substituting

3\psi_x

and multiplying the equation by −1 we get:

E \psi_x =-(\psi_ -2 \psi_x +\psi_) + V_x \psi_x

where

E = 3-\lambda

is minimized now, becoming the analog of energy. The formula inside the bracket is

discrete Laplace operator In mathematics, the discrete Laplace operator is an analog of the continuous Laplace operator, defined so that it has meaning on a Graph (discrete mathematics), graph or a lattice (group), discrete grid. For the case of a finite-dimensional graph ...

, making this equation a discrete analogue of stationary Schrodinger equation. As in

, MERW predicts that the probability distribution should lead exactly to the one of quantum

ground state The ground state of a quantum-mechanical system is its stationary state of lowest energy; the energy of the ground state is known as the zero-point energy of the system. An excited state is any state with energy greater than the ground state. ...

\rho_x \propto \psi_x^2

with its strongly localized density (in contrast to standard diffusion). Taking the

infinitesimal In mathematics, an infinitesimal number is a quantity that is closer to zero than any standard real number, but that is not zero. The word ''infinitesimal'' comes from a 17th-century Modern Latin coinage ''infinitesimus'', which originally referr ...

limit, we can get standard continuous stationary (time-independent) Schrodinger equation (

E\psi=-C\psi_+V\psi

for

C=\hbar^2/2m

) here.J. Duda, ''Extended Maximal Entropy Random Walk''
PhD Thesis, 2012.

References

{{reflist

External links