A restricted Boltzmann machine (RBM) (also called a restricted Sherrington–Kirkpatrick model with external field or restricted stochastic Ising–Lenz–Little model) is a generative

stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...

artificial neural network In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks. A neural network consists of connected ...

that can learn a

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

over its set of inputs. RBMs were initially proposed under the name Harmonium by Paul Smolensky in 1986, and rose to prominence after

Geoffrey Hinton Geoffrey Everest Hinton (born 1947) is a British-Canadian computer scientist, cognitive scientist, and cognitive psychologist known for his work on artificial neural networks, which earned him the title "the Godfather of AI". Hinton is Univer ...

and collaborators used fast learning algorithms for them in the mid-2000s. RBMs have found applications in

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

collaborative filtering Collaborative filtering (CF) is, besides content-based filtering, one of two major techniques used by recommender systems.Francesco Ricci and Lior Rokach and Bracha ShapiraIntroduction to Recommender Systems Handbook, Recommender Systems Handbo ...

feature learning In machine learning (ML), feature learning or representation learning is a set of techniques that allow a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual fea ...

topic model In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...

ling,Ruslan Salakhutdinov and Geoffrey Hinton (2010)
Replicated softmax: an undirected topic model
. '' Neural Information Processing Systems'' 23.

immunology Immunology is a branch of biology and medicine that covers the study of Immune system, immune systems in all Organism, organisms. Immunology charts, measures, and contextualizes the Physiology, physiological functioning of the immune system in ...

, and even manybody quantum mechanics. They can be trained in either supervised or unsupervised ways, depending on the task. As their name implies, RBMs are a variant of

Boltzmann machine A Boltzmann machine (also called Sherrington–Kirkpatrick model with external field or stochastic Ising model), named after Ludwig Boltzmann, is a spin glass, spin-glass model with an external field, i.e., a Spin glass#Sherrington–Kirkpatrick m ...

s, with the restriction that their

neurons A neuron (American English), neurone (British English), or nerve cell, is an membrane potential#Cell excitability, excitable cell (biology), cell that fires electric signals called action potentials across a neural network (biology), neural net ...

must form a

bipartite graph In the mathematics, mathematical field of graph theory, a bipartite graph (or bigraph) is a Graph (discrete mathematics), graph whose vertex (graph theory), vertices can be divided into two disjoint sets, disjoint and Independent set (graph theo ...

: *a pair of nodes from each of the two groups of units (commonly referred to as the "visible" and "hidden" units respectively) may have a symmetric connection between them; and *there are no connections between nodes within a group. By contrast, "unrestricted" Boltzmann machines may have connections between hidden units. This restriction allows for more efficient training

algorithms In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...

than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm.Miguel Á. Carreira-Perpiñán and Geoffrey Hinton (2005)
On contrastive divergence learning
''Artificial Intelligence and Statistics''. Restricted Boltzmann machines can also be used in

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

networks. In particular,

deep belief network In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a class of deep neural network, composed of multiple layers of latent variables ("hidden units"), with connections between the layers but not b ...

s can be formed by "stacking" RBMs and optionally fine-tuning the resulting deep network with

gradient descent Gradient descent is a method for unconstrained mathematical optimization. It is a first-order iterative algorithm for minimizing a differentiable multivariate function. The idea is to take repeated steps in the opposite direction of the gradi ...

and

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

Structure

The standard type of RBM has binary-valued ( Boolean) hidden and visible units, and consists of a

matrix Matrix (: matrices or matrixes) or MATRIX may refer to: Science and mathematics * Matrix (mathematics), a rectangular array of numbers, symbols or expressions * Matrix (logic), part of a formula in prenex normal form * Matrix (biology), the m ...

of weights

W

of size

m\times n

. Each weight element

(w_)

of the matrix is associated with the connection between the visible (input) unit

v_i

and the hidden unit

h_j

. In addition, there are bias weights (offsets)

a_i

for

v_i

and

b_j

for

h_j

. Given the weights and biases, the ''energy'' of a configuration (pair of Boolean vectors) is defined as :

E(v,h) = -\sum_i a_i v_i - \sum_j b_j h_j -\sum_i \sum_j v_i w_ h_j

or, in matrix notation, :

E(v,h) = -a^ v - b^ h -v^ W h.

This energy function is analogous to that of a

Hopfield network A Hopfield network (or associative memory) is a form of recurrent neural network, or a spin glass system, that can serve as a content-addressable memory. The Hopfield network, named for John Hopfield, consists of a single layer of neurons, where ...

. As with general Boltzmann machines, the

joint probability distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

for the visible and hidden vectors is defined in terms of the energy function as follows,Geoffrey Hinton (2010).
A Practical Guide to Training Restricted Boltzmann Machines
'. UTML TR 2010–003, University of Toronto. :

P(v,h) = \frac e^

where

Z

is a partition function defined as the sum of

e^

over all possible configurations, which can be interpreted as a

normalizing constant In probability theory, a normalizing constant or normalizing factor is used to reduce any probability function to a probability density function with total probability of one. For example, a Gaussian function can be normalized into a probabilit ...

to ensure that the probabilities sum to 1. The

marginal probability In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variable ...

of a visible vector is the sum of

P(v,h)

over all possible hidden layer configurations, :

P(v) = \frac \sum_ e^

, and vice versa. Since the underlying graph structure of the RBM is bipartite (meaning there are no intra-layer connections), the hidden unit activations are mutually independent given the visible unit activations. Conversely, the visible unit activations are mutually independent given the hidden unit activations. That is, for ''m'' visible units and ''n'' hidden units, the

conditional probability In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...

of a configuration of the visible units , given a configuration of the hidden units , is :

P(v, h) = \prod_^m P(v_i, h)

. Conversely, the conditional probability of given is :

P(h, v) = \prod_^n P(h_j, v)

. The individual activation probabilities are given by :

P(h_j=1, v) = \sigma \left(b_j + \sum_^m w_ v_i \right)

and

\,P(v_i=1, h) = \sigma \left(a_i + \sum_^n w_ h_j \right)

where

\sigma

denotes the logistic sigmoid. The visible units of Restricted Boltzmann Machine can be multinomial, although the hidden units are Bernoulli. In this case, the logistic function for visible units is replaced by the

softmax function The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

P(v_i^k = 1, h) = \frac

where ''K'' is the number of discrete values that the visible values have. They are applied in topic modeling, and

recommender system A recommender system (RecSys), or a recommendation system (sometimes replacing ''system'' with terms such as ''platform'', ''engine'', or ''algorithm'') and sometimes only called "the algorithm" or "algorithm", is a subclass of information fi ...

Relation to other models

Restricted Boltzmann machines are a special case of

s and

Markov random field In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph In discrete mathematics, particularly ...

s.Asja Fischer and Christian Igel
Training Restricted Boltzmann Machines: An Introduction
. Pattern Recognition 47, pp. 25-39, 2014 The

graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. Graphical models are commonly used in ...

of RBMs corresponds to that of

factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observe ...

Training algorithm

Restricted Boltzmann machines are trained to maximize the product of probabilities assigned to some training set

V

(a matrix, each row of which is treated as a visible vector

v

), :

\arg\max_W \prod_ P(v)

or equivalently, to maximize the expected

log probability In probability theory and computer science, a log probability is simply a logarithm of a probability. The use of log probabilities means representing probabilities on a logarithmic scale (-\infty, 0], instead of the standard , 1/math> unit interva ...

of a training sample

v

selected randomly from

V

: :

\arg\max_W \mathbb \left \log P(v)\right /math>

The algorithm most often used to train RBMs, that is, to optimize the weight matrix W, is the contrastive divergence (CD) algorithm due to Hinton, originally developed to train PoE (

product of experts Product of experts (PoE) is a machine learning technique. It models a probability distribution by combining the output from several simpler distributions. It was proposed by Geoffrey Hinton in 1999, along with an algorithm for training the paramete ...

) models. The algorithm performs

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate distribution, multivariate probability distribution when direct sampling from the joint distribution is dif ...

and is used inside a

procedure (similar to the way backpropagation is used inside such a procedure when training feedforward neural nets) to compute weight update. The basic, single-step contrastive divergence (CD-1) procedure for a single sample can be summarized as follows: # Take a training sample , compute the probabilities of the hidden units and sample a hidden activation vector from this probability distribution. # Compute the

outer product In linear algebra, the outer product of two coordinate vectors is the matrix whose entries are all products of an element in the first vector with an element in the second vector. If the two coordinate vectors have dimensions ''n'' and ''m'', the ...

of and and call this the ''positive gradient''. # From , sample a reconstruction of the visible units, then resample the hidden activations from this. (Gibbs sampling step) # Compute the

of and and call this the ''negative gradient''. # Let the update to the weight matrix

W

be the positive gradient minus the negative gradient, times some learning rate:

\Delta W = \epsilon (vh^\mathsf - v'h'^\mathsf)

. # Update the biases and analogously:

\Delta a = \epsilon (v - v')

\Delta b = \epsilon (h - h')

. A Practical Guide to Training RBMs written by Hinton can be found on his homepage.

Stacked Restricted Boltzmann Machine

* The difference between the Stacked Restricted Boltzmann Machines and RBM is that RBM has lateral connections within a layer that are prohibited to make analysis tractable. On the other hand, the Stacked Boltzmann consists of a combination of an unsupervised three-layer network with symmetric weights and a supervised fine-tuned top layer for recognizing three classes. * The usage of Stacked Boltzmann is to understand Natural languages, retrieve documents, image generation, and classification. These functions are trained with unsupervised pre-training and/or supervised fine-tuning. Unlike the undirected symmetric top layer, with a two-way unsymmetric layer for connection for RBM. The restricted Boltzmann's connection is three-layers with asymmetric weights, and two networks are combined into one. * Stacked Boltzmann does share similarities with RBM, the neuron for Stacked Boltzmann is a stochastic binary Hopfield neuron, which is the same as the Restricted Boltzmann Machine. The energy from both Restricted Boltzmann and RBM is given by Gibb's probability measure:

E = -\frac12\sum_+\sum_i

. The training process of Restricted Boltzmann is similar to RBM. Restricted Boltzmann train one layer at a time and approximate equilibrium state with a 3-segment pass, not performing back propagation. Restricted Boltzmann uses both supervised and unsupervised on different RBM for pre-training for classification and recognition. The training uses contrastive divergence with Gibbs sampling: Δw_ij = e*(p_ij - p'_ij) * The restricted Boltzmann's strength is it performs a non-linear transformation so it's easy to expand, and can give a hierarchical layer of features. The Weakness is that it has complicated calculations of integer and real-valued neurons. It does not follow the gradient of any function, so the approximation of Contrastive divergence to maximum likelihood is improvised.

Literature

References

Bibliography

* * * {{Cite web , url=http://deeplearning4j.org/understandingRBMs.html , title=Understanding RBMs , access-date=2014-12-29 , archive-url=https://web.archive.org/web/20160920122139/http://deeplearning4j.org/understandingRBMs.html , archive-date=2016-09-20 , url-status=dead , work=Deeplearning4j Documentation , first1=Chris , last1=Nicholson , first2=Adam , last2=Gibson

External links

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...

br>implementation
of Bernoulli RBM an
tutorial

SimpleRBM
is a very small RBM code (24kB) useful for you to learn about how RBMs learn and work. *

Julia Julia may refer to: People *Julia (given name), including a list of people with the name *Julia (surname), including a list of people with the name *Julia gens, a patrician family of Ancient Rome *Julia (clairvoyant) (fl. 1689), lady's maid of Qu ...

implementation of Restricted Boltzmann machines: https://github.com/cossio/RestrictedBoltzmannMachines.jl Neural network architectures Stochastic models Supervised learning Unsupervised learning