The Latent Diffusion Model (LDM) is a

diffusion model In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference. The goal of diffusion models is to learn the latent structure of ...

architecture developed by the CompVis (Computer Vision & Learning) group at

LMU Munich The Ludwig Maximilian University of Munich (simply University of Munich or LMU; german: link=no, Ludwig-Maximilians-Universität München) is a public research university in Munich, Bavaria, Germany. Originally established as the University of ...

. Introduced in 2015, diffusion models (DMs) are trained with the objective of removing successive applications of noise (commonly

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

) on training images. The LDM is an improvement on standard DM by performing diffusion modeling in a

latent space A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. Position within the latent ...

, and by allowing self-attention and cross-attention conditioning. LDMs are widely used in practical diffusion models. For instance,

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and genera ...

versions 1.1 to 2.1 were based on the LDM architecture.

Version history

Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques from

non-equilibrium thermodynamics Non-equilibrium thermodynamics is a branch of thermodynamics that deals with physical systems that are not in thermodynamic equilibrium but can be described in terms of macroscopic quantities (non-equilibrium state variables) that represent an ex ...

, especially

diffusion Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical p ...

. It was accompanied by a software implementation in

Theano In Greek mythology, Theano (; Ancient Greek: Θεανώ) may refer to the following personages: *Theano, wife of Metapontus, king of Icaria. Metapontus demanded that she bear him children, or leave the kingdom. She presented the children of Mela ...

. A 2019 paper proposed the noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD). The paper was accompanied by a software package written in

PyTorch PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and op ...

release on GitHub. A 2020 paper proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by

variational inference Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...

. The paper was accompanied by a software package written in

TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...

release on GitHub. It was reimplemented in

by lucidrains. On December 20, 2021, the LDM paper was published on arXiv, and both Stable Diffusion and LDM repositories were published on GitHub. However, they remained roughly the same. Substantial information concerning Stable Diffusion v1 was only added to GitHub on August 10, 2022. All of Stable Diffusion (SD) versions 1.1 to XL were particular instantiations of the LDM architecture. SD 1.1 to 1.4 were released by CompVis in August 2022. There is no "version 1.0". SD 1.1 was a LDM trained on the laion2B-en dataset. SD 1.1 was finetuned to 1.2 on more aesthetic images. SD 1.2 was finetuned to 1.3, 1.4 and 1.5, with 10% of text-conditioning dropped, to improve classifier-free guidance. SD 1.5 was released by RunwayML in October 2022.

Architecture

While the LDM can work for generating arbitrary data conditional on arbitrary data, for concreteness, we describe its operation in conditional text-to-image generation. LDM consists of a

variational autoencoder In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods. ...

(VAE), a modified

U-Net U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on the fully convolutional network and its architecture was modi ...

, and a text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional

, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. For conditioning on text, the fixed, a pretrained

CLIP Clip or CLIP may refer to: Fasteners * Hair clip, a device used to hold hair together or attaching materials such as caps to the hair * Binder clip, a device used for holding thicker materials (such as large volumes of paper) together ** Bulldog ...

ViT-L/14 text encoder is used to transform text prompts to an embedding space.

Variational Autoencoder

To compress the image data, a variational autoencoder (VAE) is first trained on a dataset of images. The encoder part of the VAE takes an image as input and outputs a lower-dimensional latent representation of the image. This latent representation is then used as input to the U-Net. Once the model is trained, the encoder is used to encode images into latent representations, and the decoder is used to decode latent representations back into images. Let the encoder and the decoder of the VAE be

E, D

. To encode an RGB image, its three channels are divided by the maximum value, resulting in a tensor

x

of shape

(3, 512, 512)

with all entries within range

, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

/math>. The encoded vector is

0.18215 \times E(2x - 1)

, with shape

(4, 64, 64)

, where 0.18215 is a hyperparameter, which the original authors picked to roughly whiten the encoded vector to roughly unit variance. Conversely, given a latent tensor

y

, the decoded image is

(D(y / 0.18125) + 1)/2

, then

clipped ''Clipped'' is a video featuring five tracks by the Australian hard rock band AC/DC. First released in 1991, it contained three tracks from '' The Razors Edge'' and two from ''Blow Up Your Video''. In 2002 a DVD version was released which al ...

to the range

/math>. In the implemented version, the encoder is a

convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

(CNN) with a single self-attention mechanism near the end. It takes a tensor of shape

(3, H, W)

and outputs a tensor of shape

(8, H/8, W/8)

, being the concatenation of the predicted mean and variance of the latent vector, each of shape

(4, H/8, W/8)

. The variance is used in training, but after training, usually only the mean is taken, with the variance discarded. The decoder is also a CNN with a single self-attention mechanism near the end. It takes a tensor of shape

(4, H/8, W/8)

and outputs a tensor of shape

(3, H, W)

U-Net

The U-Net backbone takes the following kinds of inputs: * A latent image array, produced by the VAE encoder. It has dimensions

(\text, \text, \text)

. Typically,

(\text, \text, \text) = (4, 64, 64)

. * A timestep-embedding vector, which tells the backbone how much noise there is in the image. For example, an embedding of timestep

t = 0

would indicate that the input image is already noiseless, while

t = 100

would mean there is much noise. * A modality-embedding vector sequence, which indicates to the backbone about additional conditions for denoising. For example, in text-to-image generation, the text is divided into a sequence of tokens, then encoded by a text encoder, such as a CLIP encoder, before feeding into the backbone. As another example, an input image can be processed by a

Vision Transformer A Vision Transformer (ViT) is a transformer that is targeted at vision processing tasks such as image recognition. Vision Transformers Transformers found their initial applications in natural language processing (NLP) tasks, as demonstrated by ...

into a sequence of vectors, which can then be used to condition the backbone for tasks such as generating an image in the same style. Each run through the U-Net backbone produces a predicted noise vector. This noise vector is scaled down and subtracted away from the latent image array, resulting in a slightly less noisy latent image. The denoising is repeated according to a denoising schedule ("noise schedule"), and the output of the last step is processed by the VAE decoder into a finished image. Encoder cross-attention

Similar to the standard

, the U-Net backbone used in the SD 1.5 is essentially composed of down-scaling layers followed by up-scaling layers. However, the U-Net backbone has additional modules to allow for it to handle the embedding. As an illustration, we describe a single down-scaling layer in the backbone: * The latent array and the time-embedding are processed by a ResBlock: ** The latent array is processed by a

convolutional layer In artificial neural networks, a convolutional layer is a type of network layer that applies a convolution operation to the input. Convolutional layers are some of the primary building blocks of convolutional neural networks In deep learning, ...

. ** The time-embedding vector is processed by a one-layered feedforward network, then added to the previous array (broadcast over all pixels). ** This is processed by another convolutional layer, then another time-embedding. * The latent array and the embedding vector sequence are processed by a SpatialTransformer, which is essentially a standard pre-LN Transformer decoder without causal masking. ** In the cross-attentional blocks, the latent array itself serves as the query sequence, one query-vector per pixel. For example, if, at this layer in the U-Net, the latent array has dimensions

(128, 32, 32)

, then the query sequence has

1024

vectors, each of which has

128

dimensions. The embedding vector sequence serves as both the key sequence and as the value sequence. ** When no embedding vector sequence is input, a cross-attentional block defaults to self-attention, with the latent array serving as the query, key, and value. In pseudocode, def ResBlock(x, time, residual_channels): x_in = x time_embedding = feedforward_network(time) x = concatenate(x, residual_channels) x = conv_layer_1(activate(normalize_1(x))) + time_embedding x = conv_layer_2(dropout(activate(normalize_2(x)))) return x_in + x def SpatialTransformer(x, cond): x_in = x x = normalize(x) x = proj_in(x) x = cross_attention(x, cond) x = proj_out(x) return x_in + x def unet(x, time, cond): residual_channels = [] for resblock, spatialtransformer in downscaling_layers: x = resblock(x, time) residual_channels.append(x) x = spatialtransformer(x, cond) x = middle_layer.resblock_1(x, time) x = middle_layer.spatialtransformer(x, time) x = middle_layer.resblock_2(x, time) for resblock, spatialtransformer in upscaling_layers: residual = residual_channels.pop() x = resblock(concatenate(x, residual), time) x = spatialtransformer(x, cond) return x The detailed architecture may be found in.

Training and inference

The LDM is trained by using a

Markov chain A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happen ...

to gradually add noise to the training images. The model is then trained to reverse this process, starting with a noisy image and gradually removing the noise until it recovers the original image. More specifically, the training process can be described as follows: * Forward diffusion process: Given a real image

x_0

, a sequence of latent variables

x_

are generated by gradually adding Gaussian noise to the image, according to a pre-determined "noise schedule". * Reverse diffusion process: Starting from a Gaussian noise sample

x_T

, the model learns to predict the noise added at each step, in order to reverse the diffusion process and obtain a reconstruction of the original image

x_0

. The model is trained to minimize the difference between the predicted noise and the actual noise added at each step. This is typically done using a

mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference betwe ...

(MSE) loss function. Once the model is trained, it can be used to generate new images by simply running the reverse diffusion process starting from a random noise sample. The model gradually removes the noise from the sample, guided by the learned noise distribution, until it generates a final image. See the

page for details.

Version history

Architecture

Variational Autoencoder

U-Net

Training and inference

See also

References

Further reading