The Latent Diffusion Model (LDM) is a
diffusion model
In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference. The goal of diffusion models is to learn the latent structure of ...
architecture developed by the CompVis (Computer Vision & Learning) group at
LMU Munich
The Ludwig Maximilian University of Munich (simply University of Munich or LMU; german: link=no, Ludwig-Maximilians-Universität München) is a public research university in Munich, Bavaria, Germany. Originally established as the University of ...
.
Introduced in 2015, diffusion models (DMs) are trained with the objective of removing successive applications of noise (commonly
Gaussian
Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below.
There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...
) on training images. The LDM is an improvement on standard DM by performing diffusion modeling in a
latent space
A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. Position within the latent ...
, and by allowing self-attention and cross-attention conditioning.
LDMs are widely used in practical diffusion models. For instance,
Stable Diffusion
Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and genera ...
versions 1.1 to 2.1 were based on the LDM architecture.
Version history
Diffusion models were introduced in 2015 as a method to learn a model that can sample from a highly complex probability distribution. They used techniques from
non-equilibrium thermodynamics
Non-equilibrium thermodynamics is a branch of thermodynamics that deals with physical systems that are not in thermodynamic equilibrium but can be described in terms of macroscopic quantities (non-equilibrium state variables) that represent an ex ...
, especially
diffusion
Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration. Diffusion is driven by a gradient in Gibbs free energy or chemical p ...
. It was accompanied by a software implementation in
Theano
In Greek mythology, Theano (; Ancient Greek: Θεανώ) may refer to the following personages:
*Theano, wife of Metapontus, king of Icaria. Metapontus demanded that she bear him children, or leave the kingdom. She presented the children of Mela ...
.
A 2019 paper proposed the noise conditional score network (NCSN) or score-matching with Langevin dynamics (SMLD). The paper was accompanied by a software package written in
PyTorch
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and op ...
release on GitHub.
A 2020 paper
proposed the Denoising Diffusion Probabilistic Model (DDPM), which improves upon the previous method by
variational inference
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually ...
. The paper was accompanied by a software package written in
TensorFlow
TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...
release on GitHub. It was reimplemented in
PyTorch
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and op ...
by lucidrains.
On December 20, 2021, the LDM paper was published on arXiv, and both Stable Diffusion and LDM repositories were published on GitHub. However, they remained roughly the same. Substantial information concerning Stable Diffusion v1 was only added to GitHub on August 10, 2022.
All of Stable Diffusion (SD) versions 1.1 to XL were particular instantiations of the LDM architecture.
SD 1.1 to 1.4 were released by CompVis in August 2022. There is no "version 1.0". SD 1.1 was a LDM trained on the laion2B-en dataset. SD 1.1 was finetuned to 1.2 on more aesthetic images. SD 1.2 was finetuned to 1.3, 1.4 and 1.5, with 10% of text-conditioning dropped, to improve classifier-free guidance.
SD 1.5 was released by
RunwayML in October 2022.
Architecture
While the LDM can work for generating arbitrary data conditional on arbitrary data, for concreteness, we describe its operation in conditional text-to-image generation.
LDM consists of a
variational autoencoder
In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods.
...
(VAE), a modified
U-Net
U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on the fully convolutional network and its architecture was modi ...
, and a text encoder.
The VAE encoder compresses the image from pixel space to a smaller dimensional
latent space
A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. Position within the latent ...
, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a
ResNet backbone,
denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.
The denoising step can be conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a
cross-attention mechanism.
For conditioning on text, the fixed, a pretrained
CLIP
Clip or CLIP may refer to:
Fasteners
* Hair clip, a device used to hold hair together or attaching materials such as caps to the hair
* Binder clip, a device used for holding thicker materials (such as large volumes of paper) together
** Bulldog ...
ViT-L/14 text encoder is used to transform text prompts to an embedding space.
Variational Autoencoder
To compress the image data, a variational autoencoder (VAE) is first trained on a dataset of images. The encoder part of the VAE takes an image as input and outputs a lower-dimensional latent representation of the image. This latent representation is then used as input to the U-Net. Once the model is trained, the encoder is used to encode images into latent representations, and the decoder is used to decode latent representations back into images.
Let the encoder and the decoder of the VAE be
.
To encode an RGB image, its three channels are divided by the maximum value, resulting in a tensor
of shape
with all entries within range