A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description. Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom, as a result of advances in

deep neural networks Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

. In 2022, the output of state-of-the-art text-to-image models—such as OpenAI's DALL-E 2,

Google Brain Google Brain was a deep learning artificial intelligence research team that served as the sole AI branch of Google before being incorporated under the newer umbrella of Google AI, a research division at Google dedicated to artificial intelligence ...

's Imagen, Stability AI's

Stable Diffusion Stable Diffusion is a deep learning, text-to-image model released in 2022 based on Diffusion model, diffusion techniques. The generative artificial intelligence technology is the premier product of Stability AI and is considered to be a part of ...

, and Midjourney—began to be considered to approach the quality of real photographs and human-drawn

art Art is a diverse range of cultural activity centered around ''works'' utilizing creative or imaginative talents, which are expected to evoke a worthwhile experience, generally through an expression of emotional power, conceptual ideas, tec ...

. Text-to-image models are generally latent diffusion models, which combine a

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.

History

Before the rise of

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

, attempts to build text-to-image models were limited to

collage Collage (, from the , "to glue" or "to stick together") is a technique of art creation, primarily used in the visual arts, but in music too, by which art results from an assembly of different forms, thus creating a new whole. (Compare with pasti ...

s by arranging existing component images, such as from a database of

clip art Clip art (also clipart, clip-art) is a type of graphic art. Pieces are pre-made images used to illustrate any medium. Today, clip art is used extensively and comes in many forms, both electronic and printed. However, most clip art today is creat ...

. The inverse task, image captioning, was more tractable, and a number of image captioning deep learning models came prior to the first text-to-image models. The first modern text-to-image model, alignDRAW, was introduced in 2015 by researchers from the

University of Toronto The University of Toronto (UToronto or U of T) is a public university, public research university whose main campus is located on the grounds that surround Queen's Park (Toronto), Queen's Park in Toronto, Ontario, Canada. It was founded by ...

. alignDRAW extended the previously-introduced DRAW architecture (which used a recurrent

variational autoencoder In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian metho ...

with an

attention mechanism In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented b"soft"weights assigned to eac ...

) to be conditioned on text sequences. Images generated by alignDRAW were in small resolution (32×32 pixels, attained from resizing) and were considered to be 'low in diversity'. The model was able to generalize to objects not represented in the training data (such as a red school bus) and appropriately handled novel prompts such as "a stop sign is flying in blue skies", exhibiting output that it was not merely "memorizing" data from the

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

. In 2016, Reed, Akata, Yan et al. became the first to use

generative adversarial network A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative artificial intelligence. The concept was initially developed by Ian Goodfellow and his colleagues in June ...

s for the text-to-image task. With models trained on narrow, domain-specific datasets, they were able to generate "visually plausible" images of birds and flowers from text captions like ''"an all black bird with a distinct thick, rounded bill"''. A model trained on the more diverse

COCO Coco or variants may refer to: Arts and entertainment Film * ''Coco'' (2009 film), a French comedy film * ''Coco'' (2017 film), an American animated fantasy film * '' Pokémon the Movie: Secrets of the Jungle'' (), a 2020 Japanese anime film ...

(Common Objects in Context) dataset produced images which were "from a distance... encouraging", but which lacked coherence in their details. Later systems include VQGAN-CLIP, XMC-GAN, and GauGAN2. One of the first text-to-image models to capture widespread public attention was

OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...

DALL-E DALL-E, DALL-E 2, and DALL-E 3 (stylised DALL·E) are text-to-image models developed by OpenAI using deep learning methodologies to generate digital images from natural language descriptions known as Prompt engineering, ''prompts''. The first ...

, a

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

system announced in January 2021. A successor capable of generating more complex and realistic images, DALL-E 2, was unveiled in April 2022, followed by

that was publicly released in August 2022. In August 2022,

text-to-image personalization A text-to-image model is a machine learning model which takes an input natural language prompt and produces an image matching that description. Text-to-image models began to be developed in the mid-2010s during the beginnings of the AI boom ...

allows to teach the model a new concept using a small set of images of a new object that was not included in the training set of the text-to-image foundation model. This is achieved by textual inversion, namely, finding a new text term that correspond to these images. Following other text-to-image models,

-powered text-to-video platforms such as Runway, Make-A-Video, Imagen Video, Midjourney, and Phenaki can generate video from text and/or text/image prompts.

Architecture and training

Text-to-image models have been built using a variety of architectures. The text encoding step may be performed with a

recurrent neural network Recurrent neural networks (RNNs) are a class of artificial neural networks designed for processing sequential data, such as text, speech, and time series, where the order of elements is important. Unlike feedforward neural networks, which proces ...

such as a

long short-term memory Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, ...

(LSTM) network, though

models have since become a more popular option. For the image generation step, conditional

s (GANs) have been commonly used, with diffusion models also becoming a popular option in recent years. Rather than directly training a model to output a high-resolution image conditioned on a text embedding, a popular technique is to train a model to generate low-resolution images, and use one or more auxiliary deep learning models to upscale it, filling in finer details. Text-to-image models are trained on large datasets of (text, image) pairs, often scraped from the web. With their 2022 Imagen model, Google Brain reported positive results from using a

large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...

trained separately on a text-only corpus (with its weights subsequently frozen), a departure from the theretofore standard approach.

Datasets

Training a text-to-image model requires a dataset of images paired with text captions. One dataset commonly used for this purpose is the COCO dataset. Released by Microsoft in 2014, COCO consists of around 123,000 images depicting a diversity of objects with five captions per image, generated by human annotators. Originally, the main focus of COCO was on the recognition of objects and scenes in images. Oxford-120 Flowers and CUB-200 Birds are smaller datasets of around 10,000 images each, restricted to flowers and birds, respectively. It is considered less difficult to train a high-quality text-to-image model with these datasets because of their narrow range of subject matter. One of the largest open datasets for training text-to-image models is LAION-5B, containing more than 5 billion image-text pairs. This dataset was created using web scraping and automatic filtering based on similarity to high-quality artwork and professional photographs. Because of this, however, it also contains controversial content, which has led to discussions about the ethics of its use. Some modern AI platforms not only generate images from text but also create synthetic datasets to improve model training and fine-tuning. These datasets help avoid copyright issues and expand the diversity of training data.

Quality evaluation

Evaluating and comparing the quality of text-to-image models is a problem involving assessing multiple desirable properties. A desideratum specific to text-to-image models is that generated images semantically align with the text captions used to generate them. A number of schemes have been devised for assessing these qualities, some automated and others based on human judgement. A common algorithmic metric for assessing image quality and diversity is the

Inception Score ''Inception'' is a 2010 science fiction film, science fiction Action film, action heist film written and directed by Christopher Nolan, who also produced it with Emma Thomas, his wife. The film stars Leonardo DiCaprio as a professional thi ...

(IS), which is based on the distribution of labels predicted by a pretrained Inceptionv3 image classification model when applied to a sample of images generated by the text-to-image model. The score is increased when the image classification model predicts a single label with high probability, a scheme intended to favour "distinct" generated images. Another popular metric is the related Fréchet inception distance, which compares the distribution of generated images and real training images according to features extracted by one of the final layers of a pretrained image classification model.

History

Architecture and training

Datasets

Quality evaluation

Impact and applications

List of notable text-to-image models

Explanatory notes

See also

References