T5 (Text-to-Text Transfer Transformer) is a series of

large language models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...

developed by

Google AI Google AI is a division of Google dedicated to artificial intelligence. It was announced at Google I/O 2017 by CEO Sundar Pichai. This division has expanded its reach with research facilities in various parts of the world such as Zurich, Pa ...

introduced in 2019. Like the original Transformer model, T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text. T5 models are usually pretrained on a massive dataset of text and code, after which they can perform the text-based tasks that are similar to their pretrained tasks. They can also be finetuned to perform other tasks. T5 models have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics.

Training

The original T5 models are pre-trained on the

Colossal Clean Crawled Corpus Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2008. It completes crawls approximately ...

(C4), containing text and code scraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications. The T5 models were pretrained on many tasks, all in the format of -> . T5-finetune-summarization

Some examples are: * restoring corrupted text: Thank you me to your party week. -> for inviting last , where the means "end of output", and the and denote blanks to be filled, called "sentinels" in the original report. * translation: translate English to German: That is good. -> Das ist gut.. * judging the grammatical acceptability of a sentence (

CoLA Cola is a Carbonation, carbonated soft drink flavored with vanilla, cinnamon, citrus essential oil, oils, and other flavorings. Cola became popular worldwide after the American pharmacist John Stith Pemberton invented Coca-Cola, a trademarked br ...

sentence): The course is jumping well. -> not acceptable .

Architecture

The T5 series encompasses several models with varying sizes and capabilities, all encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text. These models are often distinguished by their parameter count, which indicates the complexity and potential capacity of the model. The original paper reported the following 5 models: ^{*The encoder and the decoder have the same shape. So for example, the T5-small has 6 layers in the encoder and 6 layers in the decoder.} In the above table, *

n_

: Number of layers in the encoder; also, number of layers in the decoder. They always have the same number of layers. *

n_

: Number of attention heads in each attention block. *

d_

: Dimension of the embedding vectors. *

d_

: Dimension of the feedforward network within each encoder and decoder layer. *

d_

: Dimension of the key and value vectors used in the self-attention mechanism. Note that unlike typical Transformers, the 3B and 11B models do not satisfy

d_ = d_ n_

. Compared to the original Transformer, it uses a few minor modifications: layer normalization with no additive bias; placing the layer normalization outside the residual path; relative positional embedding. For all experiments, they used a WordPiece tokenizer, with vocabulary size 32,000. The tokenizer is shared across both the input and output of each model. It was trained on a mixture of

English English usually refers to: * English language * English people English may also refer to: Culture, language and peoples * ''English'', an adjective for something of, from, or related to England * ''English'', an Amish ter ...

German German(s) may refer to: * Germany, the country of the Germans and German things **Germania (Roman era) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizenship in Germany, see also Ge ...

French French may refer to: * Something of, from, or related to France ** French language, which originated in France ** French people, a nation and ethnic group ** French cuisine, cooking traditions and practices Arts and media * The French (band), ...

, and

Romanian Romanian may refer to: *anything of, from, or related to the country and nation of Romania **Romanians, an ethnic group **Romanian language, a Romance language ***Romanian dialects, variants of the Romanian language **Romanian cuisine, traditional ...

data from the C4 dataset, at a ratio of 10:1:1:1.

Variants

Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X. Some models are trained from scratch while others are trained by starting with a previous trained model. By default, each model is trained from scratch, except otherwise noted. * ''T5'' small, base, large, 3B, 11B (2019): The original models. * ''T5 1.1'' small, base, large, XL, XXL: Improved versions of the original T5 series. These have roughly equal parameters. The

activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

is GEGLU instead of ReLU. The 3B and the 11B were changed to "XL" and "XXL", and their shapes are changed: * ''LM-adapted T5'' (2021): a series of models (from small to XXL) that started from checkpoints of the ''T5'' series, but trained further on 100B additional tokens from C4. * Switch Transformer (2021): a mixture-of-experts variant of T5, by replacing the feedforward layers in the encoder and decoder blocks with mixture of expert feedforward layers. * ''T0'' 3B, 11B (2021): a series of models that started from checkpoints of ''LM-adapted T5'', and further trained to perform tasks based only on task instruction (

zero-shot Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were ''not'' observed during training, and needs to predict the class that they belong to. The name is a play on words ...

). Different entries in the series uses different finetuning data. * ''ByT5'' (2021): a byte-level version of T5, trained on mC4 (multilingual C4) dataset. It operates on text encoded as

UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...

bytes, without tokenizers. * ''Flan-T5-XL'' (2022): a model that started with a checkpoint of ''T5 XL'', then instruction-tuned on the FLAN dataset. * ''T5X'' (2022): a JAX-based re-implementation of the original ''T5'' codebase. It is not a model. The original T5 codebase was implemented in

TensorFlow TensorFlow is a Library (computing), software library for machine learning and artificial intelligence. It can be used across a range of tasks, but is used mainly for Types of artificial neural networks#Training, training and Statistical infer ...

with MeshTF. * ''UL2'' 20B (2022): a model with the same architecture as the ''T5'' series, but scaled up to 20B, and trained with "mixture of denoisers" objective on the C4. It was trained on a TPU cluster by accident, when a training run was left running accidentally for a month. * ''Flan-UL2'' 20B (2022): ''UL2'' 20B instruction-finetuned on the FLAN dataset. * ''Pile-T5'' (2024): has the same architecture of ''T5'', except it used the

Llama The llama (; or ) (''Lama glama'') is a domesticated South American camelid, widely used as a List of meat animals, meat and pack animal by Inca empire, Andean cultures since the pre-Columbian era. Llamas are social animals and live with ...

tokenizer. It was trained on The Pile. It came in sizes of base, large, XL, XXL.

Applications

The T5 model itself is an encoder-decoder model, allowing it to be used for instruction following. The encoder encodes the instruction, and the decoder autoregressively generates the reply. The T5 encoder can be used as a text encoder, much like BERT. It encodes a text into a sequence of real-number vectors, which can be used for downstream applications. For example, Google Imagen uses ''T5-XXL'' as text encoder, and the encoded text vectors are used as conditioning on a

diffusion model In machine learning, diffusion models, also known as diffusion-based generative models or score-based generative models, are a class of latent variable model, latent variable generative model, generative models. A diffusion model consists of two ...

. As another example, the AuraFlow diffusion model uses ''Pile-T5-XL''.

References

External links

Notes

{{Artificial intelligence navbox Google software Large language models Software using the Apache license Open-source artificial intelligence