HOME

TheInfoList



OR:

A transformer is a deep learning architecture developed by researchers at
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
and based on the multi-head
attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...
mechanism, proposed in a 2017 paper " Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a
word embedding In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head
attention mechanism In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus ...
allowing the signal for key tokens to be amplified and less important tokens to be diminished. Transformers have the advantage of having no recurrent units, and therefore require less training time than earlier recurrent neural architectures (RNNs) such as
long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM). Later variations have been widely adopted for training
large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...
s (LLM) on large (language) datasets, such as the
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read ref ...
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
and
Common Crawl Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally eve ...
. Transformers were first developed as an improvement over previous architectures for
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, but have found many applications since then. They are used in large-scale
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
,
computer vision Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...
( vision transformers),
reinforcement learning Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
, audio, multi-modal processing, robotics, and even playing
chess Chess is a board game for two players, called White and Black, each controlling an army of chess pieces in their color, with the objective to checkmate the opponent's king. It is sometimes called international chess or Western chess to dist ...
. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs) and BERT (Bidirectional Encoder Representations from Transformers).


History


Predecessors

For many years, sequence modelling and generation was done by using plain
recurrent neural networks A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
(RNNs). A well-cited early example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens. A key breakthrough was
LSTM Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) c ...
(1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an
attention mechanism In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus ...
which used neurons that multiply the outputs of other neurons, so-called ''multiplicative units''. Neural networks using multiplicative units were called ''sigma-pi networks'' or '' higher-order networks'', but they faced high computational complexity. LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers. However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence. An early attempt to overcome this was the fast weight controller (1992) which computed the weight matrix for further processing depending on the input. It used the fast weights architecture (1987), where one neural network outputs the weights of another neural network. It was later shown to be equivalent to the linear Transformer without normalization.


Attention with seq2seq

The idea of encoder-decoder sequence transduction had been developed in the early 2010s (see irst version posted to arXiv on 10 Sep 2014/ref> for previous papers). The papers most commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014. (Sutskever et al, 2014) was a 380M-parameter model for machine translation using two
long short-term memory Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM). The architecture consists of two parts. The ''encoder'' is an LSTM that takes in a sequence of tokens and turns it into a vector. The ''decoder'' is another LSTM that converts the vector into a sequence of tokens. Similarly, (Cho et al, 2014) was 130M-parameter model that used
gated recurrent unit Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an o ...
s (GRU) instead of LSTM. Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq. These early seq2seq models had no attention mechanism, and the state vector is accessible only after the ''last'' word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved, since the input is processed sequentially by one recurrent network into a ''fixed''-size output vector, which was then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, and the output quality degrades. As evidence, reversing the input sentence improved seq2seq translation. (Bahdanau et al, 2014) introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem, allowing the model to process long-distance dependencies more easily. They called their model ''RNNsearch'', as it "emulates searching through a source sentence during decoding a translation". (Luong et al, 2015) compared the relative performance of global (that of (Bahdanau et al, 2014)) and local (sliding window) attention model architectures for machine translation, and found that a mixed attention architecture had higher quality than global attention, while the use of a local attention architecture reduced translation time. In 2016,
Google Translate Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an A ...
was revamped to
Google Neural Machine Translation Google Neural Machine Translation (GNMT) is a neural machine translation (NMT) system developed by Google and introduced in November 2016, that uses an artificial neural network to increase fluency and accuracy in Google Translate. GNMT improves ...
, which replaced the previous model based on
statistical machine translation Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrast ...
. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM. It took nine months to develop, and it achieved a higher level of performance than the statistical approach, which took ten years to develop. In the same year, self-attention ''avant la lettre'', originally called ''intra-attention or'' ''intra-sentence attention'', was proposed for LSTMs.


Parallelizing attention

Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, ''decomposable attention'' applied a self-attention mechanism to feedforward networks, which are easy to parallelize, and achieved
SOTA __NOTOC__ Sota, Soota, Souta or SOTA may refer to: People *, Japanese actor and vlogger *, Japanese professional shogi player *, Japanese actor *, Japanese footballer *, Japanese footballer *, Japanese football player *, Japanese football player ...
result in
textual entailment Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are ...
with an order of magnitude less parameters than LSTMs. One of its authors, Jakob Uszkoreit, suspected that attention ''without'' recurrence is sufficient for language translation, thus the title "attention is ''all'' you need". That hypothesis was against conventional wisdom of the time, and even his father, a well-known computational linguist, was skeptical. In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the " Attention is all you need" paper. At the time, the focus of the research was on improving
seq2seq Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models and text summarization. History The algorithm was proposed by Tomáš Mik ...
for
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, by removing its recurrence to process all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance. Its parallelizability was an important factor to its widespread use in large neural networks.


AI boom era

Already in spring 2017, even before the "Attention is all your need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Wikipedia articles. Transformer architecture is now used in many generative models that contribute to the ongoing
AI boom The AI boom, or AI spring, is the ongoing period of rapid progress in the field of artificial intelligence. Prominent examples include protein folding prediction and generative AI, led by laboratories including Google DeepMind and OpenAI. ...
. In language modelling,
ELMo Elmo is a red Muppet monster character on the long-running PBS/ HBO children's television show ''Sesame Street''. A furry red monster who has a falsetto voice and illeism, he hosts the last full five-minute segment (fifteen minutes prior ...
(2018) was a bi-directional LSTM that produces contextualized
word embeddings In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
, improving upon the line of research from
bag of words The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
and
word2vec Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...
. It was followed by BERT (2018), an encoder-only Transformer model. In 2019 October, Google started using BERT to process search queries. In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a Transformer-encoder–RNN-decoder model. Starting in 2018, the OpenAI GPT series of decoder-only Transformers became state of the art in
natural language generation Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics tha ...
. In 2022, a chatbot based on GPT-3,
ChatGPT ChatGPT (Generative Pre-trained Transformer) is a chatbot launched by OpenAI in November 2022. It is built on top of OpenAI's GPT-3 family of large language models, and is fine-tuned (an approach to transfer learning) with both supervised and ...
, became unexpectedly popular, triggering a boom around
large language models A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...
. Since 2020, Transformers have been applied in modalities beyond text, including the vision transformer, speech recognition, robotics, and multimodal. The vision transformer, in turn, stimulated new developments in
convolutional neural networks In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networ ...
. Image and video generators like
DALL-E DALL-E (stylized as DALL·E) and DALL-E 2 are deep learning models developed by OpenAI to generate digital images from natural language descriptions, called "prompts". DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a ve ...
(2021), Stable Diffusion 3 (2024), and Sora (2024), are based on the Transformer architecture.


Training


Methods for stabilizing training

The plain transformer architecture had difficulty converging. In the original paper the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of training steps), before decaying again. A 2020 paper found that using layer normalization ''before'' (instead of after) multiheaded attention and feedforward layers stabilizes training, not requiring learning rate warmup.


Pretrain-finetune

Transformers typically are first pretrained by
self-supervised learning Self-supervised learning (SSL) refers to a machine learning paradigm, and corresponding methods, for processing unlabelled data to obtain useful representations that can help with downstream learning tasks. The most salient thing about SSL metho ...
on a large generic dataset, followed by supervised
fine-tuning In theoretical physics, fine-tuning is the process in which parameters of a model must be adjusted very precisely in order to fit with certain observations. This had led to the discovery that the fundamental constants and quantities fall into suc ...
on a small task-specific dataset. The pretrain dataset is typically an unlabeled large corpus, such as The Pile. Tasks for pretraining and fine-tuning commonly include: *
language modeling A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
* next-sentence prediction *
question answering Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
* reading comprehension *
sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
*
paraphrasing A paraphrase () is a restatement of the meaning of a text or passage using other words. The term itself is derived via Latin ', . The act of paraphrasing is also called ''paraphrasis''. History Although paraphrases likely abounded in oral tr ...
The T5 transformer report documents a large number of
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...
pretraining tasks. Some examples are: * restoring or repairing incomplete or corrupted text. For example, the input, ''"Thank youme to your partyweek",'' might generate the output, ''"Thank you for inviting me to your party last week".'' * translation between natural languages (
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
) * judging the pragmatic acceptability of natural language. For example, the following sentence might be judged "not acceptable", because even though it is syntactically well-formed, it is improbable in ordinary human usage: ''The course is jumping well.'' Note that while each of these tasks is trivial or obvious for human native speakers of the language (or languages), they have typically proved challenging for previous generations of machine learning architecture.


Tasks

In general, there are 3 classes of language modelling tasks: "masked", "autoregressive", and "prefixLM". These classes are independent of a specific modeling architecture such as Transformer, but they are often discussed in the context of Transformer. In a masked task, one or more of the tokens is masked out, and the model would produce a probability distribution predicting what the masked-out tokens are based on the context. The
loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "co ...
for the task is typically sum of log-perplexities for the masked-out tokens: \text = -\sum_\ln(\textt\text) and the model is trained to minimize this loss function. The BERT series of models are trained for masked token prediction and another task. In an autoregressive task, the entire sequence is masked at first, and the model produces a probability distribution for the first token. Then the first token is revealed and the model predicts the second token, and so on. The loss function for the task is still typically the same. The GPT series of models are trained by autoregressive tasks. In a prefixLM task, the sequence is divided into two parts. The first part is presented as context, and the model predicts the first token of the second part. Then that would be revealed, and the model predicts the second token, and so on. The loss function for the task is still typically the same. The T5 series of models are trained by prefixLM tasks. Note that "masked" as in "masked language modelling" is not "masked" as in " masked attention", and "prefixLM" (prefix language modeling) is not "prefixLM" (prefix language model).


Architecture

All transformers have the same primary components: * Tokenizers, which convert text into tokens. * Embedding layer, which converts tokens and positions of the tokens into vector representations. * Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants. * Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens. The following description follows exactly the Transformer as described in the original paper. There are variants, described in the following section. By convention, we write all vectors as row vectors. This, for example, means that pushing a vector through a linear layer means multiplying it by a weight matrix on the right, as xW.


Tokenization

As the Transformer architecture natively processes numerical data, not text, there must be a translation between text and tokens. A token is an integer that represents a character, or a short segment of characters. On the input side, the input text is parsed into a token sequence. Similarly, on the output side, the output tokens are parsed back to text. The module doing the conversion between token sequences and texts is a tokenizer. The set of all tokens is the vocabulary of the tokenizer, and its size is the ''vocabulary size'' n_. When faced with tokens outside the vocabulary, typically a special token is used, written as " NK for "unknown". Some commonly used tokenizers are
byte pair encoding Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the ...
, WordPiece, and SentencePiece.


Embedding

Each token is converted into an embedding vector via a
lookup table In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as "direct addressing" and LUTs differ from hash tables in a way that, to retrieve a value v w ...
. Equivalently stated, it multiplies a
one-hot In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). A similar implementation in which all bits are '1' excep ...
representation of the token by an embedding matrix M. For example, if the input token is 3, then the one-hot representation is , 0, 0, 1, 0, 0, \dots/math>, and its embedding vector is\mathrm(3) = , 0, 0, 1, 0, 0, \dotsThe token embedding vectors are added to their respective positional encoding vectors (see below), producing the sequence of input vectors. The number of dimensions in an embedding vector is called ''hidden size'' or ''embedding size'' and written as d_


Un-embedding

An un-embedding layer is almost the reverse of an embedding layer. Whereas an embedding layer converts a token into a vector, an un-embedding layer converts a vector into a probability distribution over tokens. The un-embedding layer is a linear-
softmax The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
layer:\mathrm(x) = \mathrm(xW + b)The matrix has shape (d_, n_).


Positional encoding

A positional encoding is a fixed-size vector representation of the relative positions of tokens within a sequence: it provides the transformer model with information about ''where'' the words are in the input sequence. Without positional encoding, the model would be unable to process input sequence as more than a
bag of words The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
, as for example, both " man bites dog" and "dog bites man" would be processed exactly the same way. The positional encoding is defined as a function of type f: \R \to \R^d; d \in \mathbb, d > 0, where d is a positive even
integer An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the language ...
. The full positional encoding defined in the original paper is:(f(t)_, f(t)_) = (\sin(\theta), \cos(\theta)) \quad \forall k \in \where \theta = \frac, r = N^. Here, N is a free parameter that should be significantly larger than the biggest k that would be input into the positional encoding function. The original paper uses N=10000. The function is in a simpler form when written as a complex function of type f: \R \to \mathbb C^f(t) = \left(e^\right)_where r = N^. The main reason for using this positional encoding function is that using it, shifts are linear transformations:f(t + \Delta t) = \mathrm(f(\Delta t)) f(t)where \Delta t \in \R is the distance one wishes to shift. This allows the transformer to take any encoded position, and find the encoding of the position n-steps-ahead or n-steps-behind, by a matrix multiplication. By taking a linear sum, any convolution can also be implemented as linear transformations:\sum_j c_j f(t + \Delta t_j) = \left(\sum_j c_j \,\mathrm(f(\Delta t_j))\right) f(t)for any constants c_j. This allows the transformer to take any encoded position and find a linear sum of the encoded locations of its neighbors. This sum of encoded positions, when fed into the attention mechanism, would create attention weights on its neighbors, much like what happens in a
convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
. In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position." In typical implementations, all operations are done over the real numbers, not the complex numbers, but since complex multiplication can be implemented as real 2-by-2 matrix multiplication, this is a mere notational difference.


Encoder-decoder (overview)

Like earlier
seq2seq Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models and text summarization. History The algorithm was proposed by Tomáš Mik ...
models, the original transformer model used an encoder-decoder architecture. The encoder consists of encoding layers that process all the input tokens together one layer after another, while the decoder consists of decoding layers that iteratively process the encoder's output and the decoder's output tokens so far. The purpose of each encoder layer is to create contextualized representations of the tokens, where each representation corresponds to a token that "mixes" information from other input tokens via self-attention mechanism. Each decoder layer contains two attention sublayers: (1) cross-attention for incorporating the output of encoder (contextualized input token representations), and (2) self-attention for "mixing" information among the input tokens to the decoder (i.e. the tokens generated so far during inference time). Both the encoder and decoder layers have a feed-forward neural network for additional processing of their outputs and contain residual connections and layer normalization steps. These feed-forward layers contain most of the parameters in a Transformer model.


Feedforward network

The feedforward network (FFN) modules in a Transformer are 2-layered multilayer perceptrons:\mathrm(x) = \phi(xW^ + b^)W^ + b^where \phi is its activation function. The original Transformer used
ReLU In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...
activation. The number of neurons in the middle layer is called ''intermediate size'' (GPT), ''filter size'' (BERT), or ''feedforward size'' (BERT). It is typically larger than the embedding size. For example, in both GPT-2 series and BERT series, the intermediate size of a model is 4 times its embedding size: d_ = 4 d_.


Scaled dot-product attention


Attention head

The attention mechanism used in the Transformer architecture are scaled
dot-product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alge ...
attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...
units. For each unit, the transformer model learns three weight matrices: the query weights W^Q, the key weights W^K, and the value weights W^V. The module takes three sequences, a query sequence, a key sequence, and a value sequence. The query sequence is a sequence of length \ell_, and each entry is a vector of dimension d_. Similarly for the key and value sequences. For each vector x_ in the query sequence, it is multiplied by a matrix W^Q to produce a query vector q_i = x_ W^Q. The matrix of all query vectors is the query matrix:Q = X_ W^QSimilarly, we construct the key matrix K = X_ W^K and the value matrix V = X_ W^V. It is usually the case that all W^Q, W^K, W^V are square matrices, meaning d_= d_, etc. Attention weights are calculated using the query and key vectors: the attention weight a_ from token i to token j is the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...
between q_i and k_j. The attention weights are divided by the square root of the dimension of the key vectors, \sqrt, which stabilizes gradients during training, and passed through a
softmax The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
which normalizes the weights. The fact that W^Q and W^K are different matrices allows attention to be non-symmetric: if token i attends to token j (i.e. q_i\cdot k_j is large), this does not necessarily mean that token j will attend to token i (i.e. q_j\cdot k_i could be small). The output of the attention unit for token i is the weighted sum of the value vectors of all tokens, weighted by a_, the attention from token i to each token. The attention calculation for all tokens can be expressed as one large matrix calculation using the
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. The matrices Q, K and V are defined as the matrices where the ith rows are vectors q_i, k_i, and v_i respectively. Then we can represent the attention as\begin \text(Q, K, V) = \text\left(\frac\right)V \end where the softmax is applied over each of the rows of the matrix. The number of dimensions in a query vector is ''query size'' d_ and similarly for the ''key size'' d_ and ''value size'' d_. The output dimension of an attention head is its ''head dimension'' d_. The attention mechanism requires the following three equalities to hold:\ell_=\ell_, \;d_=d_, \; d_=d_ but is otherwise unconstrained. If the attention head is used in a self-attention fashion, then X_ = X_ = X_ . If the attention head is used in a cross-attention fashion, then usually X_ \neq X_ = X_ . It is theoretically possible for all three to be different, but that is rarely the case in practice.


Multiheaded attention

One set of \left( W^Q, W^K, W^V \right) matrices is called an ''attention head'', and each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, multiple attention heads allow the model to do this for different definitions of "relevance". In addition, the influence field representing relevance can become progressively dilated in successive layers. Many transformer attention heads encode relevance relations that are meaningful to humans. For example, some attention heads can attend mostly to the next word, while others mainly attend from verbs to their direct objects. The computations for each attention head can be performed in
parallel Parallel is a geometric term of location which may refer to: Computing * Parallel algorithm * Parallel computing * Parallel metaheuristic * Parallel (software), a UNIX utility for running programs in parallel * Parallel Sysplex, a cluster of I ...
, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers. Concretely, let the multiple attention heads be indexed by i, then we have\text(Q, K, V) = \text_(\text(XW^Q_i, XW^K_i, XW^V_i)) W^O where the matrix X is the concatenation of word embeddings, and the matrices W^Q_i, W^K_i, W^V_i are "projection matrices" owned by individual attention head i, and W^O is a final projection matrix owned by the whole multi-headed attention head. It is theoretically possible for each attention head to have a different head dimension d_, but that is rarely the case in practice. As an example, in the smallest GPT-2 model, there are only self-attention mechanisms. It has the following dimensions:d_ = 768, n_ = 12, d_ = 64Since 12 \times 64 = 768, its projection matrix W^O \in \R^ is a square matrix.


Masked attention

It may be necessary to cut out attention links between some word-pairs. For example, the decoder, when decoding for the token position t, should not have access to the token at position t+1. This may be accomplished before the softmax stage by adding a mask matrix M that is -\infty at entries where the attention link must be cut, and 0 at other places:\begin \text(Q, K, V) = \text\left(M + \frac\right)V \endA non-masked attention module can be thought of as a masked attention module where the mask has all entries zero. For example, the following matrix is commonly used in decoder self-attention modules, called "causal masking":M_ = \begin 0 & -\infty & -\infty & \dots & -\infty \\ 0 & 0 & -\infty & \dots & -\infty \\ 0 & 0 & 0 & \dots & -\infty \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & 0 \end In words, it means that each token can pay attention to itself, and every token before it, but not any after it. As an example of an uncommon use of mask matrix, the XLNet considers all masks of the form P M_ P^ , where P is a random
permutation matrix In mathematics, particularly in matrix theory, a permutation matrix is a square binary matrix that has exactly one entry of 1 in each row and each column and 0s elsewhere. Each such matrix, say , represents a permutation of elements and, wh ...
.


Encoder

An encoder consists of an embedding layer, followed by multiple encoder layers. Each encoder layer consists of two major components: a self-attention mechanism and a feed-forward layer. It takes an input as a sequence of input vectors, applies the self-attention mechanism, to produce an intermediate sequence of vectors, then applies the feed-forward layer for each vector individually. Schematically, we have:\begin \text & h_0, h_1, \dots\\ \text H &= \begin h_0 \\ h_1 \\ \vdots \end \\ \text(H) &= \begin \text(\text(H, H, H)_0) \\ \text(\text(H, H, H)_1) \\ \vdots \end \\ \end where \text stands for "feed-forward network". We can more succinctly write it as\text(H) = \text(\text(H, H, H)) with the implicit convention that the \text is applied to each row of the matrix individually. The encoder layers are stacked. The first encoder layer takes the sequence of input vectors from the embedding layer, producing a sequence of vectors. This sequence of vectors is processed by the second encoder, and so on. The output from the final encoder layer is then used by the decoder. As the encoder processes the entire input all at once, every token can attend to every other token (all-to-all attention), so there is no need for causal masking.


Decoder

A decoder consists of an embedding layer, followed by multiple decoder layers, followed by an un-embedding layer. Each decoder consists of three major components: a causally masked self-attention mechanism, a cross-attention mechanism, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders. This mechanism can also be called the ''encoder-decoder attention''. Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow. This allows for
autoregressive In statistics, econometrics and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes in nature, economics, etc. The autoregressive model spe ...
text generation. For decoding, all-to-all attention is inappropriate, because a token cannot attend to tokens not yet generated. Thus, the self-attention module in the decoder is causally masked. In contrast, the cross-attention mechanism attends to the output vectors of the encoder, which is computed before the decoder starts decoding. Consequently, there is no need for masking in the cross-attention mechanism. Schematically, we have:\begin H' &= \text(H, H, H) \\ \text(H) &=\text(\text(H', H^E, H^E)) \end where H^E is the matrix with rows being the output vectors from the encoder. The last decoder is followed by a final un-embedding layer. to produce the output probabilities over the vocabulary. Then, one of the tokens is sampled according to the probability, and the decoder can be run again to produce the next token, etc, autoregressively generating output text.


Full transformer architecture


Sublayers

Each encoder layer contains 2 sublayers: the self-attention and the feedforward network. Each decoder layer contains 3 sublayers: the causally masked self-attention, the cross-attention, and the feedforward network. The final points of detail are the residual connections and layer normalization (LayerNorm, or LN), which while conceptually unnecessary, are necessary for numerical stability and convergence. Similarly to how the feedforward network modules are applied individually to each vector, the LayerNorm is also applied individually to each vector. There are two common conventions in use: the ''post-LN'' and the ''pre-LN'' convention. In the post-LN convention, the output of each sublayer is \mathrm(x + \mathrm(x))where \mathrm(x) is the function implemented by the sublayer itself. In the pre-LN convention, the output of each sublayer isx + \mathrm(\mathrm(x))The original 2017 Transformer used the post-LN convention. It was difficult to train and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases. The pre-LN convention was developed in 2020, which was found to be easier to train, requiring no warm-up, leading to faster convergence.


Pseudocode

The following is the pseudocode for a standard pre-LN encoder-decoder Transformer, adapted from input: Encoder input t_e Decoder input t_d output: Array of probability distributions, with shape (decoder vocabulary size x length(decoder output sequence)) /* encoder */ z_e ← encoder.tokenizer(t_e) for each t in 1:length(z_e) do z_e ← encoder.embedding(z_e + encoder.positional_embedding(t) for each l in 1:length(encoder.layers) do layer ← encoder.layers /* first sublayer */ z_e_copy ← copy(z_e) for each t in 1:length(z_e) do z_e ← layer.layer_norm(z_e z_e ← layer.multiheaded_attention(z_e, z_e, z_e) for each t in 1:length(z_e) do z_e ← z_e + z_e_copy /* second sublayer */ z_e_copy ← copy(z_e) for each t in 1:length(z_e) do z_e ← layer.layer_norm(z_e z_e ← layer.feedforward(z_e) for each t in 1:length(z_e) do z_e ← z_e + z_e_copy for each t in 1:length(z_e) do z_e ← encoder.final_layer_norm(z_e /* decoder */ z_d ← decoder.tokenizer(t_d) for each t in 1:length(z_d) do z_d ← decoder.embedding(z_d + decoder.positional_embedding(t) for each l in 1:length(decoder.layers) do layer ← decoder.layers /* first sublayer */ z_d_copy ← copy(z_d) for each t in 1:length(z_d) do z_d ← layer.layer_norm(z_d z_d ← layer.masked_multiheaded_attention(z_d, z_d, z_d) for each t in 1:length(z_d) do z_d ← z_d + z_d_copy /* second sublayer */ z_d_copy ← copy(z_d) for each t in 1:length(z_d) do z_d ← layer.layer_norm(z_d z_d ← layer.multiheaded_attention(z_d, z_e, z_e) for each i in 1:length(z_d) do z_d ← z_d + z_d_copy /* third sublayer */ z_d_copy ← copy(z_d) for each t in 1:length(z_d) do z_d ← layer.layer_norm(z_d z_d ← layer.feedforward(z_d) for each t in 1:length(z_d) do z_d ← z_d + z_d_copy z_d ← decoder.final_layer_norm(z_d) output_distributions ← [] for each t in 1:length(z_d) do output_distributions.append(decoder.unembed(z_d ) return output_distributions


Terminology

The Transformer architecture, being modular, allows variations. Several common variations are described here. An "encoder-only" Transformer applies the encoder to map an input text into a sequence of vectors that represent the input text. This is usually used for text embedding and
representation learning In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for Feature (machine learning), feature detection or classification from raw data. Th ...
for downstream applications. BERT is encoder-only. They are less often used currently, as they were found to be not significantly better than training an encoder-decoder Transformer, then taking just the encoder. A "decoder-only" Transformer is not literally decoder-only, since without an encoder, the cross-attention mechanism has nothing to attend to. Thus, the decoder layers in a decoder-only Transformer is composed of just two sublayers: the causally masked self-attention, and the feedforward network. This is usually used for
text generation Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics tha ...
and instruction following. The models in the GPT series and Chinchilla series are decoder-only. An "encoder-decoder" Transformer is generally the same as the original Transformer, with 2 sublayers per encoder layer and 3 sublayers per decoder layer, etc. They might have minor architectural improvements, such as alternative activation functions, changing the location of normalization, etc. This is also usually used for
text generation Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics tha ...
and instruction following. The models in the T5 series are encoder-decoder. A "prefixLM" (prefix language model) is a decoder-only architecture, but with prefix masking, which is different from causal masking. Specifically, it has mask of the formM_ = \begin \mathbf & 0, -\infty \\ \mathbf & M_ \end where the first columns correspond to the "prefix", and the subsequent columns correspond to the autoregressively generated text based on the prefix. They resemble encoder-decoder models, but has less "sparsity". Such models are rarely used, though they are cited as theoretical possibilities and benchmarked comparisons. There are also mixed seq2seq models. For example, in 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a Transformer-encoder–RNN-decoder model, on the argument that an RNN-decoder runs much faster than Transformer-decoder when run autoregressively.


Subsequent work


Alternative activation functions

The original transformer uses
ReLU In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the positive part of its argument: : f(x) = x^+ = \max(0, x), where ''x'' is the input to a neu ...
activation function In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or " ...
. Other activation functions were developed. The Llama series used SwiGLU; both GPT-1 and BERT used GELU.


Alternative normalizations

The normalization used in the Transformer can be different from LayerNorm. One example is RMSNorm which is used in the Llama series. Other examples include ScaleNorm, or FixNorm.


Alternative positional encodings

Transformers may use other positional encoding methods than sinusoidal. The original Transformer paper reported using a learned positional encoding, but finding it not superior to the sinusoidal one. Later, found that causal masking itself provides enough signal to a Transformer decoder that it can learn to implicitly perform absolute positional encoding without the positional encoding module.


RoPE

RoPE (rotary positional embedding), is best explained by considering a list of 2-dimensional vectors x^_1, x^_1), (x^_2, x^_2), (x^_3, x^_3), .../math>. Now pick some angle \theta. Then RoPE encoding is\text\big(x^_m, x^_m, m\big) = \begin \cos m \theta & - \sin m \theta \\ \sin m \theta & \cos m \theta \end \begin x^_m \\ x^_m \\ \end = \begin x^_m \cos m\theta - x^_m \sin m \theta \\ x^_m \cos m\theta + x^_m \sin m \theta \\ \end Equivalently, if we write the 2-dimensional vectors as complex numbers z_m := x^_m + i x^_m, then RoPE encoding is just multiplication by an angle:\text\big(z_m, m\big) = e^ z_m For a list of 2n-dimensional vectors, a RoPE encoder is defined by a sequence of angles \theta^, ..., \theta^. Then the RoPE encoding is applied to each pair of coordinates. The benefit of RoPE is that the dot-product between two vectors depends on their relative location only: \text\big(x, m\big)^T\text\big(y, n\big) = \text\big(x, m+k\big)^T\text\big(y, n+k\big) for any integer k.


ALiBi

ALiBi (Attention with Linear Biases) is not a ''replacement'' for the positional encoder on the original transformer. Instead, it is an ''additional'' positional encoder that is directly plugged into the attention mechanism. Specifically, the ALiBi attention mechanism is\begin \text(Q, K, V) = \text\left(\frac + s B\right)V \endHere, s is a real number ("scalar"), and B is the ''linear bias'' matrix defined byB = \begin 0 & 1 & 2 & 3 & \cdots \\ -1 & 0 & 1 & 2 & \cdots \\ -2 & -1 & 0 & 1 & \cdots \\ -3 & -2 & -1 & 0 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \ddots \\ \end in other words, B_ = j - i. The idea being that the linear bias matrix is a softened mask. Just as 0 represent full attention paid, and -\infty represents no attention paid, the linear bias matrix increases attention paid in one direction and decreases attention paid in the other direction. ALiBi allows pretraining on short context windows, then finetuning on longer context windows. Since it is directly plugged into the attention mechanism, it can be combined with any positional encoder that is plugged into the "bottom" of the entire network (which is where the sinusoidal encoder on the original transformer, as well as RoPE and many others, are located).


Relative Position Encodings

Relative Position Encodings is similar to ALiBi, but more generic:\begin \text(Q, K, V) = \text\left(\frac + B\right)V \endwhere B is a
Toeplitz matrix In linear algebra, a Toeplitz matrix or diagonal-constant matrix, named after Otto Toeplitz, is a matrix in which each descending diagonal from left to right is constant. For instance, the following matrix is a Toeplitz matrix: :\qquad\begin a & b ...
, that is, B_ = B_ whenever i-j = i'-j'. This is contrasted with the original sinusoidal positional encoding, which is an "absolute positional encoding".


Efficient implementation

The transformer model has been implemented in standard deep learning
frameworks A framework is a generic term commonly referring to an essential supporting structure which other things are built on top of. Framework may refer to: Computing * Application framework, used to implement the structure of an application for an op ...
such as
TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...
and
PyTorch PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and op ...
. ''Transformers'' is a library produced by
Hugging Face Hugging Face, Inc. is an American company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users ...
that supplies transformer-based architectures and pretrained models.


FlashAttention

FlashAttention is an algorithm that implements the transformer attention mechanism efficiently on a GPU. It performs matrix multiplications in blocks, such that each block fits within the
cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache Coun ...
of a GPU, and by careful management of the blocks it minimizes data copying between GPU caches (as data movement is slow). An improved version, FlashAttention-2, was developed to cater to the rising demand for language models capable of handling longer context lengths. It offers enhancements in work partitioning and parallelism, enabling it to achieve up to 230 TFLOPs/s on A100 GPUs (
FP16 In computing, half precision (sometimes called FP16) is a binary floating-point computer number format that occupies 16 bits (two bytes in modern computers) in computer memory. It is intended for storage of floating-point values in applications wh ...
/ BF16), a 2x speed increase over the original FlashAttention. Key advancements in FlashAttention-2 include the reduction of non-matmul FLOPs, improved parallelism over the sequence length dimension, better work partitioning between GPU warps, and added support for head dimensions up to 256 and multi-query attention (MQA) and grouped-query attention (GQA). Benchmarks revealed FlashAttention-2 to be up to 2x faster than FlashAttention and up to 9x faster than a standard attention implementation in PyTorch. Future developments include optimization for new hardware like H100 GPUs and new data types like FP8.


Multi-Query Attention

Multi-Query Attention changes the multiheaded attention mechanism. Whereas normally, \text(Q, K, V) = \text_\left(\text(XW^Q_i, XW^K_i, XW^V_i)\right) W^Owith Multi-Query Attention, there is just one W^K, W^V, thus: \text(Q, K, V) = \text_\left(\text(XW^Q_i, XW^K, XW^V)\right) W^O This has a neutral effect on model quality and training speed, but increases inference speed.


Caching

When an autoregressive transformer is used for inference, such as generating text, the query vector is different at each step, but the already-computed key and value vectors are always the same. The KV caching method saves the computed key and value vectors at each attention block, so that they are not recomputed at each new token. PagedAttention applies
memory paging In computer operating systems, memory paging is a memory management scheme by which a computer stores and retrieves data from secondary storage for use in main memory. In this scheme, the operating system retrieves data from secondary storage ...
to KV caching. If a transformer is used with a baked-in prompt, such as You are a customer support agent..." then the key and value vectors can be computed for the prompt, and saved on disk. The saving in compute is significant when the model is used for many short interactions, such as in online chatbots.


Speculative decoding

Transformers are used in large language models for autoregressive sequence generation: generating a stream of text, one token at a time. However, in most settings, decoding from language models is memory-bound, meaning that we have spare compute power available. Speculative decoding uses this spare compute power by computing several tokens in parallel. Similarly to
speculative execution Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing ...
in CPUs, future tokens are computed concurrently, by speculating on the value of previous tokens, and are later discarded if it turns out the speculation was incorrect. Specifically, consider a transformer model like GPT-3 with a context window size of 512. To generate an entire context window autoregressively with greedy decoding, it must be run for 512 times, each time generating a token x_1, x_2, ..., x_. However, if we had some educated guess for the values of these tokens, we could verify all of them in parallel, in one run of the model, by checking that each x_t is indeed the token with the largest log-likelihood in the t-th output. In speculative decoding, a smaller model or some other simple heuristic is used to generate a few speculative tokens that are subsequently verified by the larger model. For example, suppose a small model generated four speculative tokens: \tilde, \tilde, \tilde, \tilde. These tokens are run through the larger model, and only \tilde and \tilde are accepted. The same run of the large model already generated a new token x_3 to replace \tilde, and \tilde is completely discarded. The process then repeats (starting from the 4th token) until all tokens are generated. For non-greedy decoding, similar ideas apply, except the speculative tokens are accepted or rejected stochastically, in a way that guarantees the final output distribution is the same as if speculative decoding was not used.


Sub-quadratic transformers

Training transformer-based architectures can be expensive, especially for long inputs. Many methods have been developed to attempt to address the issue. ''Long Range Arena'' (2020) is a standard benchmark for comparing the behavior of transformer architectures over long inputs.


Alternative attention graphs

The standard attention graph is either all-to-all or causal, both of which scales as O(N^2) where N is the number of tokens in a sequence. Reformer (2020) reduces the computational load from O(N^2) to O(N\ln N) by using
locality-sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...
and reversible layers. Sparse attention uses attention graphs that grows slower than O(N^2). For example, BigBird (2020) uses random small-world networks which grows as O(N). Ordinary transformers require a memory size that is quadratic in the size of the context window. Attention-free transformers reduce this to a linear dependence while still retaining the advantages of a transformer by linking the key to the value.


Random Feature Attention

Random Feature Attention (2021) uses Fourier random features:\varphi(x) = \frac cos\langle w_1, x\rangle, \sin\langle w_1, x\rangle, \cdots \cos\langle w_D, x\rangle, \sin\langle w_D, x\rangleTwhere w_1, ..., w_D are independent samples from the normal distribution N(0, \sigma^2 I). This choice of parameters satisfy \mathbb E langle \varphi(x), \varphi(y)\rangle= e^, or e^ = \mathbb E langle e^ \varphi(x), e^\varphi(y)\rangle\approx \langle e^ \varphi(x), e^\varphi(y)\rangle Consequently, the one-headed attention, with one query, can be written as \text(q, K, V) = \text\left(\frac\right)V \approx \fracwhere \sigma = d_K^. Similarly for multiple queries, and for multiheaded attention. This approximation can be computed in linear time, as we can compute the matrix \varphi(k_i) v_i^T first, then multiply it with the query. In essence, we have managed to obtain a more precise version of \text(Q, K, V) = \text\left(\frac\right)V \approx Q(K^TV/\sqrt) Performer (2022) uses the same Random Feature Attention, but w_1, ..., w_D are first independently sampled from the normal distribution N(0, \sigma^2 I), then they are Gram-Schmidt processed.


Multimodality

Transformers can also be used/adapted for modalities (input or output) beyond just text, usually by finding a way to "tokenize" the modality. Multimodal models can either be trained from scratch, or by finetuning. A 2022 study found that Transformers pretrained only on natural language can be finetuned on only 0.03% of parameters and become competitive with LSTMs on a variety of logical and visual tasks, demonstrating
transfer learning Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize ...
. The LLaVA was a vision-language model composed of a language model (Vicuna-13B) and a vision model (
ViT VIT may refer to: * Vitoria Airport (IATA code VIT; ICAO airport code LEVT), Vitoria-Gasteiz, Basque Country, Spain * VIT University (disambiguation) * Victorian Institute of Teaching * Vellore Institute of Technology * Vidyalankar Institute of ...
-L/14), connected by a linear layer. Only the linear layer is finetuned. Vision transformers adapt the transformer to computer vision by breaking down input images as a series of patches, turning them into vectors, and treating them like tokens in a standard transformer. Conformer and later
Whisper Whispering is an unvoiced mode of phonation in which the vocal cords are abducted so that they do not vibrate; air passes between the arytenoid cartilages to create audible turbulence during speech. Supralaryngeal articulation remains t ...
follow the same pattern for
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...
, first turning the speech signal into a
spectrogram A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data are represen ...
, which is then treated like an image, i.e. broken down into a series of patches, turned into vectors and treated like tokens in a standard transformer. Perceivers are a variant of Transformers designed for multimodality. For image generation, two notable architectures are DALL-E 1 (2021) and Parti (2022). Unlike later models, DALL-E is not a diffusion model. Instead, it uses a decoder-only Transformer that autoregressively generates a text, followed by the token representation of an image, which is then converted by a
variational autoencoder In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods. ...
to an image. Parti is an encoder-decoder Transformer, where the encoder processes a text prompt, and the decoder generates a token representation of an image.


Applications

The transformer has had great success in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
(NLP). Many
large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...
s such as
GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while sometim ...
,
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standa ...
,
GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI and the fourth in its GPT series. It was released on March 14, 2023, and has been made publicly available in a limited form via ChatGPT Plus, ...
,
Claude Claude may refer to: __NOTOC__ People and fictional characters * Claude (given name), a list of people and fictional characters * Claude (surname), a list of people * Claude Lorrain (c. 1600–1682), French landscape painter, draughtsman and etcher ...
, BERT, XLNet,
RoBERTa ''Roberta'' is a musical from 1933 with music by Jerome Kern, and lyrics and book by Otto Harbach. The musical is based on the novel ''Gowns by Roberta'' by Alice Duer Miller. It features the songs " Yesterdays", "Smoke Gets in Your Eyes", "Let ...
and
ChatGPT ChatGPT (Generative Pre-trained Transformer) is a chatbot launched by OpenAI in November 2022. It is built on top of OpenAI's GPT-3 family of large language models, and is fine-tuned (an approach to transfer learning) with both supervised and ...
demonstrate the ability of transformers to perform a wide variety of NLP-related subtasks and their related real-world or practical applications, including: *
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
*
time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
prediction *
document summarization Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commo ...
* document generation *
named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre- ...
(NER) * writing computer code based on requirements expressed in natural language. *
speech-to-text Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the ma ...
Beyond traditional NLP, the transformer architecture has had success in other applications, such as: * biological sequence analysis * video understanding *
protein folding Protein folding is the physical process by which a protein chain is translated to its native three-dimensional structure, typically a "folded" conformation by which the protein becomes biologically functional. Via an expeditious and reprodu ...
(such as
AlphaFold AlphaFold is an artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, which performs predictions of protein structure. The program is designed as a deep learning system. AlphaFold AI software has had two major ve ...
) * evaluating chess board positions. Using static evaluation alone (that is, with no
Minimax Minimax (sometimes MinMax, MM or saddle point) is a decision rule used in artificial intelligence, decision theory, game theory, statistics, and philosophy for ''mini''mizing the possible loss for a worst case (''max''imum loss) scenario. Whe ...
search) transformer achieved an
Elo Elo or ELO may refer to: Music * Electric Light Orchestra, a British rock music group ** The Electric Light Orchestra (album), ''The Electric Light Orchestra'' (album), the group's debut album ** ''ELO 2'', the group's second album * ELO Part II ...
of 2895, putting it at grandmaster level.


See also

* * * * * * *


Notes


References


Further reading

* Alexander Rush
The Annotated transformer
, Harvard NLP group, 3 April 2018 * * {{Differentiable computing Google software Neural network architectures