A transformer is a
deep learning architecture developed by researchers at
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
and based on the multi-head
attention
Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...
mechanism, proposed in a 2017 paper "
Attention Is All You Need".
Text is converted to numerical representations called
tokens, and each token is converted into a vector via looking up from a
word embedding
In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
table.
At each layer, each token is then
contextualized within the scope of the context window with other (unmasked) tokens via a parallel
multi-head attention mechanism
In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus ...
allowing the signal for key tokens to be amplified and less important tokens to be diminished.
Transformers have the advantage of having no recurrent units, and therefore require less training time than earlier
recurrent neural architectures (RNNs) such as
long short-term memory
Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM).
Later variations have been widely adopted for training
large language model
A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...
s (LLM) on large (language) datasets, such as the
Wikipedia
Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read ref ...
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
and
Common Crawl
Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally eve ...
.
Transformers were first developed as an improvement over previous architectures for
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
,
but have found many applications since then. They are used in large-scale
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
,
computer vision
Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...
(
vision transformers),
reinforcement learning
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine ...
,
audio,
multi-modal processing, robotics, and even playing
chess
Chess is a board game for two players, called White and Black, each controlling an army of chess pieces in their color, with the objective to checkmate the opponent's king. It is sometimes called international chess or Western chess to dist ...
.
It has also led to the development of
pre-trained systems, such as
generative pre-trained transformers (GPTs)
and
BERT (Bidirectional Encoder Representations from Transformers).
History
Predecessors
For many years, sequence modelling and generation was done by using plain
recurrent neural networks
A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic ...
(RNNs). A well-cited early example was the
Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the
vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.
A key breakthrough was
LSTM
Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) c ...
(1995), a RNN which used various innovations to overcome the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key innovation was the use of an
attention mechanism
In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus ...
which used neurons that multiply the outputs of other neurons, so-called ''multiplicative units''. Neural networks using multiplicative units were called ''sigma-pi networks''
or ''
higher-order networks'', but they faced high computational complexity.
LSTM became the standard architecture for long sequence modelling until the 2017 publication of Transformers.
However, LSTM still used sequential processing, like most other RNNs. Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence. An early attempt to overcome this was the
fast weight controller (1992) which computed the weight matrix for further processing depending on the input.
It used the fast weights architecture (1987), where one neural network outputs the weights of another neural network. It was later shown to be equivalent to the linear Transformer without normalization.
Attention with seq2seq
The idea of encoder-decoder sequence transduction had been developed in the early 2010s (see
[ irst version posted to arXiv on 10 Sep 2014/ref> for previous papers). The papers most commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014.]
(Sutskever et al, 2014) was a 380M-parameter model for machine translation using two long short-term memory
Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) ca ...
(LSTM). The architecture consists of two parts. The ''encoder'' is an LSTM that takes in a sequence of tokens and turns it into a vector. The ''decoder'' is another LSTM that converts the vector into a sequence of tokens. Similarly, (Cho et al, 2014) was 130M-parameter model that used gated recurrent unit
Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an o ...
s (GRU) instead of LSTM. Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq.
These early seq2seq models had no attention mechanism, and the state vector is accessible only after the ''last'' word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved, since the input is processed sequentially by one recurrent network into a ''fixed''-size output vector, which was then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, and the output quality degrades. As evidence, reversing the input sentence improved seq2seq translation.
(Bahdanau et al, 2014) introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem, allowing the model to process long-distance dependencies more easily. They called their model ''RNNsearch'', as it "emulates searching through a source sentence during decoding a translation".
(Luong et al, 2015) compared the relative performance of global (that of (Bahdanau et al, 2014)) and local (sliding window) attention model architectures for machine translation, and found that a mixed attention architecture had higher quality than global attention, while the use of a local attention architecture reduced translation time.
In 2016, Google Translate
Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, and an A ...
was revamped to Google Neural Machine Translation
Google Neural Machine Translation (GNMT) is a neural machine translation (NMT) system developed by Google and introduced in November 2016, that uses an artificial neural network to increase fluency and accuracy in Google Translate.
GNMT improves ...
, which replaced the previous model based on statistical machine translation
Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrast ...
. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM. It took nine months to develop, and it achieved a higher level of performance than the statistical approach, which took ten years to develop. In the same year, self-attention ''avant la lettre'', originally called ''intra-attention or'' ''intra-sentence attention'', was proposed for LSTMs.
Parallelizing attention
Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, ''decomposable attention'' applied a self-attention mechanism to feedforward networks, which are easy to parallelize, and achieved SOTA __NOTOC__
Sota, Soota, Souta or SOTA may refer to:
People
*, Japanese actor and vlogger
*, Japanese professional shogi player
*, Japanese actor
*, Japanese footballer
*, Japanese footballer
*, Japanese football player
*, Japanese football player
...
result in textual entailment Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are ...
with an order of magnitude less parameters than LSTMs. One of its authors, Jakob Uszkoreit, suspected that attention ''without'' recurrence is sufficient for language translation, thus the title "attention is ''all'' you need". That hypothesis was against conventional wisdom of the time, and even his father, a well-known computational linguist, was skeptical.
In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the " Attention is all you need" paper. At the time, the focus of the research was on improving seq2seq
Seq2seq is a family of machine learning approaches used for natural language processing. Applications include language translation, image captioning, conversational models and text summarization.
History
The algorithm was proposed by Tomáš Mik ...
for machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, by removing its recurrence to process all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance. Its parallelizability was an important factor to its widespread use in large neural networks.
AI boom era
Already in spring 2017, even before the "Attention is all your need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Wikipedia articles. Transformer architecture is now used in many generative models that contribute to the ongoing AI boom
The AI boom, or AI spring, is the ongoing period of rapid progress in the field of artificial intelligence. Prominent examples include protein folding prediction and generative AI, led by laboratories including Google DeepMind and OpenAI.
...
.
In language modelling, ELMo
Elmo is a red Muppet monster character on the long-running PBS/ HBO children's television show ''Sesame Street''. A furry red monster who has a falsetto voice and illeism, he hosts the last full five-minute segment (fifteen minutes prior ...
(2018) was a bi-directional LSTM that produces contextualized word embeddings
In natural language processing (NLP), word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the v ...
, improving upon the line of research from bag of words
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
and word2vec
Word2vec is a technique for natural language processing (NLP) published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or ...
. It was followed by BERT (2018), an encoder-only Transformer model. In 2019 October, Google started using BERT to process search queries. In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a Transformer-encoder–RNN-decoder model.
Starting in 2018, the OpenAI GPT series of decoder-only Transformers became state of the art in natural language generation
Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited survey of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics tha ...
. In 2022, a chatbot based on GPT-3, ChatGPT
ChatGPT (Generative Pre-trained Transformer) is a chatbot launched by OpenAI in November 2022. It is built on top of OpenAI's GPT-3 family of large language models, and is fine-tuned (an approach to transfer learning) with both supervised and ...
, became unexpectedly popular, triggering a boom around large language models
A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...
.
Since 2020, Transformers have been applied in modalities beyond text, including the vision transformer, speech recognition, robotics, and multimodal. The vision transformer, in turn, stimulated new developments in convolutional neural networks
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networ ...
. Image and video generators like DALL-E
DALL-E (stylized as DALL·E) and DALL-E 2 are deep learning models developed by OpenAI to generate digital images from natural language descriptions, called "prompts". DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a ve ...
(2021), Stable Diffusion 3 (2024), and Sora (2024), are based on the Transformer architecture.
Training
Methods for stabilizing training
The plain transformer architecture had difficulty converging. In the original paper the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of training steps), before decaying again.
A 2020 paper found that using layer normalization ''before'' (instead of after) multiheaded attention and feedforward layers stabilizes training, not requiring learning rate warmup.
Pretrain-finetune
Transformers typically are first pretrained by self-supervised learning
Self-supervised learning (SSL) refers to a machine learning paradigm, and corresponding methods, for processing unlabelled data to obtain useful representations that can help with downstream learning tasks. The most salient thing about SSL metho ...
on a large generic dataset, followed by supervised fine-tuning
In theoretical physics, fine-tuning is the process in which parameters of a model must be adjusted very precisely in order to fit with certain observations. This had led to the discovery that the fundamental constants and quantities fall into suc ...
on a small task-specific dataset. The pretrain dataset is typically an unlabeled large corpus, such as The Pile. Tasks for pretraining and fine-tuning commonly include:
* language modeling
A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
* next-sentence prediction
* question answering
Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural l ...
* reading comprehension
* sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...
* paraphrasing
A paraphrase () is a restatement of the meaning of a text or passage using other words. The term itself is derived via Latin ', . The act of paraphrasing is also called ''paraphrasis''.
History
Although paraphrases likely abounded in oral tr ...
The T5 transformer report documents a large number of natural language
In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...
pretraining tasks. Some examples are:
* restoring or repairing incomplete or corrupted text. For example, the input, ''"Thank youme to your partyweek",'' might generate the output, ''"Thank you for inviting me to your party last week".''
* translation between natural languages (machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
)
* judging the pragmatic acceptability of natural language. For example, the following sentence might be judged "not acceptable", because even though it is syntactically well-formed, it is improbable in ordinary human usage: ''The course is jumping well.''
Note that while each of these tasks is trivial or obvious for human native speakers of the language (or languages), they have typically proved challenging for previous generations of machine learning architecture.
Tasks
In general, there are 3 classes of language modelling tasks: "masked", "autoregressive", and "prefixLM". These classes are independent of a specific modeling architecture such as Transformer, but they are often discussed in the context of Transformer.
In a masked task, one or more of the tokens is masked out, and the model would produce a probability distribution predicting what the masked-out tokens are based on the context. The loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "co ...
for the task is typically sum of log-perplexities for the masked-out tokens: and the model is trained to minimize this loss function. The BERT series of models are trained for masked token prediction and another task.
In an autoregressive task, the entire sequence is masked at first, and the model produces a probability distribution for the first token. Then the first token is revealed and the model predicts the second token, and so on. The loss function for the task is still typically the same. The GPT series of models are trained by autoregressive tasks.
In a prefixLM task, the sequence is divided into two parts. The first part is presented as context, and the model predicts the first token of the second part. Then that would be revealed, and the model predicts the second token, and so on. The loss function for the task is still typically the same. The T5 series of models are trained by prefixLM tasks.
Note that "masked" as in "masked language modelling" is not "masked" as in " masked attention", and "prefixLM" (prefix language modeling) is not "prefixLM" (prefix language model).
Architecture
All transformers have the same primary components:
* Tokenizers, which convert text into tokens.
* Embedding layer, which converts tokens and positions of the tokens into vector representations.
* Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants.
* Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.
The following description follows exactly the Transformer as described in the original paper. There are variants, described in the following section.
By convention, we write all vectors as row vectors. This, for example, means that pushing a vector through a linear layer means multiplying it by a weight matrix on the right, as .
Tokenization
As the Transformer architecture natively processes numerical data, not text, there must be a translation between text and tokens. A token is an integer that represents a character, or a short segment of characters. On the input side, the input text is parsed into a token sequence. Similarly, on the output side, the output tokens are parsed back to text. The module doing the conversion between token sequences and texts is a tokenizer.
The set of all tokens is the vocabulary of the tokenizer, and its size is the ''vocabulary size'' . When faced with tokens outside the vocabulary, typically a special token is used, written as " NK for "unknown".
Some commonly used tokenizers are byte pair encoding
Byte pair encoding or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the ...
, WordPiece, and SentencePiece.
Embedding
Each token is converted into an embedding vector via a lookup table
In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as "direct addressing" and LUTs differ from hash tables in a way that, to retrieve a value v w ...
. Equivalently stated, it multiplies a one-hot
In digital circuits and machine learning, a one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). A similar implementation in which all bits are '1' excep ...
representation of the token by an embedding matrix . For example, if the input token is , then the one-hot representation is