Bidirectional encoder representations from transformers (BERT) is a

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

introduced in October 2018 by researchers at

Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...

. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. BERT dramatically improved the

state-of-the-art The state of the art (SOTA or SotA, sometimes cutting edge, leading edge, or bleeding edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contex ...

for

large language models A large language model (LLM) is a language model trained with Self-supervised learning, self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially Natural language generation, language g ...

. , BERT is a ubiquitous baseline in

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

(NLP) experiments. BERT is trained by masked token prediction and next sentence prediction. As a result of this training process, BERT learns contextual, latent representations of tokens in their context, similar to

ELMo Elmo is a Muppet character on the children's television show ''Sesame Street''. A furry red monster who speaks in a high-pitched falsetto voice and frequently refers to himself in the third person, he hosts the last full 15-minute segmen ...

and

GPT-2 Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of Generative pre-trained transformer, GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was par ...

. It found applications for many natural language processing tasks, such as coreference resolution and

polysemy Polysemy ( or ; ) is the capacity for a Sign (semiotics), sign (e.g. a symbol, morpheme, word, or phrase) to have multiple related meanings. For example, a word can have several word senses. Polysemy is distinct from ''monosemy'', where a word h ...

resolution. It is an evolutionary step over

, and spawned the study of "BERTology", which attempts to interpret what is learned by BERT. BERT was originally implemented in the English language at two model sizes, BERT_BASE (110 million parameters) and BERT_LARGE (340 million parameters). Both were trained on the Toronto BookCorpus (800M words) and

English Wikipedia The English Wikipedia is the primary English-language edition of Wikipedia, an online encyclopedia. It was created by Jimmy Wales and Larry Sanger on 15 January 2001, as Wikipedia's first edition. English Wikipedia is hosted alongside o ...

(2,500M words). The weights were released on

GitHub GitHub () is a Proprietary software, proprietary developer platform that allows developers to create, store, manage, and share their code. It uses Git to provide distributed version control and GitHub itself provides access control, bug trackin ...

. On March 11, 2020, 24 smaller models were released, the smallest being BERT_TINY with just 4 million parameters.

Architecture

BERT is an "encoder-only"

transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...

architecture. At a high level, BERT consists of 4 modules: * Tokenizer: This module converts a piece of English text into a sequence of integers ("tokens"). *

Embedding In mathematics, an embedding (or imbedding) is one instance of some mathematical structure contained within another instance, such as a group (mathematics), group that is a subgroup. When some object X is said to be embedded in another object Y ...

: This module converts the sequence of tokens into an array of real-valued vectors representing the tokens. It represents the conversion of discrete token types into a lower-dimensional

Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...

. * Encoder: a stack of Transformer blocks with self-attention, but without causal masking. * Task head: This module converts the final representation vectors into one-hot encoded tokens again by producing a predicted probability distribution over the token types. It can be viewed as a simple decoder, decoding the latent representation into token types, or as an "un-embedding layer". The task head is necessary for pre-training, but it is often unnecessary for so-called "downstream tasks," such as question answering or sentiment classification. Instead, one removes the task head and replaces it with a newly initialized module suited for the task, and finetune the new module. The latent vector representation of the model is directly fed into this new module, allowing for sample-efficient

transfer learning Transfer learning (TL) is a technique in machine learning (ML) in which knowledge learned from a task is re-used in order to boost performance on a related task. For example, for image classification, knowledge gained while learning to recogniz ...

Embedding

This section describes the embedding used by BERT_BASE. The other one, BERT_LARGE, is similar, just larger. The tokenizer of BERT is WordPiece, which is a sub-word strategy like

byte pair encoding Byte-pair encoding (also known as BPE, or digram coding) is an algorithm, first described in 1994 by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table. A slightly modified version of the algori ...

. Its vocabulary size is 30,000, and any token not appearing in its vocabulary is replaced by

 NK/code> ("unknown"). 

The first layer is the embedding layer, which contains three components: token type embeddings, position embeddings, and segment type embeddings. 

* Token type: The token type is a standard embedding layer, translating a one-hot vector into a dense vector based on its token type.
* Position: The position embeddings are based on a token's position in the sequence. BERT uses absolute position embeddings, where each position in sequence is mapped to a real-valued vector. Each dimension of the vector consists of a  sinusoidal function that takes the position in the sequence as input.
* Segment type: Using a vocabulary of just 0 or 1, this embedding layer produces a dense vector based on whether the token belongs to the first or second text segment in that input. In other words, type-1 tokens are all tokens that appear after the  EP/code> special token. All prior tokens are type-0.

The three embedding vectors are added together representing the initial token representation as a function of these three pieces of information. After embedding, the vector representation is normalized using a  LayerNorm operation, outputting a 768-dimensional vector for each input token. After this, the representation vectors are passed forward through 12 Transformer encoder blocks, and are decoded back to 30,000-dimensional vocabulary space using a basic affine transformation layer.

  Architectural family 

The encoder stack of BERT has 2 free parameters:  $L$ , the number of layers, and  $H$ , the hidden size. There are always  $H/64$  self-attention heads, and the feed-forward/filter size is always  $4H$ . By varying these two numbers, one obtains an entire family of BERT models.

For BERT

* the feed-forward size and filter size are synonymous. Both of them denote the number of dimensions in the middle layer of the feed-forward network.
* the hidden size and embedding size are synonymous. Both of them denote the number of real numbers used to represent a token.

The notation for encoder stack is written as L/H. For example, BERT_BASE is written as 12L/768H, BERT_LARGE as 24L/1024H, and BERT_TINY as 2L/128H.

  Training 

  Pre-training 

BERT was pre-trained simultaneously on two tasks.

* Masked Language Model (MLM): In this task, BERT ingests a sequence of words, where one words may be randomly changed ("masked"), and BERT tries to predict the original words that had been changed. For example, in the sentence "The cat sat on the  ASK/code>," BERT would need to predict "mat." This helps BERT learn bidirectional context, meaning it understands the relationships between words not just from left to right or right to left but from both directions at the same time.

* Next Sentence Prediction (NSP): In this task, BERT is trained to predict whether one sentence logically follows another. For example, given two sentences, "The cat sat on the mat." and "It was a sunny day," BERT has to decide if the second sentence is a valid continuation of the first one. This helps BERT understand relationships between sentences, which is important for tasks like question answering or document classification.

  Masked language modeling 


In masked language modeling, 15% of tokens would be randomly selected for masked-prediction task, and the training objective was to predict the masked token given its context. In more detail, the selected token is 

* replaced with a  ASK/code> token with probability 80%,
* replaced with a random word token with probability 10%,
* not replaced with probability 10%.

The reason not all selected tokens are masked is to avoid the dataset shift problem. The dataset shift problem arises when the distribution of inputs seen during training differs significantly from the distribution encountered during inference. A trained BERT model might be applied to word representation (like Word2Vec 

Word2vec is a technique in natural language processing (NLP) for obtaining  vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
), where it would be run over sentences not containing any  ASK/code> tokens. It is later found that more diverse training objectives are generally better.

As an illustrative example, consider the sentence "my dog is cute". It would first be divided into tokens like "my₁ dog₂ is₃ cute₄". Then a random token in the sentence would be picked. Let it be the 4th one "cute₄". Next, there would be three possibilities:

* with probability 80%, the chosen token is masked, resulting in "my₁ dog₂ is₃  ASK/code>₄";
* with probability 10%, the chosen token is replaced by a uniformly sampled random token, such as "happy", resulting in "my₁ dog₂ is₃ happy₄";
* with probability 10%, nothing is done, resulting in "my₁ dog₂ is₃ cute₄".

After processing the input text, the model's 4th output vector is passed to its decoder layer, which outputs a probability distribution over its 30,000-dimensional vocabulary space.

  Next sentence prediction 


Given two spans of text, the model predicts if these two spans appeared sequentially in the training corpus, outputting either  sNext/code> or  otNext/code>. Specifically, the training algorithm would sometimes sample two spans from a single continuous span in the training corpus, but other times, sample two spans from two discontinuous spans in the training corpus.

The first span starts with a special token  LS/code> (for "classify"). The two spans are separated by a special token  EP/code> (for "separate"). After processing the two spans, the 1-st output vector (the vector coding for  LS/code>) is passed to a separate neural network for the binary classification into  sNext/code> and  otNext/code>.

* For example, given " LS/code> my dog is cute  EP/code> he likes playing" the model should output token  sNext/code>.
* Given " LS/code> my dog is cute  EP/code> how do magnets work" the model should output token  otNext/code>.

  Fine-tuning 




BERT is meant as a general pretrained model for various applications in natural language processing. That is, after pre-training, BERT can be  fine-tuned with fewer resources on smaller datasets to optimize its performance on specific tasks such as  natural language inference and  text classification, and sequence-to-sequence-based language generation tasks such as  question answering and conversational response generation.

The original BERT paper published results demonstrating that a small amount of finetuning (for BERT_LARGE, 1 hour on 1 Cloud TPU) allowed it to achieved state-of-the-art 


The state of the art (SOTA or SotA, sometimes cutting edge, leading edge, or bleeding edge) refers to the highest level of general development, as of a device, technique, or  scientific field achieved at a particular time. However, in some contex ...
 performance on a number of  natural language understanding tasks:

* GLUE ( General Language Understanding Evaluation) task set (consisting of 9 tasks);
* SQuAD (Stanford Question Answering Dataset) v1.1 and v2.0;
* SWAG (Situations With Adversarial Generations).
In the original paper, all parameters of BERT are finetuned, and recommended that, for downstream applications that are text classifications, the output token at the  LS/code> input token is fed into a linear-softmax layer to produce the label outputs.

The original code base defined the final linear layer as a "pooler layer", in analogy with  global pooling in computer vision, even though it simply discards all output tokens except the one corresponding to   LS/code> .

  Cost 

BERT was trained on the  BookCorpus (800M words) and a filtered version of English Wikipedia (2,500M words) without lists, tables, and headers.

Training BERT_BASE  on 4 cloud  TPU (16 TPU chips total) took 4 days, at an estimated cost of 500 USD. Training BERT_LARGE on 16 cloud TPU (64 TPU chips total) took 4 days.

  Interpretation 

Language models like ELMo, GPT-2, and BERT, spawned the study of "BERTology", which attempts to interpret what is learned by these models. Their performance on these  natural language understanding tasks are not yet well understood. Several research publications in 2018 and 2019 focused on investigating the relationship behind BERT's output as a result of carefully chosen input sequences, analysis of internal  vector representations through probing classifiers, and the relationships represented by attention 




Attention or focus, is the concentration of  awareness on some phenomenon to the exclusion of other stimuli. It is the selective concentration on discrete information, either  subjectively or  objectively.  William James (1890) wrote that "Atte ...
 weights.

The high performance of the BERT model could also be attributed to the fact that it is bidirectionally trained. This means that BERT, based on the Transformer model architecture, applies its self-attention mechanism to learn information from a text from the left and right side during training, and consequently gains a deep understanding of the context. For example, the word ''fine'' can have two different meanings depending on the context (I feel fine today, She has fine blond hair). BERT considers the words surrounding the target word ''fine'' from the left and right side.

However it comes at a cost: due to  encoder-only architecture lacking a decoder, BERT can't  be prompted and can't  generate text, while bidirectional models in general do not work effectively without the right side, thus being difficult to prompt. As an illustrative example, if one wishes to use BERT to continue a sentence fragment "Today, I went to", then naively one would mask out all the tokens as "Today, I went to   ASK/code>   ASK/code>   ASK/code> ...   ASK/code> ." where the number of   ASK/code>  is the length of the sentence one wishes to extend to. However, this constitutes a dataset shift, as during training, BERT has never seen sentences with that many tokens masked out. Consequently, its performance degrades. More sophisticated techniques allow text generation, but at a high computational cost.

  History 


BERT was originally published by Google researchers Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. The design has its origins from pre-training contextual representations, including  semi-supervised sequence learning, generative pre-training, ELMo 









Elmo is a  Muppet character on the children's television show ''Sesame Street''. A furry red monster who speaks in a high-pitched falsetto voice and  frequently refers to himself in the third person, he hosts the last full 15-minute segmen ...
, and ULMFit. Unlike previous models, BERT is a deeply bidirectional,  unsupervised language representation, pre-trained using only a plain text corpus 


In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of  natively digital and older, digitalized, language resources, either annotated or unannotated.

Annotated, they have been used in corp ...
. Context-free models such as word2vec 

Word2vec is a technique in natural language processing (NLP) for obtaining  vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these rep ...
 or GloVe 


A glove is a garment covering the hand, with separate sheaths or openings for each finger including the thumb. Gloves protect and comfort hands against cold or heat, damage by friction, abrasion or chemicals, and disease; or in turn to provide a  ...
 generate a single word embedding representation for each word in the vocabulary, whereas BERT takes into account the context for each occurrence of a given word. For instance, whereas the vector for "running" will have the same word2vec vector representation for both of its occurrences in the sentences "He is running a company" and "He is running a marathon", BERT will provide a contextualized embedding that will be different according to the sentence.

On October 25, 2019, Google 







Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
 announced that they had started applying BERT models for English language 







English is a West Germanic language that developed in early medieval England and has since become a English as a lingua franca, global lingua franca.  The namesake of the language is the Angles (tribe), Angles, one of the Germanic peoples th ...
  search queries within the  US. On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages. In October 2020, almost every single English-based query was processed by a BERT model.

  Variants 

The BERT models were influential and inspired many variants.

RoBERTa (2019) was an engineering improvement. It preserves BERT's architecture (slightly larger, at 355M parameters), but improves its training, changing key hyperparameters, removing the ''next-sentence prediction'' task, and using much larger  mini-batch sizes. 

DistilBERT (2019)  distills BERT_BASE to a model with just 60% of its parameters (66M), while preserving 95% of its benchmark scores. Similarly, TinyBERT (2019) is a distilled model with just 28% of its parameters.

ALBERT (2019) used shared-parameter across layers, and experimented with independently varying the hidden size and the word-embedding layer's output size as two hyperparameters. They also replaced the ''next sentence prediction'' task with the ''sentence-order prediction'' (SOP) task, where the model must distinguish the correct order of two consecutive text segments from their reversed order. 

ELECTRA (2020) applied the idea of  generative adversarial networks to the MLM task. Instead of masking out tokens, a small language model generates random plausible substitutions, and a larger network identify these replaced tokens. The small model aims to fool the large model.

DeBERTa (2020) is a significant architectural variant, with ''disentangled attention''. Its key idea is to treat the positional and token encodings separately throughout the attention mechanism. Instead of combining the positional encoding ( $x_$ ) and token encoding ( $x_$ ) into a single input vector ( $x_ = x_ + x_$ ), DeBERTa keeps them separate as a tuple: ( $(x_, x_)$ ). Then, at each self-attention layer, DeBERTa computes three distinct attention matrices, rather than the single attention matrix used in BERT:


The three attention matrices are added together element-wise, then passed through a softmax layer and multiplied by a projection matrix.

Absolute position encoding is included in the final self-attention layer as additional input.

  Notes 



  References 



  Further reading 

* 

  External links 


Official GitHub repository




{{Artificial intelligence navbox

 Google software
 Large language models