The XLNet was an autoregressive
Transformer
A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
designed as an improvement over
BERT, with 340M parameters and trained on 33 billion words. It was released on 19 June, 2019, under the
Apache 2.0 license.
It achieved state-of-the-art results on a variety of natural language processing tasks, including language modeling, question answering, and natural language inference.
Architecture
The main idea of XLNet is to model language autoregressively like the
GPT models, but allow for ''all possible
permutations
In mathematics, a permutation of a set is, loosely speaking, an arrangement of its members into a sequence or linear order, or if the set is already ordered, a rearrangement of its elements. The word "permutation" also refers to the act or pr ...
'' of a sentence.
Concretely, consider the following sentence:
My dog is cute.
In standard autoregressive language modeling, the model would be tasked with predicting the probability of each word, conditioned on the previous words as its context:
We factorize the joint probability of a sequence of words
using the chain rule:
For example, the sentence "My dog is cute" is factorized as:
Schematically, we can write it as
However, for XLNet, the model is required to predict the words in a randomly generated order. Suppose we have sampled a randomly generated order 3241, then schematically, the model is required to perform the following prediction task:
By considering all permutations, XLNet is able to capture longer-range dependencies and better model the bidirectional context of words.
Two-Stream Self-Attention
To implement permutation language modeling, XLNet uses a two-stream self-attention mechanism. The two streams are:
* Content stream: This stream encodes the content of each word, as in standard causally masked self-attention.
* Query stream: This stream encodes the content of each word in the context of what has gone before. In more detail, it is a masked cross-attention mechanism, where the queries are from the query stream, and the key-value pairs are from the content stream.
The content stream uses the causal mask
permuted by a random
permutation matrix
In mathematics, particularly in matrix theory, a permutation matrix is a square binary matrix that has exactly one entry of 1 in each row and each column and 0s elsewhere. Each such matrix, say , represents a permutation of elements and, wh ...
to
.
The query stream uses the cross-attention mask
, where the diagonal is subtracted away specifically to avoid the model "cheating" by looking at the content stream for what the current masked token is.
Like the causal masking for GPT models, this two-stream masked architecture allows the model to train on all tokens in one forward pass.
Training
Two models were released:
* XLNet-Large, cased: 110M parameters, 24-layer, 1024-hidden, 16-heads
* XLNet-Base, cased: 340M parameters, 12-layer, 768-hidden, 12-heads.
It was trained on a dataset that amounted to 32.89 billion tokens after tokenization with SentencePiece. The dataset was composed of
BooksCorpus, and English Wikipedia, Giga5, ClueWeb 2012-B, and
Common Crawl
Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally eve ...
.
It was trained on 512 TPU v3 chips, for 5.5 days. At the end of training, it still under-fitted the data, meaning it could have achieved lower loss with more training. It took 0.5 million steps with an
Adam optimizer
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gr ...
, linear learning rate decay, and a batch size of 8192.
See also
*
BERT (language model)
Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his ...
*
Transformer (machine learning model)
A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer v ...
*
Generative pre-trained transformer
Reference
{{Differentiable computing
Google software
Large language models
Software using the Apache license