The XLNet was an autoregressive

Transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

designed as an improvement over BERT, with 340M parameters and trained on 33 billion words. It was released on 19 June, 2019, under the Apache 2.0 license. It achieved state-of-the-art results on a variety of natural language processing tasks, including language modeling, question answering, and natural language inference.

Architecture

The main idea of XLNet is to model language autoregressively like the GPT models, but allow for ''all possible

permutations In mathematics, a permutation of a set is, loosely speaking, an arrangement of its members into a sequence or linear order, or if the set is already ordered, a rearrangement of its elements. The word "permutation" also refers to the act or pr ...

'' of a sentence. Concretely, consider the following sentence:

My dog is cute.

In standard autoregressive language modeling, the model would be tasked with predicting the probability of each word, conditioned on the previous words as its context: We factorize the joint probability of a sequence of words

x_1, \ldots, x_T

using the chain rule:

\Pr(x_1, \ldots, x_T) = \Pr(x_1) \Pr(x_2 ,  x_1) \Pr(x_3 ,  x_1, x_2) \ldots \Pr(x_T ,  x_1, \ldots, x_).

For example, the sentence "My dog is cute" is factorized as:

\Pr(\text, \text, \text, \text) = \Pr(\text) \Pr(\text ,  \text) \Pr(\text ,  \text, \text) \Pr(\text ,  \text, \text, \text).

Schematically, we can write it as

\texttt \texttt \texttt \texttt \to \text \texttt \texttt \texttt \to \text\texttt \texttt \to \text\texttt \to \text.

However, for XLNet, the model is required to predict the words in a randomly generated order. Suppose we have sampled a randomly generated order 3241, then schematically, the model is required to perform the following prediction task:

\texttt \texttt \texttt \texttt 
\to \texttt \texttt \text\texttt 
\to \texttt \text\texttt 
\to \texttt\text 
\to \text

By considering all permutations, XLNet is able to capture longer-range dependencies and better model the bidirectional context of words.

Two-Stream Self-Attention

To implement permutation language modeling, XLNet uses a two-stream self-attention mechanism. The two streams are: * Content stream: This stream encodes the content of each word, as in standard causally masked self-attention. * Query stream: This stream encodes the content of each word in the context of what has gone before. In more detail, it is a masked cross-attention mechanism, where the queries are from the query stream, and the key-value pairs are from the content stream. The content stream uses the causal mask

M_ = \begin
0 & -\infty & -\infty & \dots  & -\infty \\
0 & 0 & -\infty & \dots  & -\infty \\
0 & 0 & 0 & \dots  & -\infty \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & \dots  & 0
\end

permuted by a random

permutation matrix In mathematics, particularly in matrix theory, a permutation matrix is a square binary matrix that has exactly one entry of 1 in each row and each column and 0s elsewhere. Each such matrix, say , represents a permutation of elements and, wh ...

P M_ P^

. The query stream uses the cross-attention mask

P (M_ - \infty I) P^

, where the diagonal is subtracted away specifically to avoid the model "cheating" by looking at the content stream for what the current masked token is. Like the causal masking for GPT models, this two-stream masked architecture allows the model to train on all tokens in one forward pass.

Training

Two models were released: * XLNet-Large, cased: 110M parameters, 24-layer, 1024-hidden, 16-heads * XLNet-Base, cased: 340M parameters, 12-layer, 768-hidden, 12-heads. It was trained on a dataset that amounted to 32.89 billion tokens after tokenization with SentencePiece. The dataset was composed of BooksCorpus, and English Wikipedia, Giga5, ClueWeb 2012-B, and

Common Crawl Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally eve ...

. It was trained on 512 TPU v3 chips, for 5.5 days. At the end of training, it still under-fitted the data, meaning it could have achieved lower loss with more training. It took 0.5 million steps with an

Adam optimizer Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gr ...

, linear learning rate decay, and a batch size of 8192.

Reference

{{Differentiable computing Google software Large language models Software using the Apache license

Architecture

Two-Stream Self-Attention

Training

See also

Reference