HOME

TheInfoList



OR:

Generative Pre-trained Transformer 2 (GPT-2) is an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
created by
OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...
in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while sometimes indistinguishable from that of humans, can become repetitive or nonsensical when generating long passages. It is a general-purpose learner; it was not specifically trained to do any of these tasks, and its ability to perform them is an extension of its general ability to accurately synthesize the next item in an arbitrary sequence. GPT-2 was created as a "direct scale-up" of OpenAI's 2018 GPT model, with a ten-fold increase in both its parameter count and the size of its training dataset. The GPT architecture implements a deep neural network, specifically a
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer's ...
model, which uses
attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "At ...
in place of previous recurrence- and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant. This model allows for greatly increased parallelization, and outperforms previous benchmarks for RNN/CNN/LSTM-based models. OpenAI released the complete version of the GPT-2 language model (with 1.5 billion parameters) in November 2019. GPT-2 was to be followed by the 175-billion-parameter
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standar ...
, revealed to the public in 2020 (whose source code has never been made available). Access to GPT-3 is provided exclusively through APIs offered by OpenAI and Microsoft.


Background

Since the origins of computing,
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
has been an object of study; the " imitation game", postulated by
Alan Turing Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher, and theoretical biologist. Turing was highly influential in the development of theoretical co ...
in 1950 (and often called the "Turing test") proposed to establish an electronic or mechanical system's capacity for intelligent action by an evaluator's ability to distinguish its behavior from that of a human. The term " machine learning" was first used to describe a possible approach to artificial intelligence as early as 1959 by IBM researcher
Arthur Samuel Arthur Lee Samuel (December 5, 1901 – July 29, 1990) was an American pioneer in the field of computer gaming and artificial intelligence. He popularized the term "machine learning" in 1959. The Samuel Checkers-playing Program was among the wo ...
; current use of the term encompasses a broad variety of statistical learning, data science and neural network approaches to computational problems (often falling under the aegis of artificial intelligence).


Computational linguistics

Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
using computers, a task originally conceived as a subfield of
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics d ...
, was attempted as soon as computing hardware had the capacity; the first application of a dictionary look-up table was developed at
Birkbeck College , mottoeng = Advice comes over nightTranslation used by Birkbeck. , established = , type = Public research university , endowment = £4.3 m (2014) , budget = £109 ...
in London in 1948. The 1954 Georgetown Experiment was a demonstration of fully automated
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates th ...
, in which sixty Russian sentences were translated into English (mostly by replacement of words with their English synonyms). The translations were often crude; the system had only 6 grammar rules and a 250-word vocabulary, and no attempt was made to analyze or translate syntactic structure. However, the experiment proved to the public that computers could interpret and process natural language, and secured
CIA The Central Intelligence Agency (CIA ), known informally as the Agency and historically as the Company, is a civilian foreign intelligence service of the federal government of the United States, officially tasked with gathering, processing, ...
funding for further research. Direct substitution remains a standard against which machine translation programs are evaluated. Systems for using natural language in human-computer interaction (HCI) also began to emerge in the mid-20th century.
SHRDLU SHRDLU was an early natural-language understanding computer program, developed by Terry Winograd at MIT in 1968–1970. In the program, the user carries on a conversation with the computer, moving objects, naming collections and querying the s ...
, a program developed at
MIT The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of the m ...
in 1968–1970, consisted of a virtual environment of several objects which a user interacted with through commands in natural language (e.g."Find a block which is taller than the one you are holding and put it into the box"). ELIZA, a
chatterbot A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. Designed to convincingly simulate the way a human would beha ...
written in 1966, analyzed a human interlocutor's text for keywords and provided conversationally appropriate responses. While many subjects claimed an inability to distinguish ELIZA's conversation from that of a human, the question of whether this constituted intelligence proved contentious (the most famous script parodied a psychotherapist by, largely, repeating what the user had said back to them). While initial attempts at machine translation had been purely computational, by the 1950s the dominant approach to
computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics d ...
had come to emphasize Noam Chomsky's concept of universal grammar; NLP research in that era, accordingly, consisted largely of attempts to reduce statements in arbitrary languages to putative underlying language-agnostic logical structures. In the 1970s, semantic NLP systems would begin to eschew ''syntactic'' encodings in favor of more general ''semantic'' encodings. However, until the advent of
neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
, most systems continued to rely on large (and increasingly unwieldly) sets of manually programmed rules, which failed to scale up as initially predicted. The field of artificial intelligence continued to develop in the late 20th century, but occasional periods of stagnation known as "
AI winter In the history of artificial intelligence, an AI winter is a period of reduced funding and interest in artificial intelligence research. while Russell & Norvig in 2003 described another as starting soon after 1988.


Neural networks

An early concept in artificial intelligence,
connectionism Connectionism refers to both an approach in the field of cognitive science that hopes to explain mental phenomena using artificial neural networks (ANN) and to a wide range of techniques and algorithms using ANNs in the context of artificial int ...
, sought to produce intelligent behavior through
artificial neural network Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected units ...
s designed to simulate the behavior of
neurons A neuron, neurone, or nerve cell is an electrically excitable cell that communicates with other cells via specialized connections called synapses. The neuron is the main component of nervous tissue in all animals except sponges and placozoa. N ...
in biological brains. The first example of an artificial neural network was the SNARC, built in 1951. The perceptron (a type of
binary classifier Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has ...
) was introduced in 1957 by psychologist
Frank Rosenblatt Frank Rosenblatt (July 11, 1928July 11, 1971) was an American psychologist notable in the field of artificial intelligence. He is sometimes called the father of deep learning. Life and career Rosenblatt was born in New Rochelle, New York as son o ...
; his machine was designed for
image recognition Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human ...
using 400
photocell Photodetectors, also called photosensors, are sensors of light or other electromagnetic radiation. There is a wide variety of photodetectors which may be classified by mechanism of detection, such as photoelectric or photochemical effects, or b ...
s connected to "neurons", with weightings determined by potentiometers (and adjusted with electric motors during its learning process). Perceptron systems became the subject of great interest; a '' New York Times'' article described the perceptron as "the embryo of an electronic computer that
he Navy He or HE may refer to: Language * He (pronoun), an English pronoun * He (kana), the romanization of the Japanese kana へ * He (letter), the fifth letter of many Semitic alphabets * He (Cyrillic), a letter of the Cyrillic script called ''He'' in ...
expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence". Perceptron systems, however, fell out of favor for decades following a 1969 book by Marvin Minsky and Seymour Papert ('' Perceptrons: an introduction to computational geometry''), which pointed out several shortcomings of the then-present state of the art (single-layer perceptrons), including an inability to encode the exclusive or (XOR) function. The book was considered, at the time, to discredit the perceptron approach (as well as neural networks in general) as a promising area of research. Neural networks become capable of classifying different inputs (i.e. sorting them into distinct categories) through a process known as "learning". This begins with the network's weights (the amount by which each neuron's "activation" influences the activation of each specific neuron in the subsequent layer) being initialized to random quantities; in this state, the output of the network is similarly random. An
objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
, like a loss function, is defined, which is capable of quantitatively measuring how close the output of the network is to its desired performance (for example, how often an input consisting of a handwritten number results in the sole activation of the output neuron corresponding to that number). From this, and from the performance of the network, the weights can be adjusted in order to improve its performance.
Backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gene ...
, a supervised algorithm first applied to machine learning systems in Paul Werbos' 1974 dissertation, efficiently calculates "gradients", which are vector fields describing the optimal adjustment of all weights in the entire network for a given input/output example. The use of these gradients to train neural networks, a practice known as gradient descent, enabled the creation of much more complex systems, and wide-scale application of neural networks to
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
would occur in the
1980s File:1980s replacement montage02.PNG, 420px, From left, clockwise: The first Space Shuttle, '' Columbia'', lifts off in 1981; US president Ronald Reagan and Soviet leader Mikhail Gorbachev ease tensions between the two superpowers, leading to th ...
. In 1985, D.B. Parker would rediscover Werbos' method; in 1986, Rumelhart, Hinton and Williams would apply it to generate internal representations of incoming data in neural networks with hidden layers, referred to as "
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. D ...
" networks; this research would later form the basis for recurrent neural networks. Traditional feed-forward neural networks (FFNNs) are so named because each layer takes in output from the previous layer, and feeds it into the next; a FFNN's structure contains no " cycles" where information flows backwards. In contrast, a recurrent neural network (RNN) has at least one cycle of activation flow. RNNs are often used for processing sequences of data (and predicting future sequence items), since the network can process each item using both the item itself and its own output from processing the previous item. The neocognitron, proposed by Kunihiko Fukushima in 1979 based on models of neural architecture in the mammalian visual cortex, provided the basis for
convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networ ...
s (CNNs), often used in image processing. By "sliding" a small layer over a larger input, a CNN can perform deeper processing with less computation. For example, a 100×100 image has 10,000 pixels, which would require 10,000 weights to process with a fully connected layer; a convolutional layer consisting of a 5×5 "window" sliding over the image can perform
edge detection Edge detection includes a variety of mathematical methods that aim at identifying edges, curves in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The same problem of finding discontinuitie ...
using only 25 learnable parameters. Convolutional layers are combined by "pooling layers", and processed by "fully connected" layers (which are typically multilayer perceptrons).


Machine learning for natural language processing

Due to their ability to process sequential information, recurrent neural networks have seen use in many NLP applications; unlike FFNNs, they are capable of encoding different weights (and giving different output) for identical items based on their surroundings in a sequence—that is to say, a RNN system that parsed one word at a time could still associate a " black dog" with fuzzy paws, a "
corn dog A corn dog (also spelled corndog) is a sausage (usually a hot dog) on a stick that has been coated in a thick layer of cornmeal batter and deep fried. It originated in the United States and is commonly found in American cuisine. History Newly a ...
" with ketchup, and a " sun dog" with refraction. Moreover, since the retention of information from previous sequence items can be performed
recursively Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
, RNN systems can be designed that recall items arbitrarily far back in a sequence: for example, being able to continue the sequences "Tom looked at the black dog", "Tom looked at the corn dog", and "Tom looked at the sun dog" with "fondly", "hungrily", and "indirectly", respectively. While capable of impressive solutions, many-layered FFNNs and RNNs both proved vulnerable to the
vanishing gradient problem In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network' ...
: since gradients (encoded as finite-precision numbers) are required to backpropagate across all layers of a model, they can "vanish" to zero (or "explode" to infinity) over a sufficiently large number of layers. The long short-term memory network (LSTM), first proposed by
Sepp Hochreiter Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 2018 ...
and Jürgen Schmidhuber in 1995–1997, sought to resolve this issue by introducing a novel architecture consisting of multiple distinct "cells" with "input", "output" and "forget" gates. In 2009, an LSTM-based model submitted by
Alex Graves Alexander John Graves (born July 23, 1965) is an American film director, television director, television producer and screenwriter. Early life Alex Graves was born in Kansas City, Missouri. His father, William Graves, was a reporter for ''The K ...
' team won the ICDAR competition for handwriting recognition; another was the most accurate model in the competition and a third was the fastest. Another issue RNNs and LSTMs encounter is that they can only take into account the context of ''previous'' sequence items. This can create issues when parsing sentences like "Tom rode his bike to the store, put out the kickstand, and turned off the engine", in which the necessary context of the "
bike A bicycle, also called a pedal cycle, bike or cycle, is a human-powered or motor-powered assisted, pedal-driven, single-track vehicle, having two wheels attached to a frame, one behind the other. A is called a cyclist, or bicyclist. Bic ...
" being a motorcycle is revealed only at the end. One method of solving problems like this is the bidirectional LSTM, which proceeds in both directions simultaneously, giving access to both "past" and "future" input features.
Conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consid ...
s use tags to connect inputs directly to outputs. There exist combinations of the above approaches, like the LSTM-CRF network and the BI-LSTM-CRF network. Other improvements on the RNN model include
neural Turing machine A Neural Turing machine (NTM) is a recurrent neural network model of a Turing machine. The approach was published by Alex Graves et al. in 2014. NTMs combine the fuzzy pattern matching capabilities of neural networks with the algorithmic power of ...
s, adaptive computation time, neural programmers, and attention mechanisms, the latter of which form the basis for GPT-2 and related technologies.


Selective focusing

By the early 2010s, the best performance in neural machine translation was achieved with the encoder–decoder model, in which a RNN or LSTM "encoder network" encoded source sentences into vectors, and a "decoder network" of similar architecture processed these vectors into translated output. 2014 saw the introduction of significantly more complex "
attention Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "At ...
" mechanisms, which vastly augmented these models' performance. Attention mechanisms gave these models the ability to adaptively focus their decoder networks' "attention" on specific aspects of the source text, rather than forcing them to parse the entire text as one vector. 2017 then saw the introduction of "
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer's ...
" models, which went a step further by using attention mechanisms to replace the RNN/LSTM architecture entirely.


Attention mechanisms

One constraint of encoder–decoder models was the difficulty of compressing the encodings of larger sentences into fixed-length vectors; performance often deteriorated on larger inputs. In 2014, Bahdanau et al. introduced an extension to the encoder–decoder model that could "align and translate jointly". For each word of the source sentence that was translated, the Bahdanau model's encoder (a bidirectional RNN with 1000 hidden units in each direction) searched the entire rest of that sentence for the positions of relevant information. Rather than giving the decoder a fixed-length vector encoding of the entire input sequence (like previous models), it produced "context vectors", associated with those positions as well as previously generated target words. The decoder (which also had 1000 hidden units) then used these context vectors to decide where to focus its "attention". Research into "attention" mechanisms was continued by Luong et al. in a 2015 paper. A "global" approach based on the Bahdanau paper was attempted, as well as a "local" approach wherein only a subset of source words were "considered" at a time; the local approach, while more architecturally complicated, was less computationally expensive and easier to train. It took 7–10 days to fully train an English–German translation model, which was specifically designed to be capable of translating 1,000 target words per second; its accuracy was tested against the 2014 ACL Workshop on Machine Translation (WMT'14) task for English–German sentence pairs, and achieved a result of 23.0 BLEU—a 2.1 BLEU improvement on the previous best result achieved by previous attempts, a phrase-based language model from Buck et al. 2014.


Transformers

While attention mechanisms were effective in improving performance when used to augment existing convolutional and recurrent neural network architectures, it was soon discovered that performant models could be built using attention mechanisms on their own, without anything else underlying them. In June 2017, the
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer's ...
architecture was first introduced, in a paper released by researchers from Google Brain
Google Research
and University of Toronto. Transformers are a type of model based solely on attention mechanisms, discarding
convolution In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution'' ...
and recurrence altogether. Unlike previous RNN-based models, transformers can process sequential input without needing to perform computation on each item in sequence; this means they can be massively parallelized. On the WMT'14 French–English task, a specifically trained French–English translation model using the transformer architecture was able to establish a new single-model benchmark of 41.8 BLEU. Since their introduction, transformers have seen use in many NLP applications.


Generative Pre-trained Transformer

On June 11, 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced the ''Generative Pre-trained Transformer'' (GPT). At this point, the best-performing neural NLP models primarily employed supervised learning from large amounts of manually labeled data. This reliance on supervised learning limited their use on datasets that were not well-annotated, in addition to making it prohibitively expensive and time-consuming to train extremely large models; many languages (such as Swahili or Haitian Creole) are difficult to translate and interpret using such models due to a lack of available text for corpus-building. In contrast, GPT's "semi-supervised" approach involved two stages: an unsupervised
generative Generative may refer to: * Generative actor, a person who instigates social change * Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer * Generative music, mus ...
"pre-training" stage in which a language modeling objective was used to set initial parameters, and a supervised discriminative "fine-tuning" stage in which these parameters were adapted to a target task. The use of a transformer architecture, as opposed to previous techniques involving attention-augmented RNNs, provided GPT with a more structured memory than could be achieved through recurrent mechanisms; this resulted in "robust transfer performance across diverse tasks".
During transfer, we utilize task-specific input adaptations derived from traversal-style approaches, which process structured text input as a single contiguous sequence of tokens.


Corpus

The unsupervised pre-training was performed using BooksCorpus, a dataset of over 7,000 unpublished fiction books from various genres; this dataset was chosen in part because its long passages of continuous text conditioned the model to handle long-range information. Other available datasets, while larger, were rejected on the basis that they lacked this long-range structure (being "shuffled" at a sentence level). The ''ftfy'' library was used to clean the BooksCorpus text (standardize punctuation and whitespace); it was tokenized using ''spaCy''.


Architecture

GPT's architecture itself was a twelve-layer decoder-only transformer, using twelve masked self-attention heads, with 64 dimensional states each (for a total of 768). Rather than simple
stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...
, the Adam optimization algorithm was used; the learning rate was increased linearly from zero over the first 2,000 updates, to a maximum of 2.5×10−4, and annealed to 0 using a cosine schedule.
We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm is used extensively throughout the model, a simple weight initialization of N(0,0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges 3nd residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in Loshchilov et al. 2017, with ''w = 0.01'' on all non bias or gain weights.
..br/> We used learned position embeddings instead of the sinusoidal version proposed in the original work.
..br/>Unless specified, we reuse the hyperparameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. ''λ'' was set to 0.5.
While GPT's fine-tuning was adapted to specific tasks, its pre-training was not; to perform the various tasks, minimal changes were performed to its underlying task-agnostic model architecture. Despite this, GPT still improved on previous benchmarks in several language processing tasks, outperforming discriminatively-trained models with task-oriented architectures on a number of diverse tasks.


Performance

On natural language inference (also known as '' textual entailment'') tasks, models are evaluated on their ability to interpret pairs of sentences from various datasets and classify the relationship between them as "entailment", "contradiction" or "neutral". Examples of such datasets include QNLI ( Wikipedia articles) and MultiNLI (transcribed speech, popular fiction and government reports, among other sources); on these GPT achieved, respectively, a 5.8% and 1.5% improvement over previous best results. It similarly outperformed previous models on two tasks related to question answering and
commonsense reasoning In artificial intelligence (AI), commonsense reasoning is a human-like ability to make presumptions about the type and essence of ordinary situations humans encounter every day. These assumptions include judgments about the nature of physical objec ...
—by 5.7% on RACE, a dataset of written question–answer pairs from middle and high school exams, and by 8.9% on the Story Cloze Test. Another task, ''semantic similarity'' (or ''paraphrase detection''), assesses whether a model can predict whether two sentences are paraphrases of one another; on the Quora Question Pairs (QQP) dataset, GPT improved on previous best-performing models by 4.2%. In a text classification task using the Corpus of Linguistic Acceptability (CoLA), GPT achieved a score of 45.4, versus a previous best of 35.0. Finally, on GLUE, a multi-task test, GPT achieved an overall score of 72.8 (compared to a previous record of 68.9).


Scale-up

GPT-2 was created as a direct scale-up of GPT, with both its parameter count and dataset size increased by a factor of 10. Both are
unsupervised ''Unsupervised'' is an American adult animated sitcom created by David Hornsby, Rob Rosell, and Scott Marder which ran on FX from January 19 to December 20, 2012. The show was created, and for the most part, written by David Hornsby, Scott Marder ...
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer's ...
models trained to generate text by predicting the next word in a sequence of tokens. The GPT-2 model has 1.5 billion parameters, and was trained on a
dataset A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of th ...
of 8 million web pages. While GPT-2 was reinforced on very simple criteria (interpreting a sequence of words in a text sample and predicting the most likely next word), it produces full sentences and paragraphs by continuing to predict additional words, generating fully comprehensible (and semantically meaningful) statements in natural language. Notably, GPT-2 was evaluated on its performance on tasks in a zero-shot setting.


Training

Since the transformer architecture enabled massive parallelization, GPT-series models could be trained on larger corpora than previous NLP models. While the initial GPT model demonstrated that the approach was viable, GPT-2 would further explore the emergent properties of networks trained on extremely large corpora. '' CommonCrawl'', a large corpus produced by web crawling and previously used in training NLP systems, was considered due to its large size, but was rejected after further review revealed large amounts of unintelligible content. Instead, OpenAI developed a new corpus, known as '' WebText''; rather than scraping content indiscriminately from the
World Wide Web The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet. Documents and downloadable media are made available to the network through web se ...
, WebText was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned; HTML documents were parsed into plain text, duplicate pages were eliminated, and Wikipedia pages were removed (since their presence in many other datasets could have induced overfitting). While the cost of training GPT-2 is known to have been $256 per hour, the amount of hours it took to complete training is unknown; therefore, the overall training cost cannot be estimated accurately. However, comparable large language models using transformer architectures have had their costs documented in more detail; the training processes for BERT and XLNet consumed, respectively, $6,912 and $245,000 of resources.


Performance

Due to the broadness of its dataset, and the broadness of its approach, GPT-2 became capable of performing a diverse range of tasks beyond simple text generation: answering questions, summarizing, and even
translating Translation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between ''transla ...
between languages in a variety of specific domains, without being instructed in anything beyond how to predict the next word in a sequence. One example of generalized learning is GPT-2's ability to perform machine translation between French and English, for which task GPT-2's performance was assessed using WMT-14 translation tasks. GPT-2's training corpus included virtually no French text; non-English text was deliberately removed while cleaning the dataset prior to training, and as a consequence, only 10MB of French of the remaining 40,000MB was available for the model to learn from (mostly from foreign-language quotations in English posts and articles). Despite this, GPT-2 achieved 5 BLEU on the WMT-14 English-to-French test set (slightly below the score of a translation via word-for-word substitution). It was also able to outperform several contemporary (2017) unsupervised machine translation baselines on the French-to-English test set, where GPT-2 achieved 11.5 BLEU. This remained below the highest-performing contemporary unsupervised approach (2019), which had achieved 33.5 BLEU. However, other models used large amounts of French text to achieve these results; GPT-2 was estimated to have used a monolingual French corpus approximately 1/500 the size of comparable approaches.


Release

GPT-2 was first announced on 14 February 2019. A February 2019 article in ''
The Verge ''The Verge'' is an American technology news website operated by Vox Media, publishing news, feature stories, guidebooks, product reviews, consumer electronics news, and podcasts. The website launched on November 1, 2011, and uses Vox Media' ...
'' by James Vincent said that, while " hewriting it produces is usually easily identifiable as non-human", it remained "one of the most exciting examples yet" of language generation programs:
Give it a fake headline, and it’ll write the rest of the article, complete with fake quotations and statistics. Feed it the first line of a short story, and it’ll tell you what happens to your character next. It can even write fan fiction, given the right prompt.
'' The Guardian'' described this output as "plausible newspaper prose"; Kelsey Piper of '' Vox'' said "one of the coolest AI systems I’ve ever seen may also be the one that will kick me out of my job". GPT-2's flexibility was described as "impressive" by ''
The Verge ''The Verge'' is an American technology news website operated by Vox Media, publishing news, feature stories, guidebooks, product reviews, consumer electronics news, and podcasts. The website launched on November 1, 2011, and uses Vox Media' ...
''; specifically, its ability to translate text between languages, summarize long articles, and answer trivia questions were noted. A study by the University of Amsterdam employing a modified Turing test found that at least in some scenarios, participants were unable to distinguish poems generated by GPT-2 from those written by humans.


Restrictions and partial release

While previous OpenAI models had been made immediately available to the public, OpenAI initially refused to make a public release of GPT-2's source code when announcing it in February, citing the risk of malicious use; limited access to the model (i.e. an interface that allowed input and provided output, not the source code itself) was allowed for selected press outlets on announcement. One commonly-cited justification was that, since generated text was usually completely novel, it could be used by
spam Spam may refer to: * Spam (food), a canned pork meat product * Spamming, unsolicited or undesired electronic messages ** Email spam, unsolicited, undesired, or illegal email messages ** Messaging spam, spam targeting users of instant messaging ( ...
mers to evade automated filters; OpenAI demonstrated a version of GPT-2 fine-tuned to "generate infinite positive – or negative – reviews of products". Another was that GPT-2 could be used to generate text that was obscene or
racist Racism is the belief that groups of humans possess different behavioral traits corresponding to inherited attributes and can be divided based on the superiority of one Race (human categorization), race over another. It may also mean prejudice, d ...
. Researchers such as Jeremy Howard warned of "the technology to totally fill Twitter, email, and the web up with reasonable-sounding, context-appropriate prose, which would drown out all other speech and be impossible to filter". The
Allen Institute for Artificial Intelligence The Allen Institute for AI (abbreviated AI2) is a research institute founded by late Microsoft co-founder Paul Allen. The institute seeks to achieve scientific breakthroughs by constructing AI systems with reasoning, learning, and reading capabi ...
, in response to GPT-2, announced a tool to detect "neural fake news". However, opinion was divided. A February 2019 article in ''The Verge'' argued that the threat posed by GPT-2 had been exaggerated; Anima Anandkumar, a professor at
Caltech The California Institute of Technology (branded as Caltech or CIT)The university itself only spells its short form as "Caltech"; the institution considers other spellings such a"Cal Tech" and "CalTech" incorrect. The institute is also occasional ...
and director of machine learning research at
Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...
, said that there was no evidence that GPT-2 had the capabilities to pose the threats described by OpenAI, and that what they did was the "opposite of open", characterizing their refusal to release the full model as "malicious BS". ''The Gradient'' published an open letter to OpenAI requesting that they release the model publicly, comparing the threat posed by text-generation AI to the threat posed by the printing press, and giving Photoshop as an example of "a technology that has (thankfully) not destroyed modern society despite its potential for chaos":
Thirty years later, society has emerged relatively unscathed despite Photoshop being simple enough for high school students to use and ubiquitous enough to commandeer its own verb. Why? Precisely because everyone knows about Photoshop.


774M release

While OpenAI did not release the fully-trained model or the corpora it was trained on, description of their methods in prior publications (and the free availability of underlying technology) made it possible for GPT-2 to be replicated by others as free software; one such replication, OpenGPT-2, was released in August 2019, in conjunction with a freely licensed version of WebText called OpenWebText. The cloud compute costs for OpenGPT-2 were given as approximately $50,000. On August 20, 2019, OpenAI released a partial version of GPT-2, with 774 million parameters (roughly half the size of the full 1.5 billion parameter model).


Full 1.5B release

Initial concerns that GPT-2 would lend itself to widespread misuse did not come to pass; ''The Verge'' said that "there are reasons to be skeptical about claims that AI technology will usher in some sort of ‘infopocalypse.’ For a start, we already have programs that can generate plausible text at high volume for little cost: humans." By November 2019, OpenAI said that they had "seen no strong evidence of misuse so far", and the full version, with 1.5 billion parameters, was released on November 5, 2019.


Limitations

While GPT-2's ability to generate plausible passages of natural language text were generally remarked on positively, its shortcomings were noted as well, especially when generating texts longer than a couple paragraphs; ''Vox'' said "the prose is pretty rough, there’s the occasional non-sequitur, and the articles get less coherent the longer they get". ''The Verge'' similarly noted that longer samples of GPT-2 writing tended to "stray off topic" and lack overall coherence; '' The Register'' opined that "a human reading it should, after a short while, realize something's up", and noted that "GPT-2 doesn't answer questions as well as other systems that rely on algorithms to extract and retrieve information." GPT-2 deployment is resource-intensive; the full version of the model is larger than five gigabytes, making it difficult to embed locally into applications, and consumes large amounts of RAM. In addition, performing a single prediction "can occupy a CPU at 100% utilization for several minutes", and even with GPU processing, "a single prediction can take seconds". To alleviate these issues, the company Hugging Face created DistilGPT2, using knowledge distillation to produce a smaller model that "scores a few points lower on some quality benchmarks", but is "33% smaller and twice as fast".


Implementations and subsequent research

Possible applications of GPT-2 described by journalists included aiding humans in writing text like news articles. Even before the release of the full version, GPT-2 was used for a variety of applications and services, as well as for entertainment. In June 2019, a subreddit named r/SubSimulatorGPT2 was created in which a variety of GPT-2 instances trained on different subreddits made posts and replied to each other's comments, creating a situation where one could observe "an AI personification of r/Bitcoin argue with the machine learning-derived spirit of r/ShittyFoodPorn"; by July of that year, a GPT-2-based software program released to
autocomplete Autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing. In Android and iOS smartphones, this is called predictive text. In graphical user interfaces, users can typically press the tab ...
lines of code in a variety of programming languages was described by users as a "game-changer". In 2019,
AI Dungeon ''AI Dungeon'' is a free-to-play single-player and multiplayer text adventure game which uses artificial intelligence (A.I.) to generate content. It allows players to create and share their own custom adventure settings. The game's first versi ...
was launched, which used GPT-2 to generate dynamic text adventures based on user input. AI Dungeon now offers access to the largest release of
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standar ...
API as an optional paid upgrade, the free version of the site uses the 2nd largest release of GPT-3. Latitude, the company formed around AI Dungeon, raised $3.3 million in seed funding in 2021. Several websites host interactive demonstrations of different instances of GPT-2 and other transformer models. In February 2021, a crisis center for troubled teens announced that they would begin using a GPT-2-derived chatbot to help train counselors by allowing them to have conversations with simulated teens (this use was purely for internal purposes, and did not involve having GPT-2 communicate with the teens themselves).


References

{{Existential risk from artificial intelligence Deep learning software applications Language modeling Natural language generation Open-source artificial intelligence Software using the MIT license Unsupervised learning OpenAI