Generative Pre-trained Transformer 1 (GPT-1) was the first of

OpenAI OpenAI is an artificial intelligence (AI) research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company conducts research in the field of AI with the stated goal of promo ...

large language model A large language model (LLM) is a language model consisting of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabelled text using self-supervised learning. LLMs emerged around 2018 an ...

s following

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

's invention of the

transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...

architecture in 2017. In June 2018,

released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced that initial model along with the general concept of a generative pre-trained transformer. Up to that point, the best-performing neural NLP models primarily employed

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

from large amounts of manually labeled data. This reliance on supervised learning limited their use of datasets that were not well-annotated, in addition to making it prohibitively expensive and time-consuming to train extremely large models; many languages (such as

Swahili Swahili may refer to: * Swahili language, a Bantu language official in Kenya, Tanzania and Uganda and widely spoken in the African Great Lakes * Swahili people, an ethnic group in East Africa * Swahili culture Swahili culture is the culture of ...

or Haitian Creole) are difficult to translate and interpret using such models due to a lack of available text for corpus-building. In contrast, a GPT's "semi-supervised" approach involved two stages: an unsupervised

generative Generative may refer to: * Generative actor, a person who instigates social change * Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer * Generative music, ...

"pre-training" stage in which a language modeling objective was used to set initial parameters, and a supervised discriminative "fine-tuning" stage in which these parameters were adapted to a target task. The use of a

architecture, as opposed to previous techniques involving attention-augmented RNNs, provided GPT models with a more structured memory than could be achieved through recurrent mechanisms; this resulted in "robust transfer performance across diverse tasks".

Reason for choosing BookCorpus

BookCorpus BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords. It was the main corpus used to train the i ...

was chosen as a training dataset partly because the long passages of continuous text helped the model learn to handle long-range information. It contained over 7,000 unpublished fiction books from various genres. The rest of the datasets available at the time, while being larger, lacked this long-range structure (being "shuffled" at a sentence level). The BookCorpus text was cleaned by the ''ftfy'' library to standardized punctuation and whitespace and then

tokenized In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...

by ''spaCy''.

Architecture

The GPT-1 architecture was a twelve-layer decoder-only

, using twelve masked self-attention heads, with 64-dimensional states each (for a total of 768). Rather than simple

stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...

, the

Adam optimization algorithm Adam; el, Ἀδάμ, Adám; la, Adam is the name given in Genesis 1-5 to the first human. Beyond its use as the name of the first man, ''adam'' is also used in the Bible as a pronoun, individually as "a human" and in a collective sense as ...

was used; the learning rate was increased linearly from zero over the first 2,000 updates to a maximum of 2.5×10⁻⁴, and annealed to 0 using a cosine schedule. GPT-1 has 117 million parameters. While the fine-tuning was adapted to specific tasks, its pre-training was not; to perform the various tasks, minimal changes were performed to its underlying task-agnostic model architecture. Despite this, GPT-1 still improved on previous benchmarks in several language processing tasks, outperforming discriminatively-trained models with task-oriented architectures on several diverse tasks.

Performance and evaluation

GPT-1 achieved a 5.8% and 1.5% improvement over previous best results on natural language inference (also known as ''

textual entailment Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are ...

'') tasks, evaluating the ability to interpret pairs of sentences from various datasets and classify the relationship between them as "entailment", "contradiction" or "neutral". Examples of such datasets include QNLI (

Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read ref ...

articles) and MultiNLI (transcribed speech, popular fiction, and government reports, among other sources); It similarly outperformed previous models on two tasks related to question answering and

commonsense reasoning In artificial intelligence (AI), commonsense reasoning is a human-like ability to make presumptions about the type and essence of ordinary situations humans encounter every day. These assumptions include judgments about the nature of physical objec ...

—by 5.7% on RACE, a dataset of written question-answer pairs from middle and high school exams, and by 8.9% on the Story

Cloze Test A cloze test (also cloze deletion test or occlusion test) is an exercise, test, or assessment consisting of a portion of language with certain items, words, or signs removed (cloze text), where the participant is asked to replace the missing la ...

. GPT-1 improved on previous best-performing models by 4.2% on ''semantic similarity'' (or ''paraphrase detection''), evaluating the ability to predict whether two sentences are paraphrases of one another, using the

Quora Quora () is a social question-and-answer website based in Mountain View, California. It was founded on June 25, 2009, and made available to the public on June 21, 2010. Users can collaborate by editing questions and commenting on answers that ...

Question Pairs (QQP) dataset. GPT-1 achieved a score of 45.4, versus a previous best of 35.0 in a text classification task using the Corpus of Linguistic Acceptability (CoLA). Finally, GPT-1 achieved an overall score of 72.8 (compared to a previous record of 68.9) on GLUE, a multi-task test.

References

{{Differentiable computing Large language models Generative pre-trained transformers Software using the MIT license OpenAI