GPT-2
   HOME

TheInfoList



OR:

Generative Pre-trained Transformer 2 (GPT-2) is a
large language model A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are g ...
by
OpenAI OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...
and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages. It was partially released in February 2019, followed by full release of the 1.5-billion-parameter model on November 5, 2019. GPT-2 was created as a "direct scale-up" of GPT-1 with a ten-fold increase in both its parameter count and the size of its training dataset. It is a general-purpose learner and its ability to perform the various tasks was a consequence of its general ability to accurately predict the next item in a sequence, which enabled it to translate texts, answer questions about a topic from a text, summarize passages from a larger text, and generate text output on a level sometimes indistinguishable from that of humans; however, it could become repetitive or nonsensical when generating long passages. It was superseded by the GPT-3 and GPT-4 models, which are no longer open source. GPT-2 has, like its predecessor GPT-1 and its successors GPT-3 and GPT-4, a
generative pre-trained transformer A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an Neural network (machine learning), artificial neural network that is used in natural ...
architecture, implementing a
deep neural network Deep learning is a subset of machine learning that focuses on utilizing multilayered neural network (machine learning), neural networks to perform tasks such as Statistical classification, classification, Regression analysis, regression, and re ...
, specifically a
transformer In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple Electrical network, circuits. A varying current in any coil of the transformer produces ...
model, which uses
attention Attention or focus, is the concentration of awareness on some phenomenon to the exclusion of other stimuli. It is the selective concentration on discrete information, either subjectively or objectively. William James (1890) wrote that "Atte ...
instead of older recurrence- and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant. This model allows for greatly increased
parallelization Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different for ...
, and outperforms previous benchmarks for RNN/CNN/LSTM-based models.


Training

Since the transformer architecture enabled massive parallelization, GPT models could be trained on larger corpora than previous NLP (natural language processing) models. While the GPT-1 model demonstrated that the approach was viable, GPT-2 would further explore the emergent properties of networks trained on extremely large corpora. '' CommonCrawl'', a large corpus produced by
web crawling Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
and previously used in training NLP systems, was considered due to its large size, but was rejected after further review revealed large amounts of unintelligible content. Instead, OpenAI developed a new corpus, known as '' WebText''; rather than scraping content indiscriminately from the
World Wide Web The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
, WebText was generated by scraping only pages linked to by
Reddit Reddit ( ) is an American Proprietary software, proprietary social news news aggregator, aggregation and Internet forum, forum Social media, social media platform. Registered users (commonly referred to as "redditors") submit content to the ...
posts that had received at least 3 karma prior to December 2017. The corpus was subsequently cleaned;
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
documents were parsed into plain text, duplicate pages were eliminated, and Wikipedia pages were removed (since their presence in many other datasets could have induced
overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
). While the cost of training GPT-2 is known to have been $256 per hour, the amount of hours it took to complete training is unknown; therefore, the overall training cost cannot be estimated accurately. However, comparable large language models using transformer architectures have had their costs documented in more detail; the training processes for BERT and XLNet consumed, respectively, $6,912 and $245,000 of resources.


Release

GPT-2 was first announced on 14 February 2019. A February 2019 article in ''
The Verge ''The Verge'' is an American Technology journalism, technology news website headquarters, headquartered in Lower Manhattan, New York City and operated by Vox Media. The website publishes news, feature stories, guidebooks, product reviews, cons ...
'' by James Vincent said that, while " hewriting it produces is usually easily identifiable as non-human", it remained "one of the most exciting examples yet" of language generation programs:
Give it a fake headline, and it’ll write the rest of the article, complete with fake quotations and statistics. Feed it the first line of a short story, and it’ll tell you what happens to your character next. It can even write fan fiction, given the right prompt.
''
The Guardian ''The Guardian'' is a British daily newspaper. It was founded in Manchester in 1821 as ''The Manchester Guardian'' and changed its name in 1959, followed by a move to London. Along with its sister paper, ''The Guardian Weekly'', ''The Guardi ...
'' described this output as "plausible newspaper prose"; Kelsey Piper of '' Vox'' said "one of the coolest AI systems I’ve ever seen may also be the one that will kick me out of my job". GPT-2's flexibility was described as "impressive" by ''
The Verge ''The Verge'' is an American Technology journalism, technology news website headquarters, headquartered in Lower Manhattan, New York City and operated by Vox Media. The website publishes news, feature stories, guidebooks, product reviews, cons ...
''; specifically, its ability to translate text between languages, summarize long articles, and answer trivia questions were noted. A study by the
University of Amsterdam The University of Amsterdam (abbreviated as UvA, ) is a public university, public research university located in Amsterdam, Netherlands. Established in 1632 by municipal authorities, it is the fourth-oldest academic institution in the Netherlan ...
employing a modified
Turing test The Turing test, originally called the imitation game by Alan Turing in 1949,. Turing wrote about the ‘imitation game’ centrally and extensively throughout his 1950 text, but apparently retired the term thereafter. He referred to ‘ iste ...
found that at least in some scenarios, participants were unable to distinguish poems generated by GPT-2 from those written by humans. The GPT-2 series contained 4 models, reported in the paper. They were not released all at once, but in stages.


Restrictions and partial release

While previous OpenAI models had been made immediately available to the public, OpenAI initially refused to make a public release of GPT-2's source code when announcing it in February, citing the risk of malicious use; limited access to the model (i.e. an interface that allowed input and provided output, not the source code itself) was allowed for selected press outlets on announcement. One commonly-cited justification was that, since generated text was usually completely novel, it could be used by
spam Spam most often refers to: * Spam (food), a consumer brand product of canned processed pork of the Hormel Foods Corporation * Spamming, unsolicited or undesired electronic messages ** Email spam, unsolicited, undesired, or illegal email messages ...
mers to evade automated filters; OpenAI demonstrated a version of GPT-2 fine-tuned to "generate infinite positive – or negative – reviews of products". Another justification was that GPT-2 could be used to generate text that was
obscene An obscenity is any utterance or act that strongly offends the prevalent morality of the time. It is derived from the Latin , , "boding ill; disgusting; indecent", of uncertain etymology. Generally, the term can be used to indicate strong moral ...
or
racist Racism is the belief that groups of humans possess different behavioral traits corresponding to inherited attributes and can be divided based on the superiority of one Race (human categorization), race or ethnicity over another. It may also me ...
. Researchers such as Jeremy Howard warned of "the technology to totally fill Twitter, email, and the web up with reasonable-sounding, context-appropriate prose, which would drown out all other speech and be impossible to filter". The Allen Institute for Artificial Intelligence, in response to GPT-2, announced a tool to detect "neural fake news". However, opinion was divided. A February 2019 article in ''The Verge'' argued that the threat posed by GPT-2 had been exaggerated;
Anima Anandkumar Animashree (Anima) Anandkumar is the Bren Professor of Computing at California Institute of Technology. Previously, she was a senior director of Machine learning, Machine Learning research at Nvidia, NVIDIA and a principal scientist at Amazon Web ...
, a professor at
Caltech The California Institute of Technology (branded as Caltech) is a private university, private research university in Pasadena, California, United States. The university is responsible for many modern scientific advancements and is among a small g ...
and director of machine learning research at
Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...
, said that there was no evidence that GPT-2 had the capabilities to pose the threats described by OpenAI, and that what they did was the "opposite of open", characterizing their refusal to release the full model as "malicious BS". ''The Gradient'' published an open letter to OpenAI requesting that they release the model publicly, comparing the threat posed by text-generation AI to the threat posed by the
printing press A printing press is a mechanical device for applying pressure to an inked surface resting upon a printing, print medium (such as paper or cloth), thereby transferring the ink. It marked a dramatic improvement on earlier printing methods in whi ...
, and giving
Photoshop Adobe Photoshop is a raster graphics editor developed and published by Adobe for Windows and macOS. It was created in 1987 by Thomas and John Knoll. It is the most used tool for professional digital art, especially in raster graphics editin ...
as an example of "a technology that has (thankfully) not destroyed modern society despite its potential for chaos":
Thirty years later, society has emerged relatively unscathed despite Photoshop being simple enough for high school students to use and ubiquitous enough to commandeer its own verb. Why? Precisely because everyone knows about Photoshop.


774M release

While OpenAI did not release the fully-trained model or the corpora it was trained on, description of their methods in prior publications (and the free availability of underlying technology) made it possible for GPT-2 to be replicated by others as
free software Free software, libre software, libreware sometimes known as freedom-respecting software is computer software distributed open-source license, under terms that allow users to run the software for any purpose as well as to study, change, distribut ...
; one such replication, OpenGPT-2, was released in August 2019, in conjunction with a freely licensed version of WebText called OpenWebText. The cloud compute costs for OpenGPT-2 were given as approximately $50,000. On August 20, 2019, OpenAI released a partial version of GPT-2, with 774 million parameters (roughly half the size of the full 1.5 billion parameter model).


Full 1.5B release

Initial concerns that GPT-2 would lend itself to widespread misuse did not come to pass; ''The Verge'' said that "there are reasons to be skeptical about claims that AI technology will usher in some sort of ‘infopocalypse.’ For a start, we already have programs that can generate plausible text at high volume for little cost: humans." By November 2019, OpenAI said that they had "seen no strong evidence of misuse so far", and the full version, with 1.5 billion parameters trained with forty gigabytes of data, "about eight thousand times larger than the collected works of Shakespeare", was released on November 5, 2019.


Small and Medium Releases

Two other smaller releases of GPT-2 are available, including the small version of 124M parameters and the medium size of 355M parameters. Both are available to download from Huggingface.


Limitations

While GPT-2's ability to generate plausible passages of natural language text were generally remarked on positively, its shortcomings were noted as well, especially when generating texts longer than a couple paragraphs; ''Vox'' said "the prose is pretty rough, there’s the occasional non-sequitur, and the articles get less coherent the longer they get". ''The Verge'' similarly noted that longer samples of GPT-2 writing tended to "stray off topic" and lack overall coherence; ''
The Register ''The Register'' (often also called El Reg) is a British Technology journalism, technology news website co-founded in 1994 by Mike Magee (journalist), Mike Magee and John Lettice. The online newspaper's Nameplate_(publishing), masthead Logo, s ...
'' opined that "a human reading it should, after a short while, realize something's up", and noted that "GPT-2 doesn't answer questions as well as other systems that rely on algorithms to extract and retrieve information." GPT-2 deployment is resource-intensive; the full version of the model is larger than five gigabytes, making it difficult to embed locally into applications, and consumes large amounts of RAM. In addition, performing a single prediction "can occupy a CPU at 100% utilization for several minutes", and even with GPU processing, "a single prediction can take seconds". To alleviate these issues, the company
Hugging Face Hugging Face, Inc. is a French-American company based in List of tech companies in the New York metropolitan area, New York City that develops computation tools for building applications using machine learning. It is most notable for its Transf ...
created DistilGPT2, using knowledge distillation to produce a smaller model that "scores a few points lower on some quality benchmarks", but is "33% smaller and twice as fast".


Application and subsequent research

Even before the release of the full version, GPT-2 was used for a variety of applications and services, as well as for entertainment. In June 2019, a
subreddit Reddit ( ) is an American Proprietary software, proprietary social news news aggregator, aggregation and Internet forum, forum Social media, social media platform. Registered users (commonly referred to as "redditors") submit content to the ...
named r/SubSimulatorGPT2 was created in which a variety of GPT-2 instances trained on different subreddits made posts and replied to each other's comments, creating a situation where one could observe "an AI personification of r/Bitcoin argue with the machine learning-derived spirit of r/ShittyFoodPorn"; by July of that year, a GPT-2-based software program released to
autocomplete Autocomplete, or word completion, is a feature in which an application software, application predicts the rest of a word a user is typing. In Android (operating system), Android and iOS smartphones, this is called predictive text. In graphical us ...
lines of code in a variety of programming languages was described by users as a "game-changer". In 2019, AI Dungeon was launched, which used GPT-2 to generate dynamic text adventures based on user input. AI Dungeon now offers access to the largest release of
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...
API as an optional paid upgrade, the free version of the site uses the 2nd largest release of GPT-3. Latitude, the company formed around AI Dungeon, raised $3.3 million in
seed funding Seed money, also known as seed funding or seed capital, is a form of securities offering in which an investor puts capital in a startup company in exchange for an equity stake or convertible note stake in the company. The term ''seed'' suggests ...
in 2021. Several websites host interactive demonstrations of different instances of GPT-2 and other transformer models. In February 2021, a crisis center for troubled teens announced that they would begin using a GPT-2-derived chatbot to help train counselors by allowing them to have conversations with simulated teens (this use was purely for internal purposes, and did not involve having GPT-2 communicate with the teens themselves). On May 9, 2023, OpenAI released a mapped version of GPT-2. OpenAI used successor model,
GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the p ...
, to map each neuron of GPT-2 to determine their functions.


Performance and evaluation

GPT-2 became capable of performing a variety of tasks beyond simple text production due to the breadth of its dataset and technique: answering questions, summarizing, and even translating between languages in a variety of specific domains, without being instructed in anything beyond how to predict the next word in a sequence. One example of generalized learning is GPT-2's ability to perform machine translation between French and English, for which task GPT-2's performance was assessed using WMT-14 translation tasks. GPT-2's training corpus included virtually no French text; non-English text was deliberately removed while cleaning the dataset prior to training, and as a consequence, only 10MB of French of the remaining 40,000MB was available for the model to learn from (mostly from foreign-language quotations in English posts and articles). Despite this, GPT-2 achieved 5 BLEU on the WMT-14 English-to-French test set (slightly below the score of a translation via word-for-word substitution). It was also able to outperform several contemporary (2017) unsupervised machine translation baselines on the French-to-English test set, where GPT-2 achieved 11.5 BLEU. This remained below the highest-performing contemporary unsupervised approach (2019), which had achieved 33.5 BLEU. However, other models used large amounts of French text to achieve these results; GPT-2 was estimated to have used a monolingual French corpus approximately 1/500 the size of comparable approaches. GPT-2 was to be followed by the 175-billion-parameter
GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...
, revealed to the public in 2020 (whose source code has never been made available). Access to GPT-3 is provided exclusively through APIs offered by OpenAI and
Microsoft Microsoft Corporation is an American multinational corporation and technology company, technology conglomerate headquartered in Redmond, Washington. Founded in 1975, the company became influential in the History of personal computers#The ear ...
. That was then later followed by
GPT-4 Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model trained and created by OpenAI and the fourth in its series of GPT foundation models. It was launched on March 14, 2023, and made publicly available via the p ...
.


References

{{Generative AI Large language models Generative pre-trained transformers Software using the MIT license OpenAI