BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a
dataset
A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record o ...
consisting of the text of around 7,000 self-published books
scraped from the indie ebook distribution website
Smashwords
Smashwords, Inc., based in Los Gatos, California, is a platform for self-publishing e-books. The company, founded by Mark Coker, began public operation in 2008 and was acquired by Draft2Digital, LLC in 2022.
Authors and independent publishers ...
.
It was the main
corpus
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
used to train the initial
GPT model by
OpenAI
OpenAI, Inc. is an American artificial intelligence (AI) organization founded in December 2015 and headquartered in San Francisco, California. It aims to develop "safe and beneficial" artificial general intelligence (AGI), which it defines ...
,
and has been used as training data for other early
large language model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation.
The largest and most capable LLMs are g ...
s including Google's
BERT.
The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.
[
The corpus was introduced in a 2015 paper by researchers from the ]University of Toronto
The University of Toronto (UToronto or U of T) is a public university, public research university whose main campus is located on the grounds that surround Queen's Park (Toronto), Queen's Park in Toronto, Ontario, Canada. It was founded by ...
and MIT
The Massachusetts Institute of Technology (MIT) is a private research university in Cambridge, Massachusetts, United States. Established in 1861, MIT has played a significant role in the development of many areas of modern technology and sc ...
titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service. The dataset was initially hosted on a University of Toronto webpage.[ An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords
Smashwords, Inc., based in Los Gatos, California, is a platform for self-publishing e-books. The company, founded by Mark Coker, began public operation in 2008 and was acquired by Draft2Digital, LLC in 2022.
Authors and independent publishers ...
.
References
{{reflist
Datasets in machine learning
English corpora