Common Crawl is a
nonprofit
A nonprofit organization (NPO), also known as a nonbusiness entity, nonprofit institution, not-for-profit organization, or simply a nonprofit, is a non-governmental (private) legal entity organized and operated for a collective, public, or so ...
501(c)(3)
A 501(c)(3) organization is a United States corporation, Trust (business), trust, unincorporated association or other type of organization exempt from federal income tax under section 501(c)(3) of Title 26 of the United States Code. It is one of ...
organization that
crawls the web and freely provides its archives and datasets to the public.
Common Crawl's
web archive
The WARC (Web ARChive) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. These combined resources are saved as a WARC computer file, file which can be rep ...
consists of
petabytes
The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...
of data collected since 2008.
It completes crawls approximately once a month.
Common Crawl was founded by
Gil Elbaz.
Advisors to the non-profit include
Peter Norvig
Peter Norvig (born 14 December 1956) is an American computer scientist and Distinguished Education Fellow at the Stanford Institute for Human-Centered AI. He previously served as a director of research and search quality at Google. Norvig is th ...
and
Joi Ito.
The organization's crawlers respect
nofollow
nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engi ...
and
robots.txt
robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.
The standard, dev ...
policies. Open source code for processing Common Crawl's data set is publicly available.
The Common Crawl dataset includes copyrighted work and is distributed from the US under
fair use
Fair use is a Legal doctrine, doctrine in United States law that permits limited use of copyrighted material without having to first acquire permission from the copyright holder. Fair use is one of the limitations to copyright intended to bal ...
claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the Common Crawl dataset to work around copyright law in other
legal jurisdictions.
Contents archived by Common Crawl are mirrored and made available online in
Wayback Machine
The Wayback Machine is a digital archive of the World Wide Web founded by Internet Archive, an American nonprofit organization based in San Francisco, California. Launched for public access in 2001, the service allows users to go "back in ...
.
English is the primary language for 46% of documents in the March 2023 version of the Common Crawl dataset. The next most common primary languages are German, Russian, Japanese, French, Spanish and Chinese, each with less than 6% of documents.
History
Amazon Web Services
Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...
began hosting Common Crawl's archive through its Public Data Sets program in 2012.
The organization began releasing
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
files and the text output of the crawlers alongside
.arc files in July 2012.
Common Crawl's archives had only included .arc files previously.
In December 2012,
blekko donated to Common Crawl search engine
metadata
Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive ...
blekko had gathered from crawls it conducted from February to October 2012.
The donated data helped Common Crawl "improve its crawl while avoiding spam, porn and the influence of excessive
SEO."
In 2013, Common Crawl began using the
Apache Software Foundation
The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
's
Nutch
Apache Nutch is a highly extensible and scalable Open-source license, open source web crawler software project.
Features
Nutch is coded entirely in the Java (programming language), Java programming language, but data is written in language-ind ...
webcrawler instead of a custom crawler.
Common Crawl switched from using .arc files to
.warc files with its November 2013 crawl.
A filtered version of Common Crawl was used to train OpenAI's
GPT-3
Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.
Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...
language model, announced in 2020.
Timeline of Common Crawl data
The following data have been collected from the official Common Crawl Blog
and Common Crawl's API.
Norvig Web Data Science Award
In corroboration with
SURFsara
SURF (short for Samenwerkende Universitaire Rekenfaciliteiten, "Cooperating University Computing Facilities") is an organization that develops, implements and maintains the national research and education network (NREN) of the Netherlands. It opera ...
, Common Crawl sponsors the Norvig Web Data Science Award, a competition open to students and researchers in
Benelux
The Benelux Union (; ; ; ) or Benelux is a politico-economic union, alliance and formal international intergovernmental cooperation of three neighbouring states in Western Europe: Belgium, the Netherlands, and Luxembourg. The name is a portma ...
.
The award is named for
Peter Norvig
Peter Norvig (born 14 December 1956) is an American computer scientist and Distinguished Education Fellow at the Stanford Institute for Human-Centered AI. He previously served as a director of research and search quality at Google. Norvig is th ...
who also chairs the judging committee for the award.
Colossal Clean Crawled Corpus
Google's version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the
T5 language model series in 2019.
There are some concerns over copyrighted content in the C4.
References
{{reflist
External links
Common Crawlin California, United States
Common Crawl GitHub Repositorywith the crawler, libraries and example code
Common Crawl Discussion GroupCommon Crawl Blog
Internet-related organizations
Web archiving
Web archiving initiatives