Common Crawl is a

nonprofit A nonprofit organization (NPO), also known as a nonbusiness entity, nonprofit institution, not-for-profit organization, or simply a nonprofit, is a non-governmental (private) legal entity organized and operated for a collective, public, or so ...

501(c)(3) A 501(c)(3) organization is a United States corporation, Trust (business), trust, unincorporated association or other type of organization exempt from federal income tax under section 501(c)(3) of Title 26 of the United States Code. It is one of ...

organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's

web archive The WARC (Web ARChive) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. These combined resources are saved as a WARC computer file, file which can be rep ...

consists of

petabytes The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...

of data collected since 2008. It completes crawls approximately once a month. Common Crawl was founded by Gil Elbaz. Advisors to the non-profit include

Peter Norvig Peter Norvig (born 14 December 1956) is an American computer scientist and Distinguished Education Fellow at the Stanford Institute for Human-Centered AI. He previously served as a director of research and search quality at Google. Norvig is th ...

and Joi Ito. The organization's crawlers respect

nofollow nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engi ...

and

robots.txt robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit. The standard, dev ...

policies. Open source code for processing Common Crawl's data set is publicly available. The Common Crawl dataset includes copyrighted work and is distributed from the US under

fair use Fair use is a Legal doctrine, doctrine in United States law that permits limited use of copyrighted material without having to first acquire permission from the copyright holder. Fair use is one of the limitations to copyright intended to bal ...

claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the Common Crawl dataset to work around copyright law in other legal jurisdictions. Contents archived by Common Crawl are mirrored and made available online in

Wayback Machine The Wayback Machine is a digital archive of the World Wide Web founded by Internet Archive, an American nonprofit organization based in San Francisco, California. Launched for public access in 2001, the service allows users to go "back in ...

. English is the primary language for 46% of documents in the March 2023 version of the Common Crawl dataset. The next most common primary languages are German, Russian, Japanese, French, Spanish and Chinese, each with less than 6% of documents.

History

Amazon Web Services Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...

began hosting Common Crawl's archive through its Public Data Sets program in 2012. The organization began releasing

metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...

files and the text output of the crawlers alongside .arc files in July 2012. Common Crawl's archives had only included .arc files previously. In December 2012, blekko donated to Common Crawl search engine

blekko had gathered from crawls it conducted from February to October 2012. The donated data helped Common Crawl "improve its crawl while avoiding spam, porn and the influence of excessive SEO." In 2013, Common Crawl began using the

Apache Software Foundation The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...

Nutch Apache Nutch is a highly extensible and scalable Open-source license, open source web crawler software project. Features Nutch is coded entirely in the Java (programming language), Java programming language, but data is written in language-ind ...

webcrawler instead of a custom crawler. Common Crawl switched from using .arc files to .warc files with its November 2013 crawl. A filtered version of Common Crawl was used to train OpenAI's

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020. Like its predecessor, GPT-2, it is a decoder-only transformer model of deep neural network, which supersedes recurrence and convolution-based ...

language model, announced in 2020.

Timeline of Common Crawl data

The following data have been collected from the official Common Crawl Blog and Common Crawl's API.

Norvig Web Data Science Award

In corroboration with

SURFsara SURF (short for Samenwerkende Universitaire Rekenfaciliteiten, "Cooperating University Computing Facilities") is an organization that develops, implements and maintains the national research and education network (NREN) of the Netherlands. It opera ...

, Common Crawl sponsors the Norvig Web Data Science Award, a competition open to students and researchers in

Benelux The Benelux Union (; ; ; ) or Benelux is a politico-economic union, alliance and formal international intergovernmental cooperation of three neighbouring states in Western Europe: Belgium, the Netherlands, and Luxembourg. The name is a portma ...

. The award is named for

who also chairs the judging committee for the award.

Colossal Clean Crawled Corpus

Google's version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the T5 language model series in 2019. There are some concerns over copyrighted content in the C4.

References

{{reflist

External links

Common Crawl
in California, United States
Common Crawl GitHub Repository
with the crawler, libraries and example code
Common Crawl Discussion GroupCommon Crawl Blog
Internet-related organizations Web archiving Web archiving initiatives