Common Crawl is a

nonprofit A nonprofit organization (NPO) or non-profit organisation, also known as a non-business entity, not-for-profit organization, or nonprofit institution, is a legal entity organized and operated for a collective, public or social benefit, in co ...

501(c)(3) A 501(c)(3) organization is a United States corporation, trust, unincorporated association or other type of organization exempt from federal income tax under section 501(c)(3) of Title 26 of the United States Code. It is one of the 29 types of 50 ...

organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month. Common Crawl was founded by

Gil Elbaz Gil Elbaz is an American entrepreneur, investor, and philanthropist best known for co-founding, along with Adam Weissman, Applied Semantics (ASI). He is the founder and CEO of Factual, an information-sharing startup. He is also the founder and chai ...

. Advisors to the non-profit include Peter Norvig and Joi Ito. The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available. The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other

legal jurisdictions The contemporary national legal systems are generally based on one of four basic systems: civil law (legal system), civil law, common law, statutory law, religious law or combinations of these. However, the legal system of each country is shaped ...

History

Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012. The organization began releasing

metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...

files and the text output of the crawlers alongside

.arc ARC is a lossless data compression and archival format by System Enhancement Associates (SEA). The file format and the program were both called ARC. The format is known as the subject of controversy in the 1980s, part of important debates over ...

files in July of that year. Common Crawl's archives had only included .arc files previously. In December 2012, blekko donated to Common Crawl search engine

blekko gathered from crawls it conducted from February to October 2012. The donated data helped Common Crawl "improve its crawl while avoiding spam, porn and the influence of excessive

SEO Seo or SEO may refer to: * Search engine optimization, the process of improving the visibility of a website or a web page in search engines Organisations * SEO Economic Research, a scientific institute * Spanish Ornithological Society (''Socied ...

." In 2013, Common Crawl began using Apache Software Foundation's

Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archite ...

webcrawler instead of a custom crawler. Common Crawl switched from using .arc files to .warc files with its November 2013 crawl. A filtered version of Common Crawl was used to train OpenAI's

GPT-3 Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt. The architecture is a standard ...

language model, announced in 2020. One challenge of using Common Crawl data is that despite the vast amount of documented web data, individual pieces of crawled websites could be better documented. This can create challenges when trying to diagnose problems in projects that use the Common Crawl data. A solution proposed by Timnit Gebru, et al., in 2020 to an industry-wide documentation shortfall is that every dataset should be accompanied with a datasheet that documents its motivation, composition, collection process, and recommended uses.

History of Common Crawl data

The following data have been collected from the official Common Crawl Blog.

Norvig Web Data Science Award

In corroboration with

SURFsara SURF is an organization that develops, implements and maintains the national research and education network (NREN) of the Netherlands, It operates the national research network formally called SURFnet. SURF as a network is a backbone computer netw ...

, Common Crawl sponsors the Norvig Web Data Science Award, a competition open to students and researchers in

Benelux The Benelux Union ( nl, Benelux Unie; french: Union Benelux; lb, Benelux-Unioun), also known as simply Benelux, is a politico-economic union and formal international intergovernmental cooperation of three neighboring states in western Europe: B ...

. The award is named for Peter Norvig who also chairs the judging committee for the award.

References

{{reflist

External links

Common Crawl
in California, United States
Common Crawl GitHub Repository
with the crawler, libraries and example code
Common Crawl Discussion GroupCommon Crawl Blog
Internet-related organizations Web archiving Web archiving initiatives