TenTen Corpus Family
   HOME

TheInfoList



OR:

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web
text corpora In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in cor ...
, i.e. collections of texts that have been crawled from the
World Wide Web The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
and processed to match the same standards. These corpora are made available through the
Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour (lexicographers, researchers in corpus linguistics, translators or language learn ...
corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name. In the creation of the TenTen corpora, data crawled from the World Wide Web are processed with
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
tools developed by the Natural Language Processing Centre at the Faculty of Informatics at
Masaryk University Masaryk University (MU) (; ) is the second largest university in the Czech Republic, a member of the Compostela Group and the Utrecht Network. Founded in 1919 in Brno, it now consists of ten faculties and 35,115 students. It is named after To ...
(
Brno Brno ( , ; ) is a Statutory city (Czech Republic), city in the South Moravian Region of the Czech Republic. Located at the confluence of the Svitava (river), Svitava and Svratka (river), Svratka rivers, Brno has about 403,000 inhabitants, making ...
,
Czech Republic The Czech Republic, also known as Czechia, and historically known as Bohemia, is a landlocked country in Central Europe. The country is bordered by Austria to the south, Germany to the west, Poland to the northeast, and Slovakia to the south ...
) and by the Lexical Computing company (developer of the Sketch Engine).


Corpus linguistics

In
corpus linguistics Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural ''corpora''). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a giv ...
, a
text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
is a large and structured collection of texts that are electronically stored and processed. It is used to do hypothesis testing about languages, validating linguistic rules or the frequency distribution of words (
n-grams An ''n''-gram is a sequence of ''n'' adjacent symbols in particular order. The symbols may be ''n'' adjacent letters (including punctuation marks and blanks), syllables, or rarely whole words found in a language dataset; or adjacent phonemes ...
) within languages. Electronically processed corpora provide fast search. Text processing procedures such as tokenization,
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also called grammatical tagging, is the process of marking up a word in a text ( corpus) as corresponding to a particular part of speech, based on both its defini ...
and
word-sense disambiguation Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious. Given that natural language requires ref ...
enrich corpus texts with detailed linguistic information. This enables to narrow the search to a particular
parts of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...
, word sequences or a specific part of the corpus. First text corpora were created in the 1960s, such as the 1-million-word
Brown Corpus The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples of American English, the first major structured Text_corpus, corpus of varied genres. This ...
of
American English American English, sometimes called United States English or U.S. English, is the set of variety (linguistics), varieties of the English language native to the United States. English is the Languages of the United States, most widely spoken lang ...
. Over time, many further corpora were produced (such as the
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
and the
LOB Corpus The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for th ...
) and work had begun also on corpora of larger sizes and covering other languages than English. This development was linked with the emergence of corpus creation tools that help achieve larger size, wider coverage, cleaner data etc.


Production of TenTen corpora

The procedure by which TenTen corpora are produced is based on the creators' earlier research in preparing web corpora and the subsequent processing thereof. At the beginning, a huge amount of text data is
downloaded In computer networks, download means to ''receive'' data from a remote system, typically a server such as a web server, an FTP server, an email server, or other similar systems. This contrasts with uploading, where data is ''sent to'' a remote se ...
from the World Wide Web by the dedicated SpiderLing web crawler. In a later stage, these texts undergo
cleaning Cleaning is the process of removing unwanted substances, such as dirt, infectious agents, and other impurities, from an object or environment. Cleaning is often performed for beauty, aesthetic, hygiene, hygienic, Function (engineering), function ...
, which consists of removing any non-textual material such as navigation links, headers and footers from the
HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...
source code of web pages with the jusText tool, so that only full solid sentences are preserved. Eventually, the ONION tool is applied to remove duplicate text portions from the corpus, which naturally occur on the World Wide Web due to practices such as quoting,
citing A citation is a reference to a source. More precisely, a citation is an abbreviated alphanumeric expression embedded in the body of an intellectual work that denotes an entry in the bibliographic references section of the work for the purpose o ...
,
copying Copying is the duplication of information or an wiktionary:artifact, artifact based on an instance of that information or artifact, and not using the process that originally generated it. With Analog device, analog forms of information, copying is ...
etc.


TenTen corpora data structure

TenTen corpora follow a specific metadata structure that is common to all of them. Metadata is contained in structural attributes that relate to individual documents and paragraphs in the corpus. Some TenTen corpora can feature additional specific attributes.


Document attributes

*
top-level domain A top-level domain (TLD) is one of the domain name, domains at the highest level in the hierarchical Domain Name System of the Internet after the root domain. The top-level domain names are installed in the DNS root zone, root zone of the nam ...
– domain at the highest level of the hierarchical
Domain Name System The Domain Name System (DNS) is a hierarchical and distributed name service that provides a naming system for computers, services, and other resources on the Internet or other Internet Protocol (IP) networks. It associates various information ...
(e.g. "com") *
website A website (also written as a web site) is any web page whose content is identified by a common domain name and is published on at least one web server. Websites are typically dedicated to a particular topic or purpose, such as news, educatio ...
– identification string defining a realm of administrative autonomy within the Internet (e.g. "wikipedia.org") *
web domain Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by D ...
– collection of related web pages (e.g. "la.wikipedia.org") *crawl date – date when the document was downloaded from the Web *url – the
Uniform Resource Locator A uniform resource locator (URL), colloquially known as an address on the World Wide Web, Web, is a reference to a web resource, resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific t ...
referring to the document's source *wordcount – number of words in the document *length – classification of the document into a range by its length measured in thousands of words


Paragraph attributes

*
heading Heading can refer to: * Heading (metalworking), a process which incorporates the extruding and upsetting processes * Heading (navigation), the direction a person or vehicle is facing, usually similar to its course ** Aircraft heading, the directi ...
– a numeric attribute distinguishing headers and similar titles from ordinary
body text __NOTOC__ Body text or body copy, or running text, is the text forming the main content of a book, magazine, web page, or any other printed or digital work. This is as a contrast to both additional components such as headings, images, charts, foot ...
(1 if the paragraph is a heading, 0 otherwise)


Available TenTen corpora

The following corpora can be accessed through the Sketch Engine as of October 2018:


See also

*
Text corpus In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated. Annotated, they have been used in corp ...
*
Sketch Engine Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour (lexicographers, researchers in corpus linguistics, translators or language learn ...
*
Web crawler Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
(spider) *
Data deduplication In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amou ...


References


External links


TenTen Corpus Family
(at the Sketch Engine website) {{Corpus linguistics Corpora Commercial digital libraries Czech digital libraries