Sketch Engine
   HOME

TheInfoList



OR:

Sketch Engine is a corpus manager and
text analysis Content analysis is the study of documents and communication artifacts, which might be texts of various formats, pictures, audio or video. Social scientists use content analysis to examine patterns in communication in a replicable and systematic ...
software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour ( lexicographers, researchers in
corpus linguistics Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...
, translators or language learners) to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features,
word sketch A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam KilgarriffKilgarriff, Adam; Rychlý, Pavel; Smrž, Pavel; ...
es: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.


History of development

Sketch Engine is a product of Lexical Computing Limited, a company founded in 2003 by the lexicographer and research scientist
Adam Kilgarriff Adam Kilgarriff (12 February 1960 – 16 May 2015) was a corpus linguist, lexicographer, and co-author of Sketch Engine. Life His parents were booksellers. He spent one year as a volunteer in Kenya 1978–1979 then began studying at Cambrid ...
. He started a collaboration with Pavel Rychlý, a computer scientist working at the Natural Language Processing Centre,
Masaryk University Masaryk University (MU) ( cs, Masarykova univerzita; la, Universitas Masarykiana Brunensis) is the second largest university in the Czech Republic, a member of the Compostela Group and the Utrecht Network. Founded in 1919 in Brno as the se ...
, and the developer of Manatee and Bonito (two major parts of the software suite), and introduced the concept of
word sketch A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam KilgarriffKilgarriff, Adam; Rychlý, Pavel; Smrž, Pavel; ...
es. Since then, Sketch Engine has been commercial software, however, all the core features of Manatee and Bonito that were developed by 2003 (and extended since then) are freely available under the GPL license within the NoSketch Engine suite.


Features

A list of tools available in Sketch Engine: *
Word sketch A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam KilgarriffKilgarriff, Adam; Rychlý, Pavel; Smrž, Pavel; ...
es – a one-page automatic derived summary of a word's grammatical and collocational behaviour * Word sketch difference – compares and contrasts two words by analysing their collocation * Distributional
Thesaurus A thesaurus (plural ''thesauri'' or ''thesauruses'') or synonym dictionary is a reference work for finding synonyms and sometimes antonyms of words. They are often used by writers to help find the best word to express an idea: Synonym dictionar ...
– automated thesaurus finding words with similar meaning or appearing in the same/similar context * Concordance search – finds examples of a word form, lemma, phrase, tag or complex structure *
Collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words ...
search – word co-occurrence analysis displaying the most frequent words (to a search word) which can be regarded as collocation candidates * Word lists – generates frequency lists which can be filtered with complex criteria *
n-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or ...
s – generates frequency lists of multi-word expressions *
Terminology Terminology is a group of specialized words and respective meanings in a particular field, and also the study of such terms and their use; the latter meaning is also known as terminology science. A ''term'' is a word, compound word, or multi-wo ...
/ Keyword extraction (both monolingual and bilingual) – automatic extraction key words and multi-word terms from texts (based on frequency count and linguistic criteria) * Diachronic analysis ( Trends) – detecting words which undergo changes in the frequency of use in time (show trending words) * Corpus building and management – create corpora from the Web or uploaded texts including
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
and lemmatization which can be used as data mining software * Parallel corpus (bilingual) facilities – looking up translation examples (EUR-Lex corpus, Europarl corpus, OPUS corpus, etc.) or building parallel corpus from own aligned texts * Text type analysis – statistics of metadata in the corpus


Keywords and terminology extraction

It is a tool for automatic term extraction for identifying words typical of a particular corpus, document, or text. It supports extracting one-word and multi-word units from monolingual and bilingual texts. The terminology extraction feature provides a list of relevant terms based on comparison with a large corpus of general language. This tool is also a separate service operating a
OneClick terms
with a dedicated interface.


List of text corpora

Sketch Engine provides access to more than 700 text corpora. There are monolingual as well as multilingual language corpora of different sizes (from thousand of words up to 60 billions of words) and various sources (web, books, subtitles, legal documents, etc.). The list of corpora includes
British National Corpus The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention ...
,
Brown Corpus The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) is an electronic collection of text samples of American English, the first major structured corpus of varied genres. This corpus first set the bar for the ...
, Cambridge Academic English Corpus and Cambridge Learner Corpus, CHILDES corpora of child language, OpenSubtitles (a set of 60 parallel corpora), 24 multilingual corpora of
EUR-Lex Eur-Lex (stylized EUR-Lex) is an official website of European Union law and other public documents of the European Union (EU), published in 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on EU ...
documents,
TenTen Corpus Family The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available thro ...
(multi-billion web corpora), trends corpora (monitor corpora with daily updates), etc.


Architecture

Sketch Engine consists of three main components: an underlying
database management system In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
called Manatee, a web interface search front-end called Bonito and a web interface for corpus building and management called Corpus Architect.


Manatee

Manatee is a
database management system In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
specifically devised for effective indexing of large text corpora. It is based on the idea of inverted indexing (keeping an index of all positions of a given word in the text). It has been used to index text corpora comprising tens of billions of words. Searching corpora indexed by Manatee is performed by formulating queries in the Corpus Query Language (CQL). Manatee is written in C++ and offers an API for a number of other programming languages including Python,
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
,
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offic ...
and
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called ...
. Recently, it was rewritten into Go for faster processing of corpus queries.


Bonito

Bonito is a web interface for Manatee providing access to corpus search. In the
client–server model The client–server model is a distributed application structure that partitions tasks or workloads between the providers of a resource or service, called servers, and service requesters, called clients. Often clients and servers communicate ov ...
, Manatee is the server and Bonito plays the client part. It is written in Python.


Corpus Architect

Corpus Architect is a web interface providing corpus building and management features. It is also written in Python.


Applications

Sketch Engine has been used by major British or other publishing houses for producing dictionaries such as Macmillan English Dictionary, Dictionnaires Le Robert,
Oxford University Press Oxford University Press (OUP) is the university press of the University of Oxford. It is the largest university press in the world, and its printing history dates back to the 1480s. Having been officially granted the legal right to print book ...
or
Shogakukan is a Japanese publisher of dictionaries, literature, comics ( manga), non-fiction, DVDs, and other media in Japan. Shogakukan founded Shueisha, which also founded Hakusensha. These are three separate companies, but are together called the ...
and four of the UK's five biggest dictionary publishers use Sketch Engine.


See also

* SkELL – a free web service for language learning based on Sketch Engine


References


Related publications

*


External links


Sketch Engine website

List of corpora available in Sketch Engine

OneClick terms – an online term extractor with term extraction technology from Sketch Engine
{{Corpus linguistics Applied linguistics Computational linguistics Corpus linguistics Database management systems Data mining and machine learning software Lexicography Linguistic research Natural language processing Text analysis Text mining