HOME

TheInfoList



OR:

Collocation extraction is the task of using a computer to extract
collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words ...
s automatically from a
corpus Corpus is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of linguistics Music * ...
. The traditional method of performing collocation extraction is to find a formula based on the statistical quantities of those words to calculate a score associated to every word pairs. Proposed formulas are
mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...
,
t-test A ''t''-test is any statistical hypothesis test in which the test statistic follows a Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a ...
,
z test A ''Z''-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-tests test the mean of a distribution. For each significance level in the confid ...
,
chi-squared test A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format ...
and likelihood ratio. Within the area of
corpus linguistics Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...
, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance. 'Crystal clear', 'middle management', 'nuclear family', and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a compound noun, for example 'riding boots' or 'motor cyclist'.


See also

* Collocational restriction * Collostructional analysis * Compound noun, adjective and verb *
Phrasal verb In the traditional grammar of Modern English, a phrasal verb typically constitutes a single semantic unit composed of a verb followed by a particle (examples: ''turn down'', ''run into'' or ''sit up''), sometimes combined with a preposition (e ...
* Siamese twins (English language) *
Terminology extraction Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a give ...
* n-gram analysis


External links


What is collocation


References

Tasks of natural language processing Computational linguistics Corpus linguistics {{comp-ling-stub