Collocation extraction is the task of using a computer to extract
collocation
In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words ...
s automatically from a
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
.
The traditional method of performing collocation extraction is to find a formula based on the statistical quantities of those words to calculate a score associated to every word pairs. Proposed formulas are
mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...
,
t-test
A ''t''-test is any statistical hypothesis test in which the test statistic follows a Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test statistic would follow a normal distribution if the value of a ...
,
z test
A ''Z''-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Z-tests test the mean of a distribution. For each significance level in the confid ...
,
chi-squared test
A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format ...
and
likelihood ratio.
Within the area of
corpus linguistics
Corpus linguistics is the study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora ...
, collocation is defined as a sequence of words or
terms which
co-occur more often than would be expected by chance. 'Crystal clear', 'middle management', 'nuclear family', and 'cosmetic surgery' are examples of collocated pairs of words. Some words are often found together because they make up a
compound noun, for example 'riding boots' or 'motor cyclist'.
See also
*
Collocational restriction
*
Collostructional analysis
*
Compound noun, adjective and verb
*
Phrasal verb
In the traditional grammar of Modern English, a phrasal verb typically constitutes a single semantic unit composed of a verb followed by a particle (examples: ''turn down'', ''run into'' or ''sit up''), sometimes combined with a preposition (e ...
*
Siamese twins (English language)
*
Terminology extraction
Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a give ...
*
n-gram analysis
External links
What is collocation
References
Tasks of natural language processing
Computational linguistics
Corpus linguistics
{{comp-ling-stub