In
linguistics
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
, statistical semantics applies the methods of
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
to the problem of determining the meaning of words or phrases, ideally through
unsupervised learning
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...
, to a degree of precision at least sufficient for the purpose of
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
.
History
The term ''statistical semantics'' was first used by
Warren Weaver in his well-known paper on
machine translation
Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages.
Early approaches were mostly rule-based or statisti ...
. He argued that
word-sense disambiguation
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
Given that natural language requires ref ...
for machine translation should be based on the
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idio ...
frequency of the context words near a given target word. The underlying assumption that "a word is characterized by the company it keeps" was advocated by
J. R. Firth. This assumption is known in
linguistics
Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...
as the
distributional hypothesis. Emile Delavenay defined ''statistical semantics'' as the "statistical study of the meanings of words and their frequency and order of recurrence". "
Furnas et al. 1983" is frequently cited as a foundational contribution to statistical semantics. An early success in the field was
latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...
.
Applications
Research in statistical semantics has resulted in a wide variety of algorithms that use the distributional hypothesis to discover many aspects of
semantics
Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...
, by applying statistical techniques to
large corpora:
* Measuring the
similarity in word meanings
* Measuring the similarity in word relations
* Modeling
similarity-based generalization
* Discovering words with a given relation
* Classifying relations between words
*
Extracting keywords from documents
* Measuring the cohesiveness of text
*
Discovering the different senses of words
* Distinguishing the different senses of words
* Subcognitive aspects of words
* Distinguishing praise from criticism
Related fields
Statistical semantics focuses on the meanings of common words and the relations between common words, unlike
text mining
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
, which tends to focus on whole documents, document collections, or named entities (names of people, places, and organizations). Statistical semantics is a subfield of
computational semantics
Computational semantics is the study of how to automate the process of constructing and reasoning with semantics, meaning representations of natural language expressions. It consequently plays an important role in natural language processing, nat ...
, which is in turn a subfield of
computational linguistics
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
and
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
.
Many of the applications of statistical semantics (listed above) can also be addressed by
lexicon
A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...
-based algorithms, instead of the
corpus
Corpus (plural ''corpora'') is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of ...
-based algorithms of statistical semantics. One advantage of corpus-based algorithms is that they are typically not as labour-intensive as lexicon-based algorithms. Another advantage is that they are usually easier to adapt to new languages or noisier new text types from e.g. social media than lexicon-based algorithms are. However, the best performance on an application is often achieved by combining the two approaches.
See also
*
Co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idio ...
*
Computational linguistics
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
*
Information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
*
Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...
*
Latent semantic indexing
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...
*
Semantic analytics
*
Semantic similarity
*
Statistical natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
*
Text corpus
In linguistics and natural language processing, a corpus (: corpora) or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Annotated, they have been used in corp ...
*
Text mining
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...
*
Web mining
References
Sources
*
*
*: Reprinted in
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
{{DEFAULTSORT:Statistical Semantics
Applications of artificial intelligence
Computational linguistics
Information retrieval techniques
Semantics
Statistical natural language processing
Applied statistics
Computational fields of study