In
linguistics
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Lingu ...
, statistical semantics applies the methods of
statistics to the problem of determining the meaning of words or phrases, ideally through
unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
, to a degree of precision at least sufficient for the purpose of
information retrieval.
History
The term ''statistical semantics'' was first used by
Warren Weaver
Warren Weaver (July 17, 1894 – November 24, 1978) was an American scientist, mathematician, and science administrator. He is widely recognized as one of the pioneers of machine translation and as an important figure in creating support for scien ...
in his well-known paper on
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
. He argued that
word sense disambiguation
Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to cons ...
for machine translation should be based on the
co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ...
frequency of the context words near a given target word. The underlying assumption that "a word is characterized by the company it keeps" was advocated by
J.R. Firth. This assumption is known in
linguistics
Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Lingu ...
as the
distributional hypothesis. Emile Delavenay defined ''statistical semantics'' as the "statistical study of meanings of words and their frequency and order of recurrence". "
Furnas et al. 1983" is frequently cited as a foundational contribution to statistical semantics. An early success in the field was
latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
.
Applications
Research in statistical semantics has resulted in a wide variety of algorithms that use the distributional hypothesis to discover many aspects of
semantics
Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and compu ...
, by applying statistical techniques to
large corpora:
* Measuring the
similarity in word meanings
* Measuring the similarity in word relations
* Modeling
similarity-based generalization
* Discovering words with a given relation
* Classifying relations between words
*
Extracting keywords from documents
* Measuring the cohesiveness of text
* Discovering the different senses of words
* Distinguishing the different senses of words
* Subcognitive aspects of words
* Distinguishing praise from criticism
Related fields
Statistical semantics focuses on the meanings of common words and the relations between common words, unlike
text mining
Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
, which tends to focus on whole documents, document collections, or named entities (names of people, places, and organizations). Statistical semantics is a subfield of
computational semantics
Computational semantics is the study of how to automate the process of constructing and reasoning with meaning representations of natural language expressions. It consequently plays an important role in natural-language processing and computati ...
, which is in turn a subfield of
computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
and
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
.
Many of the applications of statistical semantics (listed above) can also be addressed by
lexicon-based algorithms, instead of the
corpus
Corpus is Latin for "body". It may refer to:
Linguistics
* Text corpus, in linguistics, a large and structured set of texts
* Speech corpus, in linguistics, a large set of speech audio files
* Corpus linguistics, a branch of linguistics
Music
* ...
-based algorithms of statistical semantics. One advantage of corpus-based algorithms is that they are typically not as labour-intensive as lexicon-based algorithms. Another advantage is that they are usually easier to adapt to new languages or to noisier new text types from e.g. social media than lexicon-based algorithms are. However, the best performance on an application is often achieved by combining the two approaches.
See also
*
Co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of occurrence of two terms (also known as coincidence or concurrence) from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense ...
*
Computational linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
*
Information retrieval
*
Latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
*
Latent semantic indexing
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
*
Semantic analytics
*
Semantic similarity
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...
*
Statistical natural language processing
*
Text corpus
In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
*
Text mining
Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...
*
Web mining
References
Sources
*
*
*: Reprinted in
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
{{DEFAULTSORT:Statistical Semantics
Applications of artificial intelligence
Computational linguistics
Information retrieval techniques
Semantics
Statistical natural language processing
Applied statistics
Computational fields of study