linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

, statistical semantics applies the methods of

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

to the problem of determining the meaning of words or phrases, ideally through

unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...

, to a degree of precision at least sufficient for the purpose of

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

History

The term ''statistical semantics'' was first used by Warren Weaver in his well-known paper on

machine translation Machine translation is use of computational techniques to translate text or speech from one language to another, including the contextual, idiomatic and pragmatic nuances of both languages. Early approaches were mostly rule-based or statisti ...

. He argued that

word-sense disambiguation Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious. Given that natural language requires ref ...

for machine translation should be based on the

co-occurrence In linguistics, co-occurrence or cooccurrence is an above-chance frequency of ordered occurrence of two adjacent terms in a text corpus. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idio ...

frequency of the context words near a given target word. The underlying assumption that "a word is characterized by the company it keeps" was advocated by J. R. Firth. This assumption is known in

as the distributional hypothesis. Emile Delavenay defined ''statistical semantics'' as the "statistical study of the meanings of words and their frequency and order of recurrence". " Furnas et al. 1983" is frequently cited as a foundational contribution to statistical semantics. An early success in the field was

latent semantic analysis Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...

Applications

Research in statistical semantics has resulted in a wide variety of algorithms that use the distributional hypothesis to discover many aspects of

semantics Semantics is the study of linguistic Meaning (philosophy), meaning. It examines what meaning is, how words get their meaning, and how the meaning of a complex expression depends on its parts. Part of this process involves the distinction betwee ...

, by applying statistical techniques to large corpora: * Measuring the similarity in word meanings * Measuring the similarity in word relations * Modeling similarity-based generalization * Discovering words with a given relation * Classifying relations between words * Extracting keywords from documents * Measuring the cohesiveness of text * Discovering the different senses of words * Distinguishing the different senses of words * Subcognitive aspects of words * Distinguishing praise from criticism

Related fields

Statistical semantics focuses on the meanings of common words and the relations between common words, unlike

text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...

, which tends to focus on whole documents, document collections, or named entities (names of people, places, and organizations). Statistical semantics is a subfield of

computational semantics Computational semantics is the study of how to automate the process of constructing and reasoning with semantics, meaning representations of natural language expressions. It consequently plays an important role in natural language processing, nat ...

, which is in turn a subfield of

computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

and

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

. Many of the applications of statistical semantics (listed above) can also be addressed by

lexicon A lexicon (plural: lexicons, rarely lexica) is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word () ...

-based algorithms, instead of the

corpus Corpus (plural ''corpora'') is Latin for "body". It may refer to: Linguistics * Text corpus, in linguistics, a large and structured set of texts * Speech corpus, in linguistics, a large set of speech audio files * Corpus linguistics, a branch of ...

-based algorithms of statistical semantics. One advantage of corpus-based algorithms is that they are typically not as labour-intensive as lexicon-based algorithms. Another advantage is that they are usually easier to adapt to new languages or noisier new text types from e.g. social media than lexicon-based algorithms are. However, the best performance on an application is often achieved by combining the two approaches.

References

Sources

* * *: Reprinted in * * * * * * * * * * * * * * * * * * * * {{DEFAULTSORT:Statistical Semantics Applications of artificial intelligence Computational linguistics Information retrieval techniques Semantics Statistical natural language processing Applied statistics Computational fields of study

History

Applications

Related fields

See also

References

Sources