computational linguistics Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...

, word-sense induction (WSI) or discrimination is an

open problem In science and mathematics, an open problem or an open question is a known problem which can be accurately stated, and which is assumed to have an objective and verifiable solution, but which has not yet been solved (i.e., no solution for it is kno ...

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

, which concerns the automatic identification of the

senses A sense is a biological system used by an organism for sensation, the process of gathering information about the surroundings through the detection of stimuli. Although, in some cultures, five human senses were traditionally identified as su ...

of a

word A word is a basic element of language that carries semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consensus among linguist ...

(i.e. meanings). Given that the output of word-sense induction is a set of senses for the target word (sense inventory), this task is strictly related to that of

word-sense disambiguation Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious. Given that natural language requires ref ...

(WSD), which relies on a predefined sense inventory and aims to solve the

ambiguity Ambiguity is the type of meaning (linguistics), meaning in which a phrase, statement, or resolution is not explicitly defined, making for several interpretations; others describe it as a concept or statement that has no real reference. A com ...

of words in context.

Approaches and methods

The output of a word-sense induction algorithm is a clustering of contexts in which the target word occurs or a clustering of words related to the target word. Three main methods have been proposed in the literature: * Context clustering * Word clustering * Co-occurrence graphs

Context clustering

The underlying hypothesis of this approach is that, words are semantically similar if they appear in similar documents, with in similar context windows, or in similar syntactic contexts. Each occurrence of a target word in a corpus is represented as a context

vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...

. These context vectors can be either first-order vectors, which directly represent the context at hand, or second-order vectors, i.e., the contexts of the target word are similar if their words tend to co-occur together. The vectors are then clustered into groups, each identifying a sense of the target word. A well-known approach to context clustering is the Context-group Discrimination algorithm based on large matrix computation methods.

Word clustering

Word clustering is a different approach to the induction of word senses. It consists of clustering words, which are semantically similar and can thus bear a specific meaning. Lin’s algorithm is a prototypical example of word clustering, which is based on syntactic dependency statistics, which occur in a corpus to produce sets of words for each discovered sense of a target word. The Clustering By Committee (CBC) also uses syntactic contexts, but exploits a similarity matrix to encode the similarities between words and relies on the notion of committees to output different senses of the word of interest. These approaches are hard to obtain on a large scale for many domain and languages.

Co-occurrence graphs

The main hypothesis of co-occurrence graphs assumes that the semantics of a word can be represented by means of a co-occurrence

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discret ...

, whose vertices are co-occurrences and edges are co-occurrence relations. These approaches are related to word clustering methods, where co-occurrences between words can be obtained on the basis of grammatical or collocational relations. HyperLex is the successful approaches of a graph algorithm, based on the identification of hubs in co-occurrence graphs, which have to cope with the need to tune a large number of parameters. To deal with this issue several graph-based algorithms have been proposed, which are based on simple graph patterns, namely Curvature Clustering, Squares, Triangles and Diamonds (SquaT++), and Balanced Maximum Spanning Tree Clustering (B-MST). The patterns aim at identifying meanings using the local structural properties of the co-occurrence graph. A randomized algorithm which partitions the graph vertices by iteratively transferring the mainstream message (i.e. word sense) to neighboring vertices is Chinese Whispers. By applying co-occurrence graphs approaches have been shown to achieve the state-of-the-art performance in standard evaluation tasks.

Applications

* Word-sense induction has been shown to benefit Web Information Retrieval when highly ambiguous queries are employed. * Simple word-sense induction algorithms boost Web search result clustering considerably and improve the diversification of search results returned by search engines such as

Yahoo! Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, and its a ...

* Word-sense induction has been applied to enrich lexical resources such as

WordNet WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...

Software

SenseClusters
is a freely available open source software package that performs both context clustering and word clustering.

References

{{Natural language processing Natural language processing Computational linguistics Semantics Lexical semantics Word-sense disambiguation