In
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
and
information retrieval, explicit semantic analysis (ESA) is a
vectoral representation of text (individual words or entire documents) that uses a document corpus as a
knowledge base
A knowledge base (KB) is a technology used to store complex structured and unstructured information used by a computer system. The initial use of the term was in connection with expert systems, which were the first knowledge-based systems. ...
. Specifically, in ESA, a word is represented as a column vector in the
tf–idf
In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or ...
matrix of the text corpus and a document (string of words) is represented as the
centroid
In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ...
of the vectors representing its words. Typically, the text corpus is
English Wikipedia
The English Wikipedia is, along with the Simple English Wikipedia, one of two English-language editions of Wikipedia, an online encyclopedia. It was founded on January 15, 2001, as Wikipedia's first edition, and, as of
, has the most arti ...
, though other corpora including the
Open Directory Project
DMOZ (from ''directory.mozilla.org'', an earlier domain name, stylized in lowercase in its logo) was a multilingual open-content directory of World Wide Web links. The site and community who maintained it were also known as the Open Directory ...
have been used.
ESA was designed by
Evgeniy Gabrilovich
Evgeniy Gabrilovich is a research director at Facebook Reality Labs where he conducts research on neuromotor interfaces. Prior to that he was a Principal Scientist/Director at Google, specializing in Information Retrieval, Machine Learning, and ...
and Shaul Markovitch as a means of improving
text categorization
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
and has been used by this pair of researchers to compute what they refer to as "
semantic
Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy, linguistics and comput ...
relatedness" by means of
cosine similarity
In data analysis, cosine similarity is a measure of similarity between two sequences of numbers. For defining it, the sequences are viewed as vectors in an inner product space, and the cosine similarity is defined as the cosine of the angle betw ...
between the aforementioned vectors, collectively interpreted as a space of "concepts explicitly defined and described by humans", where Wikipedia articles (or ODP entries, or otherwise titles of documents in the knowledge base corpus) are equated with concepts. The name "explicit semantic analysis" contrasts with
latent semantic analysis
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the do ...
(LSA), because the use of a knowledge base makes it possible to assign human-readable labels to the concepts that make up the vector space.
Model
To perform the basic variant of ESA, one starts with a collection of texts, say, all Wikipedia articles; let the number of documents in the collection be . These are all turned into "
bags of words", i.e., term frequency histograms, stored in an
inverted index
In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of do ...
. Using this inverted index, one can find for any word the set of Wikipedia articles containing this word; in the vocabulary of Egozi, Markovitch and Gabrilovitch, "each word appearing in the Wikipedia corpus can be seen as triggering each of the concepts it points to in the inverted index."
The output of the inverted index for a single word query is a list of indexed documents (Wikipedia articles), each given a score depending on how often the word in question occurred in them (weighted by the total number of words in the document). Mathematically, this list is an -dimensional vector of word-document scores, where a document not containing the query word has score zero. To compute the relatedness of two words, one compares the vectors (say and ) by computing the cosine similarity,
:
and this gives a numeric estimate of the semantic relatedness of the words. The scheme is extended from single words to multi-word texts by simply summing the vectors of all words in the text.
Analysis
ESA, as originally posited by Gabrilovich and Markovitch, operates under the assumption that the knowledge base contains topically
orthogonal
In mathematics, orthogonality is the generalization of the geometric notion of '' perpendicularity''.
By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
concepts. However, it was later shown by Anderka and Stein that ESA also improves the performance of
information retrieval systems when it is based not on Wikipedia, but on the
Reuters corpus of newswire articles, which does not satisfy the orthogonality property; in their experiments, Anderka and Stein used newswire stories as "concepts". To explain this observation, links have been shown between ESA and the
generalized vector space model
The Generalized vector space model is a generalization of the vector space model used in information retrieval. Wong ''et al.'' presented an analysis of the problems that the pairwise orthogonality assumption of the vector space model (VSM) create ...
. Gabrilovich and Markovitch replied to Anderka and Stein by pointing out that their experimental result was achieved using "a single application of ESA (text similarity)" and "just a single, extremely small and homogenous test collection of 50 news documents".
Applications
Word relatedness
ESA is considered by its authors a measure of semantic relatedness (as opposed to
semantic similarity
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...
). On datasets used to benchmark relatedness of words, ESA outperforms other algorithms, including
WordNet
WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definit ...
semantic similarity measures and skip-gram Neural Network Language Model (
Word2vec).
Document relatedness
ESA is used in commercial software packages for computing relatedness of documents. Domain-specific restrictions on the ESA model are sometimes used to provide more robust document matching.
Extensions
Cross-language explicit semantic analysis (CL-ESA) is a multilingual generalization of ESA.
[Martin Potthast, Benno Stein, and Maik Anderka]
A Wikipedia-based multilingual retrieval model
Proceedings of the 30th European Conference on IR Research (ECIR), pp. 522-530, 2008. CL-ESA exploits a document-aligned multilingual reference collection (e.g., again, Wikipedia) to represent a document as a language-independent concept vector. The relatedness of two documents in different languages is assessed by the cosine similarity between the corresponding vector representations.
See also
*
Topic model
In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden ...
References
External links
Explicit semantic analysison Evgeniy Gabrilovich's homepage; has links to implementations
{{Natural language processing
Natural language processing
Vector space model