The Generalized vector space model is a generalization of the
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
used in
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
. Wong ''et al.''
presented an analysis of the problems that the pairwise orthogonality assumption of the
vector space model
Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
(VSM) creates. From here they extended the VSM to the generalized vector space model (GVSM).
Definitions
GVSM introduces term to term correlations, which deprecate the pairwise orthogonality assumption. More specifically, the factor considered a new space, where each term vector ''t
i'' was expressed as a linear combination of ''2
n'' vectors ''m
r'' where ''r = 1...2
n''.
For a document ''d
k'' and a query ''q'' the similarity function now becomes:
:
where ''t
i'' and ''t
j'' are now vectors of a ''2
n'' dimensional space.
Term correlation
can be implemented in several ways. For an example, Wong et al. uses the term occurrence frequency matrix obtained from automatic indexing as input to their algorithm. The term occurrence and the output is the term correlation between any pair of index terms.
Semantic information on GVSM
There are at least two basic directions for embedding term to term relatedness, other than exact keyword matching, into a retrieval model:
# compute semantic correlations between terms
# compute frequency co-occurrence statistics from large corpora
Recently Tsatsaronis focused on the first approach.
They measure semantic relatedness (''SR'') using a thesaurus (''O'') like
WordNet
WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...
. It considers the path length, captured by compactness (''SCM''), and the path depth, captured by semantic path elaboration (''SPE'').
They estimate the
inner product by:
where ''s
i'' and ''s
j'' are senses of terms ''t
i'' and ''t
j'' respectively, maximizing
.
Building also on the first approach, Waitelonis et al.
have computed semantic relatedness from
Linked Open Data resources including
DBpedia
DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia a ...
as well as the
YAGO taxonomy.
Thereby they exploits taxonomic relationships among semantic entities in documents and queries after
named entity linking.
References
{{reflist
Vector space model