Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

operations, particularly in the context of query understanding. In the context of

search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...

s, query expansion involves evaluating a user's input (what words were typed into the search query area, and sometimes other types of

data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...

) and expanding the search query to match additional documents. Query expansion involves techniques such as: * Finding

synonym A synonym is a word, morpheme, or phrase that means precisely or nearly the same as another word, morpheme, or phrase in a given language. For example, in the English language, the words ''begin'', ''start'', ''commence'', and ''initiate'' are a ...

s of words, and searching for the synonyms as well * Finding semantically related words (e.g.

antonym In lexical semantics, opposites are words lying in an inherently incompatible binary relationship. For example, something that is ''even'' entails that it is not ''odd''. It is referred to as a 'binary' relationship because there are two members i ...

meronym In linguistics, meronymy () is a semantic relation between a meronym denoting a part and a holonym denoting a whole. In simpler terms, a meronym is in a ''part-of'' relationship with its holonym. For example, ''finger'' is a meronym of ''hand, ...

hyponym Hypernymy and hyponymy are the wikt:Wiktionary:Semantic relations, semantic relations between a generic term (''hypernym'') and a more specific term (''hyponym''). The hypernym is also called a ''supertype'', ''umbrella term'', or ''blanket term ...

hypernym Hypernymy and hyponymy are the semantic relations between a generic term (''hypernym'') and a more specific term (''hyponym''). The hypernym is also called a ''supertype'', ''umbrella term'', or ''blanket term''. The hyponym names a subtype of ...

s) * Finding all the various morphological forms of words by stemming each word in the search query * Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results * Re-weighting the terms in the original query Query expansion is a methodology studied in the field of

computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, particularly within the realm of

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

and

Precision and recall trade-offs

Search engines invoke query expansion to increase the quality of user search results. It is assumed that users do not always formulate search queries using the best terms. Best in this case may be because the database does not contain the user entered terms. By stemming a user-entered term, more documents are matched, as the alternate word forms for a user entered term are matched as well, increasing the total recall. This comes at the expense of reducing the precision. By expanding a search query to search for the synonyms of a user entered term, the recall is also increased at the expense of precision. This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator. It is also inferred that a larger recall negatively impacts overall search result quality, given that many users do not want more results to comb through, regardless of the precision. The goal of query expansion in this regard is by increasing recall, precision can potentially increase (rather than decrease as mathematically equated), by including in the result set pages which are more relevant (of higher quality), or at least equally relevant. Pages which would not be included in the result set, which have the potential to be more relevant to the user's desired query, are included, and without query expansion would not have, regardless of

relevance Relevance is the connection between topics that makes one useful for dealing with the other. Relevance is studied in many different fields, including cognitive science, logic, and library and information science. Epistemology studies it in gener ...

. At the same time, many of the current commercial search engines use word frequency ( tf-idf) to assist in ranking. By ranking the occurrences of both the user entered words and synonyms and alternate morphological forms, documents with a higher density (high frequency and close proximity) tend to migrate higher up in the search results, leading to a higher quality of the search results near the top of the results, despite the larger recall.

Query expansion methods

Automatic methods for query expansion were proposed in 1960 by Maron and Kuhns. Modern query expansion methods either imply document collection analysis (global or local) or are dictionary- or

ontology Ontology is the philosophical study of existence, being. It is traditionally understood as the subdiscipline of metaphysics focused on the most general features of reality. As one of the most fundamental concepts, being encompasses all of realit ...

-based. The global analysis of the document collection is applied for searching for relations between terms. The local analysis refers to the relevance feedback introduced by Rocchio. Rocchio proposed to judge manually some of the retrieved documents and use this feedback information to expand the query. Since collecting users' judgment can be challenging, only the first top retrieved documents are considered as relevant. This is the so called pseudo-relevance feedback (PRF). Pseudo-relevance feedback is efficient in average but can damage results for some queries, especially difficult ones since the top retrieved documents are probably non-relevant. Pseudo-relevant documents are used to find expansion candidate terms that co-occur with many query terms. This idea was further developed within the relevance

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

formalism in positional relevance and proximity relevance models which consider the distance to query terms in the pseudo-relevant documents. Another direction in query expansion is the representation of index and query terms in a vector space which can be used to find related terms at query time, using semantic vectors or

word embedding In natural language processing, a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that ...

s. More generally, query expansion, with its counterpart

document expansion A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes a "teaching" or "lesson": ...

, are today implemented in the form of vector databases, using various encoding schemes based on

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

Software libraries

QueryTermAnalyzer
open-source, C#. Machine learning based query term weight and synonym analyzer for query expansion.
LucQE
- open-source, Java. Provides a framework along with several implementations that allow to perform query expansion with the use of Apache

Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...

. * Xapian is an open-source search library which includes support for query expansion
ReQue
open-source, Python. A configurable software framework and a collection of gold standard datasets for training and evaluating supervised query expansion methods.Hossein Fani, Mahtab Tamannaee, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri; An Extensible Toolkit of Query Refinement Methods and Gold Standard Dataset Generation. In Advances in Information Retrieval: 43rd European Conference on IR Research (ECIR'21), 2021.

References

Citations

Sources

* D. Abberley, D. Kirby, S. Renals, and T. Robinson, The THISL broadcast news retrieval system. In ''Proc. ESCA ETRW Workshop Accessing Information in Spoken Audio'', (Cambridge), pp. 14–19, 1999. Section o

- Concise, mathematical overview. * R. Navigli, P. Velardi
An Analysis of Ontology-based Query Expansion Strategies
''Proc. of Workshop on Adaptive Text Extraction and Mining (ATEM 2003)'', in the ''14th European Conference on Machine Learning (ECML 2003)'', Cavtat-Dubrovnik, Croatia, September 22-26th, 2003, pp. 42–49 - An analysis of query expansion methods relying on WordNet as the reference ontology. * Y. Qiu and H.P. Frei

In ''Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval'', Pittsburgh, SIGIR Forum, ACM Press, June 1993 - Academic document on a specific method of query expansion * Efthimis N. Efthimiadis

In: Martha E. Williams (ed.), ''Annual Review of Information Systems and Technology (ARIST)'', v31, pp 121–187, 1996 - An introduction for less-technical viewers. {{DEFAULTSORT:Query Expansion Search algorithms