Query understanding is the process of inferring the

intent Intentions are mental states in which the agent commits themselves to a course of action. Having the plan to visit the zoo tomorrow is an example of an intention. The action plan is the ''content'' of the intention while the commitment is the ''a ...

of a

search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

user by extracting semantic meaning from the searcher’s keywords. Query understanding methods generally take place before the search engine retrieves and ranks results. It is related to

natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

but specifically focused on the understanding of search queries. Query understanding is at the heart of technologies like

Amazon Alexa Amazon Alexa, also known simply as Alexa, is a virtual assistant technology largely based on a Polish speech synthesiser named Ivona, bought by Amazon in 2013. It was first used in the Amazon Echo smart speaker and the Echo Dot, Echo Studio a ...

Apple An apple is an edible fruit produced by an apple tree (''Malus domestica''). Apple trees are cultivated worldwide and are the most widely grown species in the genus ''Malus''. The tree originated in Central Asia, where its wild ancestor ...

Siri Siri ( ) is a virtual assistant that is part of Apple Inc.'s iOS, iPadOS, watchOS, macOS, tvOS, and audioOS operating systems. It uses voice queries, gesture based control, focus-tracking and a natural-language user interface to answer questi ...

Google Assistant Google Assistant is a virtual assistant software application developed by Google that is primarily available on mobile and home automation devices. Based on artificial intelligence, Google Assistant can engage in two-way conversations, unlike t ...

, IBM's Watson, and

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washi ...

's Cortana.

Methods

Tokenization

Tokenization Tokenization may refer to: * Tokenization (lexical analysis) in language processing * Tokenization (data security) in the field of data security * Word segmentation * Tokenism Tokenism is the practice of making only a perfunctory or symbolic ef ...

is the process of breaking up a

text string In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...

into words or other meaningful elements called tokens. Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, such as splitting the string on punctuation and

whitespace characters In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...

. Tokenization is more challenging in languages without spaces between words, such as

Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of v ...

and Japanese. Tokenizing text in these languages requires the use of

word segmentation Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in compu ...

algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing c ...

Spelling correction

Spelling correction is the process of automatically detecting and correcting spelling errors in search queries. Most spelling correction algorithms are based on a

language model A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...

, which determines the

a priori probability An ''a priori'' probability is a probability that is derived purely by deductive reasoning. One way of deriving ''a priori'' probabilities is the principle of indifference, which has the character of saying that, if there are ''N'' mutually exclus ...

of an intended query, and an error model (typically a noisy channel model), which determines the probability of a particular misspelling, given an intended query.

Stemming and lemmatization

Many, but not all, languages

inflect In linguistic morphology, inflection (or inflexion) is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and defi ...

words to reflect their role in the utterance they appear in: a word such as *care* may appear as, besides the base form. as *cares*, *cared*, *caring*, and others. The variation between various forms of a word is likely to be of little importance for the relatively coarse-grained model of meaning involved in a retrieval system, and for this reason the task of conflating the various forms of a word is a potentially useful technique to increase recall of a retrieval system. The languages of the world vary in how much morphological variation they exhibit, and for some languages there are simple methods to reduce a word in query to its

lemma Lemma may refer to: Language and linguistics * Lemma (morphology), the canonical, dictionary or citation form of a word * Lemma (psycholinguistics), a mental abstraction of a word about to be uttered Science and mathematics * Lemma (botany), a ...

root In vascular plants, the roots are the organs of a plant that are modified to provide anchorage for the plant and take in water and nutrients into the plant body, which allows plants to grow taller and faster. They are most often below the sur ...

form or its

stem Stem or STEM may refer to: Plant structures * Plant stem, a plant's aboveground axis, made of vascular tissue, off which leaves and flowers hang * Stipe (botany), a stalk to support some other structure * Stipe (mycology), the stem of a mushro ...

. For some other languages, this operation involves non-trivial string processing. A noun in English typically appears in four variants: *cat* *cat's* *cats* *cats'* or *child* *child´s* *children* *children's*. Other languages have more variation. Finnish, e.g., potentially exhibits about 5000 forms for a noun, and for many languages the inflectional forms are not limited to

affix In linguistics, an affix is a morpheme that is attached to a word stem to form a new word or word form. Affixes may be derivational, like English ''-ness'' and ''pre-'', or inflectional, like English plural ''-s'' and past tense ''-ed''. They ar ...

es but change the core of the word itself. Stemming algorithms, also known as stemmers, typically use a collection of simple rules to remove

suffix In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the conjugation of verbs. Suffixes can carry ...

es intended to model the language’s inflection rules. More advanced methods, lemmatisation methods, group together the inflected forms of a word through more complex rule sets based on a word’s part of speech or its record in a

lexical database In digital lexicography, natural language processing, and digital humanities, a lexical resource is a language resource consisting of data regarding the lexemes of the lexicon of one or more languages e.g., in the form of a database. Characteris ...

, transforming an inflected word through lookup or a series of transformations to its lemma. For a long time, it was taken to be proven that morphological normalisation by and large did not help retrieval performance. Once the attention of the information retrieval field moved to languages other than English, it was found that for some languages there were obvious gains to be found.

Entity recognition

Entity recognition is the process of locating and classifying entities within a text string. Named-entity recognition specifically focuses on named entities, such as names of people, places, and organizations. In addition, entity recognition includes identifying concepts in queries that may be represented by multi-word phrases. Entity recognition systems typically use grammar-based linguistic techniques or statistical

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machin ...

models.

Query rewriting

Query rewriting is the process of automatically reformulating a search query to more accurately capture its intent. Query expansion adds additional query terms, such as synonyms, in order to retrieve more documents and thereby increase recall. Query relaxation removes query terms to reduce the requirements for a document to match the query, thereby also increasing recall. Other forms of query rewriting, such as automatically converting consecutive query terms into phrases and restricting query terms to specific

fields Fields may refer to: Music *Fields (band), an indie rock band formed in 2006 *Fields (progressive rock band), a progressive rock band formed in 1971 * ''Fields'' (album), an LP by Swedish-based indie rock band Junip (2010) * "Fields", a song by ...

, aim to increase

precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...

. Apache Lucene search engine uses query rewrite to transform complex queries to more primitive queries, such as expressions with wildcards (e.g. quer*) into a boolean query of the matching terms from the index (such as query OR queries).

References

{{reflist Information retrieval techniques Natural language processing