HOME

TheInfoList



OR:

In text retrieval, full-text search refers to techniques for searching a single computer-stored
document A document is a written, drawn, presented, or memorialized representation of thought, often the manifestation of non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', which denotes a "teaching" o ...
or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references). In a full-text search, a
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques became common in online bibliographic databases in the 1990s. Many websites and application programs (such as
word processing A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consen ...
software) provide full-text-search capabilities. Some web search engines, such as AltaVista, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems.


Indexing

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called " serial scanning". This is what some tools, such as
grep grep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command ''g/re/p'' (''globally search for a regular expression and print matching lines''), which has the sa ...
, do when searching. However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an
index Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents. The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore stop words (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific
stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morph ...
on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive".


The precision vs. recall tradeoff

Recall measures the quantity of relevant results returned by a search, while precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned to all relevant results. Precision is the ratio of the number of relevant results returned to the total number of results returned. The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only 1 relevant result of 3 possible relevant results was returned, so the recall is a very low ratio of 1/3, or 33%. The precision for the example is a very low 1/4, or 25%, since only 1 of the 4 results returned was relevant. Due to the ambiguities of
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...
, full-text-search systems typically includes options like stop words to increase precision and
stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morph ...
to increase recall. Controlled-vocabulary searching also helps alleviate low-precision issues by tagging documents in such a way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall, while an increase in recall lowers precision.


False-positive problem

Full-text searching is likely to retrieve many documents that are not
relevant Relevant is something directly related, connected or pertinent to a topic; it may also mean something that is current. Relevant may also refer to: * Relevant operator, a concept in physics, see renormalization group * Relevant, Ain, a commune ...
to the ''intended'' search question. Such documents are called ''false positives'' (see
Type I error In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the f ...
). The retrieval of irrelevant documents is often caused by the inherent ambiguity of
natural language In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languag ...
. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background). Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "bank", clustering can be used to categorize the document/data universe into "financial institution", "place to sit", "place to store" etc. Depending on the occurrences of words relevant to the categories, search terms or a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.


Performance improvements

The deficiencies of full text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.


Improved querying tools

* Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text. * Field-restricted search. Some search engines enable users to limit full text searches to a particular field within a stored data record, such as "Title" or "Author." * . Searches that use
Boolean Any kind of logic, function, expression, or theory based on the work of George Boole is considered Boolean. Related to this, "Boolean" may refer to: * Boolean data type, a form of data with only two possible values (usually "true" and "false" ...
operators (for example, ) can dramatically increase the precision of a full text search. The operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the operator can be used to increase recall; consider, for example, . This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall. * Phrase search. A phrase search matches only those documents that contain a specified phrase, such as * Concept search. A search that is based on multi-word concepts, for example Compound term processing. This type of search is becoming popular in many e-discovery solutions. * Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context. * Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for would retrieve only those documents in which the words occur within two words of each other. *
Regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision. * Fuzzy search will search for document that match the given terms and some variation around them (using for instance edit distance to threshold the multiple variation) * Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an
asterisk The asterisk ( ), from Late Latin , from Ancient Greek , ''asteriskos'', "little star", is a typographical symbol. It is so called because it resembles a conventional image of a heraldic star. Computer scientists and mathematicians often vo ...
. For example, using the asterisk in a search query will find "sin", "son", "sun", etc. in a text.


Improved search algorithms

The
PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordi ...
algorithm developed by
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
gives more prominence to documents to which other Web pages have linked. See
Search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
for additional examples.


Software

The following is a partial list of available software products whose predominant purpose is to perform full-text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full-text search may be accomplished.


Free and open source software

* Apache Lucene * Apache Solr * ArangoSearch * BaseX * KinoSearch * Lemur/Indri * mnoGoSearch * OpenSearch * PostgreSQL * Searchdaimon *
Sphinx A sphinx ( , grc, σφίγξ , Boeotian: , plural sphinxes or sphinges) is a mythical creature with the head of a human, the body of a lion, and the wings of a falcon. In Greek tradition, the sphinx has the head of a woman, the haunches o ...
* Swish-e * Terrier IR Platform * Xapian


Proprietary software

* Algolia * Autonomy Corporation * Azure Search * Bar Ilan Responsa Project * Basis database * Brainware * BRS/Search * Concept Searching Limited * Dieselpoint * dtSearch * Elasticsearch * Endeca * Exalead * Fast Search & Transfer * Inktomi * Lucid Imagination * MarkLogic * SAP HANA * Swiftype *
Thunderstone Software LLC. Thunderstone is a US-based software company specializing in enterprise search. The company has its headquarters in Cleveland Cleveland ( ), officially the City of Cleveland, is a city in the U.S. state of Ohio and the county seat of Cuya ...
*
Vivísimo Vivisimo was a privately held technology company in Pittsburgh, Pennsylvania, specialising in the development of computer search engines. The company was acquired by IBM in May 2012 and is now branded aIBM Watson Explorer a product of the IBM Wa ...


References


See also

*
Pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
and string matching * Compound term processing * Enterprise search * Information extraction * Information retrieval * Faceted search * WebCrawler, first FTS engine * Search engine indexing - how search engines generate indices to support full-text searching {{DEFAULTSORT:Full Text Search Text editor features Information retrieval genres