Document retrieval is defined as the matching of some stated user query against a set of
free-text
In Document retrieval, text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of ...
records. These records could be any type of mainly
unstructured text, such as
newspaper article
An article or piece is a written work published in a print or electronic medium, for the propagation of news, research results, academic analysis or debate.
News
A news article discusses current or recent news of either general interest (i.e ...
s, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
where the information is stored primarily in the form of
text
Text may refer to:
Written word
* Text (literary theory)
In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothi ...
. Text databases became decentralized thanks to the
personal computer
A personal computer, commonly referred to as PC or computer, is a computer designed for individual use. It is typically used for tasks such as Word processor, word processing, web browser, internet browsing, email, multimedia playback, and PC ...
. Text retrieval is a critical area of study today, since it is the fundamental basis of all
internet
The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s.
Description
Document retrieval systems find information to given criteria by matching text records (''documents'') against user queries, as opposed to
expert system
In artificial intelligence (AI), an expert system is a computer system emulating the decision-making ability of a human expert.
Expert systems are designed to solve complex problems by reasoning through bodies of knowledge, represented mainly as ...
s that answer questions by
inferring over a logical
knowledge database. A document retrieval system consists of a database of documents, a
classification algorithm to build a full text index, and a user interface to access the database.
A document retrieval system has two main tasks:
# Find relevant documents to user queries
# Evaluate the matching results and sort them according to relevance, using algorithms such as
PageRank
PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...
.
Internet
search engines
Search engines, including web search engines, selection-based search engines, metasearch engines, desktop search tools, and web portals and vertical market websites have a search facility for online databases.
By content/topic
Gene ...
are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using
statistical
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
or
natural language processing
Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
techniques.
Variations
There are two main classes of indexing schemata for document retrieval systems: ''form based'' (or ''word based''), and ''content based'' indexing. The document classification scheme (or
indexing algorithm) in use determines the nature of the document retrieval system.
Form based
Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A
suffix tree
In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
algorithm is an example for form based indexing.
Content based
The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an
inverted index
In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of d ...
algorithm.
A ''signature file'' is a technique that creates a ''quick and dirty'' filter, for example a
Bloom filter
In computing, a Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives ar ...
, that will keep all the documents that match to the query and ''hopefully'' a few ones that do not. The way this is done is by creating for each file a signature, typically a hash coded version. One method is superimposed coding. A post-processing step is done to discard the false alarms. Since in most cases this structure is inferior to
inverted files in terms of speed, size and functionality, it is not used widely. However, with proper parameters it can beat the inverted files in certain environments.
Example: PubMed
The
PubMed
PubMed is an openly accessible, free database which includes primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institute ...
form interface features the "related articles" search which works through a comparison of words from the documents' title, abstract, and
MeSH
Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences. It serves as a thesaurus of index terms that facilitates searching. Created and updated by th ...
terms using a word-weighted algorithm.
See also
*
Compound term processing
*
Document classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...
*
Enterprise search
Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web ...
*
Evaluation measures (information retrieval)
*
Full text search
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts ...
*
Information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
*
Latent semantic indexing
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the d ...
*
Search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
References
Further reading
*
*
*
External links
{{Commons category
Formal Foundation of Information Retrieval Buckinghamshire Chilterns University College
Information retrieval genres
Electronic documents
Substring indices
Search engine software