HOME

TheInfoList



OR:

Compound-term processing, in information-retrieval, is search result matching on the basis of compound terms. Compound terms are built by combining two or more simple terms; for example, "triple" is a single word term, but "triple heart bypass" is a compound term. Compound-term processing is a new approach to an old problem: how can one improve the relevance of search results while maintaining ease of use? Using this technique, a search for ''survival rates following a triple heart bypass in elderly people'' will locate documents about this topic even if this precise phrase is not contained in any document. This can be performed by a
concept search A concept search (or conceptual search) is an automated information retrieval method that is used to search electronically stored unstructured text (for example, digital archives, email, scientific literature, etc.) for information that is concep ...
, which itself uses compound-term processing. This will extract the key concepts automatically (in this case "survival rates", "triple heart bypass" and "elderly people") and use these concepts to select the most relevant documents.


Techniques

In August 2003, Concept Searching Limited introduced the idea of using statistical compound-term processing. CLAMOUR is a European collaborative project which aims to find a better way to classify when collecting and disseminating industrial information and statistics. CLAMOUR appears to use a linguistic approach, rather than one based on
statistical modelling ''Statistical Modelling'' is a bimonthly peer-reviewed scientific journal covering statistical modelling. It is published by SAGE Publications on behalf of the Statistical Modelling Society. The editors-in-chief are Vicente Núñez-Antón ( Univers ...
.


History

Techniques for probabilistic weighting of single word terms date back to at least 1976 in the landmark publication by Stephen E. Robertson and
Karen Spärck Jones Karen Ida Boalth Spärck Jones (26 August 1935 – 4 April 2007) was a self-taught programmer and a pioneering British computer and information scientist responsible for the concept of inverse document frequency (IDF), a technology that unde ...
. Robertson stated that the assumption of word independence is not justified and exists as a matter of mathematical convenience. His objection to the term independence is not a new idea, dating back to at least 1964 when H. H. Williams stated that " e assumption of independence of words in a document is usually made as a matter of mathematical convenience". In 2004, Anna Lynn Patterson filed patents on "phrase-based searching in an information retrieval system" to which
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
subsequently acquired the rights.Google Acquires Cuil Patent Applications
/ref>


Adaptability

Statistical compound-term processing is more adaptable than the process described by Patterson. Her process is targeted at searching the
World Wide Web The World Wide Web (WWW or simply the Web) is an information system that enables Content (media), content sharing over the Internet through user-friendly ways meant to appeal to users beyond Information technology, IT specialists and hobbyis ...
where an extensive statistical knowledge of common searches can be used to identify candidate phrases. Statistical compound term processing is more suited to
enterprise search Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web ...
applications where such
a priori ('from the earlier') and ('from the later') are Latin phrases used in philosophy to distinguish types of knowledge, Justification (epistemology), justification, or argument by their reliance on experience. knowledge is independent from any ...
knowledge is not available. Statistical compound-term processing is also more adaptable than the linguistic approach taken by the CLAMOUR project, which must consider the syntactic properties of the terms (i.e. part of speech, gender, number, etc.) and their combinations. CLAMOUR is highly language-dependent, whereas the statistical approach is language-independent.


Applications

Compound-term processing allows information-retrieval applications, such as
search engines Search engines, including web search engines, selection-based search engines, metasearch engines, desktop search tools, and web portals and vertical market websites have a search facility for online databases. By content/topic Gene ...
, to perform their matching on the basis of multi-word concepts, rather than on single words in isolation which can be highly ambiguous. Early search engines looked for documents containing the words entered by the user into the search box . These are known as
keyword search In computing, a search engine is an information retrieval software system designed to help find information stored on one or more computer systems. Search engines discover, crawl, transform, and store information for retrieval and presentation in ...
engines.
Boolean search In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts r ...
engines add a degree of sophistication by allowing the user to specify additional requirements. For example, "Tiger NEAR Woods AND (golf OR golfing) NOT Volkswagen" uses the operators "NEAR", "AND", "OR" and "NOT" to specify that these words must follow certain requirements. A
phrase search In computer science, phrase searching allows users to retrieve content from information systems An information system (IS) is a formal, sociotechnical, organizational system designed to collect, process, store, and distribute information. From a ...
is simpler to use, but requires that the exact phrase specified appear in the results.


See also

* Concept Searching Limited *
Enterprise search Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web ...
*
Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...


References

{{DEFAULTSORT:Compound Term Processing Information retrieval techniques