Cross-language information retrieval (CLIR) is a subfield of
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
dealing with retrieving information written in a language different from the language of the user's query.
The term "cross-language information retrieval" has many synonyms, of which the following are perhaps the most frequent: cross-lingual information retrieval, translingual information retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers more generally both to technology for retrieval of multilingual collections and to technology which has been moved to handle material in one language to another. The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept queries for information in various languages and return objects (text, and other media) of various languages, translated into the user's language. Cross-language information retrieval refers more specifically to the use case where users formulate their information need in one language and the system retrieves relevant documents in another. To do so, most CLIR systems use various translation techniques. CLIR techniques can be classified into different categories based on different translation resources:
* Dictionary-based CLIR techniques
* Parallel corpora based CLIR techniques
* Comparable corpora based CLIR techniques
* Machine translator based CLIR techniques
CLIR systems have improved so much that the most accurate multi-lingual and cross-lingual
adhoc information retrieval systems today are nearly as effective as monolingual systems. Other related information access tasks, such as
media monitoring
Media monitoring is the activity of monitoring the output of the print, online and broadcast media. It is based on analyzing a diverse range of media platforms in order to identify trends that can be used for a variety of reasons such as political ...
,
information filtering
An information filtering system is a system that removes redundant or unwanted information from an information stream using (semi)automated or computerized methods prior to presentation to a human user. Its main goal is the management of the info ...
and routing,
sentiment analysis
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subje ...
, and
information extraction require more sophisticated models and typically more processing and analysis of the information items of interest. Much of that processing needs to be aware of the specifics of the target languages it is deployed in.
Mostly, the various mechanisms of
variation in human language pose coverage challenges for information retrieval systems: texts in a collection may treat a topic of interest but use terms or expressions which do not match the expression of information need given by the user. This can be true even in a mono-lingual case, but this is especially true in cross-lingual information retrieval, where users may know the target language only to some extent. The benefits of CLIR technology for users with poor to moderate competence in the target language has been found to be greater than for those who are fluent. Specific technologies in place for CLIR services include
morphological analysis to handle
inflection
In linguistic Morphology (linguistics), morphology, inflection (less commonly, inflexion) is a process of word formation in which a word is modified to express different grammatical category, grammatical categories such as grammatical tense, ...
, decompounding or compound splitting to handle
compound terms, and translations mechanisms to translate a query from one language to another.
The first workshop on CLIR was held in Zürich during the SIGIR-96 conference. Workshops have been held yearly since 2000 at the meetings of the
Cross Language Evaluation Forum (CLEF). Researchers also convene at the annual
Text Retrieval Conference
The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or ''tracks.'' It is co-sponsored by the National Institute of Standards and Technology (NIST) and ...
(TREC) to discuss their findings regarding different systems and methods of information retrieval, and the conference has served as a point of reference for the CLIR subfield. Early CLIR experiments were conducted at TREC-6, held at the
National Institute of Standards and Technology
The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into Outline of p ...
(NIST) on November 19–21, 1997.
Google Search
Google Search (also known simply as Google or Google.com) is a search engine operated by Google. It allows users to search for information on the World Wide Web, Web by entering keywords or phrases. Google Search uses algorithms to analyze an ...
had a cross-language search feature that was removed in 2013.
See also
*
EXCLAIM (EXtensible Cross-Linguistic Automatic Information Machine)
*
CLEF
A clef (from French: 'key') is a musical symbol used to indicate which notes are represented by the lines and spaces on a musical staff. Placing a clef on a staff assigns a particular pitch to one of the five lines or four spaces, whic ...
(Conference and Labs of the Evaluation Forum, formerly known as Cross-Language Evaluation Forum)
References
External links
A resource page for CLIRA search engine for CLIR
Information retrieval genres
Natural language processing
{{comp-ling-stub