Rocchio Algorithm
   HOME

TheInfoList



OR:

The Rocchio algorithm is based on a method of relevance feedback found in
information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio
algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
was developed using the
vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
. Its underlying assumption is that most users have a general conception of which documents should be denoted as relevant or irrelevant.Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: ''An Introduction to Information Retrieval'', page 163-167. Cambridge University Press, 2009. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the
search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the so called weights, i.e. the variables a, b and c listed below in the Algorithm section.


Algorithm

The
formula In science, a formula is a concise way of expressing information symbolically, as in a mathematical formula or a ''chemical formula''. The informal use of the term ''formula'' in science refers to the general construct of a relationship betwe ...
and variable definitions for Rocchio relevance feedback are as follows: \vec_m = a \,\vec_o + b\, \frac \sum_ \vec_j - c\, \frac \sum_ \vec_k As demonstrated in the formula, the associated weights (a, b, c) are responsible for shaping the modified
vector Vector most often refers to: * Euclidean vector, a quantity with a magnitude and a direction * Disease vector, an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematics a ...
in a direction closer, or farther away, from the original query, related documents, and non-related documents. In particular, the values for b and c should be incremented or decremented proportionally to the set of documents classified by the user. If the user decides that the modified query should not contain terms from either the original query, related documents, or non-related documents, then the corresponding weight (a, b, c) value for the category should be set to 0. In the later part of the algorithm, the variables D_, and D_ are presented to be sets of vectors containing the coordinates of related documents and non-related documents. In the formula, \vec_j and \vec_k are the vectors used to iterate through the two sets D_ and D_ and form vector
summation In mathematics, summation is the addition of a sequence of numbers, called ''addends'' or ''summands''; the result is their ''sum'' or ''total''. Beside numbers, other types of values can be summed as well: functions, vectors, matrices, pol ...
s. These sums are normalized, i.e. divided by the size of their respective document set. In order to visualize the changes taking place on the modified vector, please refer to the image below. As the weights are increased or decreased for a particular category of documents, the coordinates for the modified vector begin to move either closer, or farther away, from the
centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the figure. The same definition extends to any object in n-d ...
of the document collection. Thus if the weight is increased for related documents, then the modified vectors
coordinate In geometry, a coordinate system is a system that uses one or more numbers, or coordinates, to uniquely determine and standardize the position of the points or other geometric elements on a manifold such as Euclidean space. The coordinates are ...
s will reflect being closer to the centroid of related documents.


Time complexity

The
time complexity In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations ...
for training and testing the algorithm are listed below and followed by the definition of each variable. Note that when in testing phase, the time complexity can be reduced to that of calculating the
euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
between a class
centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the figure. The same definition extends to any object in n-d ...
and the respective document. As shown by: \Theta(\vert\mathbb\vert M_). Training = \Theta(\vert\mathbb\vert L_+\vert\mathbb\vert\vert V\vert)
Testing = \Theta( L_+\vert\mathbb\vert M_)= \Theta(\vert\mathbb\vert M_)


Usage

Though there are benefits to ranking documents as not-relevant, a relevant document ranking will result in more precise documents being made available to the user. Therefore, traditional values for the algorithm's weights (a, b, c) in Rocchio classification are typically around a = 1, b = 0.8, and c = 0.1. Modern
information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
systems have moved towards eliminating the non-related documents by setting c = 0 and thus only accounting for related documents. Although not all retrieval systems have eliminated the need for non-related documents, most have limited the effects on modified query by only accounting for strongest non-related documents in the D_ set.


Limitations

The Rocchio algorithm often fails to classify multimodal classes and relationships. For instance, the country of
Burma Myanmar, officially the Republic of the Union of Myanmar; and also referred to as Burma (the official English name until 1989), is a country in northwest Southeast Asia. It is the largest country by area in Mainland Southeast Asia and ha ...
was renamed to
Myanmar Myanmar, officially the Republic of the Union of Myanmar; and also referred to as Burma (the official English name until 1989), is a country in northwest Southeast Asia. It is the largest country by area in Mainland Southeast Asia and has ...
in 1989. Therefore, the two queries of "Burma" and "Myanmar" will appear much farther apart in the
vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
, though they both contain similar origins.


See also

* Nearest centroid classifier, aka Rocchio classifier


References

{{reflist
Relevance Feedback in Information Retrieval

Relevance Feedback and Query Expansion

Vector Space Classification


Search algorithms