Overcategorization
   HOME

TheInfoList



OR:

Overcategorization, overcategorisation or category clutter is the process of assigning too many categories, classes or
index term In information retrieval, an index term (also known as subject term, subject heading, descriptor, or keyword) is a term that captures the essence of the topic of a document. Index terms make up a controlled vocabulary for use in bibliographic recor ...
s to a given
document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...
. It is related to the
Library and information science Library and information science (LIS)Library and Information Sciences is the name used in the Dewey Decimal Classification for class 20 from the 18th edition (1971) to the 22nd edition (2003). are two interconnected disciplines that deal with inf ...
(LIS) concepts of
document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...
and
subject indexing Subject indexing is the act of describing or classifying a document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as ...
. In LIS, the ideal number of terms that should be assigned to classify an item are measured by the variables
precision and recall In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space. Precision (also calle ...
. Assigning few category labels that are most closely related to the content of the item being classified will result in searches that have high precision, I.e., where a high proportion of the results are closely related to the query. Assigning more category labels to each item will reduce the precision of each search, but increase the recall, retrieving more relevant results. Related LIS concepts include exhaustivity of indexing and
information overload Information overload (also known as infobesity, infoxication, or information anxiety) is the difficulty in understanding an issue and Decision making, effectively making decisions when one has too much information (TMI) about that issue, and is ...
.


Basic principles

If too many categories are assigned to a given document, the implications for users depend on how
informative Information is an abstract concept that refers to something which has the power to inform. At the most fundamental level, it pertains to the interpretation (perhaps formally) of that which may be sensed, or their abstractions. Any natur ...
the links are. If the user is able to distinguish between useful and not useful links, the damage is limited: The user only wastes time selecting links. In many cases, however, the user cannot judge whether or not a given link will turn out to be fruitful. In that case he or she has to follow the link and to read or skim another document. The worst case scenario is, of course, that even after reading the new document the user is unable to decide whether or not it might be useful if its subject matter is not thoroughly investigated. Overcategorization also has another unpleasant implication: It makes the system (for example in Wikipedia) difficult to maintain in a
consistent In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...
way. If the system is inconsistent, it means that when the user considers the links in a given category, he or she will not find all documents relevant to that category. Basically, the problem of overcategorization should be understood from the perspective of
relevance Relevance is the connection between topics that makes one useful for dealing with the other. Relevance is studied in many different fields, including cognitive science, logic, and library and information science. Epistemology studies it in gener ...
and the traditional measures of
recall Recall may refer to: * Recall (baseball), a baseball term * Recall (bugle call), a signal to stop * Recall (information retrieval), a statistical measure * ReCALL (journal), ''ReCALL'' (journal), an academic journal about computer-assisted langua ...
and
precision Precision, precise or precisely may refer to: Arts and media * ''Precision'' (march), the official marching music of the Royal Military College of Canada * "Precision" (song), by Big Sean * ''Precisely'' (sketch), a dramatic sketch by the Eng ...
. If too few ''relevant'' categories are assigned to a document, recall may decrease. If too many non-relevant categories are assigned, precision becomes lower. The hard job is to say which categories are fruitful or
relevant Relevant is something directly related, connected or pertinent to a topic; it may also mean something that is current. Relevant may also refer to: * Relevant operator, a concept in physics, see renormalization group * Relevant, Ain, a commune o ...
for future use of the document.


See also

* Exhaustivity *
Information overload Information overload (also known as infobesity, infoxication, or information anxiety) is the difficulty in understanding an issue and Decision making, effectively making decisions when one has too much information (TMI) about that issue, and is ...
*
Information pollution Information pollution (also referred to as info pollution) is the contamination of an information supply with irrelevant, redundant, unsolicited, hampering, and low-value information. Examples include misinformation, junk e-mail, and media vio ...
*
Relevance Relevance is the connection between topics that makes one useful for dealing with the other. Relevance is studied in many different fields, including cognitive science, logic, and library and information science. Epistemology studies it in gener ...
*
Subject (documents) In library and information science documents (such as books, articles and pictures) are classified and searched by subject – as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this ...
*
Subject indexing Subject indexing is the act of describing or classifying a document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as ...
*
Overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...


References

{{reflist Library science Information science