decision tree learning Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of obser ...

, ID3 (Iterative Dichotomiser 3) is an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

invented by

Ross Quinlan John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical C4.5 and ID3 algorithms. He also contributed to ...

Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81–106 used to generate a

decision tree A decision tree is a decision support system, decision support recursive partitioning structure that uses a Tree (graph theory), tree-like Causal model, model of decisions and their possible consequences, including probability, chance event ou ...

from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

and

natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...

domains.

Algorithm

The ID3 algorithm begins with the original set

S

as the

root node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be co ...

. On each

iteration Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...

of the algorithm, it iterates through every unused attribute of the set

S

and calculates the

entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

\Eta

or the

information gain Information is an abstract concept that refers to something which has the power to inform. At the most fundamental level, it pertains to the interpretation (perhaps formally) of that which may be sensed, or their abstractions. Any natur ...

IG(S)

of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set

S

is then split or partitioned by the selected attribute to produce subsets of the data. (For example, a node can be split into

child node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be con ...

s based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.) The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: * every element in the subset belongs to the same class; in which case the node is turned into a

leaf node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children (depending on the type of tree), but must be conn ...

and labelled with the class of the examples. * there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset. * there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute. An example could be the absence of a person among the

population Population is a set of humans or other organisms in a given region or area. Governments conduct a census to quantify the resident population size within a given jurisdiction. The term is also applied to non-human animals, microorganisms, and pl ...

with age over 100 years. Then a leaf node is created and labelled with the most common class of the examples in the parent node's set. Throughout the algorithm, the decision tree is constructed with each non-terminal node (

internal node In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected Node (computer science), nodes. Each node in the tree can be connected to many children (depending on the type of ...

) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch.

Summary

# Calculate the

of every attribute

a

of the data set

S

. # Partition ("split") the set

S

into

subset In mathematics, a Set (mathematics), set ''A'' is a subset of a set ''B'' if all Element (mathematics), elements of ''A'' are also elements of ''B''; ''B'' is then a superset of ''A''. It is possible for ''A'' and ''B'' to be equal; if they a ...

s using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is

maximum In mathematical analysis, the maximum and minimum of a function (mathematics), function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given Interval (ma ...

. # Make a decision tree

node In general, a node is a localized swelling (a "knot") or a point of intersection (a vertex). Node may refer to: In mathematics * Vertex (graph theory), a vertex in a mathematical graph *Vertex (geometry), a point where two or more curves, lines ...

containing that attribute. # Recurse on subsets using the remaining attributes.

Properties

ID3 does not guarantee an optimal solution. It can converge upon

local optima In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative' ...

. It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. The algorithm's optimality can be improved by using

backtracking Backtracking is a class of algorithms for finding solutions to some computational problems, notably constraint satisfaction problems, that incrementally builds candidates to the solutions, and abandons a candidate ("backtracks") as soon as it de ...

during the search for the optimal decision tree at the cost of possibly taking longer. ID3 can

overfit mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree. ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are

continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...

, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time-consuming.

Usage

The ID3 algorithm is used by training on a

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...

S

to produce a

which is stored in

memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembe ...

. At runtime, this decision tree is used to

classify Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

new test cases (

feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a data set. Choosing informative, discriminating, and independent features is crucial to produce effective algorithms for pattern re ...

s) by traversing the decision tree using the features of the datum to arrive at a leaf node.

The ID3 metrics

Entropy

Entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

\Eta

is a measure of the amount of uncertainty in the (data) set

S

(i.e. entropy characterizes the (data) set

S

). :

\Eta = \sum_

Where, *

S

– The current dataset for which entropy is being calculated **This changes at each step of the ID3 algorithm, either to a subset of the previous set in the case of splitting on an attribute or to a "sibling" partition of the parent in case the recursion terminated previously. *

X

– The set of classes in

S

p(x)

– The

proportion Proportionality, proportion or proportional may refer to: Mathematics * Proportionality (mathematics), the property of two variables being in a multiplicative relation to a constant * Ratio, of one quantity to another, especially of a part compare ...

of the

number of elements In mathematics, the cardinality of a finite set is the number of its elements, and is therefore a measure of size of the set. Since the discovery by Georg Cantor, in the late 19th century, of different sizes of infinite sets, the term ''cardinal ...

in class

x

to the number of elements in set

S

When

\Eta = 0

, the set

S

is perfectly classified (i.e. all elements in

S

are of the same class). In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set

S

on this iteration. Entropy in

information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

measures how much information is expected to be gained upon

measuring Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events. In other words, measurement is a process of determining how large or small a physical quantity is as compared to ...

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

; as such, it can also be used to quantify the amount to which the

distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...

of the quantity's values is unknown. A constant quantity has zero entropy, as its distribution is perfectly known. In contrast, a uniformly distributed random variable ( discretely or continuously uniform) maximizes entropy. Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here. As such, ID3 is a greedy

heuristic A heuristic or heuristic technique (''problem solving'', '' mental shortcut'', ''rule of thumb'') is any approach to problem solving that employs a pragmatic method that is not fully optimized, perfected, or rationalized, but is nevertheless ...

performing a

best-first search Best-first search is a class of search algorithms which explores a graph by expanding the most promising node chosen according to a specified rule. Judea Pearl described best-first search as estimating the promise of node ''n'' by a "heuristic eva ...

for locally optimal entropy values. Its accuracy can be improved by preprocessing the data.

Information gain

Information gain Information is an abstract concept that refers to something which has the power to inform. At the most fundamental level, it pertains to the interpretation (perhaps formally) of that which may be sensed, or their abstractions. Any natur ...

IG(A)

is the measure of the difference in entropy from before to after the set

S

is split on an attribute

A

. In other words, how much uncertainty in

S

was reduced after splitting set

S

on attribute

A

. :

IG(S, A) = \Eta - \sum_ p(t)\Eta
    = \Eta - \Eta.

Where, *

\Eta(S)

– Entropy of set

S

T

– The subsets created from splitting set

S

by attribute

A

such that

S = \bigcup_ t

p(t)

– The proportion of the number of elements in

t

to the number of elements in set

S

\Eta(t)

– Entropy of subset

t

In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set

S

on this iteration.

References

External links

* Seminars �
http://www2.cs.uregina.ca/
* Description and examples �

* Description and examples �
Decision Trees and Political Party Classification
Decision trees Classification algorithms Articles with example pseudocode