In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold.
It is a more general form of the maximum agreement subtree problem.
Definition
Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree
isomorphic to a given pattern.
Formal definition
The problem of frequent subtree mining has been formally defined as:
:Given a threshold ''minfreq'', a class of trees
, a transitive subtree relation
between trees
, a finite set of trees
, the frequent subtree mining problem is the problem of finding all trees
such that no two trees in
are isomorphic and
::
:where is an anti-monotone function such that if
then
::
TreeMiner
In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching.
Definitions
Induced sub-trees
A sub-tree
is an induced sub-tree of
if and only if
and
. In other words, any two nodes in S that are directly connected by an edge is also directly connected in T. For any node A and B in S, if node A is the parent of node B in S, then node A must also be the parent of node B in T.
Embedded sub-trees
A sub-tree
is an embedded sub-tree of
if and only if
and two endpoint nodes of any edge in S are on the same path from the root to a leaf node in T. In other words, for any node A and B in S, if node A is the parent of node B in S, then node A must be an ancestor of node B in T. Any induced sub-trees are also embedded sub-trees, and thus the concept of embedded sub-trees is a generalization of induced sub-trees. As such embedded sub-trees characterizes the hidden patterns in a tree that are missing in traditional induced sub-tree mining. A sub-tree of size k is often called a k-sub-tree.
Support
The support of a sub-tree is the number of trees in a database that contains the sub-tree. A sub-tree is frequent if its support is not less than a user-specified threshold (often denoted as ''minsup).'' The goal of TreeMiner is to find all embedded sub-trees that have support at least the minimum support.
String representation of trees
There are several different ways of encoding a tree structure. TreeMiner uses string representations of trees for efficient tree manipulation and support counting. Initially the string is set to
. Starting from the root of the tree, node labels are added to the string in depth-first search order. -1 is added to the string whenever the search process backtracks from a child to its parent. For example, a simple binary tree with root labelled A, a left child labelled B and right child labelled C can be represented by a string A B -1 C -1.
Prefix equivalence class
Two k-sub-trees are said to be in the same prefix equivalence class if the string representation of them are identical up to the (k-1)-th node. In other words, all elements in a prefix equivalence class only differ by the last node. For example, two trees with string representation A B -1 C -1 and A B -1 D -1 are in the prefix equivalence class A B with elements (C, 0) and (D,0). An element of a prefix class is specified by the node label paired with the 0-based depth first index of the node it is attached to. In this example, both elements of prefix class A B are attached to the root, which has an index of 0.
Scope
The scope of a node A is given by a pair of numbers