computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold. It is a more general form of the

maximum agreement subtree problem The maximum agreement subtree problem is any of several closely related problems in graph theory and computer science. In all of these problems one is given a collection of trees T_1,\ldots, T_m each containing n leaves. The leaves of these trees ar ...

Definition

Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree

isomorphic In mathematics, an isomorphism is a structure-preserving mapping or morphism between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between the ...

to a given pattern.

Formal definition

The problem of frequent subtree mining has been formally defined as: :Given a threshold ''minfreq'', a class of trees

\mathcal

, a transitive subtree relation

P\preceq T

between trees

P, T \in\mathcal

, a finite set of trees

\mathcal\subseteq\mathcal

, the frequent subtree mining problem is the problem of finding all trees

\mathcal\subset\mathcal

such that no two trees in

\mathcal

are isomorphic and ::

\forall P\in\mathcal : \quad \mathrm(P, \mathcal) = \sum\nolimits_ d(P,T)\geq \mathrm,

:where is an anti-monotone function such that if

P' \preceq P

then ::

\forall T\in\mathcal : \quad d(P',T)\geq d(P,T).

TreeMiner

In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching.

Definitions

Induced sub-trees

A sub-tree

S=(V_s,E_s)

is an induced sub-tree of

T=(V,E)

if and only if

V_s\subseteq V

and

E_s\subseteq E

. In other words, any two nodes in S that are directly connected by an edge is also directly connected in T. For any node A and B in S, if node A is the parent of node B in S, then node A must also be the parent of node B in T.

Embedded sub-trees

A sub-tree

S=(V_s,E_s)

is an embedded sub-tree of

T=(V,E)

if and only if

V_s\subseteq V

and two endpoint nodes of any edge in S are on the same path from the root to a leaf node in T. In other words, for any node A and B in S, if node A is the parent of node B in S, then node A must be an ancestor of node B in T. Any induced sub-trees are also embedded sub-trees, and thus the concept of embedded sub-trees is a generalization of induced sub-trees. As such embedded sub-trees characterizes the hidden patterns in a tree that are missing in traditional induced sub-tree mining. A sub-tree of size k is often called a k-sub-tree.

Support

The support of a sub-tree is the number of trees in a database that contains the sub-tree. A sub-tree is frequent if its support is not less than a user-specified threshold (often denoted as ''minsup).'' The goal of TreeMiner is to find all embedded sub-trees that have support at least the minimum support.

String representation of trees

There are several different ways of encoding a tree structure. TreeMiner uses string representations of trees for efficient tree manipulation and support counting. Initially the string is set to

\varnothing

. Starting from the root of the tree, node labels are added to the string in depth-first search order. -1 is added to the string whenever the search process backtracks from a child to its parent. For example, a simple binary tree with root labelled A, a left child labelled B and right child labelled C can be represented by a string A B -1 C -1.

Prefix equivalence class

Two k-sub-trees are said to be in the same prefix equivalence class if the string representation of them are identical up to the (k-1)-th node. In other words, all elements in a prefix equivalence class only differ by the last node. For example, two trees with string representation A B -1 C -1 and A B -1 D -1 are in the prefix equivalence class A B with elements (C, 0) and (D,0). An element of a prefix class is specified by the node label paired with the 0-based depth first index of the node it is attached to. In this example, both elements of prefix class A B are attached to the root, which has an index of 0.

Scope

The scope of a node A is given by a pair of numbers

,r /math> where l and r are the minimum and maximum node index in the sub-tree rooted at A. In other words, l is the index of A, and r is the index of the rightmost leaf among the descendants of A. As such the index of any descendant of A must lie in the scope of A, which will be a very useful property when counting the support of sub-trees.

Algorithm

Candidate generation

Frequent sub-tree patterns follow the anti-monotone property. In other words, the support of a k-sub-tree is less than or equal to the support of its (k-1)-sub-trees. Only super patterns of known frequent patterns can possibly be frequent. By utilizing this property, k-sub-trees candidates can be generated based on frequent (k-1)-sub-trees through prefix class extension. Let C be a prefix equivalence class with two elements (x,i) and (y,j). Let C' be the class representing the extension of element (x,i). The elements of C' are added by performing ''join'' operation on the two (k-1)-sub-trees in C. The ''join'' operation on (x,i) and (y,j) is defined as the following. * If

i>j

, then add (y,j) to C'. * If

i=j

, then add (y,j) and (y, ni) to C' where ni the depth-first index of x in C * If

i, no possible element can be added to C'

This operation is repeated for any two ordered, but not necessarily distinct elements in C to construct the extended prefix classes of k-sub-trees.

Scope-list representation

TreeMiner performs depth first candidate generation using scope-list representation of sub-trees to facilitate faster support counting. A k-sub-tree S can be representation by a triplet (t,m,s) where t is the tree id the sub-tree comes from, m is the prefix match label, and s the scope of the last node in S. Depending on how S occurs in different trees across the database, S can have different scope-list representation. TreeMiner defines ''scope-list join'' that performs class extension on scope-list representation of sub-trees. Two elements (x,i) and (y,j) can be joined if there exists two sub-trees

(t_x,m_x,s_x)

and

(t_y,m_y,s_y)

that satisfy either of the following conditions. * In-scope test:

t_x=t_y, m_x=m_y, s_y\subset s_x

, which corresponds to the case when

i=j

. * Out-scope test:

t_x=t_y, m_x=m_y, s_y>s_x

, which correspond to the case when

i>j

. By keeping track of distinct tree ids used in the scope-list tests, the support of sub-trees can be calculated efficiently.

Applications

Domains in which frequent subtree mining is useful tend to involve complex relationships between data entities: for instance, the analysis of XML documents often requires frequent subtree mining. Another domain where this is useful is the web usage mining problem: since the actions taken by users when visiting a web site can be recorded and categorized in many different ways, complex databases of trees need to be analyzed with frequent subtree mining. Other domains in which frequent subtree mining is useful include

computational biology Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...

,Deepak, Akshay, David Fernández-Baca, Srikanta Tirthapura, Michael J. Sanderson, and Michelle M. McMahon.
EvoMiner: frequent subtree mining in phylogenetic databases
" Knowledge and Information Systems (2011): 1-32.Chi, Yun, Yirong Yang, and Richard R. Muntz.
Canonical forms for labelled trees and their applications in frequent subtree mining
" ''Knowledge and Information Systems'' 8, no. 2 (2005): 203–234. RNA structure analysis, pattern recognition, bioinformatics, and analysis of the

KEGG KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis ...

GLYCAN database.

Challenges

Checking whether a pattern (or a transaction) supports a given subgraph is an

NP-complete In computational complexity theory, NP-complete problems are the hardest of the problems to which ''solutions'' can be verified ''quickly''. Somewhat more precisely, a problem is NP-complete when: # It is a decision problem, meaning that for any ...

problem, since it is an NP-complete instance of the

subgraph isomorphism problem In theoretical computer science, the subgraph isomorphism problem is a computational task in which two graphs G and H are given as input, and one must determine whether G contains a subgraph that is isomorphic to H. Subgraph isomorphism is a gen ...

. Furthermore, due to

combinatorial explosion In mathematics, a combinatorial explosion is the rapid growth of the complexity of a problem due to the way its combinatorics depends on input, constraints and bounds. Combinatorial explosion is sometimes used to justify the intractability of cert ...

, according to Lei et al., "mining all frequent subtree patterns becomes infeasible for a large and dense tree database".

References

{{Reflist Computational problems in graph theory