Sequential pattern mining is a topic of
data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus
time series mining is closely related, but usually considered a different activity. Sequential pattern mining is a special case of
structured data mining.
There are several key traditional computational problems addressed within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members. In general, sequence mining problems can be classified as ''string mining'' which is typically based on
string processing algorithms and ''itemset mining'' which is typically based on
association rule learning. ''Local process models'' extend sequential pattern mining to more complex patterns that can include (exclusive) choices, loops, and concurrency constructs in addition to the sequential ordering construct.
String mining
String mining typically deals with a limited
alphabet for items that appear in a
sequence, but the sequence itself may be typically very long. Examples of an alphabet can be those in the
ASCII character set used in natural language text,
nucleotide bases 'A', 'G', 'C' and 'T' in
DNA sequences, or
amino acids
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
for
protein sequences. In
biology applications analysis of the arrangement of the alphabet in strings can be used to examine
gene and
protein sequences to determine their properties. Knowing the sequence of letters of a
DNA or a
protein is not an ultimate goal in itself. Rather, the major task is to understand the sequence, in terms of its structure and
biological function. This is typically achieved first by identifying individual regions or structural units within each sequence and then assigning a function to each structural unit. In many cases this requires comparing a given sequence with previously studied ones. The comparison between the strings becomes complicated when
insertions,
deletions and
mutations
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mi ...
occur in a string.
A survey and taxonomy of the key algorithms for sequence comparison for bioinformatics is presented by Abouelhoda & Ghanem (2010), which include:
* Repeat-related problems: that deal with operations on single sequences and can be based on
exact string matching or
approximate string matching
In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching i ...
methods for finding dispersed fixed length and maximal length repeats, finding tandem repeats, and finding unique subsequences and missing (un-spelled) subsequences.
* Alignment problems: that deal with comparison between strings by first aligning one or more sequences; examples of popular methods include
BLAST for comparing a single sequence with multiple sequences in a database, and
ClustalW
Clustal is a series of widely used computer programs used in bioinformatics for multiple sequence alignment. There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its ...
for multiple alignments. Alignment algorithms can be based on either exact or approximate methods, and can also be classified as global alignments, semi-global alignments and local alignment. See
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
.
Itemset mining
Some problems in sequence mining lend themselves to discovering frequent itemsets and the order they appear, for example, one is seeking rules of the form "if a , he or she is likely to within 1 week", or in the context of stock prices, "if , it is likely that within 2 days". Traditionally, itemset mining is used in marketing applications for discovering regularities between frequently co-occurring items in large transactions. For example, by analysing transactions of customer shopping baskets in a supermarket, one can produce a rule which reads "if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat in the same transaction".
A survey and taxonomy of the key algorithms for item set mining is presented by Han et al. (2007).
The two common techniques that are applied to sequence databases for
frequent itemset mining are the influential
apriori algorithm
AprioriRakesh Agrawal and Ramakrishnan SrikanFast algorithms for mining association rules Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994. is an algorithm for frequent ...
and the more-recent
FP-growth
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.Pi ...
technique.
Applications
With a great variation of products and user buying behaviors, shelf on which products are being displayed is one of the most important resources in retail environment. Retailers can not only increase their profit but, also decrease cost by proper management of shelf space allocation and products display. To solve this problem, George and Binu (2013) have proposed an approach to mine user
buying patterns using PrefixSpan algorithm and place the products on shelves based on the order of mined purchasing patterns.
Algorithms
Commonly used algorithms include:
*
GSP algorithm
* Sequential Pattern Discovery using Equivalence classes (SPADE)
* FreeSpan
* PrefixSpan
* MAPres
* Seq2Pat (for constraint-based sequential pattern mining)
See also
*
*
*
*
*
*
References
External links
SPMFincludes open-source implementations of GSP, PrefixSpan, SPADE, SPAM and many others.
{{DEFAULTSORT:Sequential Pattern Mining
Data mining
Bioinformatics
Bioinformatics algorithms