The Ruzzo–Tompa algorithm or the RT algorithm
is a
linear-time
In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations p ...
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
for finding all non-overlapping, contiguous, maximal scoring subsequences in a sequence of real numbers.
The Ruzzo–Tompa algorithm was proposed by Walter L. Ruzzo and Martin Tompa. This algorithm is an improvement over previously known quadratic time algorithms.
The maximum scoring subsequence from the set produced by the algorithm is also a solution to the
maximum subarray problem
In computer science, the maximum sum subarray problem, also known as the maximum segment sum problem, is the task of finding a contiguous subarray with the largest sum, within a given one-dimensional array A ...nof numbers. It can be solved in O ...
.
The Ruzzo–Tompa algorithm has applications in
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
,
web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for data extraction, extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. W ...
,
and
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
.
Applications
Bioinformatics
The Ruzzo–Tompa algorithm has been used in
Bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
tools to study biological data. The problem of finding disjoint maximal subsequences is of practical importance in the analysis of
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of
sequence homology
Sequence homology is the homology (biology), biological homology between DNA sequence, DNA, RNA sequence, RNA, or Protein primary structure, protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments ...
.
The algorithm is used in
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
which is used as a method of identifying similar
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
,
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
, or
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequences. Accounting for the ordering of pairs of high-scoring subsequences in two sequences creates better sequence alignments. This is because the biological model suggests that separate high-scoring subsequence pairs arise from insertions or deletions within a matching region. Requiring consistent ordering of high-scoring subsequence pairs increases their statistical significance.
Web scraping
The Ruzzo–Tompa algorithm is used in
Web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for data extraction, extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. W ...
to extract information from web pages. Pasternack and Roth proposed a method for extracting important blocks of text from HTML documents. The web pages are first
tokenized
Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...
and the score for each token is found using local, token-level classifiers. A modified version of the Ruzzo–Tompa algorithm is then used to find the k highest-valued subsequences of tokens. These subsequences are then used as predictions of important blocks of text in the article.
Information retrieval
The Ruzzo–Tompa algorithm has been used in
Information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
search algorithms. Liang et al. proposed a
data fusion method to combine the search results of several microblog search algorithms. In their method, the Ruzzo–Tompa algorithm is used to detect
bursts of information.
Problem definition
The problem of finding all maximal subsequences is defined as follows: Given a list of real numbered scores
, find the list of contiguous subsequences that gives the greatest total score, where the score of each subsequence
. The subsequences must be disjoint (non-overlapping) and have a positive score.
Other algorithms
There are several approaches to solving the all maximal scoring subsequences problem. A natural approach is to use existing, linear time algorithms to find the maximum subsequence (see
maximum subarray problem
In computer science, the maximum sum subarray problem, also known as the maximum segment sum problem, is the task of finding a contiguous subarray with the largest sum, within a given one-dimensional array A ...nof numbers. It can be solved in O ...
) and then recursively find the maximal subsequences to the left and right of the maximum subsequence. The analysis of this algorithm is similar to that of
Quicksort
Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...
: The maximum subsequence could be small in comparison to the rest of sequence, leading to a running time of
in the worst case.
Algorithm
The standard implementation of the Ruzzo–Tompa algorithm runs in
time and uses ''O''(''n'') space, where ''n'' is the length of the list of scores. The algorithm uses
dynamic programming to progressively build the final solution by incrementally solving progressively larger subsets of the problem. The description of the algorithm provided by Ruzzo and Tompa is as follows:
: Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list
of disjoint subsequences. For each subsequence
, record the cumulative total
of all scores up to but not including the leftmost score of
, and the total
up to and including the rightmost score of
.
: The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores require no special processing, so the next score is read. A positive score is incorporated into a new sub-sequence
of length one that is then integrated into the list by the following process:
# The list
is searched from right to left for the maximum value of
satisfying