The Ruzzo–Tompa algorithm or the RT algorithm is a

linear-time In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations p ...

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

for finding all non-overlapping, contiguous, maximal scoring subsequences in a sequence of real numbers. The Ruzzo–Tompa algorithm was proposed by Walter L. Ruzzo and Martin Tompa. This algorithm is an improvement over previously known quadratic time algorithms. The maximum scoring subsequence from the set produced by the algorithm is also a solution to the

maximum subarray problem In computer science, the maximum sum subarray problem, also known as the maximum segment sum problem, is the task of finding a contiguous subarray with the largest sum, within a given one-dimensional array A ...nof numbers. It can be solved in O ...

. The Ruzzo–Tompa algorithm has applications in

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

web scraping Web scraping, web harvesting, or web data extraction is data scraping used for data extraction, extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. W ...

, and

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

Applications

Bioinformatics

The Ruzzo–Tompa algorithm has been used in

Bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

tools to study biological data. The problem of finding disjoint maximal subsequences is of practical importance in the analysis of

DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...

. Maximal subsequences algorithms have been used in the identification of transmembrane segments and the evaluation of

sequence homology Sequence homology is the homology (biology), biological homology between DNA sequence, DNA, RNA sequence, RNA, or Protein primary structure, protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments ...

. The algorithm is used in

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...

which is used as a method of identifying similar

RNA Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...

, or

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

sequences. Accounting for the ordering of pairs of high-scoring subsequences in two sequences creates better sequence alignments. This is because the biological model suggests that separate high-scoring subsequence pairs arise from insertions or deletions within a matching region. Requiring consistent ordering of high-scoring subsequence pairs increases their statistical significance.

Web scraping

The Ruzzo–Tompa algorithm is used in

Web scraping Web scraping, web harvesting, or web data extraction is data scraping used for data extraction, extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. W ...

to extract information from web pages. Pasternack and Roth proposed a method for extracting important blocks of text from HTML documents. The web pages are first

tokenized Lexical tokenization is conversion of a text into (semantically or syntactically) meaningful ''lexical tokens'' belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives ...

and the score for each token is found using local, token-level classifiers. A modified version of the Ruzzo–Tompa algorithm is then used to find the k highest-valued subsequences of tokens. These subsequences are then used as predictions of important blocks of text in the article.

Information retrieval

The Ruzzo–Tompa algorithm has been used in

Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

search algorithms. Liang et al. proposed a data fusion method to combine the search results of several microblog search algorithms. In their method, the Ruzzo–Tompa algorithm is used to detect bursts of information.

Problem definition

The problem of finding all maximal subsequences is defined as follows: Given a list of real numbered scores

x_1,x_2,\ldots,x_n

, find the list of contiguous subsequences that gives the greatest total score, where the score of each subsequence

S_ = \sum_ x_k

. The subsequences must be disjoint (non-overlapping) and have a positive score.

Other algorithms

There are several approaches to solving the all maximal scoring subsequences problem. A natural approach is to use existing, linear time algorithms to find the maximum subsequence (see

) and then recursively find the maximal subsequences to the left and right of the maximum subsequence. The analysis of this algorithm is similar to that of

Quicksort Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961. It is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...

: The maximum subsequence could be small in comparison to the rest of sequence, leading to a running time of

O(n^2)

in the worst case.

Algorithm

The standard implementation of the Ruzzo–Tompa algorithm runs in

O(n)

time and uses ''O''(''n'') space, where ''n'' is the length of the list of scores. The algorithm uses dynamic programming to progressively build the final solution by incrementally solving progressively larger subsets of the problem. The description of the algorithm provided by Ruzzo and Tompa is as follows: : Read the scores left to right and maintain the cumulative sum of the scores read. Maintain an ordered list

I_1,I_2,\ldots,I_j

of disjoint subsequences. For each subsequence

I_j

, record the cumulative total

L_j

of all scores up to but not including the leftmost score of

I_j

, and the total

R_j

up to and including the rightmost score of

I_j

. : The lists are initially empty. Scores are read from left to right and are processed as follows. Nonpositive scores require no special processing, so the next score is read. A positive score is incorporated into a new sub-sequence

I_k

of length one that is then integrated into the list by the following process: # The list

I

is searched from right to left for the maximum value of

j

satisfying

L_j # If there is no such j, then add I_k to the end of the list.
# If there is such a j, and R_j \geq R_k, then add I_k to the end of the list.
# Otherwise (i.e., there is such a j, but R_j < R_k), extend the subsequence I_k to the left to encompass everything up to and including the leftmost score in I_j . Delete subsequences I_j,I_j+1,\ldots,I_k-1 from the list, and append I_k to the end of the list. Reconsider the newly extended subsequence I_k (now renumbered I_j) as in step 1.

:Once the end of the input is reached, all subsequences remaining on the list I are maximal. The following

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...

code implements the Ruzzo–Tompa algorithm: def ruzzo_tompa(scores): """Ruzzo–Tompa algorithm.""" k = 0 total = 0 # Allocating arrays of size n I, L, R, Lidx = 0* len(scores) for _ in range(4)] for i, s in enumerate(scores): total += s if s > 0: # store I by (start,end) indices of scores I = (i, i + 1) Lidx = i L = total - s R = total while True: maxj = None for j in range(k - 1, -1, -1): if L < L maxj = j break if maxj is not None and R axj< R I axj= (Lidx axj i + 1) R axj= total k = maxj else: k += 1 break # Getting maximal subsequences using stored indices return cores[I[l0">[l.html" ;"title="cores[I[l">cores[I[l0: I[l">[l">cores[I[l<_a>0.html" ;"title="[l.html" ;"title="cores[I[l">cores[I[l0">[l.html" ;"title="cores[I[l">cores[I[l0: I[l1 for l in range(k)]