In
computer science, a generalized suffix tree is a
suffix tree
In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
for a set of
strings
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
. Given the set of strings
of total length
, it is a
Patricia tree
In computer science, a radix tree (also radix trie or compact prefix tree or compressed trie) is a data structure that represents a space-optimized trie (prefix tree) in which each node that is the only child is merged with its parent. The res ...
containing all
suffixes of the strings. It is mostly used in
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
.
Functionality
It can be built in
time and space, and can be used to find all
occurrences of a string
of length
in
time, which is
asymptotically optimal (assuming the size of the alphabet is constant
).
When constructing such a tree, each string should be padded with a unique out-of-alphabet marker symbol (or string) to ensure no suffix is a substring of another, guaranteeing each suffix is represented by a unique leaf node.
Algorithms for constructing a GST include
Ukkonen's algorithm (1995) and
McCreight's algorithm (1976).
Example
A suffix tree for the strings
ABAB
and
BABA
is shown in a figure above. They are padded with the unique terminator strings
$0
and
$1
. The numbers in the leaf nodes are string number and starting position. Notice how a left to right traversal of the leaf nodes corresponds to the sorted order of the suffixes. The terminators might be strings or unique single symbols. Edges on
$
from the root are left out in this example.
Alternatives
An alternative to building a generalized suffix tree is to concatenate the strings, and build a regular
suffix tree
In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
or
suffix array
In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.
Suffix arrays were introduced by as ...
for the resulting string. When hits are evaluated after a search, global positions are mapped into documents and local positions with some algorithm and/or data structure, such as a binary search in the starting/ending positions of the documents.
References
External links
*
A C implementation of Generalized Suffix Tree for two strings
{{Strings
Trees (data structures)
Substring indices
String data structures
Computer science suffixes