HOME

TheInfoList



OR:

In computer science, a generalized suffix tree is a
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
for a set of
strings String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
. Given the set of strings D=S_1,S_2,\dots,S_d of total length n, it is a
Patricia tree In computer science, a radix tree (also radix trie or compact prefix tree or compressed trie) is a data structure that represents a space-optimized trie (prefix tree) in which each node that is the only child is merged with its parent. The res ...
containing all n suffixes of the strings. It is mostly used in
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
.


Functionality

It can be built in \Theta(n) time and space, and can be used to find all z occurrences of a string P of length m in O(m + z) time, which is asymptotically optimal (assuming the size of the alphabet is constant). When constructing such a tree, each string should be padded with a unique out-of-alphabet marker symbol (or string) to ensure no suffix is a substring of another, guaranteeing each suffix is represented by a unique leaf node. Algorithms for constructing a GST include Ukkonen's algorithm (1995) and McCreight's algorithm (1976).


Example

A suffix tree for the strings ABAB and BABA is shown in a figure above. They are padded with the unique terminator strings $0 and $1. The numbers in the leaf nodes are string number and starting position. Notice how a left to right traversal of the leaf nodes corresponds to the sorted order of the suffixes. The terminators might be strings or unique single symbols. Edges on $ from the root are left out in this example.


Alternatives

An alternative to building a generalized suffix tree is to concatenate the strings, and build a regular
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
or
suffix array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by as ...
for the resulting string. When hits are evaluated after a search, global positions are mapped into documents and local positions with some algorithm and/or data structure, such as a binary search in the starting/ending positions of the documents.


References


External links

*
A C implementation of Generalized Suffix Tree for two strings
{{Strings Trees (data structures) Substring indices String data structures Computer science suffixes