Suffix array
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a suffix array is a sorted
array An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
of all
suffixes In linguistics, a suffix is an affix which is placed after the Stem (linguistics), stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the Grammatical conjugation ...
of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of
bibliometrics Bibliometrics is the use of statistical methods to analyse books, articles and other publications, especially in regard with scientific contents. Bibliometric methods are frequently used in the field of library and information science. Biblio ...
. Suffix arrays were introduced by as a simple, space efficient alternative to suffix trees. They had independently been discovered by
Gaston Gonnet Gaston H. Gonnet is a Uruguayan Canadian computer scientist and entrepreneur. He is best known for his contributions to the Maple computer algebra system and the creation of a digital version of the Oxford English Dictionary. Education and ear ...
in 1987 under the name ''PAT array'' . gave the first in-place \mathcal(n) time suffix array construction algorithm that is optimal both in time and space, where ''in-place'' means that the algorithm only needs \mathcal(1) additional space beyond the input string and the output suffix array. Enhanced suffix arrays (ESAs) are suffix arrays with additional tables that reproduce the full functionality of suffix trees preserving the same time and memory complexity. The suffix array for a subset of all suffixes of a string is called
sparse suffix array Sparse is a computer software tool designed to find possible coding faults in the Linux kernel. Unlike other such tools, this static analysis tool was initially designed to only flag constructs that were likely to be of interest to kernel deve ...
. Multiple probabilistic algorithms have been developed to minimize the additional memory usage including an optimal time and memory algorithm.


Definition

Let S=S ..S /math> be an n-string and let S ,j/math> denote the substring of S ranging from i to j inclusive. The suffix array A of S is now defined to be an array of integers providing the starting positions of
suffixes In linguistics, a suffix is an affix which is placed after the Stem (linguistics), stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the Grammatical conjugation ...
of S in
lexicographical order In mathematics, the lexicographic or lexicographical order (also known as lexical order, or dictionary order) is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a ...
. This means, an entry A /math> contains the starting position of the i-th smallest suffix in S and thus for all 1 \leq i \leq n: S [i-1n.html"_;"title="-1.html"_;"title="[i-1">[i-1n">-1.html"_;"title="[i-1">[i-1n<_S[A[i.html" ;"title="-1">[i-1n.html" ;"title="-1.html" ;"title="[i-1">[i-1n">-1.html" ;"title="[i-1">[i-1n< S[A[i">-1">[i-1n.html" ;"title="-1.html" ;"title="[i-1">[i-1n">-1.html" ;"title="[i-1">[i-1n< S[A[in]. Each Suffix (computer science), suffix of S shows up in A exactly once. Suffixes are simple strings. These strings are sorted (as in a paper dictionary), before their starting positions (integer indices) are saved in A.


Example

Consider the text S=banana$ to be indexed: The text ends with the special sentinel letter $ that is unique and lexicographically smaller than any other character. The text has the following suffixes: These suffixes can be sorted in ascending order: The suffix array A contains the starting positions of these sorted suffixes: The suffix array with the suffixes written out vertically underneath for clarity: So for example, A /math> contains the value 4, and therefore refers to the suffix starting at position 4 within S, which is the suffix ana$.


Correspondence to suffix trees

Suffix arrays are closely related to suffix trees: * Suffix arrays can be constructed by performing a
depth-first traversal Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible alon ...
of a suffix tree. The suffix array corresponds to the leaf-labels given in the order in which these are visited during the traversal, if edges are visited in the lexicographical order of their first character. * A suffix tree can be constructed in linear time by using a combination of suffix array and
LCP array In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array. ...
. For a description of the algorithm, see the corresponding section in the
LCP array In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array. ...
article. It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suffix array enhanced with additional information (such as the
LCP array In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array. ...
) and solves the same problem in the same time complexity. Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms (e.g., compared to
Ukkonen's algorithm In computer science, Ukkonen's algorithm is a linear-time, online algorithm for constructing suffix trees, proposed by Esko Ukkonen in 1995. The algorithm begins with an implicit suffix tree containing the first character of the string. Then it ste ...
) and improved cache locality.


Space efficiency

Suffix arrays were introduced by in order to improve over the space requirements of suffix trees: Suffix arrays store n integers. Assuming an integer requires 4 bytes, a suffix array requires 4n bytes in total. This is significantly less than the 20n bytes which are required by a careful suffix tree implementation. However, in certain applications, the space requirements of suffix arrays may still be prohibitive. Analyzed in bits, a suffix array requires \mathcal(n \log n) space, whereas the original text over an alphabet of size \sigma only requires \mathcal(n \log \sigma) bits. For a human genome with \sigma = 4 and n = 3.4 \times 10^9 the suffix array would therefore occupy about 16 times more memory than the genome itself. Such discrepancies motivated a trend towards
compressed suffix array In computer science, a compressed suffix arrayR. Grossi and J. S. Vitter, Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching, ''SIAM Journal on Computing,'' 35(2), 2005, 378-407. An earlier version ap ...
s and BWT-based compressed full-text indices such as the
FM-index In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manz ...
. These data structures require only space within the size of the text or even less.


Construction algorithms

A suffix tree can be built in \mathcal(n) and can be converted into a suffix array by traversing the tree depth-first also in \mathcal(n), so there exist algorithms that can build a suffix array in \mathcal(n). A naive approach to construct a suffix array is to use a comparison-based sorting algorithm. These algorithms require \mathcal(n \log n) suffix comparisons, but a suffix comparison runs in \mathcal(n) time, so the overall runtime of this approach is \mathcal(n^2 \log n). More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related to each other. These algorithms strive to achieve the following goals: * minimal asymptotic complexity \Theta(n) * lightweight in space, meaning little or no working memory beside the text and the suffix array itself is needed * fast in practice One of the first algorithms to achieve all goals is the SA-IS algorithm of . The algorithm is also rather simple (< 100
LOC LOC, L.O.C., Loc, LoC, or locs may refer to: Places * Lóc, a village in Sângeorgiu de Pădure, Mureș County, Romania * Lócs, a village in Vas county, Hungary * Line of Contact, meeting place of Western and Eastern Allied forces at the e ...
) and can be enhanced to simultaneously construct the
LCP array In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array. ...
. The SA-IS algorithm is one of the fastest known suffix array construction algorithms. A carefu
implementation by Yuta Mori
outperforms most other linear or super-linear construction approaches. Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported
alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...
: ''constant alphabets'' where the alphabet size is bound by a constant, ''integer alphabets'' where characters are integers in a range depending on n and ''general alphabets'' where only character comparisons are allowed. Most suffix array construction algorithms are based on one of the following approaches: * ''Prefix doubling'' algorithms are based on a strategy of . The idea is to find prefixes that honor the lexicographic ordering of suffixes. The assessed prefix length doubles in each iteration of the algorithm until a prefix is unique and provides the rank of the associated suffix. * ''Recursive'' algorithms follow the approach of the suffix tree construction algorithm by to recursively sort a subset of suffixes. This subset is then used to infer a suffix array of the remaining suffixes. Both of these suffix arrays are then merged to compute the final suffix array. * ''Induced copying'' algorithms are similar to recursive algorithms in the sense that they use an already sorted subset to induce a fast sort of the remaining suffixes. The difference is that these algorithms favor iteration over recursion to sort the selected suffix subset. A survey of this diverse group of algorithms has been put together by . A well-known recursive algorithm for integer alphabets is the ''DC3 / skew'' algorithm of . It runs in linear time and has successfully been used as the basis for parallel and external memory suffix array construction algorithms. Recent work by proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is \mathcal(n \log n), it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text. In practical
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm. This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017. It too can be modified to compute an LCP array. It uses a induced copying combined with Itoh-Tanaka. In 2021 a faster implementation of the algorithm was presented by Ilya Grebnov which in average showed 65% performance improvement over DivSufSort implementation on Silesia Corpus.


Generalized Suffix Array

The concept of a suffix array can be extended to more than one string. This is called a generalized suffix array (or GSA), a suffix array that contains all suffixes for a set of strings (for example, S=S_1,S_2,S_3,...,S_k and is lexicographically sorted with all suffixes of each string.


Applications

The suffix array of a string can be used as an
index Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
to quickly locate every occurrence of a substring pattern P within the string S. Finding every occurrence of the pattern is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and can be found efficiently with two
binary search In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the m ...
es. The first search locates the starting position of the interval, and the second one determines the end position: n = len(S) def search(P: str) -> Tuple nt, int """ Return indices (s, r) such that the interval A :r(including the end index) represents all suffixes of S that start with the pattern P. """ # Find starting position of interval l = 0 # in Python, arrays are indexed starting at 0 r = n while l < r: mid = (l + r) // 2 # division rounding down to nearest integer # suffixAt(A is the ith smallest suffix if P > suffixAt(A id: l = mid + 1 else: r = mid s = l # Find ending position of interval r = n while l < r: mid = (l + r) // 2 if suffixAt(A id.startswith(P): l = mid + 1 else: r = mid return (s, r) Finding the substring pattern P of length m in the string S of length n takes \mathcal(m \log n) time, given that a single suffix comparison needs to compare m characters. describe how this bound can be improved to \mathcal(m + \log n) time using LCP information. The idea is that a pattern comparison does not need to re-compare certain characters, when it is already known that these are part of the longest common prefix of the pattern and the current search interval. improve the bound even further and achieve a search time of \mathcal(m) as known from suffix trees. Suffix sorting algorithms can be used to compute the Burrows–Wheeler transform (BWT). The BWT requires sorting of all cyclic permutations of a string. If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated BWT matrix corresponds to the order of suffixes in a suffix array. The BWT can therefore be computed in linear time by first constructing a suffix array of the text and then deducing the BWT string: BWT = S [i1.html"_;"title=".html"_;"title="[i">[i1">.html"_;"title="[i">[i1/math>. Suffix_arrays_can_also_be_used_to_look_up_substrings_in_example-based_machine_translation.html" ;"title="">[i1.html" ;"title=".html" ;"title="[i">[i1">.html" ;"title="[i">[i1/math>. Suffix arrays can also be used to look up substrings in example-based machine translation">">[i1.html" ;"title=".html" ;"title="[i">[i1">.html" ;"title="[i">[i1/math>. Suffix arrays can also be used to look up substrings in example-based machine translation, demanding much less storage than a full phrase table as used in Statistical machine translation. Many additional applications of the suffix array require the
LCP array In computer science, the longest common prefix array (LCP array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array. ...
. Some of these are detailed in the application section of the latter.


Enhanced Suffix Arrays

Suffix trees are powerful data structures that have wide application in areas of pattern and string matching, indexing and textual statistics. However, it occupies a significant amount of space and thus has a drawback in many real-time applications that requires processing a considerably large amount of data like genome analysis. To overcome this drawback, Enhanced Suffix Arrays were developed that are data structures consisting of suffix arrays and an additional table called the child table that contains the information about the parent-child relationship between the nodes in the suffix tree. The node branching data structure for this tree is a linked list. Enhanced suffix trees are superior in terms of both space efficiency and time complexity and are easy to implement. Moreover, they can be applied to any algorithm that uses a suffix tree by using an abstract concept lcp-interval trees. The time complexity for searching a pattern in an enhanced suffix array is O(m, Σ, ). The suffix array of the string is an array of n integers in the range of 0 to n that represents the n+1 suffixes of the string including the special character #. The suffix array is composed of two arrays: # pos array pos ,...n It represents a sorted list of all S suffixes. Only the initial positions of the suffixes are stored in the array to reduce the space complexity since the suffixes are too large. # lcp array lcp ,...n It is an array of n integers that maintains the lengths of the longest common prefix of two consecutive suffixes stored in the pos array.


Constructing the lcp-interval

For a suffix array of S, the lcp-interval associated with the corresponding node of suffix tree of S can be defined as: ''Interval ,..j 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if'' ''1. lcptab < l,'' ''2. lcptab ≥ l for all i + 1 ≤ k ≤ j,'' ''3. lcptab = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j,'' ''4. lcptab + 1< l.'' The length of the longest common prefix of pos − 1and pos is stored in lcp where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of ..jis a child of the corresponding node of ..l a lcp-interval ..jis a child interval of another lcp-interval ..l If ..lis a child interval of ..j a lcp-interval ..jis the parent interval of a lcp-interval ..l


Constructing a Child Table

The child table ''cldtab'' is composed of three n arrays, ''up'', ''down'' and ''nextlIndex''.The information about the edges of the corresponding suffix tree is stored in maintained by the ''up'' and ''down'' array. The ''nextlIndexarray'' stores the links in the linked list used for node branching the suffix tree. The ''up'', ''down'' and ''nextlIndex'' array are defined as follows: # The element ''up 'records the starting index of the longest lcp-second interval’s child interval, which ends at index ''i-1''. # The initial index of the second child interval of the longest lcp-interval, starting at index ''i'' is stored in the element ''down '. # If and only if the interval is neither the first child nor the final child of its parent, the element ''nextlIndex ' contains the first index of the next sibling interval of the longest lcp-interval, starting at index ''i''. By performing a bottom-up traversal of the lcp-interval of the tree, the child table can be constructed in linear time. The ''up/down'' values and the ''nextlIndex'' values can be computed separately by using two distinct algorithms.


Constructing a Suffix Link Table

The suffix links for an enhanced suffix array can be computed by generating the suffix link interval '1,..,r''for each ,..jinterval during the preprocessing. The left and right elements l and r of the interval are maintained in the first index of ,..,j The table for this interval ranges from 0 to n. The suffix link table is constructed by the left-to-right breadth-first traversal of the lcp-interval tree. Every time an ''l''-interval is computed, it is added to the list of l-intervals, which is referred to as the l-list. When the lcp-value > 0, for every ''l''-interval ,..,jin the list, link is calculated. The interval 'l'',..,''r''is computed by a binary search in(''l''-1)-list, where ''l'' is the largest left boundary amongst all the ''l''-1 intervals. The suffix link interval of ,..jis represented by this interval 'l,..,r'' The values ''l'' and ''r'' are ultimately stored in the first index of ,..,j


Notes


References

* * * * * * * * * * * * * * * * * * * * *Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. "Replacing suffix trees with enhanced suffix arrays." ''Journal of Discrete Algorithms'', 2(1):53–86, 2004. *Dong Kyue Kim, Jeong Eun Jeon, and Heejin Park. "An efficient index data structure with the capabilities of suffix trees and suffix arrays for alphabets of non-negligible size." ''String Processing and Information Retrieval Lecture Notes in Computer Science'', page138–149, 2004.


External links


Suffix Array in JavaSuffix sorting module for BWT in C codeSuffix Array Implementation in RubyProject containing various Suffix Array c/c++ Implementations with a unified interfaceA fast, lightweight, and robust C API library to construct the suffix arraySuffix Array implementation in PythonLinear Time Suffix Array implementation in C using suffix tree
{{Strings Arrays Substring indices String data structures Computer science suffixes