HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, the longest common prefix array (LCP
array An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
) is an auxiliary
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...
to the
suffix array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by a ...
. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array. For example, if ''A'' := samp>aab, ab, abaab, b, baabis a suffix array, the longest common prefix between ''A'' = aab and ''A'' = ab is ''a'' which has length 1, so ''H'' = 1 in the LCP array ''H''. Likewise, the LCP of ''A'' = ab and ''A'' = abaab is ab, so ''H'' = 2. Augmenting the suffix array with the LCP array allows one to efficiently simulate top-down and bottom-up traversals of the
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow parti ...
, speeds up pattern matching on the suffix array and is a prerequisite for compressed suffix trees.


History

The LCP array was introduced in 1993, by
Udi Manber Udi Manber ( he, אודי מנבר) is an Israeli computer scientist. He is one of the authors of agrep and GLIMPSE. After a career in engineering and management, he worked on medical research. Education He earned both his bachelor's degree in 1 ...
and Gene Myers alongside the suffix array in order to improve the running time of their string search algorithm.


Definition

Let A be the
suffix array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by a ...
of the string S=s_1,s_2,\ldots s_\$ of length n, where \$ is a sentinel letter that is unique and
lexicographically In mathematics, the lexicographic or lexicographical order (also known as lexical order, or dictionary order) is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of ...
smaller than any other character. Let S ,j/math> denote the substring of S ranging from i to j. Thus, S [i n">[i.html" ;"title="[i">[i n/math> is the ith smallest suffix of S. Let \operatorname(v,w) denote the length of the longest common prefix between two strings v and w. Then the LCP array H[1,n] is an integer array of size n such that H /math> is undefined and H[i]=\operatorname(S[A[i-1],n],S n for every 1. Thus H /math> stores the length of longest common prefix of the lexicographically ith smallest suffix and its predecessor in the suffix array. Difference between LCP array and suffix array: * Suffix array: Represents the lexicographic rank of each suffix of an array. * LCP array: Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.


Example

Consider the string S = \textrm: and its corresponding sorted suffix array A : Suffix array with suffixes written out underneath vertically: Then the LCP array H is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix: So, for example, H 3 is the length of the longest common prefix \text shared by the suffixes A = S ,7= \textrm and A = S ,7= \textrm. Note that H /math> is undefined, since there is no lexicographically smaller suffix.


Efficient construction algorithms

LCP array construction algorithms can be divided into two different categories: algorithms that compute the LCP array as a byproduct to the suffix array and algorithms that use an already constructed suffix array in order to compute the LCP values. provide an algorithm to compute the LCP array alongside the suffix array in O(n \log n) time. show that it is also possible to modify their O(n) time algorithm such that it computes the LCP array as well. present the first O(n) time algorithm (FLAAP) that computes the LCP array given the text and the suffix array. Assuming that each text symbol takes one byte and each entry of the suffix or LCP array takes 4 bytes, the major drawback of their algorithm is a large space occupancy of 13n bytes, while the original output (text, suffix array, LCP array) only occupies 9n bytes. Therefore, created a refined version of the algorithm of (lcp9) and reduced the space occupancy to 9n bytes. provide another refinement of Kasai's algorithm (\Phi-algorithm) that improves the running time. Rather than the actual LCP array, this algorithm builds the ''permuted'' LCP (PLCP) array, in which the values appear in text order rather than lexicographical order. provide two algorithms that although being theoretically slow (O(n^2)) were faster than the above-mentioned algorithms in practice. , the currently fastest linear-time LCP array construction algorithm is due to , which in turn is based on one of the fastest suffix array construction algorithms (SA-IS) by . based on Yuta Mori's DivSufSort is even faster.


Applications

As noted by several string processing problems can be solved by the following kinds of
tree traversal In computer science, tree traversal (also known as tree search and walking the tree) is a form of graph traversal and refers to the process of visiting (e.g. retrieving, updating, or deleting) each node in a tree data structure, exactly once. ...
s: * bottom-up traversal of the complete suffix tree * top-down traversal of a subtree of the suffix tree * suffix tree traversal using the suffix links. show how to simulate a bottom-up traversal of the
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow parti ...
using only the
suffix array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by a ...
and LCP array. enhance the suffix array with the LCP array and additional data structures and describe how this ''enhanced suffix array'' can be used to simulate ''all three kinds'' of suffix tree traversals. reduce the space requirements of the enhanced suffix array by preprocessing the LCP array for range minimum queries. Thus, ''every'' problem that can be solved by suffix tree algorithms can also be solved using the ''enhanced suffix array''. Deciding if a pattern P of length m is a substring of a string S of length n takes O(m \log n) time if only the suffix array is used. By additionally using the LCP information, this bound can be improved to O(m + \log n) time. show how to improve this running time even further to achieve optimal O(m) time. Thus, using suffix array and LCP array information, the decision query can be answered as fast as using the
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow parti ...
. The LCP array is also an essential part of compressed suffix trees which provide full suffix tree functionality like suffix links and
lowest common ancestor In graph theory and computer science, the lowest common ancestor (LCA) (also called least common ancestor) of two nodes and in a tree or directed acyclic graph (DAG) is the lowest (i.e. deepest) node that has both and as descendants, where ...
queries. Furthermore, it can be used together with the suffix array to compute the Lempel-Ziv
LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including ...
factorization in O(n) time. The
longest repeated substring problem In computer science, the longest repeated substring problem is the problem of finding the longest substring of a string that occurs at least twice. This problem can be solved in linear time and space \Theta(n) by building a suffix tree In compu ...
for a string S of length n can be solved in \Theta(n) time using both the suffix array A and the LCP array. It is sufficient to perform a linear scan through the LCP array in order to find its maximum value v_ and the corresponding index i where v_ is stored. The longest substring that occurs at least twice is then given by S [iA[i">[i.html" ;"title="[i">[iA[iv_-1">[i.html"_;"title="[i">[i<_a>A[i.html" ;"title="[i.html" ;"title="[i">[iA[i">[i.html" ;"title="[i">[iA[iv_-1/math>. The remainder of this section explains two applications of the LCP array in more detail: How the suffix array and the LCP array of a string can be used to construct the corresponding suffix tree and how it is possible to answer LCP queries for arbitrary suffixes using range minimum queries on the LCP array.


Find the number of occurrences of a pattern

In order to find the number of occurrences of a given string P (length m) in a text T (length N), * We use binary search against the suffix array of T to find the starting and end position of all occurrences of P. * Now to speed up the search, we use LCP array, specifically a special version of the LCP array (LCP-LR below). The issue with using standard binary search (without the LCP information) is that in each of the O(\log N) comparisons needed to be made, we compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m\log N). The LCP-LR array helps improve this to O(m+\log N), in the following way: At any point during the binary search algorithm, we consider, as usual, a range (L,\dots,R) of the suffix array and its central point M, and decide whether we continue our search in the left sub-range (L,\dots,M) or in the right sub-range (M,\dots,R). In order to make the decision, we compare P to the string at M. If P is identical to M, our search is complete. But if not, we have already compared the first k characters of P and then decided whether P is lexicographically smaller or larger than M. Let's assume the outcome is that P is larger than M. So, in the next step, we consider (M,\dots,R) and a new central point M' in the middle: M ...... M' ...... R , we know: lcp(P,M)

k The trick now is that LCP-LR is precomputed such that an O(1)-lookup tells us the longest common prefix of M and M', \mathrm(M,M'). We already know (from the previous step) that M itself has a prefix of k characters in common with P: \mathrm(P,M)=k. Now there are three possibilities: * Case 1: k < \mathrm(M,M'), i.e. P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R). * Case 2: k > \mathrm(M,M'), i.e. P has more prefix characters in common with M than M has in common with M'. Consequently, if we were to compare P to M', the common prefix would be smaller than k, and M' would be lexicographically larger than P, so, without actually making the comparison, we continue in the left half (M,\dots,M'). * Case 3: k = \mathrm(M,M'). So M and M' are both identical with P in the first k characters. To decide whether we continue in the left or right half, it suffices to compare P to M' starting from the (k+1)th character. * We continue recursively. The overall effect is that no character of P is compared to any character of the text more than once. The total number of character comparisons is bounded by m, so the total complexity is indeed O(m+\log N). We still need to precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array. We know the standard LCP array gives us the lcp of consecutive entries only, i.e. \mathrm(i-1,i) for any i. However, M and M' in the description above are not necessarily consecutive entries. The key to this is to realize that only certain ranges (L,\dots,R) will ever occur during the binary search: It always starts with (0, \dots,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. Another way of looking at it is : every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L\dots M\dots R) that can possibly play a role during binary search, and it suffices to precompute \mathrm(L,M) and \mathrm(M,R) for those N possible ranges. So that is 2N distinct precomputed values, hence LCP-LR is O(N) in size. Moreover, there is a straightforward recursive algorithm to compute the 2N values of LCP-LR in O(N) time from the standard LCP array. To sum up: * It is possible to compute LCP-LR in O(N) time and O(2N)=O(N) space from LCP. * Using LCP-LR during binary search helps accelerate the search procedure from O(M\log N) to O(M+\log N). * We can use two binary searches to determine the left and right end of the match range for P, and the length of the match range corresponds with the number of occurrences for P.


Suffix tree construction

Given the suffix array A and the LCP array H of a string S=s_1,s_2,\ldots s_n\$ of length n+1, its suffix tree ST can be constructed in O(n) time based on the following idea: Start with the partial suffix tree for the lexicographically smallest suffix and repeatedly insert the other suffixes in the order given by the suffix array. Let ST_ be the partial suffix tree for 0\leq i \leq n. Further let d(v) be the length of the concatenation of all path labels from the root of ST_i to node v. Start with ST_0, the tree consisting only of the root. To insert A +1/math> into ST_i, walk up the ''rightmost'' path beginning at the recently inserted leaf A /math> to the root, until the deepest node v with d(v) \leq H +1/math> is reached. We need to distinguish two cases: * d(v)=H +1/math>: This means that the concatenation of the labels on the root-to-v path equals the longest common prefix of suffixes A /math> and A +1/math>.
In this case, insert A +1/math> as a new leaf x of node v and label the edge (v,x) with S [i+1H[i+1">[i+1.html" ;"title="[i+1">[i+1H[i+1n">[i+1.html"_;"title="[i+1">[i+1<_a>H[i+1.html" ;"title="[i+1.html" ;"title="[i+1">[i+1H[i+1">[i+1.html" ;"title="[i+1">[i+1H[i+1n/math>. Thus the edge label consists of the remaining characters of suffix A +1/math> that are not already represented by the concatenation of the labels of the root-to-v path.
This creates the partial suffix tree ST_. * d(v) < H +1/math>: This means that the concatenation of the labels on the root-to-v path displays less characters than the longest common prefix of suffixes A /math> and A +1/math> and the ''missing'' characters are contained in the edge label of v's ''rightmost'' edge. Therefore, we have to ''split up'' that edge as follows:
Let w be the child of v on ST_i's rightmost path. # Delete the edge (v,w). # Add a new internal node y and a new edge (v,y) with label S [id(v),A[i]+H[i+1]-1]. The new label consists of the ''missing'' characters of the longest common prefix of A /math> and A +1/math>. Thus, the concatenation of the labels of the root-to-y path now displays the longest common prefix of A /math> and A +1/math>. # Connect w to the newly created internal node y by an edge (y,w) that is labeled S [iH[i+1">[i.html" ;"title="[i">[iH[i+1A[i">[i.html"_;"title="[i">[i<_a>H[i+1.html" ;"title="[i.html" ;"title="[i">[iH[i+1">[i.html" ;"title="[i">[iH[i+1A[id(w)-1]. The new label consists of the ''remaining'' characters of the deleted edge (v,w) that were not used as the label of edge (v,y). # Add A +1/math> as a new leaf x and connect it to the new internal node y by an edge (y,x) that is labeled S [i+1H[i+1">[i+1.html" ;"title="[i+1">[i+1H[i+1n">[i+1.html"_;"title="[i+1">[i+1<_a>H[i+1.html" ;"title="[i+1.html" ;"title="[i+1">[i+1H[i+1">[i+1.html" ;"title="[i+1">[i+1H[i+1n/math>. Thus the edge label consists of the remaining characters of suffix A +1/math> that are not already represented by the concatenation of the labels of the root-to-v path. # This creates the partial suffix tree ST_. A simple amortization argument shows that the running time of this algorithm is bounded by O(n): The nodes that are traversed in step i by walking up the ''rightmost'' path of ST_i (apart from the last node v) are removed from the ''rightmost'' path, when A +1/math> is added to the tree as a new leaf. These nodes will never be traversed again for all subsequent steps j>i. Therefore, at most 2n nodes will be traversed in total.


LCP queries for arbitrary suffixes

The LCP array H only contains the length of the longest common prefix of every pair of consecutive suffixes in the suffix array A. However, with the help of the inverse suffix array A^ ( A[i]= j \Leftrightarrow A^[j]= i , i.e. the suffix S ,n/math> that starts at position j in S is stored in position A^[j] in A) and constant-time range minimum queries on H, it is possible to determine the length of the longest common prefix of arbitrary suffixes in O(1) time. Because of the lexicographic order of the suffix array, every common prefix of the suffixes S ,n/math> and S ,n/math> has to be a common prefix of all suffixes between i's position in the suffix array A^ /math> and j's position in the suffix array A^ . Therefore, the length of the longest prefix that is shared by ''all'' of these suffixes is the minimum value in the interval H ^[i1,A^[j">.html" ;"title="^[i">^[i1,A^[j. This value can be found in constant time if H is preprocessed for range minimum queries. Thus given a string S of length n and two arbitrary positions i,j in the string S with A^[i] < A^ , the length of the longest common prefix of the suffixes S ,n/math> and S ,n/math> can be computed as follows: \operatorname(i,j)=H operatorname_H(A^[i1,A^[j">.html" ;"title="operatorname_H(A^[i">operatorname_H(A^[i1,A^[j">">operatorname_H(A^[i<_a>1,A^[j.html" ;"title=".html" ;"title="operatorname_H(A^[i">operatorname_H(A^[i1,A^[j">.html" ;"title="operatorname_H(A^[i">operatorname_H(A^[i1,A^[j/math>.


Notes


References

* * * * * * * * * * * * * * * * * *


External links


Mirror of the ad-hoc-implementation of the code
described in {{harvtxt, Fischer, 2011
SDSL: Succinct Data Structure Library - Provides various LCP array implementations, Range Minimum Query (RMQ) support structures and many more succinct data structures Bottom-up suffix tree traversal emulated using suffix array and LCP array (Java)Text-Indexing project
(linear-time construction of suffix trees, suffix arrays, LCP array and Burrows–Wheeler Transform) Arrays Substring indices String data structures