HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, the two-way string-matching algorithm is a
string-searching algorithm A string-searching algorithm, sometimes called string-matching algorithm, is an algorithm that searches a body of text for portions that match by pattern. A basic example of string searching is when the pattern and the searched text are arrays of ...
, discovered by
Maxime Crochemore Maxime Crochemore (born 1947) is a French computer scientist known for his numerous contributions to string algorithms, algorithms on strings. He is currently a professor at King's College London. Biography Crochemore earned his doctorate (PhD) i ...
and
Dominique Perrin Dominique Pierre Perrin (b. 1946) is a French mathematician and theoretical computer scientist known for his contributions to coding theory and to combinatorics on words. He is a professor of the University of Marne-la-Vallée and currently serv ...
in 1991. It takes a pattern of size ''m'', called a “needle”, preprocesses it in linear time O(''m''), producing information that can then be used to search for the needle in any “haystack” string, taking only linear time O(''n'') with ''n'' being the haystack's length. The two-way algorithm can be viewed as a combination of the forward-going
Knuth–Morris–Pratt algorithm In computer science, the Knuth–Morris–Pratt algorithm (or KMP algorithm) is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the wo ...
(KMP) and the backward-running
Boyer–Moore string-search algorithm In computer science, the Boyer–Moore string-search algorithm is an efficient string-searching algorithm that is the standard benchmark for practical string-search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. ...
(BM). Like those two, the 2-way algorithm preprocesses the pattern to find partially repeating periods and computes “shifts” based on them, indicating what offset to “jump” to in the haystack when a given character is encountered. Unlike BM and KMP, it uses only O(log ''m'') additional space to store information about those partial repeats: the search pattern is split into two parts (its critical factorization), represented only by the position of that split. Being a number less than ''m'', it can be represented in ⌈log₂ ''m''⌉ bits. This is sometimes treated as "close enough to O(1) in practice", as the needle's size is limited by the size of addressable memory; the overhead is a number that can be stored in a single register, and treating it as O(1) is like treating the size of a loop counter as O(1) rather than log of the number of iterations. The actual matching operation performs at most 2''n'' − ''m'' comparisons. Breslauer later published two improved variants performing fewer comparisons, at the cost of storing additional data about the preprocessed needle: * The first one performs at most ''n'' + ⌊(''n'' − ''m'')/2⌋ comparisons, ⌈(''n'' − ''m'')/2⌉ fewer than the original. It must however store ⌈log ''m''⌉ additional offsets in the needle, using O(log2 ''m'') space. * The second adapts it to only store a constant number of such offsets, denoted ''c'', but must perform ''n'' + ⌊( + ε) * (''n'' − ''m'')⌋ comparisons, with ε = ( ''F''''c''+2 − 1)−1 = O(−''c'') going to zero exponentially quickly as ''c'' increases. The algorithm is considered fairly efficient in practice, being cache-friendly and using several operations that can be implemented in well-optimized subroutines. It is used by the C standard libraries
glibc The GNU C Library, commonly known as glibc, is the GNU Project implementation of the C standard library. It provides a wrapper around the system calls of the Linux kernel and other kernels for application use. Despite its name, it now also dir ...
,
newlib Newlib is a C standard library implementation intended for use on embedded systems. It is a conglomeration of several library parts, all under free software licenses that make them easily usable on embedded products. It was created by Cygnus ...
, and
musl musl is a C standard library intended for operating systems based on the Linux kernel, released under the MIT License. It was developed by Rich Felker to write a clean, efficient, and standards-conformant libc implementation. Overview musl wa ...
, to implement the ''memmem'' and ''strstr'' family of substring functions. As with most advanced string-search algorithms, the naïve implementation may be more efficient on small-enough instances; this is especially so if the needle isn't searched in multiple haystacks, which would amortize the preprocessing cost.


Critical factorization

Before we define critical factorization, we should define: * A factorization is a partition of a string . For example, ("Wiki","pedia") is a factorization of "Wikipedia". * A period of a string is an integer such that all characters -distance apart are equal. More precisely, holds for any integer . This definition is allowed to be vacuously true, so that any word of length has a period of . To illustrate, the 8-letter word "educated" has period 6 in addition to the trivial periods of 8 and above. The minimum period of is denoted as . * A repetition in is a non-empty string such that: ** is a suffix of or is a suffix of ; ** is a prefix of or is a prefix of ; *: In other words, occurs on both sides of the cut with a possible overflow on either side. Examples include "an" for ("ban","ana") and "voca" for ("a","vocado"). Each factorization trivially has at least one repetition: the string . * A local period is the length of a repetition in . The smallest local period in is denoted as . Because the trivial repetition is guaranteed to exist and has the same length as , we see that . * A critical factorization is a factorization of such that . The existence of a critical factorization is provably guaranteed. For a needle of length in an ordered alphabet, it can be computed in comparisons, by computing the lexicographically larger of two ordered maximal suffixes, defined for order ≤ and ≥.


The algorithm

The algorithm starts by critical factorization of the needle as the preprocessing step. This step produces the index (starting point) of the periodic right-half, and the period of this stretch. The suffix computation here follows the authors' formulation. It can alternatively be computed using the Duval's algorithm, which is simpler and still linear time but slower in practice. ''Shorthand for inversion.'' function cmp(a, b) if a > b return 1 if a = b return 0     if a < b return -1 function maxsuf(n, rev) l ← len(n) p ← 1 ''currently known period.'' k ← 1 ''index for period testing, 0 < k <= p.'' j ← 0 ''index for maxsuf testing. greater than maxs.'' i ← -1 ''the proposed starting index of maxsuf'' while j + k < l cmpv ← cmp(n + k n + k if rev cmpv ← -cmpv ''invert the comparison'' if cmpv < 0 ''Suffix (j+k) is smaller. Period is the entire prefix so far.'' j ← j + k k ← 1 p ← j - i else if cmpv = 0 ''They are the same - we should go on.'' if k = p ''We are done checking this stretch of p. reset k.'' j ← j + p k ← 1 else k ← k + 1 else ''Suffix is larger. Start over from here.'' i ← j j ← j + 1 p ← 1 k ← 1 return
, p The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
function crit_fact(n) dx1, per1← maxsuf(n, false) dx2, per2← maxsuf(n, true) if idx1 > idx2 return dx1, per1 else return dx2, per2 The comparison proceeds by first matching for the right-hand-side, and then for the left-hand-side if it matches. Linear-time skipping is done using the period. function match(n, h) nl ← len(n) hl ← len(h)
, p The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...
← crit_fact(n)     P ← ''set of matches.''     ''Match the suffix.'' ''Use a library function like memcmp, or write your own loop.'' if n ... n = n +1... n +p P ← pos ← 0 s ← 0 ''TODO. At least put the skip in.''


References

{{Strings String matching algorithms