computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...

, a suffix automaton is an efficient

data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...

for representing the

substring index In computer science, a substring index is a data structure which gives substring search in a text or text collection in sublinear time. If you have a document S of length n, or a set of documents D=\ of total length n, you can locate all occurrenc ...

of a given string which allows the storage, processing, and retrieval of compressed information about all its

substring In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "''the best of''" is a substring of "''It was the best of times''". In contrast, "''Itwastimes''" is a subsequence ...

s. The suffix automaton of a string

S

is the smallest

directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one v ...

with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string. In terms of

automata theory Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science. The word ''automata'' comes from the Greek word αὐτόματο� ...

, a suffix automaton is the minimal partial

deterministic finite automaton In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automa ...

that recognizes the set of

suffixes In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the conjugation of verbs. Suffixes can carry gr ...

of a given

string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...

S=s_1 s_2 \dots s_n

. The state graph of a suffix automaton is called a directed acyclic word graph (DAWG), a term that is also sometimes used for any

deterministic acyclic finite state automaton In computer science, a deterministic acyclic finite state automaton (DAFSA),Jan Daciuk, Stoyan Mihov, Bruce Watson and Richard Watson (2000). Incremental construction of minimal acyclic finite state automata. Computational Linguistics 26(1):3-16. ...

. Suffix automata were introduced in 1983 by a group of scientists from the

University of Denver The University of Denver (DU) is a private research university in Denver, Colorado. Founded in 1864, it is the oldest independent private university in the Rocky Mountain Region of the United States. It is classified among "R1: Doctoral Univ ...

and the

University of Colorado Boulder The University of Colorado Boulder (CU Boulder, CU, or Colorado) is a public research university in Boulder, Colorado. Founded in 1876, five months before Colorado became a state, it is the flagship university of the University of Colorado s ...

. They suggested a

linear time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by t ...

online algorithm In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start. In contrast, an o ...

for its construction and showed that the suffix automaton of a string

S

having length at least two characters has at most

2, S,  - 1

states and at most

3, S,  - 4

transitions. Further works have shown a close connection between suffix automata and

suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow parti ...

s, and have outlined several generalizations of suffix automata, such as compacted suffix automaton obtained by compression of nodes with a single outgoing arc. Suffix automata provide efficient solutions to problems such as substring search and computation of the largest common substring of two and more strings.

History

The concept of suffix automaton was introduced in 1983 by a group of scientists from

and

consisting of Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht,

David Haussler David Haussler (born 1953) is an American bioinformatician known for his work leading the team that assembled the first human genome sequence in the race to complete the Human Genome Project and subsequently for comparative genome analysis that d ...

and Ross McConnell, although similar concepts had earlier been studied alongside suffix trees in the works of Peter Weiner, Vaughan Pratt and Anatol Slissenko. In their initial work, Blumer ''et al''. showed a suffix automaton built for the string

S

of length greater than

1

has at most

2, S, -1

states and at most

3, S, -4

transitions, and suggested a linear

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

for automaton construction. In 1983, Mu-Tian Chen and Joel Seiferas independently showed that Weiner's 1973 suffix-tree construction algorithm while building a suffix tree of the string

S

constructs a suffix automaton of the reversed string

S^R

as an auxiliary structure. In 1987, Blumer ''et al''. applied the compressing technique used in suffix trees to a suffix automaton and invented the compacted suffix automaton, which is also called the compacted directed acyclic word graph (CDAWG). In 1997,

Maxime Crochemore Maxime Crochemore (born 1947) is a French computer scientist known for his numerous contributions to algorithms on strings. He is currently a professor at King's College London. Biography Crochemore earned his doctorate (PhD) in 1978 and his Doc ...

and Renaud Vérin developed a linear algorithm for direct CDAWG construction. In 2001, Shunsuke Inenaga ''et al''. developed an algorithm for construction of CDAWG for a set of words given by a

trie In computer science, a trie, also called digital tree or prefix tree, is a type of ''k''-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes d ...

Definitions

Usually when speaking about suffix automata and related concepts, some notions from

formal language theory In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sy ...

and

are used, in particular: * "Alphabet" is a finite set

\Sigma

that is used to construct words. Its elements are called "characters"; * "Word" is a finite sequence of characters

\omega = \omega_1 \omega_2 \dots \omega_n

. "Length" of the word

\omega

is denoted as

, \omega, =n

; * "

Formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of s ...

" is a set of words over given alphabet; * "Language of all words" is denoted as

\Sigma^*

(where the "*" character stands for

Kleene star In mathematical logic and computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation, either on sets of strings or on sets of symbols or characters. In mathematics, it is more commonly known as the free monoid ...

), "empty word" (the word of zero length) is denoted by the character

\varepsilon

; * "

Concatenation In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenat ...

of words"

\alpha = \alpha_1 \alpha_2 \dots \alpha_n

and

\beta = \beta_1 \beta_2 \dots \beta_m

is denoted as

\alpha \cdot \beta

\alpha\beta

and corresponds to the word obtained by writing

\beta

to the right of

\alpha

, that is,

\alpha \beta = \alpha_1 \alpha_2 \dots \alpha_n \beta_1 \beta_2 \dots \beta_m

; * "Concatenation of languages"

A

and

B

is denoted as

A \cdot B

AB

and corresponds to the set of pairwise concatenations

AB = \

; * If the word

\omega\in\Sigma^*

may be represented as

\omega = \alpha\gamma\beta

, where

\alpha,\beta,\gamma \in \Sigma^*

, then words

\alpha

\beta

and

\gamma

are called "prefix", "suffix" and " subword" (substring) of the word

\omega

correspondingly; * If

T = T_1 \dots T_n

and

T_l T_ \dots T_r = S

(with

1 \leq l \leq r \leq n

) then

S

is said to "occur" in

T

as a subword. Here

l

and

r

are called left and right positions of occurrence of

S

T

correspondingly.

Automaton structure

Formally,

is determined by 5-tuple

\mathcal A = (\Sigma, Q, q_0, F, \delta)

, where: *

\Sigma

is an "alphabet" that is used to construct words, *

Q

is a set of automaton " states", *

q_0 \in Q

is an "initial" state of automaton, *

F \subset Q

is a set of "final" states of automaton, *

\delta : Q \times \Sigma \mapsto Q

is a partial "transition" function of automaton, such that

\delta(q, \sigma)

for

q \in Q

and

\sigma \in \Sigma

is either undefined or defines a transition from

q

over character

\sigma

. Most commonly, deterministic finite automaton is represented as a

directed graph In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph that is made up of a set of vertices connected by directed edges, often called arcs. Definition In formal terms, a directed graph is an ordered pai ...

("diagram") such that: * Set of graph vertices corresponds to the state of states

Q

, * Graph has a specific marked vertex corresponding to initial state

q_0

, * Graph has several marked vertices corresponding to the set of final states

F

, * Set of graph arcs corresponds to the set of transitions

\delta

, * Specifically, every transition

\delta(q_1, \sigma) = q_2

is represented by an arc from

q_1

q_2

marked with the character

\sigma

. This transition also may be denoted as

q_1 \begin\\ 5pt end q_2

. In terms of its diagram, the automaton recognizes the word

\omega=\omega_1 \omega_2 \dots \omega_m

only if there is a path from the initial vertex

q_0

to some final vertex

q \in F

such that concatenation of characters on this path forms

\omega

. The set of words recognized by an automaton forms a language that is set to be recognized by the automaton. In these terms, the language recognized by a suffix automaton of

S

is the language of its (possibly empty) suffixes.

Automaton states

"Right context" of the word

\omega

with respect to language

L

is a set

omega Omega (; capital: Ω, lowercase: ω; Ancient Greek ὦ, later ὦ μέγα, Modern Greek ωμέγα) is the twenty-fourth and final letter in the Greek alphabet. In the Greek numeric system/ isopsephy ( gematria), it has a value of 800. Th ...

R=\ that is a set of words

\alpha

such that their concatenation with

\omega

forms a word from

L

. Right contexts induce a natural

equivalence relation In mathematics, an equivalence relation is a binary relation that is reflexive, symmetric and transitive. The equipollence relation between line segments in geometry is a common example of an equivalence relation. Each equivalence relatio ...

alpha R =

beta Beta (, ; uppercase , lowercase , or cursive ; grc, βῆτα, bē̂ta or ell, βήτα, víta) is the second letter of the Greek alphabet. In the system of Greek numerals, it has a value of 2. In Modern Greek, it represents the voiced labi ...

R on the set of all words. If language

L

is recognized by some deterministic finite automaton, there exists unique up to

isomorphism In mathematics, an isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between them. The word i ...

automaton that recognizes the same language and has the minimum possible number of states. Such an automaton is called a

minimal automaton In automata theory (a branch of theoretical computer science), DFA minimization is the task of transforming a given deterministic finite automaton (DFA) into an equivalent DFA that has a minimum number of states. Here, two DFAs are called equi ...

for the given language

L

Myhill–Nerode theorem In the theory of formal languages, the Myhill–Nerode theorem provides a necessary and sufficient condition for a language to be regular. The theorem is named for John Myhill and Anil Nerode, who proved it at the University of Chicago in 1958 . ...

allows it to define it explicitly in terms of right contexts: In these terms, a "suffix automaton" is the minimal deterministic finite automaton recognizing the language of suffixes of the word

S=s_1s_2\dots s_n

. The right context of the word

\omega

with respect to this language consists of words

\alpha

, such that

\omega \alpha

is a suffix of

S

. It allows to formulate the following lemma defining a bijection between the right context of the word and the set of right positions of its occurrences in

S

: For example, for the word

S=abacaba

and its subword

\omega = ab

, it holds

endpos(ab)=\

and

b R = \

. Informally,

b R

is formed by words that follow occurrences of

ab

to the end of

S

and

endpos(ab)

is formed by right positions of those occurrences. In this example, the element

x=2 \in endpos(ab)

corresponds with the word

s_3 s_4 s_5 s_6 s_7 = acaba \in b R

while the word

a \in b R

corresponds with the element

7-, a, =6 \in endpos(ab)

. It implies several structure properties of suffix automaton states. Let

, \alpha,  \leq , \beta,

, then: * If

alpha R

and

R have at least one common element

x

, then

endpos(\alpha)

and

endpos(\beta)

have a common element as well. It implies

\alpha

is a suffix of

\beta

and therefore

endpos(\beta) \subset endpos(\alpha)

and

R \subset alphaR. In aforementioned example,

a \in b R \cap ab R

, so

ab

is a suffix of

cab

and thus

ab R=\ \subset \ = b R

and

endpos(cab) = \ \subset \ = endpos(ab)

; * If

alpha R=

R, then

endpos(\alpha)=endpos(\beta)

, thus

\alpha

occurs in

S

only as a suffix of

\beta

. For example, for

\alpha=b

and

\beta = ab

it holds that

R = b R = \

and

endpos(b)=endpos(ab)=\

; * If

alpha R=

R and

\gamma

is a suffix of

\beta

such that

, \alpha,  \leq , \gamma,  \leq , \beta,

, then

alpha R = gamma R =

R. In the example above

R = ac R = \

and it holds for "intermediate" suffix

\gamma = ac

that

c R = \

. Any state

q = alpha R

of the suffix automaton recognizes some continuous

chain A chain is a wikt:series#Noun, serial assembly of connected pieces, called links, typically made of metal, with an overall character similar to that of a rope in that it is flexible and curved in compression (physics), compression but line (g ...

of nested suffixes of the longest word recognized by this state. "Left extension"

\overset

of the string

\gamma

is the longest string

\omega

that has the same right context as

\gamma

. Length

, \overset,

of the longest string recognized by

q= gamma R

is denoted by

len(q)

. It holds: "Suffix link"

link(q)

of the state

q= alpha R

is the pointer to the state

p

that contains the largest suffix of

\alpha

that is not recognized by

q

. In this terms it can be said

q= alpha R

recognizes exactly all suffixes of

\overset

that is longer than

len(link(q))

and not longer than

len(q)

. It also holds:

Connection with suffix trees

A "

prefix tree In computer science, a trie, also called digital tree or prefix tree, is a type of ''k''-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes de ...

" (or "trie") is a rooted directed tree in which arcs are marked by characters in such a way no vertex

v

of such tree has two out-going arcs marked with the same character. Some vertices in trie are marked as final. Trie is said to recognize a set of words defined by paths from its root to final vertices. In this way prefix trees are a special kind of deterministic finite automata if you perceive its root as an initial vertex. The "suffix trie" of the word

S

is a prefix tree recognizing a set of its suffixes. "A

" is a tree obtained from a suffix trie via the compaction procedure, during which consequent edges are merged if the

degree Degree may refer to: As a unit of measurement * Degree (angle), a unit of angle measurement ** Degree of geographical latitude ** Degree of geographical longitude * Degree symbol (°), a notation used in science, engineering, and mathemati ...

of the vertex between them is equal to two. By its definition, a suffix automaton can be obtained via minimization of the suffix trie. It may be shown that a compacted suffix automaton is obtained by both minimization of the suffix tree (if one assumes each string on the edge of the suffix tree is a solid character from the alphabet) and compaction of the suffix automaton. Besides this connection between the suffix tree and the suffix automaton of the same string there is as well a connection between the suffix automaton of the string

S=s_1 s_2 \dots s_n

and the suffix tree of the reversed string

S^R = s_n s_ \dots s_1

. Similarly to right contexts one may introduce "left contexts"

L = \, "right extensions"

\overset

corresponding to the longest string having same left context as

\omega

and the equivalence relation

alpha L =

L. If one considers right extensions with respect to the language

L

of "prefixes" of the string

S

it may be obtained: , which implies the suffix link tree of the string

S

and the suffix tree of the string

S^R

are isomorphic: Similarly to the case of left extensions, the following lemma holds for right extensions:

Size

A suffix automaton of the string

S

of length

n>1

has at most

2n-1

states and at most

3n-4

transitions. These bounds are reached on strings

abb\dots bb=ab^

and

abb\dots bc=ab^c

correspondingly. This may be formulated in a stricter way as

, \delta,  \leq , Q,  + n - 2

where

, \delta,

and

, Q,

are the numbers of transitions and states in automaton correspondingly.

Construction

Initially the automaton only consists of a single state corresponding to the empty word, then characters of the string are added one by one and the automaton is rebuilt on each step incrementally.

State updates

After a new character is appended to the string, some equivalence classes are altered. Let

alpha

be the right context of

\alpha

with respect to the language of

\omega

suffixes. Then the transition from

alpha

alpha

after

x

is appended to

\omega

is defined by lemma: After adding

x

to the current word

\omega

the right context of

\alpha

may change significantly only if

\alpha

is a suffix of

\omega x

. It implies equivalence relation

\equiv_

is a refinement of

\equiv_

. In other words, if

alpha =

, then

alpha =

. After the addition of a new character at most two equivalence classes of

\equiv_

will be split and each of them may split in at most two new classes. First, equivalence class corresponding to empty right context is always split into two equivalence classes, one of them corresponding to

\omega x

itself and having

\

as a right context. This new equivalence class contains exactly

\omega x

and all its suffixes that did not occur in

\omega

, as the right context of such words was empty before and contains only empty word now. Given the correspondence between states of the suffix automaton and vertices of the suffix tree, it is possible to find out the second state that may possibly split after a new character is appended. The transition from

\omega

\omega x

corresponds to the transition from

\omega^R

x \omega^R

in the reversed string. In terms of suffix trees it corresponds to the insertion of the new longest suffix

x\omega^R

into the suffix tree of

\omega^R

. At most two new vertices may be formed after this insertion: one of them corresponding to

x \omega^R

, while the other one corresponds to its direct ancestor if there was a branching. Returning to suffix automata, it means the first new state recognizes

\omega x

and the second one (if there is a second new state) is its suffix link. It may be stated as a lemma: It implies that if

\alpha=\beta

(for example, when

x

didn't occur in

\omega

at all and

\alpha=\beta=\varepsilon

), then only the equivalence class corresponding to the empty right context is split. Besides suffix links it is also needed to define final states of the automaton. It follows from structure properties that all suffixes of a word

\alpha

recognized by

q= alpha R

are recognized by some vertex on ''suffix path''

(q, link(q), link^2(q), \dots)

q

. Namely, suffixes with length greater than

len(link(q))

lie in

q

, suffixes with length greater than

len(link(link(q))

but not greater than

len(link(q))

lie in

link(q)

and so on. Thus if the state recognizing

\omega

is denoted by

last

, then all final states (that is, recognizing suffixes of

\omega

) form up the sequence

(last, link(last), link^2(last), \dots)

Transitions and suffix links updates

After the character

x

is appended to

\omega

possible new states of suffix automaton are

omega x Omega X () is a South Korean boy band formed in 2021. The group consists of 11 members: Jaehan, Hwichan, Sebin, Hangyeom, Taedong, Xen, Jehyun, Kevin, Junghoon, Hyuk and Yechan. History Pre-debut All of Omega X's members had appeared in surviva ...

and

alpha

. Suffix link from

goes to

alpha

and from

alpha

it goes to

link(alpha)

. Words from

occur in

\omega x

only as its suffixes therefore there should be no transitions at all from

while transitions to it should go from suffixes of

\omega

having length at least

\alpha

and be marked with the character

x

. State

alpha

is formed by subset of

alpha

, thus transitions from

alpha

should be same as from

alpha

. Meanwhile, transitions leading to

alpha

should go from suffixes of

\omega

having length less than

, \alpha,

and at least

len(link(alpha))

, as such transitions have led to

alpha

before and corresponded to seceded part of this state. States corresponding to these suffixes may be determined via traversal of suffix link path for

Construction algorithm

Theoretical results above lead to the following algorithm that takes character

x

and rebuilds the suffix automaton of

\omega

into the suffix automaton of

\omega x

: # The state corresponding to the word

\omega

is kept as

last

; # After

x

is appended, previous value of

last

is stored in the variable

p

and

last

itself is reassigned to the new state corresponding to

\omega x

; # States corresponding to suffixes of

\omega

are updated with transitions to

last

. To do this one should go through

p, link(p), link^2(p),\dots

, until there is a state that already has a transition by

x

; # Once the aforementioned loop is over, there are 3 cases: ## If none of states on the suffix path had a transition by

x

, then

x

never occurred in

\omega

before and the suffix link from

last

should lead to

q_0

; ## If the transition by

x

is found and leads from the state

p

to the state

q

, such that

len(p)+1=len(q)

, then

q

does not have to be split and it is a suffix link of

last

; ## If the transition is found but

len(q) > len(p)+1

, then words from

q

having length at most

len(p)+1

should be segregated into new "clone" state

cl

; # If the previous step was concluded with the creation of

cl

, transitions from it and its suffix link should copy those of

q

, at the same time

cl

is assigned to be common suffix link of both

q

and

last

; # Transitions that have led to

q

before but corresponded to words of the length at most

len(p)+1

are redirected to

cl

. To do this, one continues going through the suffix path of

p

until the state is found such that transition by

x

from it doesn't lead to

q

. The whole procedure is described by the following pseudo-code: function : define assign assign while is undefined: assign define if : assign else if : assign else: define assign assign assign while : assign Here

q_0

is the initial state of the automaton and

new\_state()

is a function creating new state for it. It is assumed

last

len

link

and

\delta

are stored as global variables.

Complexity

Complexity of the algorithm may vary depending on the underlying structure used to store transitions of the automaton. It may be implemented in

O(n \log , \Sigma, )

with

O(n)

memory overhead or in

O(n)

with

O(n , \Sigma, )

memory overhead if one assumes that memory allocation is done in

O(1)

. To obtain such complexity, one has to use the methods of

amortized analysis In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case ...

. The value of

len(p)

strictly reduces with each iteration of the cycle while it may only increase by as much as one after the first iteration of the cycle on the next ''add_letter'' call. Overall value of

len(p)

never exceeds

n

and it is only increased by one between iterations of appending new letters that suggest total complexity is at most linear as well. The linearity of the second cycle is shown in a similar way.

Generalizations

The suffix automaton is closely related to other suffix structures and

substring indices In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "''the best of''" is a substring of "''It was the best of times''". In contrast, "''Itwastimes''" is a subsequence ...

. Given a suffix automaton of a specific string one may construct its suffix tree via compacting and recursive traversal in linear time. Similar transforms are possible in both directions to switch between the suffix automaton of

S

and the suffix tree of reversed string

S^R

. Other than this several generalizations were developed to construct an automaton for the set of strings given by trie, compacted suffix automation (CDAWG), to maintain the structure of the automaton on the sliding window, and to construct it in a bidirectional way, supporting the insertion of a characters to both the beginning and the end of the string.

Compacted suffix automaton

As was already mentioned above, a compacted suffix automaton is obtained via both compaction of a regular suffix automaton (by removing states which are non-final and have exactly one out-going arc) and the minimization of a suffix tree. Similarly to the regular suffix automaton, states of compacted suffix automaton may be defined in explicit manner. ''A two-way extension''

\overset

of a word

\gamma

is the longest word

\omega = \beta \gamma \alpha

, such that every occurrence of

\gamma

S

is preceded by

\beta

and succeeded by

\alpha

. In terms of left and right extensions it means that two-way extension is the left extension of the right extension or, which is equivalent, the right extension of the left extension, that is

\overset = \overset = \overset

. In terms of two-way extensions compacted automaton is defined as follows: Two-way extensions induce an equivalence relation

\overset = \overset

which defines the set of words recognized by the same state of compacted automaton. This equivalence relation is a

transitive closure In mathematics, the transitive closure of a binary relation on a set is the smallest relation on that contains and is transitive. For finite sets, "smallest" can be taken in its usual sense, of having the fewest related pairs; for infinit ...

of the relation defined by

(\overset=\overset) \vee (\overset = \overset)

, which highlights the fact that a compacted automaton may be obtained by both gluing suffix tree vertices equivalent via

\overset = \overset

relation (minimization of the suffix tree) and gluing suffix automaton states equivalent via

\overset=\overset

relation (compaction of suffix automaton). If words

\alpha

and

\beta

have same right extensions, and words

\beta

and

\gamma

have same left extensions, then cumulatively all strings

\alpha

\beta

and

\gamma

have same two-way extensions. At the same time it may happen that neither left nor right extensions of

\alpha

and

\gamma

coincide. As an example one may take

S=\beta=ab

\alpha=a

and

\gamma=b

, for which left and right extensions are as follows:

\overset=\overset=ab=\overset = \overset

, but

\overset=b

and

\overset=a

. That being said, while equivalence relations of one-way extensions were formed by some continuous chain of nested prefixes or suffixes, bidirectional extensions equivalence relations are more complex and the only thing one may conclude for sure is that strings with the same two-way extension are ''substrings'' of the longest string having the same two-way extension, but it may even happen that they don't have any non-empty substring in common. The total number of equivalence classes for this relation does not exceed

n+1

which implies that compacted suffix automaton of the string having length

n

has at most

n+1

states. The amount of transitions in such automaton is at most

2n-2

Suffix automaton of several strings

Consider a set of words

T=\

. It is possible to construct a generalization of suffix automaton that would recognize the language formed up by suffixes of all words from the set. Constraints for the number of states and transitions in such automaton would stay the same as for a single-word automaton if you put

n = , S_1,  + , S_2,  + \dots + , S_k,

. The algorithm is similar to the construction of single-word automaton except instead of

last

state, function ''add_letter'' would work with the state corresponding to the word

\omega_i

assuming the transition from the set of words

\

to the set

\

. This idea is further generalized to the case when

T

is not given explicitly but instead is given by a

with

Q

vertices. Mohri ''et al''. showed such an automaton would have at most

2Q-2

and may be constructed in linear time from its size. At the same time, the number of transitions in such automaton may reach

O(Q, \Sigma, )

, for example for the set of words

T=\

over the alphabet

\Sigma=\

the total length of words is equal to

O(n^2+nk)

, the number of vertices in corresponding suffix trie is equal to

O(n+k)

and corresponding suffix automaton is formed of

O(n+k)

states and

O(nk)

transitions. Algorithm suggested by Mohri mainly repeats the generic algorithm for building automaton of several strings but instead of growing words one by one, it traverses the trie in a

breadth-first search Breadth-first search (BFS) is an algorithm for searching a tree data structure for a node that satisfies a given property. It starts at the tree root and explores all nodes at the present depth prior to moving on to the nodes at the next d ...

order and append new characters as it meet them in the traversal, which guarantees amortized linear complexity.

Sliding window

Some compression algorithms, such as

LZ77 LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for many variations including ...

and RLE may benefit from storing suffix automaton or similar structure not for the whole string but for only last

k

its characters while the string is updated. This is because compressing data is usually expressively large and using

O(n)

memory is undesirable. In 1985, Janet Blumer developed an algorithm to maintain a suffix automaton on a sliding window of size

k

O(nk)

worst-case and

O(n \log k)

on average, assuming characters are distributed independently and uniformly. She also showed

O(nk)

complexity cannot be improved: if one considers words construed as a concatenation of several

(ab)^m c (ab)^m d

words, where

k=6m+2

, then the number of states for the window of size

k

would frequently change with jumps of order

m

, which renders even theoretical improvement of

O(nk)

for regular suffix automata impossible. The same should be true for the suffix tree because its vertices correspond to states of the suffix automaton of the reversed string but this problem may be resolved by not explicitly storing every vertex corresponding to the suffix of the whole string, thus only storing vertices with at least two out-going edges. A variation of McCreight's suffix tree construction algorithm for this task was suggested in 1989 by Edward Fiala and Daniel Greene; several years later a similar result was obtained with the variation of Ukkonen's algorithm by Jesper Larsson. The existence of such an algorithm, for compacted suffix automaton that absorbs some properties of both suffix trees and suffix automata, was an open question for a long time until it was discovered by Martin Senft and Tomasz Dvorak in 2008, that it is impossible if the alphabet's size is at least two. One way to overcome this obstacle is to allow window width to vary a bit while staying

O(k)

. It may be achieved by an approximate algorithm suggested by Inenaga et al. in 2004. The window for which suffix automaton is built in this algorithm is not guaranteed to be of length

k

but it is guaranteed to be at least

k

and at most

2k+1

while providing linear overall complexity of the algorithm.

Applications

Suffix automaton of the string

S

may be used to solve such problems as: * Counting the number of distinct substrings of

S

O(, S, )

on-line, * Finding the longest substring of

S

occurring at least twice in

O(, S, )

, * Finding the longest common substring of

S

and

T

O(, T, )

, * Counting the number of occurrences of

T

S

O(, T, )

, * Finding all occurrences of

T

S

O(, T, +k)

, where

k

is the number of occurrences. It is assumed here that

T

is given on the input after suffix automaton of

S

is constructed. Suffix automata are also used in data compression, music retrieval and matching on genome sequences.

References

Bibliography

* * * * * * * * * * * * * * * * * * * * * * * * * * *

External links

*
Suffix automaton
article on E-Maxx Algorithms in English {{Strings , state=collapsed String data structures Finite automata Algorithms on strings