
In
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, a deterministic acyclic finite state automaton (DAFSA),
[Jan Daciuk, Stoyan Mihov, Bruce Watson and Richard Watson (2000). Incremental construction of minimal acyclic finite state automata. Computational Linguistics 26(1):3-16.]
is a
data structure
In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...
that represents a set of
strings, and allows for a query operation that tests whether a given string belongs to the set in time proportional to its length. Algorithms exist to construct and maintain such
automata
An automaton (; : automata or automatons) is a relatively self-operating machine, or control mechanism designed to automatically follow a sequence of operations, or respond to predetermined instructions. Some automata, such as bellstrikers i ...
,
while keeping them
minimal.
DAFSA is the rediscovery of a data structure called Directed Acyclic Word Graph (DAWG),
[Appel, Andrew; Jacobsen, Guy (1988). The World's Fastest Scrabble Program. Communications of the ACM, 31 (5): 572–578] although the same name had already been given to a different data structure which is related to
suffix automaton.
[Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht, David Haussler, Ross M. McConnell (1983). Linear size finite automata for the set of all subwords of a word - an outline of results. Bull Europ. Assoc. Theoret. Comput. Sci, 21: 12-20]
A DAFSA is a special case of a
finite state recognizer that takes the form of a
directed acyclic graph
In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...
with a single source vertex (a vertex with no incoming edges), in which each edge of the graph is labeled by a letter or symbol, and in which each vertex has at most one outgoing edge for each possible letter or symbol. The strings represented by the DAFSA are formed by the symbols on paths in the graph from the source vertex to any sink vertex (a vertex with no outgoing edges). In fact, a
deterministic finite state automaton is acyclic
if and only if
In logic and related fields such as mathematics and philosophy, "if and only if" (often shortened as "iff") is paraphrased by the biconditional, a logical connective between statements. The biconditional is true in two cases, where either bo ...
it recognizes a
finite set of strings.
History
Blumer et al
first defined terminology Directed Acyclic Word Graph (DAWG) in 1983. Appel and Jacobsen
used the same naming for a different data structure in 1988. Independent of earlier work, Daciuk et al
rediscovered the latter data structure in 2000 but called it DAFSA.
Comparison to tries
By allowing the same vertices to be reached by multiple paths, a DAFSA may use significantly fewer vertices than the strongly related
trie
In computer science, a trie (, ), also known as a digital tree or prefix tree, is a specialized search tree data structure used to store and retrieve strings from a dictionary or set. Unlike a binary search tree, nodes in a trie do not store t ...
data structure. Consider, for example, the four English words "tap", "taps", "top", and "tops". A trie for those four words would have 12 vertices, one for each of the strings formed as a prefix of one of these words, or for one of the words followed by the end-of-string marker. However, a DAFSA can represent these same four words using only six vertices ''v
i'' for 0 ≤ ''i'' ≤ 5, and the following edges: an edge from ''v''
0 to ''v''
1 labeled "t", two edges from ''v''
1 to ''v''
2 labeled "a" and "o", an edge from ''v''
2 to ''v''
3 labeled "p", an edge ''v''
3 to ''v''
4 labeled "s", and edges from ''v''
3 and ''v''
4 to ''v''
5 labeled with the end-of-string marker. There is a tradeoff between memory and functionality, because a standard DAFSA can tell you if a word exists within it, but it cannot point you to auxiliary information about that word, whereas a trie can.
The primary difference between DAFSA and trie is the elimination of suffix and infix redundancy in storing strings. The trie eliminates prefix redundancy since all common prefixes are shared between strings, such as between ''doctors'' and ''doctorate'' the ''doctor'' prefix is shared. In a DAFSA common suffixes are also shared, for words that have the same set of possible suffixes as each other. For dictionary sets of common English words, this translates into major memory usage reduction.
Because the terminal nodes of a DAFSA can be reached by multiple paths, a DAFSA cannot directly store auxiliary information relating to each path, e.g. a word's frequency in the English language. However, if for each node we store the number of unique paths through that point in the structure, we can use it to retrieve the index of a word, or a word given its index.
The auxiliary information can then be stored in an array.
References
*
* . One of the early mentions of the data structure.
* .
*
* An
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
Python implementation.
External links
*
Directed Acyclic Word Graph or DAWG – JohnPaul Adamovsky teaches how to construct a DAFSA using an array of integers ()
*
– JohnPaul Adamovsky teaches how to construct a DAFSA hash function using a novel encoding with multiple integer arrays ()
{{Strings
Graph data structures
String data structures
Finite-state machines