Glushkov's Construction Algorithm
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
theory – particularly
formal language theory In logic, mathematics, computer science, and linguistics, a formal language is a set of string (computer science), strings whose symbols are taken from a set called "#Definition, alphabet". The alphabet of a formal language consists of symbol ...
– Glushkov's construction algorithm, invented by Victor Mikhailovich Glushkov, transforms a given
regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
into an equivalent
nondeterministic finite automaton In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if * each of its transitions is ''uniquely'' determined by its source state and input symbol, and * reading an input symbol is required for each state tr ...
(NFA). Thus, it forms a bridge between regular expressions and nondeterministic finite automata: two abstract representations of the same class of
formal language In logic, mathematics, computer science, and linguistics, a formal language is a set of strings whose symbols are taken from a set called "alphabet". The alphabet of a formal language consists of symbols that concatenate into strings (also c ...
s. A regular expression may be used to conveniently describe an advanced search pattern in a "find and replace"–like operation of a
text processing In computing, the term text processing refers to the theory and practice of automating the creation or manipulation of electronic text. ''Text'' usually refers to all the alphanumeric characters specified on the keyboard of the person engaging th ...
utility. Glushkov's algorithm can be used to transform it into an NFA, which furthermore is small by nature, as the number of its states equals the number of symbols of the regular expression, plus one. Subsequently, the NFA can be made deterministic by the
powerset construction In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the same f ...
and then be minimized to get an optimal automaton corresponding to the given regular expression. The latter format is best suited for execution on a computer. From another, more theoretical point of view, Glushkov's algorithm is a part of the proof that NFA and regular expressions both accept exactly the same languages; that is, the
regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...
s. The converse of Glushkov's algorithm is
Kleene's algorithm In theoretical computer science, in particular in formal language theory, Kleene's algorithm transforms a given nondeterministic finite automaton (NFA) into a regular expression. Together with other conversion algorithms, it establishes the equival ...
, which transforms a finite automaton into a regular expression. The automaton obtained by Glushkov's construction is the same as the one obtained by
Thompson's construction algorithm In computer science, Thompson's construction algorithm, also called the McNaughton–Yamada–Thompson algorithm, is a method of transforming a regular expression into an equivalent nondeterministic finite automaton (NFA). This NFA can be used to ...
, once its ε-transitions are removed.


Construction

Given a regular expression , the Glushkov Construction Algorithm creates a non-deterministic automaton that accepts the language L(e) accepted by . The construction uses four steps:


Step 1

Linearisation of the expression. Each letter of the alphabet appearing in the expression is renamed, so that each letter occurs at most once in the new expression e'. Glushkov's construction essentially relies on the fact that e' represents a
local language Local may refer to: Geography and transportation * Local (train), a train serving local traffic demand * Local, Missouri, a community in the United States Arts, entertainment, and media * ''Local'' (comics), a limited series comic book by Bria ...
L(e'). Let be the old alphabet and let be the new one.


Step 2a

Computation of the sets P(e'), D(e'), and F(e'). The first, P(e'), is the set of letters which occurs as first letter of a word of L(e'). The second, D(e'), is the set of letters that can end a word of L(e'). The last one, F(e'), is the set of letter pairs that can occur in words of L(e'), i.e. it is the set of factors of length two of the words of L(e'). Those sets are mathematically defined by :P(e')=\, :D(e')=\, :F(e')=\. They are computed by induction over the structure of the expression, as explained below.


Step 2b

Computation of the set \Lambda(e') which contains the empty word if this word belongs to L(e'), and is the empty set otherwise. Formally, this is \Lambda(e')=\\cap L(e'), where \varepsilon denotes the empty word.


Step 3

Computation of automaton recognizing the ''local language'', as defined by P(e'), D(e'), F(e'), and \Lambda(e'). By definition, the local language defined by the sets , , and is the set of words which begin with a letter of , end by a letter of , and whose factors of length 2 belong to , optionally also including the empty word; that is, it is the language: :L'=(PB^*\cap B^*D) \setminus B^*(B^2\setminus F)B^* \cup \Lambda(e'). Strictly speaking, it is the computation of the automaton for the local language denoted by this linearised expression that is Glushkov's construction.


Step 4

Remove the linearisation, replacing each indexed letter by the original letter of .


Example

Consider the regular expression e = (a(ab)^*)^* + (ba)^*.


Computation of the set of letters

The computation of the sets , , , and is done inductively over the
regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
e'. One must give the values for (the symbols for the empty language and the singleton language containing the empty word), the letters, and the results of the operations +,\cdot,^*. The most costly operations are the products of sets for the computation of .


Properties

The obtained automaton is non-deterministic, and it has as many states as the number of letters of the regular expression, plus one. Furthermore, it has been shown that Glushkov's automaton is the same as Thompson's automaton when the ε-transitions are removed.


Applications and deterministic expressions

The computation of the automaton by the expression occurs often; it has been systematically used in search functions, in particular by the
Unix Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
grep grep is a command-line utility for searching plaintext datasets for lines that match a regular expression. Its name comes from the ed command g/re/p (global regular expression search and print), which has the same effect. grep was originally de ...
command. Similarly,
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
's specification also uses such constructions; for more efficiency, regular expressions of a certain kind, called ''deterministic expressions'', have been studied.


See also

*
Аналитик Analitik () is a programming language, developed in 1968 at the Institute of Cybernetics of the Academy of Sciences of the Ukrainian SSR in the USSR. It is a development on the ALMIR-65 language, keeping compatibility with it. Distinctive featu ...


Notes and references

{{Reflist, 30em


External links


A Unified Construction of the Glushkov, Follow, and Antimirov Automata

Algorithms and Computation: 14th International Symposium, ISAAC
Finite-state machines