theoretical computer science computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumscribe the ...

and

formal language theory In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of symb ...

, a regular language (also called a rational language) is a

formal language In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules. The alphabet of a formal language consists of sym ...

that can be defined by a

regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

, in the strict sense in theoretical computer science (as opposed to many modern regular expressions engines, which are augmented with features that allow recognition of non-regular languages). Alternatively, a regular language can be defined as a language recognized by a

finite automaton A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number o ...

. The equivalence of regular expressions and finite automata is known as Kleene's theorem (after American mathematician Stephen Cole Kleene). In the Chomsky hierarchy, regular languages are the languages generated by Type-3 grammars.

Formal definition

The collection of regular languages over an

alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...

Σ is defined recursively as follows: * The empty language Ø is a regular language. * For each ''a'' ∈ Σ (''a'' belongs to Σ), the singleton language is a regular language. * If ''A'' is a regular language, ''A''* ( Kleene star) is a regular language. Due to this, the empty string language is also regular. * If ''A'' and ''B'' are regular languages, then ''A'' ∪ ''B'' (union) and ''A'' • ''B'' (concatenation) are regular languages. * No other languages over Σ are regular. See

for syntax and semantics of regular expressions.

Examples

All finite languages are regular; in particular the

empty string In formal language theory, the empty string, or empty word, is the unique string of length zero. Formal theory Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special c ...

language = Ø* is regular. Other typical examples include the language consisting of all strings over the alphabet which contain an even number of ''a''s, or the language consisting of all strings of the form: several ''a''s followed by several ''b''s. A simple example of a language that is not regular is the set of strings . Intuitively, it cannot be recognized with a finite automaton, since a finite automaton has finite memory and it cannot remember the exact number of a's. Techniques to prove this fact rigorously are given

below Below may refer to: *Earth * Ground (disambiguation) *Soil *Floor * Bottom (disambiguation) *Less than *Temperatures below freezing *Hell or underworld People with the surname *Ernst von Below (1863–1955), German World War I general *Fred Below ...

Equivalent formalisms

A regular language satisfies the following equivalent properties: # it is the language of a regular expression (by the above definition) # it is the language accepted by a

nondeterministic finite automaton In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if * each of its transitions is ''uniquely'' determined by its source state and input symbol, and * reading an input symbol is required for each state ...

(NFA)1. ⇒ 2. by Thompson's construction algorithm2. ⇒ 1. by Kleene's algorithm or using Arden's lemma # it is the language accepted by a deterministic finite automaton (DFA)2. ⇒ 3. by the

powerset construction In the theory of computation and automata theory, the powerset construction or subset construction is a standard method for converting a nondeterministic finite automaton (NFA) into a deterministic finite automaton (DFA) which recognizes the sa ...

3. ⇒ 2. since the former

definition A definition is a statement of the meaning of a term (a word, phrase, or other set of symbols). Definitions can be classified into two large categories: intensional definitions (which try to give the sense of a term), and extensional definiti ...

is stronger than the latter # it can be generated by a

regular grammar In theoretical computer science and formal language theory, a regular grammar is a grammar that is ''right-regular'' or ''left-regular''. While their exact definition varies from textbook to textbook, they all require that * all production rule ...

2. ⇒ 4. see Hopcroft, Ullman (1979), Theorem 9.2, p.2194. ⇒ 2. see Hopcroft, Ullman (1979), Theorem 9.1, p.218 # it is the language accepted by an

alternating finite automaton In automata theory, an alternating finite automaton (AFA) is a nondeterministic finite automaton whose transitions are divided into ''existential'' and ''universal'' transitions. For example, let ''A'' be an alternating automaton. * For an existenti ...

# it is the language accepted by a two-way finite automaton # it can be generated by a prefix grammar # it can be accepted by a read-only

Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer alg ...

# it can be defined in monadic second-order logic ( Büchi–Elgot–Trakhtenbrot theorem) # it is recognized by some finite

syntactic monoid In mathematics and computer science, the syntactic monoid M(L) of a formal language L is the smallest monoid that recognizes the language L. Syntactic quotient The free monoid on a given set is the monoid whose elements are all the strings of ze ...

''M'', meaning it is the

preimage In mathematics, the image of a function is the set of all output values it may produce. More generally, evaluating a given function f at each element of a given subset A of its domain produces a set, called the "image of A under (or through) ...

of a subset ''S'' of a finite monoid ''M'' under a monoid homomorphism ''f'': Σ^* → ''M'' from the free monoid on its alphabet3. ⇔ 10. by the

Myhill–Nerode theorem In the theory of formal languages, the Myhill–Nerode theorem provides a necessary and sufficient condition for a language to be regular. The theorem is named for John Myhill and Anil Nerode, who proved it at the University of Chicago in 1958 ...

# the number of equivalence classes of its

syntactic congruence In mathematics and computer science, the syntactic monoid M(L) of a formal language L is the smallest monoid that recognizes the language L. Syntactic quotient The free monoid on a given set is the monoid whose elements are all the strings of z ...

is finite.''u''~''v'' is defined as: ''uw''∈''L'' if and only if ''vw''∈''L'' for all ''w''∈Σ^*3. ⇔ 11. see the proof in the ''

Syntactic monoid In mathematics and computer science, the syntactic monoid M(L) of a formal language L is the smallest monoid that recognizes the language L. Syntactic quotient The free monoid on a given set is the monoid whose elements are all the strings of ze ...

'' article, and see p.160 in (This number equals the number of states of the minimal deterministic finite automaton accepting ''L''.) Properties 10. and 11. are purely algebraic approaches to define regular languages; a similar set of statements can be formulated for a monoid ''M'' ⊆ Σ^*. In this case, equivalence over ''M'' leads to the concept of a recognizable language. Some authors use one of the above properties different from "1." as an alternative definition of regular languages. Some of the equivalences above, particularly those among the first four formalisms, are called ''Kleene's theorem'' in textbooks. Precisely which one (or which subset) is called such varies between authors. One textbook calls the equivalence of regular expressions and NFAs ("1." and "2." above) "Kleene's theorem". Another textbook calls the equivalence of regular expressions and DFAs ("1." and "3." above) "Kleene's theorem". Two other textbooks first prove the expressive equivalence of NFAs and DFAs ("2." and "3.") and then state "Kleene's theorem" as the equivalence between regular expressions and finite automata (the latter said to describe "recognizable languages"). A linguistically oriented text first equates regular grammars ("4." above) with DFAs and NFAs, calls the languages generated by (any of) these "regular", after which it introduces regular expressions which it terms to describe "rational languages", and finally states "Kleene's theorem" as the coincidence of regular and rational languages. Other authors simply ''define'' "rational expression" and "regular expressions" as synonymous and do the same with "rational languages" and "regular languages". Apparently, the term ''"regular"'' originates from a 1951 technical report where Kleene introduced ''"regular events"'' and explicitly welcomed ''"any suggestions as to a more descriptive term"''.

Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky i ...

, in his 1959 seminal article, used the term ''"regular"'' in a different meaning at first (referring to what is called ''"

Chomsky normal form In formal language theory, a context-free grammar, ''G'', is said to be in Chomsky normal form (first described by Noam Chomsky) if all of its production rules are of the form: : ''A'' → ''BC'', or : ''A'' → ''a'', or : ''S'' → ...

"'' today), Here: Definition 8, p.149 but noticed that his ''"finite state languages"'' were equivalent to Kleene's ''"regular events"''.

Closure properties

The regular languages are

closed Closed may refer to: Mathematics * Closure (mathematics), a set, along with operations, for which applying those operations on members always results in a member of the set * Closed set, a set which contains all its limit points * Closed interval, ...

under various operations, that is, if the languages ''K'' and ''L'' are regular, so is the result of the following operations: * the set-theoretic Boolean operations: union ,

intersection In mathematics, the intersection of two or more objects is another object consisting of everything that is contained in all of the objects simultaneously. For example, in Euclidean geometry, when two lines in a plane are not parallel, thei ...

, and complement , hence also

relative complement In set theory, the complement of a set , often denoted by (or ), is the set of elements not in . When all sets in the universe, i.e. all sets under consideration, are considered to be members of a given set , the absolute complement of is ...

.Salomaa (1981) p.28 * the regular operations: ,

concatenation In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalisations of concatenat ...

, and Kleene star .Salomaa (1981) p.27 * the trio operations:

string homomorphism In computer science, in the area of formal language theory, frequent use is made of a variety of string functions; however, the notation used is different from that used for computer programming, and some commonly used functions in the theoretica ...

, inverse string homomorphism, and intersection with regular languages. As a consequence they are closed under arbitrary finite state transductions, like

quotient In arithmetic, a quotient (from lat, quotiens 'how many times', pronounced ) is a quantity produced by the division of two numbers. The quotient has widespread use throughout mathematics, and is commonly referred to as the integer part of a ...

''K'' / ''L'' with a regular language. Even more, regular languages are closed under quotients with ''arbitrary'' languages: If ''L'' is regular then ''L'' / ''K'' is regular for any ''K''. * the reverse (or mirror image) ''L''^R. Given a nondeterministic finite automaton to recognize ''L'', an automaton for ''L''^R can be obtained by reversing all transitions and interchanging starting and finishing states. This may result in multiple starting states; ε-transitions can be used to join them.

Decidability properties

Given two deterministic finite automata ''A'' and ''B'', it is decidable whether they accept the same language. As a consequence, using the above closure properties, the following problems are also decidable for arbitrarily given deterministic finite automata ''A'' and ''B'', with accepted languages ''L''_''A'' and ''L''_''B'', respectively: * Containment: is ''L''_''A'' ⊆ ''L''_''B'' ?Check if ''L''_''A'' ∩ ''L''_''B'' = ''L''_''A''. Deciding this property is

NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...

in general; see :File:RegSubsetNP.pdf for an illustration of the proof idea. * Disjointness: is ''L''_''A'' ∩ ''L''_''B'' = ? * Emptiness: is ''L''_''A'' = ? * Universality: is ''L''_''A'' = Σ^* ? * Membership: given ''a'' ∈ Σ^*, is ''a'' ∈ ''L''_''B'' ? For regular expressions, the universality problem is

NP-complete In computational complexity theory, a problem is NP-complete when: # it is a problem for which the correctness of each solution can be verified quickly (namely, in polynomial time) and a brute-force search algorithm can find a solution by trying ...

already for a singleton alphabet. For larger alphabets, that problem is

PSPACE-complete In computational complexity theory, a decision problem is PSPACE-complete if it can be solved using an amount of memory that is polynomial in the input length (polynomial space) and if every other problem that can be solved in polynomial space can b ...

. If regular expressions are extended to allow also a ''squaring operator'', with "''A''²" denoting the same as "''AA''", still just regular languages can be described, but the universality problem has an exponential space lower bound, and is in fact complete for exponential space with respect to polynomial-time reduction. For a fixed finite alphabet, the theory of the set of all languages — together with strings, membership of a string in a language, and for each character, a function to append the character to a string (and no other operations) — is decidable, and its minimal

elementary substructure In model theory, a branch of mathematical logic, two structures ''M'' and ''N'' of the same signature ''σ'' are called elementarily equivalent if they satisfy the same first-order ''σ''-sentences. If ''N'' is a substructure of ''M'', one often ...

consists precisely of regular languages. For a binary alphabet, the theory is called S2S.

Complexity results

computational complexity theory In theoretical computer science and mathematics, computational complexity theory focuses on classifying computational problems according to their resource usage, and relating these classes to each other. A computational problem is a task solved ...

, the

complexity class In computational complexity theory, a complexity class is a set of computational problems of related resource-based complexity. The two most commonly analyzed resources are time and memory. In general, a complexity class is defined in terms o ...

of all regular languages is sometimes referred to as REGULAR or REG and equals

DSPACE DSpace is an open source repository software package typically used for creating open access repositories for scholarly and/or published digital content. While DSpace shares some feature overlap with content management systems and document manag ...

(O(1)), the

decision problem In computability theory and computational complexity theory, a decision problem is a computational problem that can be posed as a yes–no question of the input values. An example of a decision problem is deciding by means of an algorithm whe ...

s that can be solved in constant space (the space used is independent of the input size). REGULAR ≠ AC⁰, since it (trivially) contains the parity problem of determining whether the number of 1 bits in the input is even or odd and this problem is not in AC⁰. On the other hand, REGULAR does not contain AC⁰, because the nonregular language of

palindrome A palindrome is a word, number, phrase, or other sequence of symbols that reads the same backwards as forwards, such as the words ''madam'' or ''racecar'', the date and time ''11/11/11 11:11,'' and the sentence: "A man, a plan, a canal – Pana ...

s, or the nonregular language

\

can both be recognized in AC⁰. If a language is ''not'' regular, it requires a machine with at least Ω(log log ''n'') space to recognize (where ''n'' is the input size). In other words, DSPACE( o(log log ''n'')) equals the class of regular languages. In practice, most nonregular problems are solved by machines taking at least

logarithmic space In computational complexity theory, L (also known as LSPACE or DLOGSPACE) is the complexity class containing decision problems that can be solved by a deterministic Turing machine using a logarithmic amount of writable memory space., Definition ...

Location in the Chomsky hierarchy

To locate the regular languages in the Chomsky hierarchy, one notices that every regular language is context-free. The converse is not true: for example the language consisting of all strings having the same number of ''a'''s as ''b'''s is context-free but not regular. To prove that a language is not regular, one often uses the

and the pumping lemma. Other approaches include using the

closure properties Closure may refer to: Conceptual Psychology * Closure (psychology), the state of experiencing an emotional conclusion to a difficult life event Computer science * Closure (computer programming), an abstraction binding a function to its scope * ...

of regular languages or quantifying

Kolmogorov complexity In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produ ...

. Important subclasses of regular languages include * Finite languages, those containing only a finite number of words. These are regular languages, as one can create a

that is the union of every word in the language. * Star-free languages, those that can be described by a regular expression constructed from the empty symbol, letters, concatenation and all boolean operators (see

algebra of sets In mathematics, the algebra of sets, not to be confused with the mathematical structure of ''an'' algebra of sets, defines the properties and laws of sets, the set-theoretic operations of union, intersection, and complementation and the ...

) including complementation but not the Kleene star: this class includes all finite languages.

The number of words in a regular language

Let

s_L(n)

denote the number of words of length

n

L

. The

ordinary generating function In mathematics, a generating function is a way of encoding an infinite sequence of numbers () by treating them as the coefficients of a formal power series. This series is called the generating function of the sequence. Unlike an ordinary ser ...

for ''L'' is the

formal power series In mathematics, a formal series is an infinite sum that is considered independently from any notion of convergence, and can be manipulated with the usual algebraic operations on series (addition, subtraction, multiplication, division, partial s ...

S_L(z) = \sum_ s_L(n) z^n \ .

The generating function of a language ''L'' is a

rational function In mathematics, a rational function is any function that can be defined by a rational fraction, which is an algebraic fraction such that both the numerator and the denominator are polynomials. The coefficients of the polynomials need not be ...

if ''L'' is regular. Hence for every regular language

L

the sequence

s_L(n)_

is constant-recursive; that is, there exist an integer constant

n_0

, complex constants

\lambda_1,\,\ldots,\,\lambda_k

and complex polynomials

p_1(x),\,\ldots,\,p_k(x)

such that for every

n \geq n_0

the number

s_L(n)

of words of length

n

L

s_L(n)=p_1(n)\lambda_1^n+\dotsb+p_k(n)\lambda_k^n

. Thus, non-regularity of certain languages

L'

can be proved by counting the words of a given length in

L'

. Consider, for example, the

Dyck language In the theory of formal languages of computer science, mathematics, and linguistics, a Dyck word is a balanced string of square brackets and The set of Dyck words forms the Dyck language. Dyck words and language are named after the mathemat ...

of strings of balanced parentheses. The number of words of length

2n

in the Dyck language is equal to the

Catalan number In combinatorial mathematics, the Catalan numbers are a sequence of natural numbers that occur in various counting problems, often involving recursively defined objects. They are named after the French-Belgian mathematician Eugène Charles Ca ...

C_n\sim\frac

, which is not of the form

p(n)\lambda^n

, witnessing the non-regularity of the Dyck language. Care must be taken since some of the eigenvalues

\lambda_i

could have the same magnitude. For example, the number of words of length

n

in the language of all even binary words is not of the form

p(n)\lambda^n

, but the number of words of even or odd length are of this form; the corresponding eigenvalues are

2,-2

. In general, for every regular language there exists a constant

d

such that for all

a

, the number of words of length

dm+a

is asymptotically

C_a m^ \lambda_a^m

. The ''zeta function'' of a language ''L'' is :

\zeta_L(z) = \exp \left(\right) \ .

The zeta function of a regular language is not in general rational, but that of an arbitrary

cyclic language In computer science, more particularly in formal language theory, a cyclic language is a set of strings that is closed with respect to repetition, root, and cyclic shift. Definition If ''A'' is a set of symbols, and ''A'' * is the set of all str ...

is.

Generalizations

The notion of a regular language has been generalized to infinite words (see ω-automata) and to trees (see

tree automaton A tree automaton is a type of state machine. Tree automata deal with tree structures, rather than the strings of more conventional state machines. The following article deals with branching tree automata, which correspond to regular languages ...

Rational set In computer science, more precisely in automata theory, a rational set of a monoid is an element of the minimal class of subsets of this monoid that contains all finite subsets and is closed under union, product and Kleene star. Rational sets are u ...

generalizes the notion (of regular/rational language) to monoids that are not necessarily free. Likewise, the notion of a recognizable language (by a finite automaton) has namesake as recognizable set over a monoid that is not necessarily free. Howard Straubing notes in relation to these facts that “The term "regular language" is a bit unfortunate. Papers influenced by Eilenberg's monograph in two volumes "A" (1974, ) and "B" (1976, ), the latter with two chapters by Bret Tilson. often use either the term "recognizable language", which refers to the behavior of automata, or "rational language", which refers to important analogies between regular expressions and rational power series. (In fact, Eilenberg defines rational and recognizable subsets of arbitrary monoids; the two notions do not, in general, coincide.) This terminology, while better motivated, never really caught on, and "regular language" is used almost universally.” Rational series is another generalization, this time in the context of a formal power series over a semiring. This approach gives rise to weighted rational expressions and

weighted automata In theoretical computer science and formal language theory, a weighted automaton or weighted finite-state machine is a generalization of a finite-state machine in which the edges have weights, for example real numbers or integers. Finite-sta ...

. In this algebraic context, the regular languages (corresponding to Boolean-weighted rational expressions) are usually called ''rational languages''. Also in this context, Kleene's theorem finds a generalization called the Kleene-Schützenberger theorem.

Learning from examples

Notes

References

* * * * Chapter 1: Regular Languages, pp. 31–90. Subsection "Decidable Problems Concerning Regular Languages" of section 4.1: Decidable Languages, pp. 152–155. * Philippe Flajolet and Robert Sedgewick, '' Analytic Combinatorics'': Symbolic Combinatorics. Online book, 2002. * *

External links

* * {{Formal languages and grammars Formal languages Finite automata