In
formal language theory
In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.
The alphabet of a formal language consists of sy ...
, the Chomsky–Schützenberger enumeration theorem is a theorem derived by
Noam Chomsky
Avram Noam Chomsky (born December 7, 1928) is an American public intellectual: a linguist, philosopher, cognitive scientist, historian, social critic, and political activist. Sometimes called "the father of modern linguistics", Chomsky is ...
and
Marcel-Paul Schützenberger about the number of words of a given length generated by an
unambiguous context-free grammar. The theorem provides an unexpected link between the theory of
formal languages
In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.
The alphabet of a formal language consists of sy ...
and
abstract algebra
In mathematics, more specifically algebra, abstract algebra or modern algebra is the study of algebraic structures. Algebraic structures include groups, rings, fields, modules, vector spaces, lattices, and algebras over a field. The te ...
.
Statement
In order to state the theorem, a few notions from algebra and formal language theory are needed.
Let
denote the set of nonnegative integers. A ''power series'' over
is an
infinite series
In mathematics, a series is, roughly speaking, a description of the operation of adding infinitely many quantities, one after the other, to a given starting quantity. The study of series is a major part of calculus and its generalization, ma ...
of the form
:
with coefficients
in
. The ''multiplication'' of two formal
power series
In mathematics, a power series (in one variable) is an infinite series of the form
\sum_^\infty a_n \left(x - c\right)^n = a_0 + a_1 (x - c) + a_2 (x - c)^2 + \dots
where ''an'' represents the coefficient of the ''n''th term and ''c'' is a con ...
and
is defined in the expected way as the
convolution
In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution' ...
of the sequences
and
:
:
In particular, we write
,
, and so on. In analogy to
algebraic numbers, a power series
is called ''algebraic'' over
, if there exists a finite set of polynomials
each with
rational
Rationality is the quality of being guided by or based on reasons. In this regard, a person acts rationally if they have a good reason for what they do or a belief is rational if it is based on strong evidence. This quality can apply to an abil ...
coefficients such that
:
A context-free grammar is said to be ''unambiguous'' if every
string
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
generated by the grammar admits a unique parse tree
or, equivalently, only one
leftmost derivation
In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form
:A\ \to\ \alpha
with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be empt ...
.
Having established the necessary notions, the theorem is stated as follows.
:Chomsky–Schützenberger theorem. If
is a
context-free language
In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).
Context-free languages have many applications in programming languages, in particular, most arithmetic expressions are generated by ...
admitting an unambiguous context-free grammar, and
is the number of words of length
in
, then
is a power series over
that is algebraic over
.
Proofs of this theorem are given by , and by .
Usage
Asymptotic estimates
The theorem can be used in
analytic combinatorics
In combinatorics, the symbolic method is a technique for counting combinatorial objects. It uses the internal structure of the objects to derive formulas for their generating functions. The method is mostly associated with Philippe Flajolet an ...
to estimate the number of words of length ''n'' generated by a given unambiguous context-free grammar, as ''n'' grows large. The following example is given by : the unambiguous context-free grammar ''G'' over the alphabet has start symbol ''S'' and the following rules
:''S'' → ''M'' , ''U''
:''M'' → 0''M''1''M'' , ''ε''
:''U'' → 0''S'' , 0''M''1''U''.
To obtain an algebraic representation of the power series associated with a given context-free grammar ''G'', one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by ''x'', each occurrence of ''ε'' by the integer '1', each occurrence of '→' by '=', and each occurrence of ', ' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:
:''S'' = ''M'' + ''U''
:''M'' = ''M''²''x''² + 1
:''U'' = ''Sx'' + ''MUx''²
In this system of equations, ''S'', ''M'', and ''U'' are functions of ''x'', so one could also write , , and . The equation system can be resolved after ''S'', resulting in a single algebraic equation:
:.
This quadratic equation has two solutions for ''S'', one of which is the algebraic power series . By applying methods from complex analysis to this equation, the number
of words of length ''n'' generated by ''G'' can be estimated, as ''n'' grows large. In this case, one obtains
but
for each
.
The following example is from :
which simplifies to
Inherent ambiguity
In classical formal language theory, the theorem can be used to prove that certain context-free languages are
inherently ambiguous.
For example, the ''Goldstine language''
over the alphabet
consists of the words
with
,
for
, and
for some
.
It is comparably easy to show that the language
is context-free. The harder part is to show that there does not exist an unambiguous grammar that generates
. This can be proved as follows:
If
denotes the number of words of length
in
, then for the associated power series holds
.
Using methods from
complex analysis
Complex analysis, traditionally known as the theory of functions of a complex variable, is the branch of mathematical analysis that investigates functions of complex numbers. It is helpful in many branches of mathematics, including algebra ...
, one can prove that this function is not algebraic over
. By the Chomsky-Schützenberger theorem, one can conclude that
does not admit an unambiguous context-free grammar.
[See for detailed account.]
Notes
References
*
*
*
*
*
*
*
{{DEFAULTSORT:Chomsky-Schutzenberger enumeration theorem
Noam Chomsky
Formal languages
Theorems in discrete mathematics