HOME

TheInfoList



OR:

In formal language theory, a
context-free grammar In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules are of the form :A\ \to\ \alpha with A a ''single'' nonterminal symbol, and \alpha a string of terminals and/or nonterminals (\alpha can be emp ...
, ''G'', is said to be in Chomsky normal form (first described by Noam Chomsky) if all of its production rules are of the form: : ''A'' → ''BC'',   or : ''A'' → ''a'',   or : ''S'' → ε, where ''A'', ''B'', and ''C'' are
nonterminal symbol In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. ''Terminal symbols'' are the elementary symbols of the language defined by a formal grammar ...
s, the letter ''a'' is a
terminal symbol In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. ''Terminal symbols'' are the elementary symbols of the language defined by a formal grammar. ...
(a symbol that represents a constant value), ''S'' is the start symbol, and ε denotes the empty string. Also, neither ''B'' nor ''C'' may be the start symbol, and the third production rule can only appear if ε is in ''L''(''G''), the language produced by the context-free grammar ''G''. Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into an
equivalent Equivalence or Equivalent may refer to: Arts and entertainment *Album-equivalent unit, a measurement unit in the music industry *Equivalence class (music) *''Equivalent VIII'', or ''The Bricks'', a minimalist sculpture by Carl Andre *'' Equival ...
onethat is, one that produces the same language which is in Chomsky normal form and has a size no larger than the square of the original grammar's size.


Converting a grammar to Chomsky normal form

To convert a grammar to Chomsky normal form, a sequence of simple transformations is applied in a certain order; this is described in most textbooks on
automata theory Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science. The word ''automata'' comes from the Greek word αὐτόματ� ...
. The presentation here follows Hopcroft, Ullman (1979), but is adapted to use the transformation names from Lange, Leiß (2009).For example, Hopcroft, Ullman (1979) merged TERM and BIN into a single transformation. Each of the following transformations establishes one of the properties required for Chomsky normal form.


START: Eliminate the start symbol from right-hand sides

Introduce a new start symbol ''S''0, and a new rule :''S''0 → ''S'', where ''S'' is the previous start symbol. This does not change the grammar's produced language, and ''S''0 will not occur on any rule's right-hand side.


TERM: Eliminate rules with nonsolitary terminals

To eliminate each rule :''A'' → ''X''1 ... ''a'' ... ''X''''n'' with a terminal symbol ''a'' being not the only symbol on the right-hand side, introduce, for every such terminal, a new nonterminal symbol ''N''''a'', and a new rule :''N''''a'' → ''a''. Change every rule :''A'' → ''X''1 ... ''a'' ... ''X''''n'' to :''A'' → ''X''1 ... ''N''''a'' ... ''X''''n''. If several terminal symbols occur on the right-hand side, simultaneously replace each of them by its associated nonterminal symbol. This does not change the grammar's produced language.


BIN: Eliminate right-hand sides with more than 2 nonterminals

Replace each rule :''A'' → ''X''1 ''X''2 ... ''X''''n'' with more than 2 nonterminals ''X''1,...,''X''''n'' by rules :''A'' → ''X''1 ''A''1, :''A''1 → ''X''2 ''A''2, :... , :''A''''n''-2 → ''X''''n''-1 ''X''''n'', where ''A''''i'' are new nonterminal symbols. Again, this does not change the grammar's produced language.


DEL: Eliminate ε-rules

An ε-rule is a rule of the form :''A'' → ε, where ''A'' is not ''S''0, the grammar's start symbol. To eliminate all rules of this form, first determine the set of all nonterminals that derive ε. Hopcroft and Ullman (1979) call such nonterminals ''nullable'', and compute them as follows: * If a rule ''A'' → ε exists, then ''A'' is nullable. * If a rule ''A'' → ''X''1 ... ''X''''n'' exists, and every single ''X''''i'' is nullable, then ''A'' is nullable, too. Obtain an intermediate grammar by replacing each rule :''A'' → ''X''1 ... ''X''''n'' by all versions with some nullable ''X''''i'' omitted. By deleting in this grammar each ε-rule, unless its left-hand side is the start symbol, the transformed grammar is obtained. For example, in the following grammar, with start symbol ''S''0, : ''S''0 → ''AbB'' , ''C'' : ''B'' → ''AA'' , ''AC'' : ''C'' → ''b'' , ''c'' : ''A'' → ''a'' , ε the nonterminal ''A'', and hence also ''B'', is nullable, while neither ''C'' nor ''S''0 is. Hence the following intermediate grammar is obtained:indicating a kept and omitted nonterminal ''N'' by ' and ', respectively : ''S''0 → ''b'' , ''b'' , ''b'' , ''b''   ,   ''C'' : ''B'' → ' , ' , ' , ''ε''   ,   ''C'' , ''C'' : ''C'' → ''b'' , ''c'' : ''A'' → ''a'' , ε In this grammar, all ε-rules have been " inlined at the call site".If the grammar had a rule ''S''0 → ε, it could not be "inlined", since it had no "call sites". Therefore it could not be deleted in the next step. In the next step, they can hence be deleted, yielding the grammar: : ''S''0 → ''AbB'' , ''Ab'' , ''bB'' , ''b''   ,   ''C'' : ''B'' → ''AA'' , ''A''   ,   ''AC'' , ''C'' : ''C'' → ''b'' , ''c'' : ''A'' → ''a'' This grammar produces the same language as the original example grammar, viz. , but has no ε-rules.


UNIT: Eliminate unit rules

A unit rule is a rule of the form :''A'' → ''B'', where ''A'', ''B'' are nonterminal symbols. To remove it, for each rule :''B'' → ''X''1 ... ''X''''n'', where ''X''1 ... ''X''''n'' is a string of nonterminals and terminals, add rule :''A'' → ''X''1 ... ''X''''n'' unless this is a unit rule which has already been (or is being) removed. The skipping of nonterminal symbol ''B'' in the resulting grammar is possible due to ''B'' being a member of the unit closure of nonterminal symbol ''A''.


Order of transformations

When choosing the order in which the above transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example, START will re-introduce a unit rule if it is applied after UNIT. The table shows which orderings are admitted. Moreover, the worst-case bloat in grammar sizei.e. written length, measured in symbols depends on the transformation order. Using , ''G'', to denote the size of the original grammar ''G'', the size blow-up in the worst case may range from , ''G'', 2 to 22 , G, , depending on the transformation algorithm used. The blow-up in grammar size depends on the order between DEL and BIN. It may be exponential when DEL is done first, but is linear otherwise. UNIT can incur a quadratic blow-up in the size of the grammar. The orderings START,TERM,BIN,DEL,UNIT and START,BIN,DEL,UNIT,TERM lead to the least (i.e. quadratic) blow-up.


Example

The following grammar, with start symbol ''Expr'', describes a simplified version of the set of all syntactical valid arithmetic expressions in programming languages like C or Algol60. Both ''number'' and ''variable'' are considered terminal symbols here for simplicity, since in a
compiler front end In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
their internal structure is usually not considered by the
parser Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Latin ...
. The terminal symbol "^" denoted exponentiation in Algol60. : In step "START" of the above conversion algorithm, just a rule ''S''0→''Expr'' is added to the grammar. After step "TERM", the grammar looks like this: : After step "BIN", the following grammar is obtained: : Since there are no ε-rules, step "DEL" does not change the grammar. After step "UNIT", the following grammar is obtained, which is in Chomsky normal form: : The ''N''''a'' introduced in step "TERM" are ''PowOp'', ''Open'', and ''Close''. The ''A''''i'' introduced in step "BIN" are ''AddOp_Term'', ''MulOp_Factor'', ''PowOp_Primary'', and ''Expr_Close''.


Alternative definition


Chomsky reduced form

Another way to define the Chomsky normal form is: A
formal grammar In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe ...
is in Chomsky reduced form if all of its production rules are of the form: : A \rightarrow\, BC or : A \rightarrow\, a, where A, B and C are nonterminal symbols, and a is a
terminal symbol In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. ''Terminal symbols'' are the elementary symbols of the language defined by a formal grammar. ...
. When using this definition, B or C may be the start symbol. Only those context-free grammars which do not generate the empty string can be transformed into Chomsky reduced form.


Floyd normal form

In a letter where he proposed a term
Backus–Naur form In computer science, Backus–Naur form () or Backus normal form (BNF) is a metasyntax notation for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats ...
(BNF),
Donald E. Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
implied a BNF "syntax in which all definitions have such a form may be said to be in 'Floyd Normal Form'", : \langle A \rangle ::= \, \langle B \rangle \mid \langle C \rangle or : \langle A \rangle ::= \, \langle B \rangle \langle C \rangle or : \langle A \rangle ::=\, a, where \langle A \rangle, \langle B \rangle and \langle C \rangle are nonterminal symbols, and a is a terminal symbol, because
Robert W. Floyd Robert W Floyd (June 8, 1936 – September 25, 2001) was a computer scientist. His contributions include the design of the Floyd–Warshall algorithm (independently of Stephen Warshall), which efficiently finds all shortest paths in a graph and ...
found any BNF syntax can be converted to the above one in 1961. But he withdrew this term, "since doubtless many people have independently used this simple fact in their own work, and the point is only incidental to the main considerations of Floyd's note." While Floyd's note cites Chomsky's original 1959 article, Knuth's letter does not.


Application

Besides its theoretical significance, CNF conversion is used in some algorithms as a preprocessing step, e.g., the
CYK algorithm In computer science, the Cocke–Younger–Kasami algorithm (alternatively called CYK, or CKY) is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, ...
, a
bottom-up parsing In computer science, parsing reveals the grammatical structure of linear input text, as a first step in working out its meaning. Bottom-up parsing recognizes the text's lowest-level small details first, before its mid-level structures, and leaving ...
for context-free grammars, and its variant probabilistic CKY.


See also

*
Backus–Naur form In computer science, Backus–Naur form () or Backus normal form (BNF) is a metasyntax notation for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats ...
*
CYK algorithm In computer science, the Cocke–Younger–Kasami algorithm (alternatively called CYK, or CKY) is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, ...
*
Greibach normal form In formal language theory, a context-free grammar is in Greibach normal form (GNF) if the right-hand sides of all production rules start with a terminal symbol, optionally followed by some variables. A non-strict form allows one exception to this f ...
*
Kuroda normal form In formal language theory, a context-sensitive grammar is in Kuroda normal form if all production rules are of the form: :''AB'' → ''CD'' or :''A'' → ''BC'' or :''A'' → ''B'' or :''A'' → ''a'' where A, B, C and D are nonterminal symbols a ...
*
Pumping lemma for context-free languages Pumping may refer to: * The operation of a pump, for moving a liquid from one location to another **The use of a breast pump for extraction of milk * Pumping (audio), a creative misuse of dynamic range compression * Pumping (computer systems), the ...
— its proof relies on the Chomsky normal form


Notes


References


Further reading

* Cole, Richard. ''Converting CFGs to CNF (Chomsky Normal Form)'', October 17, 2007
(pdf)
— uses the order TERM, BIN, START, DEL, UNIT. * ''(Pages 237–240 of section 6.6: simplified forms and normal forms.)'' * ''(Pages 98–101 of section 2.1: context-free grammars. Page 156.)'' * ''(pages 171-183 of section 7.1: Chomsky Normal Form)'' * Sipser, Michael. ''Introduction to the Theory of Computation,'' 2nd edition. * {{Noam Chomsky Formal languages Noam Chomsky