formal language In logic, mathematics, computer science, and linguistics, a formal language is a set of strings whose symbols are taken from a set called "alphabet". The alphabet of a formal language consists of symbols that concatenate into strings (also c ...

theory, a context-free grammar (CFG) is a

formal grammar A formal grammar is a set of Terminal and nonterminal symbols, symbols and the Production (computer science), production rules for rewriting some of them into every possible string of a formal language over an Alphabet (formal languages), alphabe ...

whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form :

A\ \to\ \alpha

with

A

a ''single'' nonterminal symbol, and

\alpha

a string of terminals and/or nonterminals (

\alpha

can be empty). Regardless of which symbols surround it, the single nonterminal

A

on the left hand side can always be replaced by

\alpha

on the right hand side. This distinguishes it from a

context-sensitive grammar A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any Production (computer science), production rules may be surrounded by a context of terminal symbol, terminal and nonterminal symbols. Cont ...

, which can have production rules in the form

\alpha A \beta \rightarrow \alpha \gamma \beta

with

A

a nonterminal symbol and

\alpha

\beta

, and

\gamma

strings of terminal and/or nonterminal symbols. A formal grammar is essentially a set of production rules that describe all possible strings in a given formal language. Production rules are simple replacements. For example, the first rule in the picture, :

\langle\text\rangle \to \langle\text\rangle = \langle\text\rangle ;

replaces

\langle\text\rangle

with

\langle\text\rangle = \langle\text\rangle ;

. There can be multiple replacement rules for a given nonterminal symbol. The language generated by a grammar is the set of all strings of terminal symbols that can be derived, by repeated rule applications, from some particular nonterminal symbol ("start symbol"). Nonterminal symbols are used during the derivation process, but do not appear in its final result string.

Language Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...

s generated by context-free grammars are known as

context-free language In formal language theory, a context-free language (CFL), also called a Chomsky type-2 language, is a language generated by a context-free grammar (CFG). Context-free languages have many applications in programming languages, in particular, mos ...

s (CFL). Different context-free grammars can generate the same context-free language. It is important to distinguish the properties of the language (intrinsic properties) from the properties of a particular grammar (extrinsic properties). The language equality question (do two given context-free grammars generate the same language?) is undecidable. Context-free grammars arise in

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

where they are used to describe the structure of sentences and words in a

natural language A natural language or ordinary language is a language that occurs naturally in a human community by a process of use, repetition, and change. It can take different forms, typically either a spoken language or a sign language. Natural languages ...

, and they were invented by the linguist

Noam Chomsky Avram Noam Chomsky (born December 7, 1928) is an American professor and public intellectual known for his work in linguistics, political activism, and social criticism. Sometimes called "the father of modern linguistics", Chomsky is also a ...

for this purpose. By contrast, in

computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

, as the use of recursively-defined concepts increased, they were used more and more. In an early application, grammars are used to describe the structure of

programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...

s. In a newer application, they are used in an essential part of the

Extensible Markup Language Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The Wor ...

(XML) called the document type definition. In linguistics, some authors use the term

phrase structure grammar The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammar studied previously by Emil Post and Axel Thue ( Post canonical systems). Some authors, however, reserve the term for more restricted grammars in t ...

to refer to context-free grammars, whereby phrase-structure grammars are distinct from

dependency grammar Dependency grammar (DG) is a class of modern Grammar, grammatical theories that are all based on the dependency relation (as opposed to the ''constituency relation'' of Phrase structure grammar, phrase structure) and that can be traced back prima ...

s. In computer science, a popular notation for context-free grammars is

Backus–Naur form In computer science, Backus–Naur form (BNF, pronounced ), also known as Backus normal form, is a notation system for defining the Syntax (programming languages), syntax of Programming language, programming languages and other Formal language, for ...

, or BNF.

Background

Since at least the time of the ancient Indian scholar

Pāṇini (; , ) was a Sanskrit grammarian, logician, philologist, and revered scholar in ancient India during the mid-1st millennium BCE, dated variously by most scholars between the 6th–5th and 4th century BCE. The historical facts of his life ar ...

, linguists have described the

grammar In linguistics, grammar is the set of rules for how a natural language is structured, as demonstrated by its speakers or writers. Grammar rules may concern the use of clauses, phrases, and words. The term may also refer to the study of such rul ...

s of languages in terms of their block structure, and described how sentences are recursively built up from smaller phrases, and eventually individual words or word elements. An essential property of these block structures is that logical units never overlap. For example, the sentence: : John, whose blue car was in the garage, walked to the grocery store. can be logically parenthesized (with the logical metasymbols '') as follows: : ''John[, [whose [blue car [was [in [the garage">'',_[whose_[blue_car.html" ;"title="''John[, [whose [blue car">''John[, [whose [blue car [was [in [the garage], [walked [to [the [grocery store . A context-free grammar provides a simple and mathematically precise mechanism for describing the methods by which phrases in some natural language are built from smaller blocks, capturing the "block structure" of sentences in a natural way. Its simplicity makes the formalism amenable to rigorous mathematical study. Important features of natural language syntax such as agreement and

reference A reference is a relationship between objects in which one object designates, or acts as a means by which to connect to or link to, another object. The first object in this relation is said to ''refer to'' the second object. It is called a ''nam ...

are not part of the context-free grammar, but the basic recursive structure of sentences, the way in which clauses nest inside other clauses, and the way in which lists of adjectives and adverbs are swallowed by nouns and verbs, is described exactly. Context-free grammars are a special form of

semi-Thue system In theoretical computer science and mathematical logic a string rewriting system (SRS), historically called a semi-Thue system, is a rewriting system over strings from a (usually finite) alphabet. Given a binary relation R between fixed strings ...

s that in their general form date back to the work of

Axel Thue Axel Thue (; 19 February 1863 – 7 March 1922) was a Norwegian mathematician, known for his original work in diophantine approximation and combinatorics. Work Thue published his first important paper in 1909. He stated in 1914 the so-called w ...

. The formalism of context-free grammars was developed in the mid-1950s by

, and also their classification as a special type of

(which he called

phrase-structure grammar The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammar studied previously by Emil Post and Axel Thue ( Post canonical systems). Some authors, however, reserve the term for more restricted grammars in th ...

s). Some authors, however, reserve the term for more restricted grammars in the Chomsky hierarchy: context-sensitive grammars or context-free grammars. In a broader sense,

s are also known as constituency grammars. The defining trait of phrase structure grammars is thus their adherence to the constituency relation, as opposed to the dependency relation of

s. In Chomsky's

generative grammar Generative grammar is a research tradition in linguistics that aims to explain the cognitive basis of language by formulating and testing explicit models of humans' subconscious grammatical knowledge. Generative linguists, or generativists (), ...

framework, the syntax of natural language was described by context-free rules combined with transformation rules. Block structure was introduced into computer

s by the

Algol ALGOL (; short for "Algorithmic Language") is a family of imperative computer programming languages originally developed in 1958. ALGOL heavily influenced many other languages and was the standard method for algorithm description used by the ...

project (1957–1960), which, as a consequence, also featured a context-free grammar to describe the resulting Algol syntax. This became a standard feature of computer languages, and the notation for grammars used in concrete descriptions of computer languages came to be known as

, after two members of the Algol language design committee. The "block structure" aspect that context-free grammars capture is so fundamental to grammar that the terms syntax and grammar are often identified with context-free grammar rules, especially in computer science. Formal constraints not captured by the grammar are then considered to be part of the "semantics" of the language. Context-free grammars are simple enough to allow the construction of efficient parsing algorithms that, for a given string, determine whether and how it can be generated from the grammar. An Earley parser is an example of such an algorithm, while the widely used LR and LL parsers are simpler algorithms that deal only with more restrictive subsets of context-free grammars.

Formal definitions

A context-free grammar is defined by the 4-

tuple In mathematics, a tuple is a finite sequence or ''ordered list'' of numbers or, more generally, mathematical objects, which are called the ''elements'' of the tuple. An -tuple is a tuple of elements, where is a non-negative integer. There is o ...

G = (V, \Sigma, R, S)

, where # is a finite set; each element

v\in V

is called ''a nonterminal character'' or a ''variable''. Each variable represents a different type of phrase or clause in the sentence. Variables are also sometimes called syntactic categories. Each variable defines a sub-language of the language defined by . # is a finite set of ''terminal''s, disjoint from , which make up the actual content of the sentence. The set of terminals is the alphabet of the language defined by the grammar . # is a finite relation in

V\times(V\cup\Sigma)^

, where the asterisk represents the

Kleene star In mathematical logic and theoretical computer science, the Kleene star (or Kleene operator or Kleene closure) is a unary operation on a Set (mathematics), set to generate a set of all finite-length strings that are composed of zero or more repe ...

operation. The members of are called the ''(rewrite) rule''s or ''production''s of the grammar. (also commonly symbolized by a ) # is the start variable (or start symbol), used to represent the whole sentence (or program). It must be an element of .

Production rule notation

A production rule in is formalized mathematically as a pair

(\alpha, \beta)\in R

, where

\alpha \in V

is a nonterminal and

\beta \in (V\cup\Sigma)^

is a

string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...

of variables and/or terminals; rather than using

ordered pair In mathematics, an ordered pair, denoted (''a'', ''b''), is a pair of objects in which their order is significant. The ordered pair (''a'', ''b'') is different from the ordered pair (''b'', ''a''), unless ''a'' = ''b''. In contrast, the '' unord ...

notation, production rules are usually written using an arrow operator with

\alpha

as its left hand side and as its right hand side:

\alpha\rightarrow\beta

. It is allowed for to be the

empty string In formal language theory, the empty string, or empty word, is the unique String (computer science), string of length zero. Formal theory Formally, a string is a finite, ordered sequence of character (symbol), characters such as letters, digits ...

, and in this case it is customary to denote it by . The form

\alpha\rightarrow\varepsilon

is called an ε-production. It is common to list all right-hand sides for the same left-hand side on the same line, using , (the

vertical bar The vertical bar, , is a glyph with various uses in mathematics, computing, and typography. It has many names, often related to particular meanings: Sheffer stroke (in logic), pipe, bar, or (literally, the word "or"), vbar, and others. Usage ...

) to separate them. Rules

\alpha\rightarrow \beta_1

and

\alpha\rightarrow\beta_2

can hence be written as

\alpha\rightarrow\beta_1\mid\beta_2

. In this case,

\beta_1

and

\beta_2

are called the first and second alternative, respectively.

Rule application

For any strings

u, v\in (V\cup\Sigma)^

, we say directly yields , written as

u\Rightarrow v\,

, if

\exists (\alpha, \beta)\in R

with

\alpha \in V

and

u_, u_\in (V\cup\Sigma)^

such that

u\,=u_\alpha u_

and

v\,=u_\beta u_

. Thus, is a result of applying the rule

(\alpha, \beta)

to .

Repetitive rule application

For any strings

u, v\in (V\cup\Sigma)^,

we say ''yields'' or is ''derived'' from if there is a positive integer and strings

u_, \ldots, u_\in (V\cup\Sigma)^

such that

u = u_ \Rightarrow u_ \Rightarrow \cdots \Rightarrow u_ = v

. This relation is denoted

u ~\stackrel~ v

, or

u\Rightarrow\Rightarrow v

in some textbooks. If

k\geq 2

, the relation

u ~\stackrel~ v

holds. In other words,

(\stackrel)

and

(\stackrel)

are the

reflexive transitive closure In mathematics, a subset of a given set is closed under an operation on the larger set if performing that operation on members of the subset always produces a member of that subset. For example, the natural numbers are closed under addition, but ...

(allowing a string to yield itself) and the

transitive closure In mathematics, the transitive closure of a homogeneous binary relation on a set (mathematics), set is the smallest Relation (mathematics), relation on that contains and is Transitive relation, transitive. For finite sets, "smallest" can be ...

(requiring at least one step) of

(\Rightarrow)

, respectively.

Context-free language

The language of a grammar

G = (V, \Sigma, R, S)

is the set :

L(G) = \

of all terminal-symbol strings derivable from the start symbol. A language is said to be a context-free language (CFL), if there exists a CFG , such that

L=L(G)

. Non-deterministic pushdown automata recognize exactly the context-free languages.

Examples

Words concatenated with their reverse

The grammar

G = (\, \, P, S)

, with productions : , : , : , is context-free. It is not proper since it includes an -production. A typical derivation in this grammar is : . This makes it clear that

L(G) = \

. The language is context-free; however, it can be proved that it is not regular. If the productions : , : , are added, a context-free grammar for the set of all

palindrome A palindrome (Help:IPA/English, /ˈpæl.ɪn.droʊm/) is a word, palindromic number, number, phrase, or other sequence of symbols that reads the same backwards as forwards, such as ''madam'' or ''racecar'', the date "Twosday, 02/02/2020" and th ...

s over the alphabet is obtained.

Well-formed parentheses

The canonical example of a context-free grammar is parenthesis matching, which is representative of the general case. There are two terminal symbols and and one nonterminal symbol . The production rules are : , : , : The first rule allows the symbol to multiply; the second rule allows the symbol to become enclosed by matching parentheses; and the third rule terminates the recursion.

Well-formed nested parentheses and square brackets

A second canonical example is two different kinds of matching nested parentheses, described by the productions: : : : : : with terminal symbols , , , and nonterminal . The following sequence can be derived in that grammar: :

Matching pairs

In a context-free grammar, we can pair up characters the way we do with

bracket A bracket is either of two tall fore- or back-facing punctuation marks commonly used to isolate a segment of text or data from its surroundings. They come in four main pairs of shapes, as given in the box to the right, which also gives their n ...

s. The simplest example: : : This grammar generates the language , which is not regular (according to the pumping lemma for regular languages). The special character stands for the empty string. By changing the above grammar to : : we obtain a grammar generating the language instead. This differs only in that it contains the empty string while the original grammar did not.

Distinct number of as and bs

A context-free grammar for the language consisting of all strings over containing an unequal number of s and s: : : : : Here, the nonterminal can generate all strings with more as than s, the nonterminal generates all strings with more s than s and the nonterminal generates all strings with an equal number of s and s. Omitting the third alternative in the rules for and does not restrict the grammar's language.

Second block of bs of double size

Another example of a non-regular language is

\

. It is context-free as it can be generated by the following context-free grammar: : :

First-order logic formulas

The formation rules for the terms and formulas of formal logic fit the definition of context-free grammar, except that the set of symbols may be infinite and there may be more than one start symbol.

Examples of languages that are not context free

In contrast to well-formed nested parentheses and square brackets in the previous section, there is no context-free grammar for generating all sequences of two different types of parentheses, each separately balanced ''disregarding the other'', where the two types need not nest inside one another, for example: : or : The fact that this language is not context free can be proven using pumping lemma for context-free languages and a proof by contradiction, observing that all words of the form

^n ^n ^n ^n

should belong to the language. This language belongs instead to a more general class and can be described by a conjunctive grammar, which in turn also includes other non-context-free languages, such as the language of all words of the form .

Regular grammars

Every regular grammar is context-free, but not all context-free grammars are regular. The following context-free grammar, for example, is also regular. : : : The terminals here are and , while the only nonterminal is . The language described is all nonempty strings of s and s that end in . This grammar is regular: no rule has more than one nonterminal in its right-hand side, and each of these nonterminals is at the same end of the right-hand side. Every regular grammar corresponds directly to a

nondeterministic finite automaton In automata theory, a finite-state machine is called a deterministic finite automaton (DFA), if * each of its transitions is ''uniquely'' determined by its source state and input symbol, and * reading an input symbol is required for each state tr ...

, so we know that this is a

regular language In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science (as opposed to ...

. Using vertical bars, the grammar above can be described more tersely as follows: :

Derivations and syntax trees

A ''derivation'' of a string for a grammar is a sequence of grammar rule applications that transform the start symbol into the string. A derivation proves that the string belongs to the grammar's language. A derivation is fully determined by giving, for each step: * the rule applied in that step * the occurrence of its left-hand side to which it is applied For clarity, the intermediate string is usually given as well. For instance, with the grammar: # # # the string : can be derived from the start symbol with the following derivation: : : (by rule 1. on ) : (by rule 1. on the second ) : (by rule 2. on the first ) : (by rule 2. on the second ) : (by rule 3. on the third ) Often, a strategy is followed that deterministically chooses the next nonterminal to rewrite: * in a ''leftmost derivation'', it is always the leftmost nonterminal; * in a ''rightmost derivation'', it is always the rightmost nonterminal. Given such a strategy, a derivation is completely determined by the sequence of rules applied. For instance, one leftmost derivation of the same string is : : (by rule 1 on the leftmost ) : (by rule 2 on the leftmost ) : (by rule 1 on the leftmost ) : (by rule 2 on the leftmost ) : (by rule 3 on the leftmost ), which can be summarized as : rule 1 : rule 2 : rule 1 : rule 2 : rule 3. One rightmost derivation is: : : (by rule 1 on the rightmost ) : (by rule 1 on the rightmost ) : (by rule 3 on the rightmost ) : (by rule 2 on the rightmost ) : (by rule 2 on the rightmost ), which can be summarized as : rule 1 : rule 1 : rule 3 : rule 2 : rule 2. The distinction between leftmost derivation and rightmost derivation is important because in most

parser Parsing, syntax analysis, or syntactic analysis is a process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar by breaking it into parts. The term '' ...

s the transformation of the input is defined by giving a piece of code for every grammar rule that is executed whenever the rule is applied. Therefore, it is important to know whether the parser determines a leftmost or a rightmost derivation because this determines the order in which the pieces of code will be executed. See for an example LL parsers and

LR parser In computer science, LR parsers are a type of bottom-up parsing, bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR parser, canonica ...

s. A derivation also imposes in some sense a hierarchical structure on the string that is derived. For example, if the string "" is derived according to the leftmost derivation outlined above, the structure of the string would be: : where denotes a substring recognized as belonging to . This hierarchy can also be seen as a tree: Simple Parse Tree 1

This tree is called a ''

parse tree A parse tree or parsing tree (also known as a derivation tree or concrete syntax tree) is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is use ...

'' or "concrete syntax tree" of the string, by contrast with the

abstract syntax tree An abstract syntax tree (AST) is a data structure used in computer science to represent the structure of a program or code snippet. It is a tree representation of the abstract syntactic structure of text (often source code) written in a formal ...

. In this case the presented leftmost and the rightmost derivations define the same parse tree; however, there is another rightmost derivation of the same string : : (by rule 1 on the rightmost ) : (by rule 3 on the rightmost ) : (by rule 1 on the rightmost ) : (by rule 2 on the rightmost ) : (by rule 2 on the rightmost ), which defines a string with a different structure : and a different parse tree: Simple Parse Tree 2

Note however that both parse trees can be obtained by both leftmost and rightmost derivations. For example, the last tree can be obtained with the leftmost derivation as follows: : : (by rule 1 on the leftmost ) : (by rule 1 on the leftmost ) : (by rule 2 on the leftmost ) : (by rule 2 on the leftmost ) : (by rule 3 on the leftmost ), If a string in the language of the grammar has more than one parsing tree, then the grammar is said to be an '' ambiguous grammar''. Such grammars are usually hard to parse because the parser cannot always decide which grammar rule it has to apply. Usually, ambiguity is a feature of the grammar, not the language, and an unambiguous grammar can be found that generates the same context-free language. However, there are certain languages that can only be generated by ambiguous grammars; such languages are called '' inherently ambiguous languages''.

Normal forms

Every context-free grammar with no -production has an equivalent grammar in Chomsky normal form, and a grammar in Greibach normal form. "Equivalent" here means that the two grammars generate the same language. The especially simple form of production rules in Chomsky normal form grammars has both theoretical and practical implications. For instance, given a context-free grammar, one can use the Chomsky normal form to construct a

polynomial-time In theoretical computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations p ...

algorithm that decides whether a given string is in the language represented by that grammar or not (the CYK algorithm).

Closure properties

Context-free languages are closed under the various operations, that is, if the languages and are context-free, so is the result of the following operations: * union ;

concatenation In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball". In certain formalizations of concatenati ...

;

''L''^* * substitution (in particular

homomorphism In algebra, a homomorphism is a morphism, structure-preserving map (mathematics), map between two algebraic structures of the same type (such as two group (mathematics), groups, two ring (mathematics), rings, or two vector spaces). The word ''homo ...

) * inverse homomorphism *

intersection In mathematics, the intersection of two or more objects is another object consisting of everything that is contained in all of the objects simultaneously. For example, in Euclidean geometry, when two lines in a plane are not parallel, their ...

with a

They are not closed under general intersection (hence neither under complementation) and set difference.

Decidable problems

The following are some decidable problems about context-free grammars.

Parsing

The parsing problem, checking whether a given word belongs to the language given by a context-free grammar, is decidable, using one of the general-purpose parsing algorithms: * CYK algorithm (for grammars in Chomsky normal form) * Earley parser * GLR parser * LL parser (only for the proper subclass of LL(''k'') grammars) Context-free parsing for Chomsky normal form grammars was shown by Leslie G. Valiant to be reducible to Boolean

matrix multiplication In mathematics, specifically in linear algebra, matrix multiplication is a binary operation that produces a matrix (mathematics), matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the n ...

, thus inheriting its complexity upper bound of ''O''(''n''^2.3728639). Conversely, Lillian Lee has shown ''O''(''n''^3−''ε'') Boolean matrix multiplication to be reducible to ''O''(''n''^3−3''ε'') CFG parsing, thus establishing some kind of lower bound for the latter.

Reachability, productiveness, nullability

A nonterminal symbol

X

is called ''productive'', or ''generating'', if there is a derivation

X ~\stackrel~ w

for some string

w

of terminal symbols.

X

is called ''reachable'' if there is a derivation

S ~\stackrel~ \alpha X \beta

for some strings

\alpha,\beta

of nonterminal and terminal symbols from the start symbol.

X

is called ''useless'' if it is unreachable or unproductive.

X

is called ''nullable'' if there is a derivation

X ~\stackrel~ \varepsilon

. A rule

X \rightarrow \varepsilon

is called an ''ε-production''. A derivation

X ~\stackrel~ X

is called a ''cycle''. Algorithms are known to eliminate from a given grammar, without changing its generated language, * unproductive symbols, * unreachable symbols, * -productions, with one possible exception, and * cycles.. In particular, an alternative containing a useless nonterminal symbol can be deleted from the right-hand side of a rule. Such rules and alternatives are called ''useless''. In the depicted example grammar, the nonterminal is unreachable, and is unproductive, while causes a cycle. Hence, omitting the last three rules does not change the language generated by the grammar, nor does omitting the alternatives "" from the right-hand side of the rule for . A context-free grammar is said to be ''proper'' if it has neither useless symbols nor -productions nor cycles. Combining the above algorithms, every context-free grammar not generating can be transformed into a weakly equivalent proper one.

Regularity and LL(''k'') checks

It is decidable whether a given ''grammar'' is a regular grammar, as well as whether it is an LL(''k'') grammar for a given . If ''k'' is not given, the latter problem is undecidable. Given a context-free grammar, it is not decidable whether its language is regular, nor whether it is an LL(''k'') language for a given ''k''.

Emptiness and finiteness

There are algorithms to decide whether the language of a given context-free grammar is empty, as well as whether it is finite.

Undecidable problems

Some questions that are undecidable for wider classes of grammars become decidable for context-free grammars; e.g. the emptiness problem (whether the grammar generates any terminal strings at all), is undecidable for

s, but decidable for context-free grammars. However, many problems are undecidable even for context-free grammars; the most prominent ones are handled in the following.

Universality

Given a CFG, does it generate the language of all strings over the alphabet of terminal symbols used in its rules? A reduction can be demonstrated to this problem from the well-known undecidable problem of determining whether a

Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...

accepts a particular input (the

halting problem In computability theory (computer science), computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running, or continue to run for ...

). The reduction uses the concept of a ''

computation history In computer science, a computation history is a sequence of steps taken by an abstract machine in the process of computing its result. Computation histories are frequently used in Mathematical proof, proofs about the capabilities of certain machine ...

'', a string describing an entire computation of a

. A CFG can be constructed that generates all strings that are not accepting computation histories for a particular Turing machine on a particular input, and thus it will accept all strings only if the machine does not accept that input.

Language equality

Given two CFGs, do they generate the same language? The undecidability of this problem is a direct consequence of the previous: it is impossible to even decide whether a CFG is equivalent to the trivial CFG defining the language of all strings.

Language inclusion

Given two CFGs, can the first one generate all strings that the second one can generate? If this problem was decidable, then language equality could be decided too: two CFGs

G_1

and

G_2

generate the same language if

L(G_1)

is a subset of

L(G_2)

and

L(G_2)

is a subset of

L(G_1)

Being in a lower or higher level of the Chomsky hierarchy

Using Greibach's theorem, it can be shown that the two following problems are undecidable: * Given a

, does it describe a context-free language? * Given a context-free grammar, does it describe a

Grammar ambiguity

Given a CFG, is it ambiguous? The undecidability of this problem follows from the fact that if an algorithm to determine ambiguity existed, the Post correspondence problem could be decided, which is known to be undecidable. This may be proved by Ogden's lemma.

Language disjointness

Given two CFGs, is there any string derivable from both grammars? If this problem was decidable, the undecidable Post correspondence problem (PCP) could be decided, too: given strings

\alpha_1, \ldots, \alpha_N, \beta_1, \ldots, \beta_N

over some alphabet

\

, let the grammar consist of the rule :

S \to \alpha_1 S \beta_1^ ,  \cdots ,  \alpha_N S \beta_N^ ,  b

; where

\beta_i^

denotes the reversed string

\beta_i

and

b

does not occur among the

a_i

; and let grammar consist of the rule :

T \to a_1 T a_1^ ,  \cdots ,  a_k T a_k^ ,  b

; Then the PCP instance given by

\alpha_1, \ldots, \alpha_N, \beta_1, \ldots, \beta_N

has a solution if and only if and share a derivable string. The left of the string (before the

b

) will represent the top of the solution for the PCP instance while the right side will be the bottom in reverse.

Extensions

An obvious way to extend the context-free grammar formalism is to allow nonterminals to have arguments, the values of which are passed along within the rules. This allows natural language features such as agreement and

, and programming language analogs such as the correct use and definition of identifiers, to be expressed in a natural way. E.g. we can now easily express that in English sentences, the subject and verb must agree in number. In computer science, examples of this approach include affix grammars,

attribute grammar An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are t ...

s, indexed grammars, and Van Wijngaarden two-level grammars. Similar extensions exist in linguistics. An extended context-free grammar (or regular right part grammar) is one in which the right-hand side of the production rules is allowed to be a

regular expression A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

over the grammar's terminals and nonterminals. Extended context-free grammars describe exactly the context-free languages. Another extension is to allow additional terminal symbols to appear at the left-hand side of rules, constraining their application. This produces the formalism of

Subclasses

There are a number of important subclasses of the context-free grammars: * LR(''k'') grammars (also known as

deterministic context-free grammar In formal grammar theory, the deterministic context-free grammars (DCFGs) are a proper subset of the context-free grammars. They are the subset of context-free grammars that can be derived from deterministic pushdown automata, and they generate the ...

s) allow

parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...

(string recognition) with deterministic pushdown automata (PDA), but they can only describe

deterministic context-free language In formal language theory, deterministic context-free languages (DCFL) are a proper subset of context-free languages. They are context-free languages that can be accepted by a deterministic pushdown automaton. DCFLs are always unambiguous, meanin ...

s. * Simple LR, look-ahead LR grammars are subclasses that allow further simplification of parsing. SLR and LALR are recognized using the same PDA as LR, but with simpler tables, in most cases. * LL(''k'') and LL(''*'') grammars allow parsing by direct construction of a leftmost derivation as described above, and describe even fewer languages. * Simple grammars are a subclass of the LL(1) grammars mostly interesting for its theoretical property that language equality of simple grammars is decidable, while language inclusion is not. * Bracketed grammars have the property that the terminal symbols are divided into left and right bracket pairs that always match up in rules. * Linear grammars have no rules with more than one nonterminal on the right-hand side. * Regular grammars are a subclass of the linear grammars and describe the regular languages, i.e. they correspond to

finite automata A finite-state machine (FSM) or finite-state automaton (FSA, plural: ''automata''), finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number ...

and

s. LR parsing extends LL parsing to support a larger range of grammars; in turn, generalized LR parsing extends LR parsing to support arbitrary context-free grammars. On LL grammars and LR grammars, it essentially performs LL parsing and LR parsing, respectively, while on nondeterministic grammars, it is as efficient as can be expected. Although GLR parsing was developed in the 1980s, many new language definitions and parser generators continue to be based on LL, LALR or LR parsing up to the present day.

Linguistic applications

Chomsky Avram Noam Chomsky (born December 7, 1928) is an American professor and public intellectual known for his work in linguistics, political activism, and social criticism. Sometimes called "the father of modern linguistics", Chomsky is also a ...

initially hoped to overcome the limitations of context-free grammars by adding transformation rules. Such rules are another standard device in traditional linguistics; e.g.

passivization A passive voice construction is a grammatical voice construction that is found in many languages. In a clause with passive voice, the grammatical subject expresses the ''theme'' or ''patient'' of the main verb – that is, the person or thing t ...

in English. Much of

has been devoted to finding ways of refining the descriptive mechanisms of phrase-structure grammar and transformation rules such that exactly the kinds of things can be expressed that natural language actually allows. Allowing arbitrary transformations does not meet that goal: they are much too powerful, being

Turing complete Alan Mathison Turing (; 23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. He was highly influential in the development of theoretical comput ...

unless significant restrictions are added (e.g. no transformations that introduce and then rewrite symbols in a context-free fashion). Chomsky's general position regarding the non-context-freeness of natural language has held up since then,. although his specific examples regarding the inadequacy of context-free grammars in terms of their weak generative capacity were later disproved.. Gerald Gazdar and

Geoffrey Pullum Geoffrey Keith Pullum (; born 8 March 1945) is a British and American linguist specialising in the study of English. Pullum has published over 300 articles and books on various topics in linguistics, including phonology, morphology, semantics ...

have argued that despite a few non-context-free constructions in natural language (such as cross-serial dependencies in

Swiss German Swiss German (Standard German: , ,Because of the many different dialects, and because there is no #Conventions, defined orthography for any of them, many different spellings can be found. and others; ) is any of the Alemannic German, Alemannic ...

and

reduplication In linguistics, reduplication is a Morphology (linguistics), morphological process in which the Root (linguistics), root or Stem (linguistics), stem of a word, part of that, or the whole word is repeated exactly or with a slight change. The cla ...

in Bambara.), the vast majority of forms in natural language are indeed context-free.

References

Notes

External links

* Computer programmers may find th
stack exchange answer
to be useful.
CFG Developer
created by Christopher Wong at Stanford University in 2014; modified by Kevin Gibbons in 2015. {{DEFAULTSORT:Context-Free Grammar 1956 in computing Compiler construction Formal languages Programming language topics

Background

Formal definitions

Production rule notation

Rule application

Repetitive rule application

Context-free language

Examples

Words concatenated with their reverse

Well-formed parentheses

Well-formed nested parentheses and square brackets

Matching pairs

Distinct number of as and bs

Second block of bs of double size

First-order logic formulas

Examples of languages that are not context free

Regular grammars

Derivations and syntax trees

Normal forms

Closure properties

Decidable problems

Parsing

Reachability, productiveness, nullability

Regularity and LL(''k'') checks

Emptiness and finiteness

Undecidable problems

Universality

Language equality

Language inclusion

Being in a lower or higher level of the Chomsky hierarchy

Grammar ambiguity

Language disjointness

Extensions

Subclasses

Linguistic applications

See also

References

Notes

Further reading

External links