In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context.In particular, in a context-free grammar, each production rule is of the form
A \to \alpha
with
A
\alpha
\alpha
A
\alpha
\alphaA\beta → \alpha\gamma\beta
A
\alpha
\beta
\gamma
A formal grammar is essentially a set of production rules that describe all possible strings in a given formal language. Production rules are simple replacements. For example, the first rule in the picture,
\langleStmt\rangle\to\langleId\rangle=\langleExpr\rangle;
replaces
\langleStmt\rangle
\langleId\rangle=\langleExpr\rangle;
Languages generated by context-free grammars are known as context-free languages (CFL). Different context-free grammars can generate the same context-free language. It is important to distinguish the properties of the language (intrinsic properties) from the properties of a particular grammar (extrinsic properties). The language equality question (do two given context-free grammars generate the same language?) is undecidable.
Context-free grammars arise in linguistics where they are used to describe the structure of sentences and words in a natural language, and they were invented by the linguist Noam Chomsky for this purpose. By contrast, in computer science, as the use of recursively-defined concepts increased, they were used more and more. In an early application, grammars are used to describe the structure of programming languages. In a newer application, they are used in an essential part of the Extensible Markup Language (XML) called the document type definition.[1]
In linguistics, some authors use the term phrase structure grammar to refer to context-free grammars, whereby phrase-structure grammars are distinct from dependency grammars. In computer science, a popular notation for context-free grammars is Backus–Naur form, or BNF.
Since at least the time of the ancient Indian scholar Pāṇini, linguists have described the grammars of languages in terms of their block structure, and described how sentences are recursively built up from smaller phrases, and eventually individual words or word elements. An essential property of these block structures is that logical units never overlap. For example, the sentence:
John, whose blue car was in the garage, walked to the grocery store.can be logically parenthesized (with the logical metasymbols []) as follows:
['''''John'''''[''', '''['''''whose '''''['''''blue car''''']] ['''''was '''''['''''in '''''['''''the garage''''']]],]] ['''''walked '''''['''''to '''''['''''the '''''['''''grocery store''''']]]].A context-free grammar provides a simple and mathematically precise mechanism for describing the methods by which phrases in some natural language are built from smaller blocks, capturing the "block structure" of sentences in a natural way. Its simplicity makes the formalism amenable to rigorous mathematical study. Important features of natural language syntax such as agreement and reference are not part of the context-free grammar, but the basic recursive structure of sentences, the way in which clauses nest inside other clauses, and the way in which lists of adjectives and adverbs are swallowed by nouns and verbs, is described exactly.
Context-free grammars are a special form of Semi-Thue systems that in their general form date back to the work of Axel Thue.
The formalism of context-free grammars was developed in the mid-1950s by Noam Chomsky,[2] and also their classification as a special type of formal grammar (which he called phrase-structure grammars). Some authors, however, reserve the term for more restricted grammars in the Chomsky hierarchy: context-sensitive grammars or context-free grammars. In a broader sense, phrase structure grammars are also known as constituency grammars. The defining trait of phrase structure grammars is thus their adherence to the constituency relation, as opposed to the dependency relation of dependency grammars. In Chomsky's generative grammar framework, the syntax of natural language was described by context-free rules combined with transformation rules.[3]
Block structure was introduced into computer programming languages by the Algol project (1957–1960), which, as a consequence, also featured a context-free grammar[4] to describe the resulting Algol syntax. This became a standard feature of computer languages, and the notation for grammars used in concrete descriptions of computer languages came to be known as Backus–Naur form, after two members of the Algol language design committee.[2] The "block structure" aspect that context-free grammars capture is so fundamental to grammar that the terms syntax and grammar are often identified with context-free grammar rules, especially in computer science. Formal constraints not captured by the grammar are then considered to be part of the "semantics" of the language.
Context-free grammars are simple enough to allow the construction of efficient parsing algorithms that, for a given string, determine whether and how it can be generated from the grammar. An Earley parser is an example of such an algorithm, while the widely used LR and LL parsers are simpler algorithms that deal only with more restrictive subsets of context-free grammars.
G=(V,\Sigma,R,S)
v\inV
V x (V\cup\Sigma)*
A production rule in is formalized mathematically as a pair
(\alpha,\beta)\inR
\alpha\inV
\beta\in(V\cup\Sigma)*
\alpha
\alpha → \beta
It is allowed for to be the empty string, and in this case it is customary to denote it by ε. The form
\alpha → \varepsilon
It is common to list all right-hand sides for the same left-hand side on the same line, using | (the vertical bar) to separate them. Rules
\alpha → \beta1
\alpha → \beta2
\alpha → \beta1\mid\beta2
\beta1
\beta2
For any strings
u,v\in(V\cup\Sigma)*
u ⇒ v
\exists(\alpha,\beta)\inR
\alpha\inV
u1,u2\in(V\cup\Sigma)*
u=u1\alphau2
v=u1\betau2
(\alpha,\beta)
For any strings
u,v\in(V\cup\Sigma)*,
u1,\ldots,uk\in(V\cup\Sigma)*
u=u1 ⇒ u2 ⇒ … ⇒ uk=v
u\stackrel{*}{ ⇒ }v
u ⇒ ⇒ v
k\geq2
u\stackrel{+}{ ⇒ }v
(\stackrel{*}{ ⇒ })
(\stackrel{+}{ ⇒ })
( ⇒ )
The language of a grammar
G=(V,\Sigma,R,S)
L(G)=\{w\in\Sigma*:S\stackrel{*}{ ⇒ }w\}
A language is said to be a context-free language (CFL), if there exists a CFG, such that
L=L(G)
Non-deterministic pushdown automata recognize exactly the context-free languages.
The grammar
G=(\{S\},\{a,b\},P,S)
,
,
,
is context-free. It is not proper since it includes an ε-production. A typical derivation in this grammar is
.This makes it clear that
L(G)=\{wwR:w\in\{a,b\}*\}
If the productions
,
,
are added, a context-free grammar for the set of all palindromes over the alphabet is obtained.[7]
The canonical example of a context-free grammar is parenthesis matching, which is representative of the general case. There are two terminal symbols "(" and ")" and one nonterminal symbol S. The production rules are
,
,
The first rule allows the S symbol to multiply; the second rule allows the S symbol to become enclosed by matching parentheses; and the third rule terminates the recursion.[8]
A second canonical example is two different kinds of matching nested parentheses, described by the productions:
with terminal symbols [] and nonterminal S.
The following sequence can be derived in that grammar:
In a context-free grammar, we can pair up characters the way we do with brackets. The simplest example:
This grammar generates the language
\{anbn:n\ge1\}
The special character ε stands for the empty string. By changing the above grammar to
we obtain a grammar generating the language
\{anbn:n\ge0\}
A context-free grammar for the language consisting of all strings over containing an unequal number of a's and b's:
Here, the nonterminal T can generate all strings with more a's than b's, the nonterminal U generates all strings with more b's than a's and the nonterminal V generates all strings with an equal number of a's and b's. Omitting the third alternative in the rules for T and U does not restrict the grammar's language.
Another example of a non-regular language is
\{bnamb2n:n\ge0,m\ge0\}
The formation rules for the terms and formulas of formal logic fit the definition of context-free grammar, except that the set of symbols may be infinite and there may be more than one start symbol.
In contrast to well-formed nested parentheses and square brackets in the previous section, there is no context-free grammar for generating all sequences of two different types of parentheses, each separately balanced disregarding the other, where the two types need not nest inside one another, for example:
or
The fact that this language is not context free can be proven using pumping lemma for context-free languages and a proof by contradiction, observing that all words of the form
{(}n{[}n{)}n{]}n
anbncn
See main article: Regular grammar. Every regular grammar is context-free, but not all context-free grammars are regular.[9] The following context-free grammar, for example, is also regular.
The terminals here are and, while the only nonterminal is .The language described is all nonempty strings of
a
b
a
This grammar is regular: no rule has more than one nonterminal in its right-hand side, and each of these nonterminals is at the same end of the right-hand side.
Every regular grammar corresponds directly to a nondeterministic finite automaton, so we know that this is a regular language.
Using vertical bars, the grammar above can be described more tersely as follows:
A derivation of a string for a grammar is a sequence of grammar rule applications that transform the start symbol into the string.A derivation proves that the string belongs to the grammar's language.
A derivation is fully determined by giving, for each step:
For clarity, the intermediate string is usually given as well.
For instance, with the grammar:
the string
can be derived from the start symbol with the following derivation:
(by rule 1. on)
(by rule 1. on the second)
(by rule 2. on the first)
(by rule 2. on the second)
(by rule 3. on the third)
Often, a strategy is followed that deterministically chooses the next nonterminal to rewrite:
Given such a strategy, a derivation is completely determined by the sequence of rules applied. For instance, one leftmost derivation of the same string is
(by rule 1 on the leftmost)
(by rule 2 on the leftmost)
(by rule 1 on the leftmost)
(by rule 2 on the leftmost)
(by rule 3 on the leftmost),
which can be summarized as
rule 1
rule 2
rule 1
rule 2
rule 3.
One rightmost derivation is:
(by rule 1 on the rightmost)
(by rule 1 on the rightmost)
(by rule 3 on the rightmost)
(by rule 2 on the rightmost)
(by rule 2 on the rightmost),
which can be summarized as
rule 1
rule 1
rule 3
rule 2
rule 2.
The distinction between leftmost derivation and rightmost derivation is important because in most parsers the transformation of the input is defined by giving a piece of code for every grammar rule that is executed whenever the rule is applied. Therefore, it is important to know whether the parser determines a leftmost or a rightmost derivation because this determines the order in which the pieces of code will be executed. See for an example LL parsers and LR parsers.
A derivation also imposes in some sense a hierarchical structure on the string that is derived. For example, if the string "1 + 1 + a" is derived according to the leftmost derivation outlined above, the structure of the string would be:
S}}
where indicates a substring recognized as belonging to . This hierarchy can also be seen as a tree:
This tree is called a parse tree or "concrete syntax tree" of the string, by contrast with the abstract syntax tree. In this case the presented leftmost and the rightmost derivations define the same parse tree; however, there is another rightmost derivation of the same string
(by rule 1 on the rightmost)
(by rule 3 on the rightmost)
(by rule 1 on the rightmost)
(by rule 2 on the rightmost)
(by rule 2 on the rightmost),
which defines a string with a different structure