Untitled Document

Introduction to context-free grammars by example : defining the language of regular expressions

Trying to define the language of regular expressions as a regular language

Alphabet of the languages L described by the regular expressions: AL = {a, b, c}

Now we define the language of all regular expressions that define a language over AL. This language of regular expressions is a meta-language. Let us call it RE. Alphabet of RE: = {a, b, c, epsi, con, or, star, open, close} or {a, b, c, є, ., |, *, (, )} or {a, b, c, "є", ".", "|", "*", "(", ")"}

Recursive language equations - definition of the language RE:

RE = C | RE "|" C -- union - Note: distinguish | from "|" , the latter being a lexical unit of the RE language.
C = B | C "." B -- concatenation
B = Sim | B "*" -- basic expression - possibly with star
Sim = a | b | c | "є" -- simple expression

Finding a solution (note that LG = L1 | LG .L2 has solution LG = L1 . L2 * ): we get C = B ( "." B )* -- RE = C ( "|" C )* ---- therefore RE = ( ( B ( "." B )* ) ( "|" ( B ( "." B )* ) )* ) where B = Sim ("*" )*

But we also have the parenthesis : B = Sim | B "*" | "(" RE ")"

The diagram below is an finite state automaton (assuming that the "(" RE ")" branch at the bottom is not present). With the "(" RE ")" branch, it is a recursive finite state automaton. It can be used to check whether a given sequence is part of the RE language. However, no automaton (without recursion) exists that could check that language. Why not ?

automaton

Exercise: Solve the above system of recursive equation by iteration

level 0	level 1	level 2	level 3	level 4	level 5
RE = Ø	RE = Ø	RE = Ø	RE = Ø	RE = a \| b \| c \| "є"	RE = a \| b \| c \| "є" \| ( a \| b \| c \| "є" ) "." (a \| b \| c \| "є" \| a* \| b* \| c* ) \| (a \| b \| c \| "є" ) "\|" (a \| b \| c \| "є" \| ( a \| b \| c \| "є" ) "." (a \| b \| c \| "є" \| a* \| b* \| c* ))
C = Ø	C = Ø	C = Ø	C = a \| b \| c \| "є"	C = a \| b \| c \| "є" \| ( a \| b \| c \| "є" ) "." (a \| b \| c \| "є" \| a* \| b* \| c* )	C = a \| b \| c \| "є" \| a* \| b* \| c* \| a \| b \| c** \| (a \| b \| c \| "є" \| ( a \| b \| c \| "є" ) "." (a \| b \| c \| "є" \| a* \| b* \| c* ) ) "." (a \| b \| c \| "є" \| a* \| b* \| c* \| a \| b \| c**)
B = Ø	B = Ø	B = a \| b \| c \| "є"	B = a \| b \| c \| "є" \| a* \| b* \| c*	B = a \| b \| c \| "є" \| a* \| b* \| c* \| a \| b \| c**	B = a \| b \| c \| "є" \| a* \| b* \| c* \| a \| b \| c \| a* \| b* \| c* \| "(" ( a \| b \| c \| "є" ) ")"
Sim = Ø	Sim = a \| b \| c \| "є"	Sim = a \| b \| c \| "є"	Sim = a \| b \| c \| "є"	Sim = a \| b \| c \| "є"	Sim = a \| b \| c \| "є"

Note: Each sequence belonging to RE, C, B or Sim can also be represented by a tree where the alphabet symbols form the leaf nodes and the variables representing sets of strings (that is, RE, C, B and Sim) form the internal nodes of the tree. Some examples are given in class.

Define a context-free grammar for regular expressions

We can write the recursive equations above in the form of a context-free grammar as follows:

Context-free grammar	The same grammar written with separate production rules, each having an identifying number
RE --> C \| RE "\|" C C --> B \| C "." B B --> Sim \| B "*" \| "(" RE ")" Sim --> a \| b \| c \| "є"	RE --> C RE --> RE "\|" C C --> B C -->C "." B B --> Sim B -->B "*" B -->"(" RE ")" Sim --> a \| b \| c \| "є"

Notes:

"a" , "b" , "c" are so-called terminals of the grammar. They represent symbols of the alphabet used by the regular expressions.
RE, C, S and Sim are called non-terminals (they represent set of sequences of terminals).
Each line of a grammar is called a (production) rule. It contains on the left side of the --> a non-terminal, and on the right side a sequence of terminals or non-terminals.
--> and | are meta-symbols, that is, they are part of the meta-language that is used here to define the grammar for regular expressions. A well-known meta-language for context-free grammars is BNF (Backus-Naur Form; Backus and Naur introduced this notation to define the grammar of Algol during the 60ies). BNF uses ": = =" for "-->" in instead of using "|" it foresees several separate rules for the same non-terminal symbol on the left side. In BNF, each rule is terminated with a point ".".
The terminals and non-terminals are also called the symbols of the grammar. A symbol is non-terminal if it occurs on the left side of at least one production rule.
To distinguish the terminal symbols, the non-terminals and the meta-symbols, one may use different conventions. The following are the most frequently used ones:
- non-terminals often are written in the form <name of the non-terminal>.
- terminals are often written in quotes, for instance "end" or ")".
- The meta-symbols are normally fixed and known; if one wants to define a non-terminal that is written in the same form, one has to find some way to avoid ambiguity, for instance, one may write <op> --> "+" | "|".
One of the non-terminals is the starting symbol of the grammar (normally the left-side symbol of the first rule).
The meaning of a grammar is the language that it defines, that is, the set of sequences of terminal symbols that can be generated from the starting symbol of the grammar using the different rules of the grammar. Two grammars are said to be equivalent if the generate the same language.

Syntax trees

The syntax tree of a sequence of terminal symbols explains why the sequence is part of the language according to the syntax rules of the context-free grammar. Consider the following example (a):

syntax trees