Friday, January 30, 2009

Formal grammars, parsing etc. for dummies

I thought it may be useful to post summaries of what I have been reading recently. In this post, I will try to explain some basic grammar and parsing stuff in as simple way as I can (omitting all the academic complacency) .

A grammar defines a language with a set of rules of type x -> y, where x is left hand side of the rule and y is right hand side of the rule. x and y can be either expandable or literals depending on grammar type (math zealot? read terminals and non-terminals).

Chomsky hierarchy defines four levels of grammar (although these are not the only grammar types in literature but Chomsky is cool dude in linguistic literature, hence the list):

0. Unrestricted grammar:

Languages based on such grammar can have any rule for derivation (which usually means any text can belong to a possible unrestricted grammar like 'yo dawg', where 'yo dawg -> u'). However, it is possible to define a language with unrestricted grammar which is also Turing complete. A language based on unrestricted grammar is extremely difficult to parse, but not completely impossible. Such language doesn't necessarily make sense, e.g. It is possible to create a math expression such as this 1++1=1--*1=1 in an unrestricted language, as well as an unrestricted English statement such as this: "all your bases are belong to us".

1. Context sensitive grammar:

A language defined with CSG may represent all the natural languages. Such grammar is defined with a set of rules where both sides may have expandable symbols (e.g. "a" B "c" -> "a" "x" "c"). It is very difficult to build a parser which can parse CSG.

2. Context free grammar:

All programming language grammars are context free grammars. It can have only one expandable symbol on left hand side of a grammar rule and is very easy to parse. E.g. X -> "if" (y) z | "if" (y) z "else" (X), where y and z can be represented as other statements of the same grammar. There are several well known parsing techniques to parse CFGs such as:

- LL(1)-LL(k) with a recursive descent parser (top down parsers with look-aheads 1 to k) and,
- Bottom up parsers such as LR(1) and LALR which can parse more CFGs than LL parsers (generally shift reduce parsers).

LL parser starts with Left to right and always expands Left expandable symbol first (think of queue) where as LR starts with Left to right but expands Right symbol first (think of stack).

3. Regular grammar:

A regular expression is a language defined by a regular grammar. It is represented as X -> "t" | "t" Y, where Y represents expandable symbol and always follows literal (if it exists). E.g. ab*, where * may be an expandable symbol defined to mean one or more occurrence of the literal it followed as a grammar rule.

Each grammar type above is a subset of previous grammar type, meaning every regular grammar is context free grammar, every context free grammar is also context sensitive grammar and so on.

1 comment:

Anonymous said...

OK. I never worried about the differences between a CFG, CSG, Regular Grammar, but your article is a real short eye-opener. Thanks dude for the very nice post. After having used majorly LL parsers like ANTLR, JavaCC for building Eclipse based custom language editors, now I know what is meant by an LL (I did always know that it is a top-down, but never worried to check wiki entries for the exact difference between LL and LR as both start left to right).

Thanks
Venkat (Venkataramana Madugundu)