Good Math/Bad Math

Friday, May 05, 2006

Context Free Languages

Moving on to the next tier of the Chomsky heirarchy, we hit the context free languages. CFLs are amazingly useful buggers: virtually every compiler, and in fact virtually every large piece of application software uses CFLs somewhere, and the way that we use them is actually very close to the formalism I'm going to talk about; and the programs that we use for manipulating CFLs in real software is pretty much the formal machine that I'll talk about here.

Representing Context Free Languages

A context free language is represented by a context free grammar (CFG). The production rules have a single NTS on the left, and any sequence of terminals and non-terminals on the right (including the empty sequence).

To show you a CFG for something real: this is the grammar for lambda calculus, which is something really cool that I'll write about later. To make things easier, if there are multiple productions with the same symbol on the left hand side, I'll sometimes join the right hand sides with "|" symbols; for example, if I had productions "a ::= x" and "a ::= y", I'll sometimes combine those as "a ::= x | y".

The terminals are "LAMBDA", "DOT", "(", ")", "A", "B", "C", ..., "Z". And the production rules are:

ident ::= "A" | "B" | "C" | ... | "Z"
expr ::= LAMBDA ident DOT expr
expr ::= ident
expr ::= expr expr
expr ::= ( expr )

Here are a few typical strings from this grammar: "LAMBDA X DOT X Y", "LAMBDA X DOT ( LAMBDA Y DOT X Y) X", "X Y Z", "(X (Y (Z)))", "((X) ((Y X) Z) ((X) (X Y)))".

Capabilities and Limits of CFLs

From this example, you can see one of the things that you can do in a context free language that you can't do in a regular language: count. A string is only a member of the language if parens are matched. If you were processing the string character by character, the number of open parens is always greater than or equal to one; and at the end of the string, it's always zero.

CFLs are still pretty limited in many ways. You can't write a grammar for a spoken language using CFGs - natural languages aren't context free. We use CFLs and CFGs all the time for compilers and such in real programs, but there's always an extra step involved, because there are aspects of real computer languages that can't be captured in context-free form.

So what can't you do in a CFL? I'm not going to try to formally characterize the limits of CFLs, but here are two examples:
  • Non-trivial counting: if you want a language of strings with the same number of Xs and Ys, that language is a CFL. If you want a string with the same number of Xs, Ys, and Zs - that isn't.
  • Constrained references: you can't specify that a particular symbol can only occur in one place in the string if it already occured in a prior part of the string. For example, in the lamba calculus example, it really should say that you can only use the "expr ::= ident" rule if the ident occured in an enclosing LAMBA ident". CFLs can't do that.

Computing CFLs: the Pushdown Automaton

CFLs are computable using a kind of machine called a pushdown automaton (PDA). Pushdown automata are relatively simple: take a finite state machine, and add a stack.

For the non-CS folks out there, a stack is a last in first out (LIFO) storage system. What that means is that you can store something on the top of the stack whenever you want to (called pushing), look at what's on top of the stack (peeking), and removing the element on top (popping).

For a PDA, the transitions look like:

((state,top-of-stack,input) -> (state,stack-action))
  • The top-of-stack in the transition can be either a symbol from the machine's alphabet, or it can be "*".
    • If it's a symbol, then the transition can only be taken if both the machine state and the top-of-stack mach.
    • If it's "*", then the transition can be taken regardless of the value on top of the stack.
  • The stack-action can be "push(symbol)"; "pop", or "none".
The machine accepts the input if it reaches a final state with the stack empty. (There are alternate formulations that don't require an empty stack, or that only require an empty stack but don't use final states. They're exactly equivalent to empty stack + final state.)

As usual, it's easiest to understand this with an example. So let'l look at a simple language consisting of parens and identifiers, where the parens have to match. So, for example, "((I)(((I)I)(II)))" would be accepted, but "(I)(((I)I)(II)))" (the same string, but with the first open removed) would be rejected.

Our machine has an alphabet of "(", ")", and "I". It has two states: 0, and 1. 0 is both the inital state, and a final state.

PrestateStacktopInputPost-StateStack-Action
0 * "(" 0 push("(")
0 "(" ")" 0 pop
0 I * 0 none
0 empty ")" 1 none


Or graphically:


So - if it sees a "(", it pushes a "(" on the stack. If it sees an identifier, it just keeps going past it. If it sees a ")", and there's an "(" on top of the stack, it pops the stack; if it sees a ")" and there's no "(" on the stack, then it switches into state 1. Since state 1 is not a final state, and there is no transitions that can be taken from state 1, the machine rejects the string if it sees an extra ")". If there's a "(" without a matching close, then when the machine finishes, it will have a non-empty stack, and so it will reject the string.

Finally, one nifty little note. What happens if we add a second stack to the PDA? Well, that's equivalent to a turing machine: think of one stack as the tape up to the current head position; and the second stack as the tape after the current head position.

1 Comments:

  • I found this a while ago, and it was a really cool approach to context free grammer. There's one catch: it's graphical! Check it out:
    http://chriscoyne.com/cfdg/

    By Anonymous Anonymous, at 4:59 PM  

Post a Comment

<< Home