Good Math/Bad Math

Tuesday, May 02, 2006

Regular Languages

As I said yesterday, the Chomsky heirarchy specifies four basic types of languages. The simplest one, level 3, is the family of regular languages.

Regular languages are very simple little beasts. No counting, no deep patterns, and very little computational power. A machine that processes members of a regular language can always accept or reject a string in a number of steps equal to the number of symbols in the string.

Specifying Regular Languages

There are two commonly used ways of specifying regular languages: regular grammars, and regular expressions.

Regular grammars come in two flavors: left regular and right regular. A right-regular grammar is a phrase structure grammar where the left-hand side of every replacement rule (also called a production) is one non-terminal; and the right hand side is a sequence of terminals, optionally followed by a single non-terminal.

For example, the following are all valid right regular grammar productions (I'm writing non-terminals as lower-case italic latters, and terminals as upper-case):

a ::= XYb
b ::= WX
c ::= WXZYc
d ::= Xa
d ::= X

And the following are not right-regular productions:

a ::= XbY (NTS not at end)
b ::= aWX (NTS not at end)
c ::= WXZYcd (more than one NTS)

Alternatively, we can specify regular languages using regular expressions. A regular expression is:
  • (simple matching): a regular expression consisting of a single symbol matches a string consisting of that character.
  • (concatenation): two regular expressions placed side by side: RS matches a string if the string can be partitioned into two substrings so that R matches the first substring, S matches the second substring.
  • (alternation): two regular expressions separated by "|": R|S matches a string if R matches the string or S matches the string.
  • a regular expression inside of parenthesis (grouping): (R) matches a string if R matches the string.
  • (Kleene closure/repetition): a regular expression followed by a "*". R* is the same as (empty|R|RR|RRR|RRRR|RRRRR|...) - that is, zero or one repetitions of substrings matching R.
In addition, we also often use "R+" to mean the same thing as RR* (that is, a sequence of strings matching at least one repetition of R.

Regular expressions are amazingly common - we use them every day in normal programs. Pretty much all scripting languages include some form of regular expressions as primitives, and nearly all programming languages at least include libraries for using regular expressions.

A few examples of regular expressions:
  • (a|b|c|d)*: any string of any length made from the characters "a", "b", "c", and "d".
  • (a|b)*(c|d)*: a string of any length made of "a"s and "b"s, followed by a string of any length of "c"s and "d"s.
  • (a|b)+(c|d)+: a string of one or more "a"s and "b"s; followed by a string of any length of "c"s and "d"s.
  • aa*b*(cd)*(e|f): strings consisting of at least one A; followed any number of "b"s (including zero); followed by any number of repetitions of "cd"; followed by either a single "e" or a single "f".

Machines for Regular Languages

Regular languages can be accepted by a very simple kind of machine called a finite state automaton (FSA) or a finite state machine (FSM). A FSM consists of:
  • A set of states, S.
  • A single special distinguished state in S called the initial state, s_0.
  • A subset of the states in S called the final states, S_f.
  • A set of transition rules, each of which is a triple: (state,symbol,state).
The way that a FSM works is that it starts in its initial state, s_0. Then it looks at the first character c in the string, and tries to find a transition (s_0,c,s_x). If it finds a transition like that, then it switches to state s_x, and goes on to the next character. If at any point there is no rule for the current character in the string, then the machine rejects the input string as not in the language. If, after processing the last character, the machine is in a state in S_f, then the string is accepted as a member of the language; if it is in any state not in S_f, then the string is rejected.

So, for example: if we wanted to process the language of strings consisting of a string of at least one A, and any number of Bs, we would have three states: init, S_a, and S_b. init would, obviously, be the initial state; S_b would be the only final state.

The transition rules would be: { (init,a,S_a), (S_a, a, S_a), (S_a, b, S_b), (S_b, b, S_b) }.

FSMs can be drawn very easily; each state is drawn as a circle labelled with the state name. Each transition rule is an arrow from state to state labelled by a symbol. The start state is indicated by an arrow; and the final states use a double-line for their outline. So the machine I described up above would look like this: (Note: the diagram below is corrected from the original posting; I forgot to finish the diagram before posting it! The original version did not include the double-ring on state S_a, or the "b" arc on S_b. Thanks to commenter "Big C" for pointing this out.)

The description so far leaves one question open: can we have more than one transition rule from a given state for the same character? The answer to that is not trivial; it actually introduces something very interesting.

If you can have one rule per symbol from a state, you have a deterministic machine. If you have multiple rules per symbol from a state, you have a non-deterministic machine. In the non-deterministic machine (an NFSM), the machine accepts a string as a member of its language if any possible sequence of transitions leads to a final state.

I'll have more to say about determinism versus non-determinism in computations in another post.


  • Wow. This stuff really takes me back to my undergrad CS days. Formal languages and automata theory. One of my favorite classes ever. Lots of heads slept on this one, thinking it had no obvious use, but it seemed immediately obvious to me that you couldn't write a compiler without understanding a push down automaton, for example. You couldn't have a regular expression matching library without understanding regular languages.

    Keep em coming. Good reading.

    By Anonymous D Kruz, at 9:52 PM  

  • Great post, but I have a minor nitpick. In your diagram describing the FSM, I think you're missing a transition. Shouldn't there be a transition from S_b to S_b labeled with b?

    Also, you describe the language as:
    "strings consisting of a string of at least one A, and any number of Bs"

    But your FSM seems to only accept strings consisting of a string of at least one A and a string of *at least* one B because you've identified S_b as the only final state.

    Hope this helps.

    By Blogger Big C, at 11:06 AM  

  • Say, did you use Omnigraffle to make that diagram?

    By Anonymous John Johnson, at 12:44 PM  

  • john:

    I use Omnigraffle for *all* of my diagrams.

    By Blogger MarkCC, at 1:41 PM  

  • you still have to correct the english description of example. Both S_a and S_b are final states.

    By Anonymous Anonymous, at 5:30 AM  

Post a Comment

<< Home