Regular Languages
As I said yesterday, the Chomsky heirarchy specifies four basic types of languages. The simplest one, level 3, is the family of regular languages.
Regular languages are very simple little beasts. No counting, no deep patterns, and very little computational power. A machine that processes members of a regular language can always accept or reject a string in a number of steps equal to the number of symbols in the string.
Regular grammars come in two flavors: left regular and right regular. A rightregular grammar is a phrase structure grammar where the lefthand side of every replacement rule (also called a production) is one nonterminal; and the right hand side is a sequence of terminals, optionally followed by a single nonterminal.
For example, the following are all valid right regular grammar productions (I'm writing nonterminals as lowercase italic latters, and terminals as uppercase):
And the following are not rightregular productions:
Alternatively, we can specify regular languages using regular expressions. A regular expression is:
Regular expressions are amazingly common  we use them every day in normal programs. Pretty much all scripting languages include some form of regular expressions as primitives, and nearly all programming languages at least include libraries for using regular expressions.
A few examples of regular expressions:
So, for example: if we wanted to process the language of strings consisting of a string of at least one A, and any number of Bs, we would have three states:
The transition rules would be:
FSMs can be drawn very easily; each state is drawn as a circle labelled with the state name. Each transition rule is an arrow from state to state labelled by a symbol. The start state is indicated by an arrow; and the final states use a doubleline for their outline. So the machine I described up above would look like this: (Note: the diagram below is corrected from the original posting; I forgot to finish the diagram before posting it! The original version did not include the doublering on state S_a, or the "b" arc on S_b. Thanks to commenter "Big C" for pointing this out.)
The description so far leaves one question open: can we have more than one transition rule from a given state for the same character? The answer to that is not trivial; it actually introduces something very interesting.
If you can have one rule per symbol from a state, you have a deterministic machine. If you have multiple rules per symbol from a state, you have a nondeterministic machine. In the nondeterministic machine (an NFSM), the machine accepts a string as a member of its language if any possible sequence of transitions leads to a final state.
I'll have more to say about determinism versus nondeterminism in computations in another post.
Regular languages are very simple little beasts. No counting, no deep patterns, and very little computational power. A machine that processes members of a regular language can always accept or reject a string in a number of steps equal to the number of symbols in the string.
Specifying Regular Languages
There are two commonly used ways of specifying regular languages: regular grammars, and regular expressions.Regular grammars come in two flavors: left regular and right regular. A rightregular grammar is a phrase structure grammar where the lefthand side of every replacement rule (also called a production) is one nonterminal; and the right hand side is a sequence of terminals, optionally followed by a single nonterminal.
For example, the following are all valid right regular grammar productions (I'm writing nonterminals as lowercase italic latters, and terminals as uppercase):
a ::= XYb
b ::= WX
c ::= WXZYc
d ::= Xa
d ::= X
And the following are not rightregular productions:
a ::= XbY (NTS not at end)
b ::= aWX (NTS not at end)
c ::= WXZYcd (more than one NTS)
Alternatively, we can specify regular languages using regular expressions. A regular expression is:
 (simple matching): a regular expression consisting of a single symbol matches a string consisting of that character.
 (concatenation): two regular expressions placed side by side: RS matches a string if the string can be partitioned into two substrings so that R matches the first substring, S matches the second substring.
 (alternation): two regular expressions separated by "": RS matches a string if R matches the string or S matches the string.
 a regular expression inside of parenthesis (grouping): (R) matches a string if R matches the string.
 (Kleene closure/repetition): a regular expression followed by a "*". R* is the same as (emptyRRRRRRRRRRRRRRR...)  that is, zero or one repetitions of substrings matching R.
Regular expressions are amazingly common  we use them every day in normal programs. Pretty much all scripting languages include some form of regular expressions as primitives, and nearly all programming languages at least include libraries for using regular expressions.
A few examples of regular expressions:

(abcd)*
: any string of any length made from the characters "a", "b", "c", and "d". 
(ab)*(cd)*
: a string of any length made of "a"s and "b"s, followed by a string of any length of "c"s and "d"s. 
(ab)+(cd)+
: a string of one or more "a"s and "b"s; followed by a string of any length of "c"s and "d"s. 
aa*b*(cd)*(ef)
: strings consisting of at least one A; followed any number of "b"s (including zero); followed by any number of repetitions of "cd"; followed by either a single "e" or a single "f".
Machines for Regular Languages
Regular languages can be accepted by a very simple kind of machine called a finite state automaton (FSA) or a finite state machine (FSM). A FSM consists of: A set of states, S.
 A single special distinguished state in S called the initial state, s_0.
 A subset of the states in S called the final states, S_f.
 A set of transition rules, each of which is a triple: (state,symbol,state).
So, for example: if we wanted to process the language of strings consisting of a string of at least one A, and any number of Bs, we would have three states:
init
, S_a
, and S_b
. init
would, obviously, be the initial state; S_b
would be the only final state.The transition rules would be:
{ (init,a,S_a), (S_a, a, S_a), (S_a, b, S_b), (S_b, b, S_b) }
.FSMs can be drawn very easily; each state is drawn as a circle labelled with the state name. Each transition rule is an arrow from state to state labelled by a symbol. The start state is indicated by an arrow; and the final states use a doubleline for their outline. So the machine I described up above would look like this: (Note: the diagram below is corrected from the original posting; I forgot to finish the diagram before posting it! The original version did not include the doublering on state S_a, or the "b" arc on S_b. Thanks to commenter "Big C" for pointing this out.)
The description so far leaves one question open: can we have more than one transition rule from a given state for the same character? The answer to that is not trivial; it actually introduces something very interesting.
If you can have one rule per symbol from a state, you have a deterministic machine. If you have multiple rules per symbol from a state, you have a nondeterministic machine. In the nondeterministic machine (an NFSM), the machine accepts a string as a member of its language if any possible sequence of transitions leads to a final state.
I'll have more to say about determinism versus nondeterminism in computations in another post.
5 Comments:
Wow. This stuff really takes me back to my undergrad CS days. Formal languages and automata theory. One of my favorite classes ever. Lots of heads slept on this one, thinking it had no obvious use, but it seemed immediately obvious to me that you couldn't write a compiler without understanding a push down automaton, for example. You couldn't have a regular expression matching library without understanding regular languages.
Keep em coming. Good reading.
By D Kruz, at 9:52 PM
Great post, but I have a minor nitpick. In your diagram describing the FSM, I think you're missing a transition. Shouldn't there be a transition from S_b to S_b labeled with b?
Also, you describe the language as:
"strings consisting of a string of at least one A, and any number of Bs"
But your FSM seems to only accept strings consisting of a string of at least one A and a string of *at least* one B because you've identified S_b as the only final state.
Hope this helps.
By Big C, at 11:06 AM
Say, did you use Omnigraffle to make that diagram?
By John Johnson, at 12:44 PM
john:
I use Omnigraffle for *all* of my diagrams.
By MarkCC, at 1:41 PM
you still have to correct the english description of example. Both S_a and S_b are final states.
By Anonymous, at 5:30 AM
Post a Comment
<< Home