Good Math/Bad Math

Monday, May 01, 2006

Computer Science, Math, and Languages

Computer science is, I think, more of a branch of math than of science. The math behind computer science ultimately comes down to two things:
  • (1) What can we do?/What can we not do?
  • (2) If we can do it, how long will it take?
Those two questions define the two fundamental areas of computer science: computability and complexity. Personally, I tend to find computability more fun. That's probably partly because I like the theoretical machines; and partly because I first learned theoretical computer science from a spectacular teacher named Eric Allender, who taught from the perspective of automata theory. Learning something from a teacher like that can dramatically change your opinion of an area: they can show you the fun of something that might otherwise seem hopelessly abstract. Without Eric Allender, I might never have gone to grad school.

Anyway: One of the ways that the two kinds of fundamental CS come together is in the idea of computable language from automata theory. Computable languages are a way of specifying what a given type of computing device can do; and from the type of computing device, we can generate rough complexity bounds on how long a given task needs to take. What's particularly cool/fun about talking about computable languages is that it's both incredibly deep and fascinating; and yet, it's also very concrete. The concreteness comes about because we can "design" different kinds of machines - very concrete specific machines - which are capable of doing the computations expressed by the languages.

The idea of computable languages is that given a problem that we wish to solve, it can always ultimately be described as a language. The kind of language that I'm talking about here is purely syntactic: it is a set of strings in some alphabet selected by a predicate. It's important to understand that the predicate is external: the string itself has no meaning; some program or some machine examining the string can determine if it is a member of a language: the meaning of the string is imposed by the machine, not by the string.

For example: there's a language which consists of all strings of any number of "a"s followed by any number of "b"s. We'd usually write that "a*b*". There's language consist of strings of nested parenthesis, like "(((())()())(()))". There's a language made from strings of digits that consists of the set of all prime numbers. And so on.

We can ultimately describe any computation problem in terms of language; this reformulation of problem as language can change the complexity, but not the fundamental nature of what we can and can't do. To give an example of expressing a problem as language, we can describe addition as a language of patterns like "x+y=z", where the string is part of the language if the addition statement is true.

All formal languages can be described using grammars. A grammar is a set of rules describing how a string in the language can be produced. The kind of grammar that I'm using is called either just "formal grammar", "phrase structure grammar", or Chomsky grammar.

A formal grammar consists of:
  1. An alphabet of characters, also called terminals or terminal symbols(TSs).
  2. A set of symbols distinguishable from the terminals, called non-terminal symbols (NTSs)..
  3. One symbol from the set of non-terminal symbols which is called the start symbol
  4. A set of replacement rules which specify how to generate strings in the language by replacing one substring containing a non-terminal symbol with another substring.
The replacement rules are usually written as a sort of odd-looking equation: the left hand side specifies the substring to be replaced; the right hand side specifies what you can replace it with. In most grammars that you typically see, the left-hand side is always just a single non-terminal symbol. So, for example, a grammar with non-terminals "a" and "b", and terminals "X", "Y", and "Z" could contain a rule like:

a ::= X Y b Z

What this statement means is when you have a string containing the NTS "a", you can replace it with the string "XYbZ" - that is, the terminals "XY", followed by the non-terminal "b", followed by the terminal "Z".

Here's an example of a complete grammar specifying the set of strings with equal numbers of Xs and Ys, with the Ys after the Xs. The terminals are "X" and "Y"; the only non-terminal is "a", so "a" is obviously the start symbol.

a ::= X a Y (rule 1)
a ::= X Y (rule 2)

This says that if you see an "a", you can replace it with either "XaY", or just "XY". Let's look at how we can use this to create a string of four Xs followed by four Ys:
  1. Start with the start symbol, "a": a
  2. Apply rule 1, replacing "a" with "XaY": XaY
  3. Apply rule 1 again: XXaYY
  4. Apply rule 1 again: XXXaYYY
  5. Apply rule 2, replacing "a" with "XY": XXXXYYYY
  6. Since there are no non-terminals left, we're done.
There's a very natural way of describing four basic categories of languages - and those four categories correspond to four different ways of describing them; four different kinds of machines that can determine if a given string is a valid member of a particular language.

The four groups are arranged in levels, and the organization is called the Chomsky heirarchy. (And yes, it's the same Chomsky who currently spends his time on wacky politics. Back in the '50s, he did great work as a linguist. It's a damn shame he didn't stay in linguistics.)

To briefly describe them (I'll describe them in more detail later, one post each), the four levels are:
  • Level 3: the regular languages. These are very simple languages, which can't do anything particularly interesting: no counting, nothing beyond sequencing and repetition. Regular languages can be recognized by a simple kind of machine called a finite state automaton.
  • Level 2: the context free languages (CFLs). These are widely used in practical computer science. They're reasonably expressive; you can do a limited kind of counting with them, but nothing too deep. CFLs can be recognized by a machine called a push-down automaton.
  • Level 1: the context sensitive languages (CSLs). Context sensitive languages are the things that can be written a more advanced kind of grammar. You can express a lot of quite deep computation using CSLs. CSLs can be recognized by a very interesting machine: a linear-bounded tape non-deterministic turing machine. Level 1 includes most things that we can really compute: the linear-bounding non-deterministic machine is a pretty good match for the capability of a real computer; level 0 conceptually requires infinite memory.
  • Level 0: the recursively enumerable languages. Anything computable. Level 0 languages are languages which can be accepted by a regular, unbounded turing machine.
There's actually something between level 0 and level 1 in the Chomsky heirarchy. A machine can do one of two things on a given input: it can accept the input; or it can decide about the input. Note that in level 0 above, I said accept.

Accepting the input means that if the input is a member of the language, it will eventually stop and say "Yes, this is in my language". Deciding about the input means that that the machine will eventually stop and say either "Yes, this is in my language", or "No, this is not in my language". The difference is that an accepting machine does not necessarily stop if the string is not in the language. It might be able to stop and say no sometimes, but there is no guarantee.

There is a set of languages more complex than the context sensitive languages which is decidable by a turing machine. These are called the recursive languages.

5 Comments:

  • Cool post.

    "Theory of Computability" was one of my favorite classes as an undergrad and formal languages are a bit of a hobby.

    Are you going to delve into "abacus computable" any time soon? That was the one of the three types of computability that gave me the most trouble.

    By Anonymous Anonymous, at 7:32 PM  

  • (And yes, it's the same Chomsky who currently spends his time on wacky politics. Back in the '50s, he did great work as a linguist. It's a damn shame he didn't stay in linguistics.)

    Don't mean to be persnickity, but this statement would be true if Chomsky weren't still doing linguistics. But he is - and he's been doing wacky politics for just as long as he's been doing linguistics. It's a damn shame more people don't know about his more recent work (The Minimalist Program for instance).

    By Blogger jayinbmore, at 8:16 PM  

  • Oh no. Mr Good Math, Bad Math is yet another one of those people who think Math is not a science. See the shtetl-optimized blog for a discussion on this subject.

    By Anonymous Anonymous, at 12:04 PM  

  • Math isn't a science. Computer science is, however, because computer science is experimental math.

    Anyway, anyone finds this stuff interesting might like "Godel, Escher, Bach" or "The Lady and the Tiger", by Hofstadter and Smulleyman, respectively. GEB is about computability, and TLaT is about Godel's theory through logical puzzles. They are both quite beautiful.

    By Anonymous Anonymous, at 12:42 AM  

  • Could you kindly give some examples of
    Allender's thoughts from the ... perspective of automata theory (Allender ... who taught from the perspective of automata theory.)

    By Anonymous Anonymous, at 9:41 AM  

Post a Comment

<< Home