Good Math/Bad Math: An Introduction to Information Theory, Part 2: Entropy

Thursday, March 09, 2006

An Introduction to Information Theory, Part 2: Entropy

Now that I've got a bit of background and origins done, it's time to get to some of the fun stuff.

As I said yesterday, both of the big branches of information theory have a concept of entropy. And while the exact presentations differ, because they're built on somewhat different theoretical foundations, the basic meaning of entropy in both branches is pretty much the same. I'm going to stick with the terminology of the Kolmogorov-Chaitin branch, but the basic idea is the same in either.

In K-C theory, we talk about the information content of strings. A string is an ordered sequence of characters from a fixed alphabet. It doesn't really matter what alphabet you choose; the point of this kind of theory isn't to come up with a specific number describing the complexity of a string, but to be able to talk about what information means in a formal algorithmic sense.

The entropy of a string is the total quantity of information encoded in that string.

Here's where it gets fun.

Another way of saying exactly the same thing is that the entropy of a string is a measure of the randomness of the string.

Structured strings are structured precisely because they contain redundancy - and redundancy does not add information.

Which string has more information: "XXXXYYYYY" or "4X5Y"? They have the same amount. One is a lot smaller than the other. But they contain the same amount of information.

Here's a third way of saying exactly the same thing: the entropy of a string is the length of the smallest compressed string that can be decompressed into the original.

Here's a better example. Go back and look at the first two paragraphs of yesterday's post. It took 514 character.

Here's the same information, compressed (using gzip) and then made
readable using a unix utility called uuencode:


M'XL(")E8$$0``VIU;FL`19$Q=JPP#$7[6≤7KIN',`M+]≤HHLPH``)\8BMOPY
M[#Y/GCDG#MLY2EI8$9H5GLX=*R(_+ZP/,-5-1#\HRJNT`77@LL,MZD,"H7LSUKDW3@$#V2
MH(KTO$Z^$%CN1Z>3L*J^?6ZW?^Y2+10;\+SOO'OC"/;7T5QA%987SY02I3I$
MLKW"W,VZ-(J$E"[$;'2KYI^\-_L./3BW.+WF3XE8)≤@D8X^U59DQ7IA*F+X/
MM?I!RJ*%FE%])Z+EXE+LSN*,P$YNX5/P,OCVG;IK=5_K&CK6J7%'+5M,R&J]
M95*W6O5EI%G^K)8B/XV#L=:5_`=5ELP#Y#\UJ??>[[DI=J''*2D];K_F230"
$`@(`````
`

That's only 465 characters; and if I didn't have to encode it to stop
it from crashing your web-browser, it would only have been 319
characters. Those characters are a lot more random than the original -
so they have a higher information density. The closer we get
to the shortest possible compressed length, the higher the information
density gets.

Actually, I'm lying slightly; there's an additional factor that needs to be considered in K-C theory.

Remember that K-C comes from the foundational mathematics of computer science, something called recursive function theory. We're talking about strings being processed algorithmically by a program. So really, we get to talk about programs that generate strings. So it's not just strings: you have a program and a data string. Really, you measure information content of a string as the total size of the smallest pairing of program and data that generates that string. But for most purposes, that doesn't make that much difference: the combination of the program and the data can be encoded into a string.

In the two examples above, the program to follow is implied by the contents of the string; if we wanted to really model it precisely, we'd need to include the programs:

For the string "XXXXYYYYY", the program is roughly:


       while there are more characters in the data:
          read a character
          output the character that you read
       end

For the string "4X5Y", the program is roughly:


      while there are more characters in the data:
         read a number n from the data
         read a character c from the data
         output c n times
      end

Now, specifics aside: the definitions I gave earlier are pretty much correct in both K-C and Shannon information theory: information content is proportional to randomness is proportional to minimum compressed length.

So - what happens if I take a string in very non-compressed format (like, say, an uncompressed digitized voice) and send it over a noisy phone line? Am I gaining information, or losing information?

The answer is gaining information. Introducing randomness into the string is adding information.

"AAAABBBB" is less information than "AAAABxBB".

The way this is used to refute bozos who claim things like "You can't create information" should be obvious.

13 Comments:

Hi.ark! I remember a POTM from talk origins that is still one of my favorites. And your description of banjo music still reverberates: What did you say? something like "it sounds like a cat caught in a clothes drier" I'll be checking in from time to time.
Bob from Yorktown Heights.

By Anonymous, at 8:39 PM
Our Fearless Blogmath explained,

"So - what happens if I take a string in very non-compressed format (like, say, an uncompressed digitized voice) and send it over a noisy phone line? Am I gaining information, or losing information?

The answer is gaining information. Introducing randomness into the string is adding information."

Aaaah... so "information" just means that, in the most objective sense; it has no subjective valuation of usefulness added to it. More information from noise added simply means more new stuff added. Got it.

Now, does an increase in information always mean an increase in complexity? I would think that adding chaos would always add complexity, but that complexity is not necessarily chaotic. From what little I understand, chaos is something like a non-repeating pattern (the word "chaos" produces a swirling snowfall in my head, but my head is a strange & synæsthetic place, so don't worry if that makes no sense to you).

BTW, enjoying this immensely; was only somewhat familiar with Shannon from the Shannon-Weaver “Transmission Model of Communication” from the linguistic side of things.

Mathematics is SO much more interesting when it goes beyond the mechanics of number crunching! I enjoyed (and did much better with) the concepts of calculus than I ever did arithmetic. It took me four years to learn my multiplication tables, and I still get "decimated" doing simple calculations. Go figure!

andrea

By Anonymous, at 9:24 PM
andrea:

To answer the question about whether adding information adds complexity, you need to define complexity. I *think* that in the K-C sense, the answer is pretty much "yes" - that is, when you add information to a string, you're increasing the complexity of the smallest program that can generate the string. But I'm not sure if that's what you mean.

WRT chaos - well, again, it comes down to how you define chaos. Some people would define chaos as randomness - in which case, it means the same thing as entropy - adding chaos would be adding information.

But on the other hand, my (limited) understanding of chaos theory in the mathematical sense is something quite different: a chaotic system is one that has highly orderly behavior, but whose behavior can be radically different given tiny pertubations in initial conditions. In *that* sense, it gets really interesting. You can think of the behavioural rules of the chaotic system as a kind of program in a computing machine, and the initial condition as the input to that machine. What you've done then is just moved the balance: information content is a combination of a program that generates an output, and a set of input data to that program: the chaotic system is a simple program which requires very elaborate data.

And finally, I (obviously) agree with you that math is a lot more fun when you get past number crunching. It's a shame that we teach people to think of math as a very plodding mechanical, memorization based thing, when in fact, it's so much more - anything that deals with structure is ultimately at least part math.

By MarkCC, at 9:38 PM
bob:

Good to hear that someone remembers my t.o days.

I don't *think* that i'm the one who said that about the banjo. I'm actually a huge banjo fan - I used to be an avid bluegrass banjo player before I developed RSI, and couldn't really feel the strings under my fingers anymore. I think it's a wonderful instrument with a great sound.

On the other hand, banjo players inevitably learn to collect banjo jokes - you either learn to love the banjo jokes, or you pretty much spend a lot of time being pissed off :-) So it's *possible* that I said that.

Incidentally, did you hear about the banjo player who had to stop on the way to a gig, and had to park his car in a bad neighborhood? He carefully parked his car, covered the banjo, and rushed to do his errands. But when he got back to the car, it was too late - someone had broken the window, and thrown another banjo in.

By MarkCC, at 9:44 PM
Some comments on Information Theory...

First, congratulations, looks like a good start to a decent blog.

Second, on Goedel, have a look at Rebecca Goldstein's recent _Incompleteness_, a good readable bio/history/explanation. Although she does finesse the proof...

Third, it was the ubiquitous Johnny von Neumann who pointed out to Shannon that his Information had the same form as thermodynamic Entropy, thus leading to man-centuries of future confusion. Strangely, they have a lot of similarities and can be expressed in terms of one-another, thus leading to even more confusion of the "information is physical" variety...

But anyway. I want to try to clear some murkiness where people get off on the wrong foot in Information Theory.

The words Information Theory and Algorithmic Complexity are unfortunate choices. For Information one should use Surprise (ala J.P. Crutchfield). It is really a measure of randomness, or how surprised you are when you get a new symbol from a stream. A completely random (if there are such) stream of symbols has the highest informational entropy, and a fully predictable stream the lowest entropy; and, in the same manner their algorithmic complexities are high and low respectively.

Unfortunately Complexity is now used in two different ways, the first being the same as entropy, and the second being similar to Memory. This second sense is generally the one that is bandied around when speaking of the Complexity of a process. Using this second sense, both predictable and random processes have low Complexity because they require no memory to predict -- presupposing that random things are not predictable to start with -- and processes that have structure, which require some memory of previous states, have higher Effective Complexity.

Oddly I just finished up a couple pages of a talk that address these issues: Not Random
and (reiteration of Surprise at end): Surprise

Using the Memory sense of Complexity one can make a better distinction between Information (entropy) and Meaning (complexity). For followup see "information" on: Computational Mechanics

MS

By Anonymous, at 11:13 PM
Mark,
We seem to be using "complexity" in much the same manner, so good there.

I'll probably drive you Nutz when I say that I think of chaos in both senses: as being entropic and as being organised in the highest order of magnitude, but not at the granular level.*

Some people may perceive those as being contradictory, but I see them as being obverse and reverse of the same coin. I suspect that at a certain near-infinite perspective the entropy ends up organising into a different manner, but I'm only reaching that intuitively, and not from any level of study.

andrea

* Lacking the official ones, I'm just pulling words out of the air, so we will have to dialog this until we reach some concensus on what we mean. *sigh*

“Philosophy - the discovery that English is a foreign language.”
~Warren Shibles

By Anonymous, at 11:49 PM
Hi, Mark - another t.o. reader here. I saw a ref to your blog at PZ's place but he didn't identify you by name. But this morning I saw Jim Lippard's blog post, so here I am. Between you three, plus John Wilkins and John Pieret (catshark), I'm getting an awful lot of t.o. links in my blog reading!

By Anonymous, at 11:33 AM
Michael: I was gonna say, the most evocative word I've come across for Shannon information is "surprisal".

Makes it a lot easier to explain what's going on, too :)

By Lifewish, at 3:31 PM
Michael:

Suprise is a great term for entropy, especially when it's being used in the Shannon sense.

Because my background is more recursive function theory oriented, I don't tend to think of the string being processed character by character. I tend to think of it as a sort of atomic whole; so using the randomness of the string is a good description.

Surprise is great when you think of the string as something which you get to examine, character by character, in sequence: then the notion of surprise - how hard it would have been for you to predict the next character based on what you'd seen before - is pretty close to a perfect intuition for what entropy means.

Coming from an RFT perspective, "complexity" is the closest thing I've been able to find to an intuitive description.

By MarkCC, at 5:04 PM
jackd:

PZ linked to me? I didn't notice that at all. I didn't even know that he'd noticed that I'd started blogging.

And I didn't even know that Lippard had a blog! Can you send me a link?

By MarkCC, at 5:08 PM
Hi,

Not sure if I grok the difference between "whole string" vs "symbol-by-symbol" analysis. Of course the Shannon Entropy is a property of the whole string, but it needs to be calculated over the probability distributions of each symbol. Maybe it's a tomato/tomahtoe thing...

But my other point, the different definitions of Complexity, is I think, important to your project of taking on the Intelligent Designers. I have seen a number of attempted refutations of the Irreducible Complexity "argument" that confuse Random processes with Complex processes. This confusion arises from the Information Theory plus Algorithmic Complexity overlap, and it leads to muddled ideas of what a Complex process really is and thus how it could develop...or evolve.

I still haven't found a good description of Irreducible Complexity and how "they" define it, so if you know of one...

I can also, I hope, reduce the complexity (sorry) of your I.D. tilting by separating out and destroying the Dembski analysis of the failure of search algorithms...I lost the index to the link to his latest paper (of last year anyway) that proves -- he seems to be a pretty good mathemetician, what little of the 35 pages that I could follow -- that not only can search algorithms not work on Every Possible Fitness Landscape, but that, for the same reasons, you can't even FIND an algorithm that works. The argument is based on the Wolpert and MacReady _No Free Lunch_ papers circa 1995, in which they also prove that no algorithm (unfortunately they focused on genetic algorithms) behaves any better than a random search ***when averaged over all possible landscapes***.

Now W&M may beg to differ on the following analysis because it reduces the importance of their NFL theorem, but basically it's a tautology. "Every Possible Fitness Landscape" includes a huge number of distributions that have no structure to find, and only a few where there _is_ something to search for, so the average performance of every algorithm is going to be dang-near-the-same.

If you do a (my definition) Complexity analysis of the landscapes, you should find a sub-set that have structure and are amenable to search by, say, a genetic algorithm. It is this Complex-but-not-Random subset that is of interest to evolution because biology takes place in a highly structured environment of air, water, rocks, and other evolving species.

Keep up the Good Work.

MS

ps...I stumbled on this Information Theory and Creationism while looking for stuffing for my last message. Might be of use...

By Anonymous, at 9:13 PM
Michael:

The symbol by symbol versus whole-string is really a difference between approaches to the theory of computation used by K-C theory.

Broadly speaking, there's two approaches to the theory of computation: automata theory, and recursive function theory.

In automata theory, you think of things in terms of a specific machine processing its input - generally something like a Turing machine processing an input tape. So looking at things that way, the "surprise" notion is extremely compelling.

The RFT approach is more reductionist. You start off by saying that the things that are potential inputs to a program - things like strings, programs, etc.) are all mappable to integers. So then there's no need to talk about anything but functions that map from integers to integers. So you've reduced things down to a very simple functional framework: the only thing you need to consider, at all, is single-parameter functions from integers to integers.

If your input is just an integer, then the "surprise" notion isn't particularly helpful, because you see the whole input at once - it's just a number.

In general, I actually prefer the automata approach; but most of my background is from the RFT approach. (My first theory of computation class was taught by an utterly amazing prof at Rutgers named Eric Allender. Eric taught using the automata approach. But I only took the one class as an undergrad. In grad school, I did a lot of theory classes, taught mainly by a seriously goofy guy named John Case; John is very much an RFT guy. In fact, John is the inventor of something called inductive learning machine theory, which is a branch of RFT that tries to look at what it means to learn something.)

By MarkCC, at 9:23 AM
Here you go, Mark:

The Lippard Blog.

Ah, it wasn't PZ that mentioned you, it was Orac. I think the similarity of formats (not much customization of the SciBlog standard) is what confused me.

By Anonymous, at 11:35 AM

Good Math/Bad Math

Thursday, March 09, 2006

An Introduction to Information Theory, Part 2: Entropy

13 Comments:

About

About Me

Previous