After each assignment I'll send a summary of any observations or
comments I might have. In general our assignments are dealing with
issues that can still in some respects be considered open research
questions, so I'll take a few moments to raise some of what I 
think are the interesting issues. You are certainly welcome to 
respond, ask further questions, disagree, etc. 

Entropy of English is a surprisingly complex area - how to 
formulate it, how to conduct experiments that measure it, and 
how to then use whatever information you acquire from such
formulations and experiments; these are all interesting questions
I think.  Here are a few thoughts on issues that might merit
further consideration.

1) Could entropy be used as a measure of fluency in a language? I think it  
probably could, because clearly if you don't know a language your entropy  
as determined by the Shannon game would be high (you are making lots of  
random guesses) whereas the more you know about a language the better you  
will be at guessing. 

2) As you play the Shannon game in an unknown language, it seems likely  
that you would start to learn that language in a very limited way, and  
begin to recognize certain characters of sequences and possibly even  
function words? If this happens, is that learning measurable or reflected  
in the a decline in entropy? I don't know about this one - it seems to  
depend on how fine grained a measure entropy turns out to be. Is it  
"precise" enough to detect the limited amount of learning that someone  
would experience as the result of playing the Shannon Game in a language  
they didn't know. 

3) Would the entropy of an unknown language be the same as the entropy of  
random text? This relates a bit to 2), and if indeed we start to learn a  
language in a limited way as we play the Shannon game, then 3) would be  
false. Also, if two languages are related then we would expect that 3) is 
false. For example, French and Spanish are both Romance languages that  
descend from Latin. A native speaker of Spanish may not know French, but  
the languages do indeed share certain characteristics and one might  
recognize common traits and be able to guess letters better as a result.

We would need to presume that the unknown language and the native    
language use the same character set (or nearly so) as an alphabet,   
otherwise we'd need to somehow adjust the experiment to control for that.  
For example, if the unknown language is German and the known language is  
English, the character sets are nearly the same, so the confusion  
associated with German (for an English-only speaker) would be relative to  
the language itself. However, if Mandarin where the unknown  language,  
then the character set/alphabet becomes a source of difficulty as well.  
Since we don't know the character set of the language we would spend
a certain amount of time/number of guesses learning the character
set, and that adds another level of uncertainty to the process.

4) Could entropy be viewed as a "language universal". In other words, does 
the per letter entropy of a language as spoken by a native speaker of that 
language remain about the same, regardless of the language. Put more 
simply, is the entropy of English for a native speaker about the same
as the entropy of any other language? Are some languages inherently more 
uncertain than others? Do listeners of some languages have less ability to  
predict what comes next? In general I would expect that human languages 
are "about the same" when it comes to predictability. Now the Shannon Game 
may not be a suitable way to measure entropy of some languages - it may be 
dependent or biased in some ways towards languages that have relatively 
small alphabets as is the case with English. But possibly the experiment 
could be reconstructed to deal with sounds (phonemes). Languages generally 
speaking don't have radically different numbers of phonemes (I think
this is true, although I may be wrong). For example, English has
somewhere between 30-40 basic sounds (phonemes). We can predict that
certain sounds are more likely to occur together than others. For example, 
'b' as in "bat" and 'c' as in "cat" clearly don't ever occur one right 
after the other in spoken English, so if one has heard a "b" one would 
know that most likely a vowel sound will follow (as in bag, bill, box,
bug...)

So those were some issues that came to mind in general thinking about
Shannon's Game and the per-letter entropy of English. Now a few thoughts 
specific to your assignments. 

Point 3) relates somewhat to the idea of transliteration. In other words, 
if you are playing the Shannon Game with a transliterated version of your 
native language then you are at some level dealing with an unfamiliar 
character set and might be measuring a level of uncertainty in the 
character set as well as in the "predictability" of the language. In the  
case of this assignment the issue arose with Hindi text that was 
translitered/romanized into the English alphabet. Several of you mentioned 
the difficulty of dealing with transliterated text, and I think it's a 
reasonable point. However, I think the arguments were mostly intuitive 
and sometimes went so far as to say that dealing with transliterated 
text was like dealing with random text or an unknown language. I don't 
think it would be quite that bad really. But the only way to firmly  
establish that the transliteration was causing significant decreases in  
predictability would be to run experiments with Hindi script and then with 
transliterated text and measure the difference in entropy. 

In the case of this assignment I think the range of transliterated text  
was so limited (mostly song lyrics) and the size of the experiments so  
small (a few lines) that I'm not sure any conclusions drawn had much  
validity. I think one could make a strong case that transliteration really 
caused problems by doing a full range of experiments (20 sentences 
randomly selected from a relatively large corpus of text) and then 
comparing the entropy found with that of a language that you didn't but 
but that also shares the same character set (Spanish, French, 
German, etc.). If you found that the entropy of transliterated Hindi was 
about the same as that of German (assuming you don't know German)
then I think you would have a reasonable claim that the transliterated 
text was rendering the language "unintelligible" to you. 

The more general point would be that while intuitive arguments often make  
a great deal of sense and may actually turn out to be correct, those are  
not sufficient for our purposes. We are empiricists now, and want to  
gather experimental evidence rather than relying on introspection. So 
certainly allow your intuitions to guide the hypotheses you might wish to 
study, but then make sure you draw conclusions based on your experimental 
evidence and not simply your intuitions. In some cases this is exactly 
what you did, and it was very effective. In other cases I think you may 
have conducted the experiments but did not really look at the results but 
rather figured out a way to have the results support your initial 
intuitions or common sense. It's sometimes hard to avoid this, but 
generally speaking we should try to. Let your conclusions come from the 
experiments, and let your experiments be carefully done so that the 
conclusions you draw are based on a strong foundation. 

Finally, a few comments on what seemed to be a rather popular formulation 
of Entropy of English. As I understood and observed, some of you 
interpreted Entropy of English to be measured by the consistency in the
number of guesses that a user would make to figure out a word or
sentence. So if a user always guesses the same number of times for 
each letter, then Entropy will be low. 

In particular, the scenario that we talked about had to do with
my test case of a four letter alphabet (A, B, C, D) where the string
to be guessed was DDDDD. If I take 4 guesses to get this correct
(guessing first A, then B, then C, then D) I would presume that this
means that my uncertainty with the language is high, and should
have a higher entropy than if I take only one guess to get the
letters right. I made the point that if I take four guesses to 
get each letter in a word expressed as a 4 letter alphabet correct,
then entropy is at its maximum value 

log2(4) = 2.00

or more generally, for an N character alphabet, the max per letter
entropy is log2(N). Thus, for English the max entropy (as argued
by Shannon at least) is log2(27) which is about 4.7 I think. The
minimum entropy would be 0, and that would be achieved when there
was no uncertainty in the guessing of the letters (when they can
always be guessed in just 1 try) and log2(1) = 0 

This was my argument at least. Entropy should be at its highest
when the number of guesses is highest, and lowest when the number
of guesses is minimal. 

The alternative argument seemed to be that we wanted to measure
the uncertainty surrounding the number of guesses. That is, if we
always knew that the number of guesses would be N, then there
is no variation in the number of guesses and therefore minimal
uncertainty. I understand the argument, however, I don't think
it is measuring the Entropy of English (on a per letter basis).
Rather it is measuring the entropy in the number of guesses, which
doesn't have a clear relation (at least to me) to the Entropy
of English. 

Let's follow the consequences of this argument to their logical
conclusions:

If the sequence of guesses is 10 10 10 10 10 10, then this would
hold that the entropy should be 0, since there is no variation in
the number of guesses. This would be the same as if the sequence
of guesses was 20 20 20 20 20 20. 

If a sequence of guesses is 10 10 10 10 10 9, this this should have 
the same entropy (I think) as the case of 20 20 20 20 20 19, since both
require 2 different numbers of guesses (10 and 9) and (20 and 19). 

To me, it seems like this formulation of entropy reduces down to 
figuring how many different number of guesses it takes to guess the
words. It seems to me that over a reasonably sized experiment we
are inevitably going to observe nearly all the different number of 
possible guesses, and so this formulation of entropy is always going
to result in entropy log2(N) where N is the number of characters in 
the alphabet. I may have missed some detail here, but in the end
I think this formulation doesn't really end up telling us too much.
Of course if I'm wrong you are welcome to correct me!

For this (or any formulation) to be valid it has to behave 
consistently. The entropy of the following should gradually
go from very low to high (under this alternative formulation).

10 10 10 10 10 10   (lowest)  
10 10 10 10 10  9
10 10 10 10  8  9 
10 10 10  7  8  9 
10 10  6  7  8  9
10  5  6  7  8  9
4   5  6  7  8  9   (highest)

I actually checked a few programs that seemed to be taking
this alternative viewpoint, and found that they were not 
consistent in this respect. 

In any case, since there seemed to be a number of people who
took this view I thought it was worth discussing a bit. 

I recommend taking another look at the Shannon paper just to
see if you can figure out his formulation of entropy. I recommend
starting to think about it from the upper and lower bounds,
and then work your way into various other cases. 

Ted Pedersen
October 17, 2002