After each assignment I'll send a summary of any observations or comments I might have. In general our assignments are dealing with issues that can still in some respects be considered open research questions, so I'll take a few moments to raise some of what I think are the interesting issues. You are certainly welcome to respond, ask further questions, disagree, etc. Entropy of English is a surprisingly complex area - how to formulate it, how to conduct experiments that measure it, and how to then use whatever information you acquire from such formulations and experiments; these are all interesting questions I think. Here are a few thoughts on issues that might merit further consideration. 1) Could entropy be used as a measure of fluency in a language? I think it probably could, because clearly if you don't know a language your entropy as determined by the Shannon game would be high (you are making lots of random guesses) whereas the more you know about a language the better you will be at guessing. 2) As you play the Shannon game in an unknown language, it seems likely that you would start to learn that language in a very limited way, and begin to recognize certain characters of sequences and possibly even function words? If this happens, is that learning measurable or reflected in the a decline in entropy? I don't know about this one - it seems to depend on how fine grained a measure entropy turns out to be. Is it "precise" enough to detect the limited amount of learning that someone would experience as the result of playing the Shannon Game in a language they didn't know. 3) Would the entropy of an unknown language be the same as the entropy of random text? This relates a bit to 2), and if indeed we start to learn a language in a limited way as we play the Shannon game, then 3) would be false. Also, if two languages are related then we would expect that 3) is false. For example, French and Spanish are both Romance languages that descend from Latin. A native speaker of Spanish may not know French, but the languages do indeed share certain characteristics and one might recognize common traits and be able to guess letters better as a result. We would need to presume that the unknown language and the native language use the same character set (or nearly so) as an alphabet, otherwise we'd need to somehow adjust the experiment to control for that. For example, if the unknown language is German and the known language is English, the character sets are nearly the same, so the confusion associated with German (for an English-only speaker) would be relative to the language itself. However, if Mandarin where the unknown language, then the character set/alphabet becomes a source of difficulty as well. Since we don't know the character set of the language we would spend a certain amount of time/number of guesses learning the character set, and that adds another level of uncertainty to the process. 4) Could entropy be viewed as a "language universal". In other words, does the per letter entropy of a language as spoken by a native speaker of that language remain about the same, regardless of the language. Put more simply, is the entropy of English for a native speaker about the same as the entropy of any other language? Are some languages inherently more uncertain than others? Do listeners of some languages have less ability to predict what comes next? In general I would expect that human languages are "about the same" when it comes to predictability. Now the Shannon Game may not be a suitable way to measure entropy of some languages - it may be dependent or biased in some ways towards languages that have relatively small alphabets as is the case with English. But possibly the experiment could be reconstructed to deal with sounds (phonemes). Languages generally speaking don't have radically different numbers of phonemes (I think this is true, although I may be wrong). For example, English has somewhere between 30-40 basic sounds (phonemes). We can predict that certain sounds are more likely to occur together than others. For example, 'b' as in "bat" and 'c' as in "cat" clearly don't ever occur one right after the other in spoken English, so if one has heard a "b" one would know that most likely a vowel sound will follow (as in bag, bill, box, bug...) So those were some issues that came to mind in general thinking about Shannon's Game and the per-letter entropy of English. Now a few thoughts specific to your assignments. Point 3) relates somewhat to the idea of transliteration. In other words, if you are playing the Shannon Game with a transliterated version of your native language then you are at some level dealing with an unfamiliar character set and might be measuring a level of uncertainty in the character set as well as in the "predictability" of the language. In the case of this assignment the issue arose with Hindi text that was translitered/romanized into the English alphabet. Several of you mentioned the difficulty of dealing with transliterated text, and I think it's a reasonable point. However, I think the arguments were mostly intuitive and sometimes went so far as to say that dealing with transliterated text was like dealing with random text or an unknown language. I don't think it would be quite that bad really. But the only way to firmly establish that the transliteration was causing significant decreases in predictability would be to run experiments with Hindi script and then with transliterated text and measure the difference in entropy. In the case of this assignment I think the range of transliterated text was so limited (mostly song lyrics) and the size of the experiments so small (a few lines) that I'm not sure any conclusions drawn had much validity. I think one could make a strong case that transliteration really caused problems by doing a full range of experiments (20 sentences randomly selected from a relatively large corpus of text) and then comparing the entropy found with that of a language that you didn't but but that also shares the same character set (Spanish, French, German, etc.). If you found that the entropy of transliterated Hindi was about the same as that of German (assuming you don't know German) then I think you would have a reasonable claim that the transliterated text was rendering the language "unintelligible" to you. The more general point would be that while intuitive arguments often make a great deal of sense and may actually turn out to be correct, those are not sufficient for our purposes. We are empiricists now, and want to gather experimental evidence rather than relying on introspection. So certainly allow your intuitions to guide the hypotheses you might wish to study, but then make sure you draw conclusions based on your experimental evidence and not simply your intuitions. In some cases this is exactly what you did, and it was very effective. In other cases I think you may have conducted the experiments but did not really look at the results but rather figured out a way to have the results support your initial intuitions or common sense. It's sometimes hard to avoid this, but generally speaking we should try to. Let your conclusions come from the experiments, and let your experiments be carefully done so that the conclusions you draw are based on a strong foundation. Finally, a few comments on what seemed to be a rather popular formulation of Entropy of English. As I understood and observed, some of you interpreted Entropy of English to be measured by the consistency in the number of guesses that a user would make to figure out a word or sentence. So if a user always guesses the same number of times for each letter, then Entropy will be low. In particular, the scenario that we talked about had to do with my test case of a four letter alphabet (A, B, C, D) where the string to be guessed was DDDDD. If I take 4 guesses to get this correct (guessing first A, then B, then C, then D) I would presume that this means that my uncertainty with the language is high, and should have a higher entropy than if I take only one guess to get the letters right. I made the point that if I take four guesses to get each letter in a word expressed as a 4 letter alphabet correct, then entropy is at its maximum value log2(4) = 2.00 or more generally, for an N character alphabet, the max per letter entropy is log2(N). Thus, for English the max entropy (as argued by Shannon at least) is log2(27) which is about 4.7 I think. The minimum entropy would be 0, and that would be achieved when there was no uncertainty in the guessing of the letters (when they can always be guessed in just 1 try) and log2(1) = 0 This was my argument at least. Entropy should be at its highest when the number of guesses is highest, and lowest when the number of guesses is minimal. The alternative argument seemed to be that we wanted to measure the uncertainty surrounding the number of guesses. That is, if we always knew that the number of guesses would be N, then there is no variation in the number of guesses and therefore minimal uncertainty. I understand the argument, however, I don't think it is measuring the Entropy of English (on a per letter basis). Rather it is measuring the entropy in the number of guesses, which doesn't have a clear relation (at least to me) to the Entropy of English. Let's follow the consequences of this argument to their logical conclusions: If the sequence of guesses is 10 10 10 10 10 10, then this would hold that the entropy should be 0, since there is no variation in the number of guesses. This would be the same as if the sequence of guesses was 20 20 20 20 20 20. If a sequence of guesses is 10 10 10 10 10 9, this this should have the same entropy (I think) as the case of 20 20 20 20 20 19, since both require 2 different numbers of guesses (10 and 9) and (20 and 19). To me, it seems like this formulation of entropy reduces down to figuring how many different number of guesses it takes to guess the words. It seems to me that over a reasonably sized experiment we are inevitably going to observe nearly all the different number of possible guesses, and so this formulation of entropy is always going to result in entropy log2(N) where N is the number of characters in the alphabet. I may have missed some detail here, but in the end I think this formulation doesn't really end up telling us too much. Of course if I'm wrong you are welcome to correct me! For this (or any formulation) to be valid it has to behave consistently. The entropy of the following should gradually go from very low to high (under this alternative formulation). 10 10 10 10 10 10 (lowest) 10 10 10 10 10 9 10 10 10 10 8 9 10 10 10 7 8 9 10 10 6 7 8 9 10 5 6 7 8 9 4 5 6 7 8 9 (highest) I actually checked a few programs that seemed to be taking this alternative viewpoint, and found that they were not consistent in this respect. In any case, since there seemed to be a number of people who took this view I thought it was worth discussing a bit. I recommend taking another look at the Shannon paper just to see if you can figure out his formulation of entropy. I recommend starting to think about it from the upper and lower bounds, and then work your way into various other cases. Ted Pedersen October 17, 2002