CS 8761 Natural Language Processing - Fall 2002

Assignment 1 - Due Mon, September 30, noon

This may be revised in response to your questions. Last update Sun September 29 6:00 pm

Objectives

To compute the entropy of English and a second language and interpret those results based on several experiments you will conduct with your code. You may also gain an appreciation for the difficulties of sentence boundary detection.

Specification

Write a Perl program that will play the "Shannon game". Your program should accept as input some number of randomly selected sentences from a text file. The user must then guess each sentence, letter by letter, so that the entropy can be computed.

Your Shannon Game program should display one sentence at time, where all the positions in the sentence have been covered by "*" characters. When the user guesses the correct letter it should be displayed in place of the "*". When the user guesses the complete sentence, the total entropy for that sentence should be displayed, as should a running total that reflects the entropy across all the sentences processed thus far.

You should not assume that sentence boundaries have been determined in your input text. Thus, you should provide a sentence boundary detector program to create the input to your Shannon game program. Your sentence boundary detection program should also convert the text to upper case and remove all punctuation and non-alpha characters except for spaces. Please note that sentence boundary detection is a difficult problem in its own right, so you should not be surprised if you have an imperfect solution. That is ok, as long as you are able to get the simple cases correct.

These two programs should be run as follows from the command line:

sentence.pl 10 mobydick.txt | shannon.pl outfile1 outfile2 outfile3
sentence.pl will select 10 random sentences from mobydick.txt and pass them to shannon.pl which should display them one by one to the user (where the actual letters and spaces have been covered with "*" characters). outfile1 outfile2 and outfile3 (whose names are specified by the user) are output files that are described below.

If the user happens to guess the same letter twice for the same position, don't count this more than once. Please keep track of the letters guessed for each position to avoid double counting.

You may assume that the command line arguments are valid. There is no need to include command line error checking. Please note that I will test your program automatically on a csdev machine so please test it on the same and follow exactly same command line format.

Output

Your program should reveal the sentence as letters are guessed. This does not need to be a fancy display - simple ascii output is fine.

After each sentence is guessed, your program should output the entropy associated with that sentence, and a cumulative total of entropy for all sentences processed thus far. You only need display two digits of precision to the right of the decimal.

Your program should also create three output files that can be named from the command line. Above they are called outfile1, outfile2, and outfile3 and will be referred to as such here.

outfile1: A table of the average number of guesses before the given letter was guessed correctly.
A 4.44
B 3.21
etc.
This shows that when A was the correct letter, it took 4.44 tries to guess it correctly. Your table should have 27 entries, one for each of the 26 letters and also for space.

outfile2: A table showing the average number of guesses to get a letter correct given that the previous letter is as shown:
 
A 3.99
B 2.99
etc.
This shows that when A was the letter preceding the letter to be guessed, it took 3.99 turns to guess that next letter correctly.

outfile3: A table showing the average number of guesses to get a letter correct given that the following letter is as shown:
A 2.99
B 4.55
etc.
This shows that when A follows the letter to be guessed, it takes 2.99 to get that preceding letter correct (on average).
The values in these files should be computed to 2 digits of precision to the right of the decimal point.

Experiments and Report

You will produce a short written report describing the outcome of the following experiments.

Download some text from Project Gutenberg. Run the Shannon Game for at least 20 randomly selected sentences (more if you think it is necessary). Compare the results in outfile1, outfile2, and outfile3. What conclusions can you draw from those results? Please compare these tables with each other, and also discuss each of them on their own. Make sure you reproduce the tables you are discussing in your report file

Download some text in another language that you know. If you know Hindi, then it must be Hindi and it should be transliterated (ie Hindi written in the Roman/English alphabet.) If you don't know any other language, you can use some Hindi text, or lacking that use the Spanish version of Don Quixote available from Project Gutenberg. Repeat the same experiment that you carried out for English. Discuss your Hindi/second language results on their own merits, and then compare them to English. What conclusions can you draw?

I am not looking for a particular answer. There are many conclusions one can draw from this kind of data. I'm curious as to what you find interesting, and want to see how you analyze this type of data.

If your analysis suggests that other experiments are appropriate, feel free to carry them out and comment on them. Make sure you do my original set of experiments above however, even if you introduce your own.

Submission Guidelines

Please name your programs 'sentence.pl' and 'shannon.pl'. Please name your report 'experiments.txt'. Each of these should be plain text files. Make sure that your name, date, and class information are contained in each file, and that your programs are commented.

Place these three files into a directory that is named with your umd user id. In my case the directory would be called tpederse, for example. Then create a tar file that includes this directory and your three files. Compress that tar file and submit it via the web drop from the class home page. Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be graded. The web drop has a limit of 10mb, so your files should be plain text. If you have large data files you wish to share, please contact me via email and we'll figure out a way to do that.

For both your first and second language text, I would like your report to tell me where it came from. (presumably this will come from URLs, so you can list those. Also, create a directory in your /home/cs/ partition and put the data you use in your experiments there. Make it world readable and name them as follows:

/home/cs/(your id)/CS8761/shannon/english
/home/cs/(your id)/CS8761/shannon/hindi (or whatever your 2nd language was)
chmod -R gou+r /home/cs/(your id)/CS8761
will accomplish this. Please note that everything in CS8761 and the subdirectories will be world readable as a result.

This is an individual assignment. You must write *all* of your code on your own. Do not get code from your colleagues, the internet, etc. You are welcome to discuss the Shannon game in general with your classmates, friends, family, etc., but do not discuss implementation details. You must also write your report on your own. Please do not discuss your interpretations of these results amongst yourselves. This is meant to make you think for yourself and arrive at your own conclusions.

by: Ted Pedersen - tpederse@umn.edu