Ted Pedersen - CS 8761 Natural Language Processing

CS 8761 Natural Language Processing - Fall 2004

Assignment 2 - Due Friday Oct 15, 2004, noon

This may be revised in response to your questions. Last update Sunday Oct 10, 4pm

Objectives

To estimate the entropy of English and interpret those results based on experiments using literary and newswire text. You may also gain an appreciation for the difficulties of sentence boundary detection.

Specification

Write a Perl program called shannon.pl that will play the "Shannon game". Your program should accept as input an arbitrary number of sentences that are read from STDIN. The user must then guess each sentence, letter by letter. You should estimate the entropy of English based on these guesses.

Your Shannon Game program should display one sentence at time, where all the positions in the sentence have been covered by "*" characters. When the user guesses the correct letter it should be displayed in place of the "*". When the user guesses the complete sentence, the total entropy for that sentence should be displayed, as should a running total that reflects the entropy across all the sentences processed thus far.

You should not assume that sentence boundaries have been determined in your input text. Thus, you should provide a sentence boundary detector program called boundary.pl to create the input to your Shannon game program. You should require that each sentence selected by boundary.pl has a minimum of 7 words. Make sure that once a sentence is selected that it is not selected again during the same run of the program.

Your sentence boundary detection program should also convert the text to upper case and remove all punctuation and non-alpha characters except for spaces. Please note that sentence boundary detection is a difficult problem in its own right, so you should not be surprised if you have an imperfect solution. That is ok, as long as you are able to get the simple cases correct. At a minimum, your boundary detector should identify sentences that end with . ? or ! and have no internal uses of these characters (like might occur in an abbreviation or quote). You should be able to identify a sentence that occurs on multiple lines, or multiple sentences per line. For example, the following are cases you should be able to handle.

This sentences is
going to go on and on for a
long while, and then end.

What a great sentence. This one is even better!

My friends, I wish that I would be with you today, but alas, I can't!

Examples of cases that I wouldn't necessarily expect you to handle (but would be happy if you did) are as follows:

Dr. Johnson, the noted Ph.D. in anthropology, is a great person.

My friend said, "Hey! Ted! You're lost!" and I had to agree.

These two programs should be run as follows from the command line:

boundary.pl 10 chapter1.txt chapter2.txt | shannon.pl outfile1 outfile2 outfile3 outfile4

boundary.pl will select 10 random sentences from chapter1.txt and chapter2.txt and output them to STDOUT. shannon.pl will read from STDIN and display the sentences one by one to the user (where the actual letters and spaces have been covered with "*" characters). outfile1 outfile2, outfile3, and outfile4 (whose names are specified by the user) are output files that are described below.

If the user happens to guess the same letter twice for the same position, don't count this more than once. Please keep track of the letters guessed for each position to avoid double counting.

You may assume that the command line arguments are valid. There is no need to include command line error checking. Please note that I will test your program automatically on a csdev machine so please test it on the same and follow exactly same command line format.

Output

Your program should reveal the sentence as letters are guessed. This does not need to be a fancy display - simple ascii output is fine.

After each sentence is guessed, your program should output the entropy associated with that sentence, and a cumulative total of entropy for all sentences processed thus far. You only need display two digits of precision to the right of the decimal.

Your program should also create three output files that can be named from the command line. Above they are called outfile1, outfile2, outfile3, and outfile4, and will be referred to as such here, however you should assume that the user can name them anything they like.

outfile1: A table of the average number of guesses before the given letter was guessed correctly.

A 4.44
B 3.21
etc.

This shows that when A was the correct letter, it took 4.44 tries to guess it correctly. Your table should have 27 entries, one for each of the 26 letters and also for space.

outfile2: A table showing the average number of guesses to get a letter correct given that the previous letter is as shown:

 
A 3.99
B 2.99
etc.

This shows that when A was the letter preceding the letter to be guessed, it took 3.99 turns to guess that next letter correctly.

outfile3: A table showing the average number of guesses to get a letter correct given that the following letter is as shown:

A 2.99
B 4.55
etc.

This shows that when A follows the letter to be guessed, it takes 2.99 to get that preceding letter correct (on average).

The values in these files should be computed to 2 digits of precision to the right of the decimal point.

outfile4: A table that displays each sentence that was guessed, the number of guesses per letter, and the entropy for that sentence and overall. Each entry in this table should be formatted as follows:

 M  Y    S  E  N  T  E  N  C  E    I  S      H  E  R  E     T  O  D  A  Y
10  2 1  9  3  2  1  1  1  1  1  1 8  3  1  10  2  1  1  1  9  1  1  1  1
 Sentence Entropy: X.XXXX
 Cumulative Entropy: X.XXXX

Note that the sentence above only has 5 words, this is just to show you an example of the output format for outfile4. Remember that each of your sentences should contain at least 7 words.

Experiments and Report

You should produce a written report describing the outcome of the following experiments. This should be a plain text file called experiments.txt. This report should begin with a description of your method for computing entropy of English.

Download a novel from Project Gutenberg (www.gutenberg.net). In your report, please name the novel and also provide a specific URL where it can be directly downloaded from.

Run the Shannon Game for 20 randomly selected sentences from this novel. Examine the results in outfile1, outfile2, outfile3, and outfile4. What conclusions can you draw from these results? Please make an effort to incorporate all the different pieces of information you have available into your analysis. Make sure you include your input and output files as well as the entropy values in your report.

Now use the apw.100000 file to select 20 random sentences and repeat the experiment and analysis that you conducted for the novel.

After running experiments on both kinds of texts, what general conclusions can you draw about the entropy of English? Make specific references to the results in your output files, as well as the overall entropy values. Make sure to include your view on whether or not the entropy of English is different depending on the type of text. Make sure to support your position with evidence from your experiments.

I am not looking for a particular answer. There are many conclusions one can draw from this kind of data. I'm curious as to what you find interesting, and want to see how you analyze this type of data.

If your analysis suggests that other experiments are appropriate, feel free to carry them out and comment on them. Make sure you do my original set of experiments above however, even if you introduce your own.

Submission Guidelines

Please name your programs 'boundary.pl' and 'shannon.pl'. Please name your report 'experiments.txt'. Each of these should be plain text files. Make sure that your name, date, and class information are contained in each file, and that your programs are commented using perldoc.

Place these three files into a directory that is named with your umd user id. In my case the directory would be called tpederse, for example. Then create a tar file that includes this directory and your three files. Compress that tar file and submit it via the web drop from the class home page. Please note that any submissions after the deadline will be not be graded.

This is an individual assignment. You must write *all* of your code on your own. Do not get code from your colleagues, the internet, etc. You are welcome to discuss the Shannon game in general with your classmates, friends, family, etc., but do not discuss implementation details. You must also write your report on your own. Please do not discuss your interpretations of these results amongst yourselves. This is meant to make you think for yourself and arrive at your own conclusions.

by: Ted Pedersen - tpederse@umn.edu