CS 8995 Corpus Based Natural Language Processing

Assignment 3 - Due Mon, Feb 26, 4 pm

This write may be revised in response to questions. Date of last update: Thu Feb 15, 1pm


To compare N-gram models of texts using cross entropy. This will allow us to make determinations as to how closely related two texts are, possibly even making it possible to resolve questions of authorship attribution.


Write a Perl program that will estimate N-gram models for two input texts, and then compare those models using cross entropy. Probability estimates should not be made via relative frequency (maximum likelihood estimates) but rather via the Witten-Bell smoothing algorithm. If you are interested in extra credit, you can implement the Good-Turing smoothing algorithm in Perl. This is worth 5 extra credit points, so you could earn up to 15 out of 10 on this assignment with the extra credit. There are links to Good-Turing materials on the Sample Code page.

Regardless of if you implement Witten-Bell or Good-Turing, you may want to develop your N-gram model first simply using relative frequency counts and then change the estimation method after that is working. (If you are unable to implement a smoothing algorithm, you could earn 5 of 10 possible points by submitting a program that uses relative frequency/maximum likelihood estimates.)

Your program should accept different values of N. Typical values will be 1 (unigram), 2 (bigram), and 3 (trigram), but your program should not impose any arbitrary limit on N. If you implement Good-Turing you may restrict the possible values of N to 2 and 3. However, if you implement Witten-Bell you should be able to handle any value of N.

Your program should output the cross entropy of the two texts to as many digits of precision as you feel necessary.

Your program should also output the frequency and total probability mass associated with events that occur a given number of times. In other words, show how many times events with a particular frequency occurred, and how much probability mass all events with that frequency count receive based on your smoothing algorithm. For example :

0 1000 .50
1 70   .25
2 30   .10
3 30   .10
4 30   .10
5 30   .10
0 1000 .50
1 75   .30
2 50   .11
3 25   .11
4 20   .09
5 20   .09
This output "says" that the cross entropy between the two texts was 0.0075. In text 1, there were 1000 unobserved events (events that occurred zero times in the data) and that they had a combined probability mass of .50. There were 70 events that occurred 1 time and they had a combined probability mass of .25, and so on. The same information is provided for text 2. Again, you may display your results to as many degrees of precision as you feel is necessary. The example data above is intended only to show you the formatting requirements and should not be interpreted any other way.

You should assume that the input texts are relatively large, containing at a minimum 100,000 word tokens. Make certain that your program will complete in a reasonable amount of time. (To some extent this depends on the value of N so no strict guideline is given here.)

You should treat your text according to the following guidelines: In addition to submitting a program, you must also submit a write up as described below. Programs submitted without this write up will not be graded.

Write up

Describe the results of the following experiment in a comment block at the beginning of your program.

Select 3 texts from the Project Gutenberg archive. Two of the texts should be by the same author, and the other should be by an author that is relatively distinct in style, genre, and/or era. You may choose any texts you wish, as long as the language is English and the number of word tokens in each text is more than 100,000.

Suppose that that the three texts are:
(In your write up please clearly identify the author, the work, and the Project Gutenberg file name.)

Run your program as follows and report the results for each:
What conclusions do you draw from your results regarding the effectiveness of cross entropy and n-grams as a tool for performing authorship identification? Use your results to support these conclusions. If you wish to perform additional experiments to support your view that is strongly encouraged. (The above represents the minimum required).

Other information

Please use turnin to submit this program. Remember that can only use turnin from hh33812. No email submission is necessary. The proper turnin command is:
turnin -c cs8995 -p a3 userid.pl
This is an individual assignment. Please work on your own. You are free to discuss the problem and coding issues with your colleagues, but in the end the program you submit should reflect your own understanding of N-gram models, cross-entropy and Perl programming.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

by: Ted Pedersen - tpederse@d.umn.edu