Regardless of if you implement Witten-Bell or Good-Turing, you may want to develop your N-gram model first simply using relative frequency counts and then change the estimation method after that is working. (If you are unable to implement a smoothing algorithm, you could earn 5 of 10 possible points by submitting a program that uses relative frequency/maximum likelihood estimates.)

Your program should accept different values of N. Typical values will be 1 (unigram), 2 (bigram), and 3 (trigram), but your program should not impose any arbitrary limit on N. If you implement Good-Turing you may restrict the possible values of N to 2 and 3. However, if you implement Witten-Bell you should be able to handle any value of N.

Your program should output the cross entropy of the two texts to as many digits of precision as you feel necessary.

Your program should also output the frequency and total probability mass associated with events that occur a given number of times. In other words, show how many times events with a particular frequency occurred, and how much probability mass all events with that frequency count receive based on your smoothing algorithm. For example :

0.0075 0 1000 .50 1 70 .25 2 30 .10 3 30 .10 4 30 .10 5 30 .10 0 1000 .50 1 75 .30 2 50 .11 3 25 .11 4 20 .09 5 20 .09This output "says" that the cross entropy between the two texts was 0.0075. In text 1, there were 1000 unobserved events (events that occurred zero times in the data) and that they had a combined probability mass of .50. There were 70 events that occurred 1 time and they had a combined probability mass of .25, and so on. The same information is provided for text 2. Again, you may display your results to as many degrees of precision as you feel is necessary. The example data above is intended only to show you the formatting requirements and should not be interpreted any other way.

You should assume that the input texts are relatively large, containing at a minimum 100,000 word tokens. Make certain that your program will complete in a reasonable amount of time. (To some extent this depends on the value of N so no strict guideline is given here.)

You should treat your text according to the following guidelines:

- Ignore case, treat all text as upper or lower case.
- DO NOT IGNORE PUNCTUATION. Treat each punctuation mark as a word token.
- Treat all numeric values as tokens. Do not worry about dealing with punctuation embedded in numeric values. In a case such as 9,000.00 it will be fine if you treat that as five tokens ('9' ',' '0000' '.' '00').
- Only allow a single space between words in the text.

Select 3 texts from the Project Gutenberg archive. Two of the texts should be by the same author, and the other should be by an author that is relatively distinct in style, genre, and/or era. You may choose any texts you wish, as long as the language is English and the number of word tokens in each text is more than 100,000.

Suppose that that the three texts are:

- text1, by Author#1
- text2, by Author#1
- text3, by Author#2

Run your program as follows and report the results for each:

- userid.pl 3 text1 text1
- userid.pl 3 text2 text2
- userid.pl 3 text3 text3
- userid.pl 3 text1 text2
- userid.pl 3 text2 text3
- userid.pl 3 text1 text3

turnin -c cs8995 -p a3 userid.plThis is an individual assignment. Please work on your own. You are free to discuss the problem and coding issues with your colleagues, but in the end the program you submit should reflect your own understanding of N-gram models, cross-entropy and Perl programming.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

by: Ted Pedersen - tpederse@d.umn.edu