Ted Pedersen - CS 5761 - Introduction to Natural Language Processing

CS 5761 - Introduction to Natural Language Processing

Programming Assignment 5 - Demo in Lab on Monday, Mar 11 at 4pm
(submit code via email to patw0006@d.umn.edu before lab)

Objectives

To see how different smoothing algorithms affect probability estimates.

Specification

Modify your assignment 1 program to output n-gram probabilities using maximum likelihood estimates, add-1 smoothing, and Witten-Bell smoothing. Recall that assignment 1 finds the top N most frequent sequences consisting of M words. Thus, your modified program should still output the top N most frequent M word sequences, only now it should also output probability estimates based on each of these smoothing schemes. Also output the frequency counts associated with each M word sequence. Make sure your output is nicely formatted (columns line up, etc.)

For example, your output might look something like this if you ran 'assign5.pl 2 2 textinput' :

TOP 2 MOST FREQUENT 2 WORD SEQUENCES 
 

      	     FREQ     MLE       ADD-1      WITTEN-BELL 

OF THE       100   0.000001    0.000002     0.000003 
 
FOR THE       99   0.000001    0.000002     0.000003

Remember that if you display all of the M word sequences of text, the total each column should be 1.0000 (more or less, there may be some round-off error). You should use as much precision as you need to differentiate among the estimates. This may vary depending on the amount of text you are using, and the length of your sequences, so your program should deal with that automatically.

Assume that the number of possible types in the unigram model is equal to the number of types observed (call this value V). Then, in the bigram model assume that the number of possible types is V*V, and in the trigram model V*V*V, and so on.

If you request the top N ranked sequences, and there are fewer than N observed events, once you have displayed all the observed events, have a generic display for unobserved events where their smoothed estimates are shown. You do not need to generate specific N-grams that have not been observed to display, something like "UNOBSERVED" will be fine (as long as you have estimates for this as well.) Our assumption here is that any unobserved event will be as likely as any other unobserved event.

Policies (from syllabus)

All programming assignments and your project will be demonstrated during designated lab sessions. You should also submit an electronic copy of your source code to the TA prior to the designated demo session. (His email address is patw0006@d.umn.edu.) There is no other way to submit your programming assignments or project. Failure to submit AND demo on time will result in a zero.

Any code you submit should be commented. I must be able to understand what your code does simply by reading the comments. This understanding should extend down to the details of your code. So do not simply describe the input and output, also include comments that describe your particular algorithm and coding techniques. Failure to comment to this degree will result in a zero.

All assignments and the project are to be done individually. You are required to write your own code. Unless otherwise specified, you must only turn in code that you personally wrote. The only possible exception to this is if I tell you to use a module that is available in a book or online archive. However, I will clearly indicate when this is permissible. Violations of this policy will result in severe grading penalties and/or failure in the class.

CS 5761 - Introduction to Natural Language Processing

Programming Assignment 5 - Demo in Lab on Monday, Mar 11 at 4pm (submit code via email to patw0006@d.umn.edu before lab)

Objectives

Specification

Policies (from syllabus)

Programming Assignment 5 - Demo in Lab on Monday, Mar 11 at 4pm
(submit code via email to patw0006@d.umn.edu before lab)