CS 5761 - Introduction to Natural Language Processing

Programming Assignment 4 - Demo in Lab on Monday, Mar 04 at 4pm
(submit code via email to patw0006@d.umn.edu before lab)

Objectives

To gain experience with n-gram models.

Specification

Implement a program that will learn an n-gram model from a given body of text. Your program should then generate a given number of sentences based on that n-gram model. See the discussion on pages 202-206 of your text for further details.

Your program should work for any value of n, and should output m sentences. Convert all text to lower case, and make sure to include punctuation in the n-gram models. Your program should learn a single n-gram model from any number of input files.

Your program should run as follows:

assign4.pl n m input-file/s

so running your program like this:

assign4.pl 3 10 book1.text book2.text

should result in 10 randomly generated sentences based on a tri-gram model learned from book1.text and book2.text.

Make sure that you separate punctuation marks from text and treat them as tokens. Also treat numeric data as tokens. So, in a sentence like

my, oh my, i wish i had 100 dollars

you should have 11 tokens

my , oh my , i wish i had 100 dollars

You may assume that any period, question mark, or exclamation point represents the end of a sentence. (In general this assumption is wrong, but is perfectly adequate for our purposes here.) When generating a sentence, keep going until you find a terminating punctuation mark. Once you observe that then the sentence is complete.

If the length of a sentence in the input text file is less than n, then you may simply discard that sentence and not use it when computing n-gram probabilities.

Policies (from syllabus)

All programming assignments and your project will be demonstrated during designated lab sessions. You should also submit an electronic copy of your source code to the TA prior to the designated demo session. (His email address is patw0006@d.umn.edu.) There is no other way to submit your programming assignments or project. Failure to submit AND demo on time will result in a zero.

Any code you submit should be commented. I must be able to understand what your code does simply by reading the comments. This understanding should extend down to the details of your code. So do not simply describe the input and output, also include comments that describe your particular algorithm and coding techniques. Failure to comment to this degree will result in a zero.

All assignments and the project are to be done individually. You are required to write your own code. Unless otherwise specified, you must only turn in code that you personally wrote. The only possible exception to this is if I tell you to use a module that is available in a book or online archive. However, I will clearly indicate when this is permissible. Violations of this policy will result in severe grading penalties and/or failure in the class.