CS 5761 - Introduction to Natural Language Processing 
 Programming Assignment 4 - Submit via 
 
web drop  by 5pm Friday March 26.
 Objectives 
To gain experience learning n-gram language models from text.
 Specification 
Design and implement a Perl program called ngram.pl that will learn an  
n-gram language model from a given body of text. Your program should then  
generate a given number of sentences based on that n-gram model. See the  
discussion on pages 202-206 of your text for further details. 
 
Your program should work for any value of n, and should output m  
sentences. Convert all text to lower case, and make sure to include 
punctuation in the n-gram models. Your program should learn a single 
n-gram model from any number of input files. 
 
Your program should run as follows: 
 
ngram.pl n m input-file/s
 
so running your program like this: 
 
ngram.pl 3 10 book1.text book2.text
 
should result in 10 randomly generated sentences based on a tri-gram
model learned from book1.text and book2.text. 
 
Make sure that you separate punctuation marks from text and treat them as 
tokens. Also treat numeric data as tokens. So, in a sentence like 
 
my, oh my, i wish i had 100 dollars . 
 
you should have 12 tokens 
my , oh my , i wish i had 100 dollars .
 
Your program will need to identify sentence boundaries, and your ngrams 
should *not* cross these boundaries. For example, you could have input 
like this:
He went down the stairs
and then out the side door. 
My mother and brother 
followed him.
You should treat this as two sentences, as in:
He went down the stairs and then out the side door . 
My mother and brother followed him .
To identify sentence boundaries, you may assume that any period,   
question mark, or exclamation point represents the end of a sentence.  
(In general this assumption is wrong, but is perfectly adequate for our  
purposes here.) When generating a  sentence, keep going until you find a  
terminating punctuation mark. Once you observe that then the sentence is  
complete. 
 
If the length of a sentence in the input text file is less than n, then  
you may simply discard that sentence and not use it when computing 
n-gram probabilities. 
 Policies (see syllabus for more details) 
 
Please comment your code. You must provided a detailed description of your 
spelling correction algorithm in your source code comments. This should 
focus on how you score the candidate corrections for a word. Also make  
sure you name,  class, etc. is clearly included in the comments. 
  
 
It is fine to use a Perl  reference book to provide examples of loops,  
variables, etc., but your ngram.pl specific code  must be your own, and  
not taken from any other source (human, published, on the web, etc.)