CS 5761 - Introduction to Natural Language Processing

Programming Assignment 1 - Demo in Lab on Monday, Feb 4 at 4pm
(also submit via email to patw0006@d.umn.edu before lab)

Objectives

To gain experience with the Perl programming language, in particular focusing on its text processing capabilities.

Specification

Write a Perl program that prints to standard output the n most frequent m word sequences in an arbitrary number of input files. For this assignment a word is defined as a string of alphabetic characters. Word sequences are determined as follows. Suppose your input file consists of the following:

this is the test
input
file

If m = 2, then the following word sequences will be found:

this is
is the
the test
test input
input file

n , m and the input file names should be specified by the user on the command line. Your program should display the word sequences as well as the count of the number of times each occurs in the input files. The format of the command line arguments should be as follows:

yourProgram n m [list of input files]

For example,

yourProgram 5 3 in.txt file2

should display the 5 most frequent 3 word sequences that are found in files 'in.txt' and 'file2'. Your program should display the 3 word sequences and their associated frequency counts. Your output might look like this:

new york city 456
for the people 345
president george bush 342
in the time 211
he said this 199

This tells us that 'new york city' occurred 456 times in these 2 files, that 'for the people' occurred 345 times, etc.

In the event of ties among the top m most frequent word sequences, display all of the sequences involved in the tie. In such cases your output should have more than n of these m word sequences. For example, suppose that there was a tie for the 4th ranked 3 word sequences in the files above between 'in the time' and 'under the bridge'. Suppose that both occur 211 times. Thus, both should be displayed and there will be 6 three word sequences to be displayed given the command line arguments above:

new york city 456
for the people 345
president george bush 342
in the time 211
under the bridge 211
he said this 199

Prior to finding any word sequences, your program should eliminate all non-alphabetic characters from the input files, and convert all text to lower case.

Make sure to check the boundary cases. For example, if you have a very small number of words in your input files, it's possible that the total number of m word sequences will be less than n . In this case your program should display all the m word sequences. If n is larger than the number of words in the input files, then your program should list all of the m word sequences as well.

Your program should treat the input files as one long line of text; in other words, m word sequences should not be interrupted by end-of-line or end-of-file markers.

You may assume that the command line arguments are correct, and that at least one non-empty input file will always be provided.

Test your program using a variety of input files. Make sure that your program can handle 1,000,000 words of input in a reasonable amount of time (no more than a few minutes). You can find large amounts of text by downloading a few books from the Project Gutenberg web site shown on the class web page. During your lab demo the TA will provide test input files, and your program will be graded based on the output from this data. If your program does not run or produces no correct output you will receive no credit.

Policies (from syllabus)

All programming assignments and your project will be demonstrated during designated lab sessions. You should also submit an electronic copy of your source code to the TA prior to the designated demo session. (His email address is patw0006@d.umn.edu.) There is no other way to submit your programming assignments or project. Failure to submit AND demo on time will result in a zero.

Any code you submit should be commented. I must be able to understand what your code does simply by reading the comments. This understanding should extend down to the details of your code. So do not simply describe the input and output, also include comments that describe your particular algorithm and coding techniques. Failure to comment to this degree will result in a zero.

All assignments and the project are to be done individually. You are required to write your own code. Unless otherwise specified, you must only turn in code that you personally wrote. The only possible exception to this is if I tell you to use a module that is available in a book or online archive. However, I will clearly indicate when this is permissible. Violations of this policy will result in severe grading penalties and/or failure in the class.