CS 5761 - Introduction to Natural Language Processing

Programming Assignment 2 - Submit via web drop by 5pm Monday Feb 16.

Objectives

To develop techniques that will be useful in creating profiles of written documents. To gain experience with Perl file handling and hashes.

Specification

Write a Perl program called profiler.pl that will take as input a rank cutoff (described below), a split value (described below) and an arbitrary number of text files (1 or more). If multiple files are given as input, treat all of them as one large document. You should convert all text to upper or lower case, and remove or ignore punctuation. Your program should output the following information in a nicely designed report. For example, the following command will run the profiler and display the top 30 ranked words, and display the word types that are unique to the last 5 percent of the text. The text to be processed is found in two files (holmes1.txt and bible.txt) but will be treated as one big file.
 
profiler.pl 30 5 olivertwist.txt davidcopperfield.txt
There will be some experiments for you to run and report on as a part of this assignment. That will be discussed in more detail in the lecture, and will be due a few days after the code is to be turned in.

Experiments

You should conduct the following experiments, and prepare a short presentation to be given in the lab on Thu Feb 19. This will count for 30% of the grade on this assignment.

You should select two corpora that are from different genres. Each should consist of at least 100,000 tokens. For example, you might choose a novel as your first corpus, and news wire text as your second corpus. Note that your two corpora should be fairly distinct, so do not select two different novels or text from two different newspapers. You should be able to find suitable text by searching around the Internet. Don't forget the links that are found on the class web page.

Run your profiler.pl program on each corpus with two different settings for S, 50 and 5. Interpret what these results tell you about the nature of the corpora, and about the nature of Zipf's Law.

You should prepare a presentation to give to the class during the lab. This should consist of powerpoint slides, or handouts. In either case, submit your slides or handouts to the webdrop by 4pm on Thursday, and I will put the slides on the computer in HH 302 for projection, or I will print out enough copies of your handouts so that each member of the class receives one.

Your presentation should summarize the characteristics of your corpora, the results of each of the 4 possible runs of profiler, and then your conclusions about what this tells us about the nature of the text and Zipf's Law. Note that you must draw conclusions - you can not simply summarize your results.

Policies (see syllabus for more details)

Please comment your code. In particular, note where each piece of data requested above is being collected in your code. Also make sure you name, class, etc. is clearly included in the comments.

It is fine to use a Perl reference book for examples of loops, variables, etc., but your profiler specific code must be your own, and not taken from any other source (human, published, on the web, etc.)