Ted Pedersen - CS 8761 Natural Language Processing

CS 8761 Natural Language Processing - Fall 2004

Assignment 3 - Due Mon, Nov 1, noon

This may be revised in response to your questions. Last Update, Monday Oct 25, 9pm

Objectives

To develop a method of identifying two word collocations using mean and variance, and then to compare that method with the log-likelihood ratio. This method is based on Retrieving Collocations from Text : Xtract , by Frank Smadja.

Specification

Implement a Perl program called mean_variance.pl that will identify collocations in text using the method described in Section 5.2 of your text. You should also develop a score that ranks the bigrams according to how "good" of a collocation it is. This score should be based on the mean, standard deviation, frequency count, and window size, and should be a single value that you can use to sort your results.

Your program should accept the following input parameters from the command line:

An integer value that tells how large a window in which you will find collocations. This window should be defined such that it includes the two words in the bigram, and specifies the number of intervening words that are allowed. Thus, a value of 2 means that the words must be adjacent, a value of 3 means that they may be up to 1 word apart, a value of 4 means that they may be up to 2 words apart etc. Thus, the size of the window must be 2 or greater.
An arbitrary number of input files, where each line has been previously determined to be a sentence (by a slightly revised version of your program boundary.pl from Assignment 2). Note that in your assignment 2 version your boundary program selected some number sentences at random. To use this program in this assignment, you will want to override that and have it find all sentence boundaries for all the input files. Please make sure that your boundary detection program results in one sentence per line, and one line per sentence. Your mean_variance.pl program should assume that the input is formatted in this way.

Thus, you should be able to run your program as follows:

 
boundary.pl input1.txt input2.txt > data.txt
mean_variance.pl 12 data.txt
                         OR 
boundary.pl input1.txt input2.txt | mean_variance.pl 10 


boundary.pl myfile.txt myotherfile.txt mythirdfile.txt > data.txt
mean_variance.pl 7 data.txt
                         OR
boundary.pl myfile.txt myotherfile.txt mythirdfile.txt | mean_variance.pl 5

The first group of commands will find bigrams that have up to 10 words between them, while the second will find bigrams with up to 5 words between them.

You should also familiarize yourself with the Ngram Statistics Package, which is available on the Sun systems on campus. You can also find it at http://www.d.umn.edu/~tpederse/nsp.html. You will run experiments using the log-likelihood measure as provided in this package.

Output

Your program should output a table similar to Table 5.5 (page 161) as found in your text. This table shows the mean position, the standard deviation, the count of the bigram, and then the two words that make up the bigram or trigram. In addition to this, you should show your "ranking value" and sort your output with respect to this value. This is a score that you will develop that results in the most reliable and interesting ranking of collocations. This score can be based on some combination of the mean, standard deviation, frequency, and window size.

Experiments and Report

All of your experiments should be performed on a corpus of New York Times text from 2002 that is now available on the class web page. It consists of approximately 8,000,000 tokens.

You will produce a written report describing the outcome of a number of experiments. However, you should begin your report by describing how you are interpreting the scores from the mean_variance.pl method. In other words, describe whatever you believe to be the best way to rank these scores to use them to find or retrieve significant collocations. Please title this portion of your report "MY INTERPRETATION OF MEAN VARIANCE". Please provide specific examples showing why you believe your interpretation is correct.

TOP 50 COMPARISON:

Run NSP/ll.pm and mean_variance.pl on CORPUS. Look over the top 50 or so ranks produced by each program. (Note that you should rank your mean_variance.pl results according to the interpretation you described above.) Which seems better at identifying significant or interesting collocations? How would you characterize the top 50 bigrams found by each module? Is one of these measures significantly "better" or "worse" then the other? Why do you think that is?

CUTOFF POINT:

Look up and down the list of bigrams as ranked by NSP/ll.pm and mean_variance.pl. Do you notice any "natural" cutoff point for scores, where bigrams above this value appear to be interesting or significant, while those below do not? If you do, indicate what the cutoff point is and discuss what this value "means" relative to the test that creates it. If you do not see any such cutoff, discuss why you can't find one. What does that tell you about these tests?

OVERALL RECOMMENDATION:

Based on your experiences above and any other variations you care to pursue, is NSP/ll.pm or mean_variance.pl "the best" for identifying significant collocations in large corpora of text. If one is better, please explain why it is better. If none is better please explain. Your explanation should be specific to your investigations and not simply repeat conventional wisdom.

In your report, please divide it up into sections according to my subheadings above (TOP 50 COMPARISON, CUTOFF POINT, OVERALL RECOMMENDATION). You should include written analysis of your results for each subheading, as well as actual program output to support your analysis.

Submission Guidelines

Submit your programs boundary.pl and mean_variance.pl, as well as your report file (experiments.txt) as a single compressed tar file that is named with your user id. This should be submitted to the web drop on the class web page prior to the deadline.

This is an individual assignment. You must write *all* of your code on your own. Do not get code from your colleagues, the Internet, etc. In addition, you are not to discuss which measure of association you are using with your classmates. It is essentially impossible that you will all independently arrive at the same measure or a small set of measures, so please work independently. You must also write your report on your own. Please do not discuss your interpretations of these results amongst yourselves. This is meant to make you think for yourself and arrive at your own conclusions.

by: Ted Pedersen - tpederse@umn.edu