CS 8761 Natural Language Processing - Fall 2002

Assignment 2 - Due Fri, October 11, noon

This may be revised in response to your questions. Last update Mon Oct 7 11am

Objectives

To investigate various measures of association that can be used to identify collocations in large corpora of text. In particular you will identify and implement a measure that can be used with 2 and 3 word sequences, and compare this method with some other standard measures.

Specification

Download and install version 0.51 of the N-gram Statistics Package.

NSP comes with a number of modules that can be used to perform various tests and measures on 2x2 tables in order to identify 2 word collocations in text. These modules are a good starting point, but they by no means represent the full range of possibilities.

For example, the module mi.pm provided with NSP implements pointwise mutual information. Yet, there is no module for "true" mutual information. In addition, all of the modules provided are only suitable for 2 word sequences (bigrams). Yet, there is no reason that tests of association can not be implemented for 3 word sequences (trigrams).

This assignment will require that you identify a measure that is suitable for both 2 and 3 word sequences that is not already a part of NSP. You will implement this measure and carry out some experiments to see how well it performs relative to some of the existing measures supported by NSP.

You will find the documentation of NSP to be relatively complete (we hope), and should be able to determine how to implement modules from the information provided. Please review the various READMEs that come with NSP closely before attempting to carry out any of this assignment.

NSP was implemented entirely by Satanjeev "Bano" Banerjee, a nearly former UMD MS student much like yourself. However, if you have questions about NSP please consult the documentation first, and then me second. Please do not contact Bano under any circumstances.

Output

Your implementation will consist of four .pm files that will be used by the statistic.pl program that comes with NSP. These .pm files will not run on their own and will produce output only when used with NSP. You should make no modifications to NSP. Your modules must work with version 0.51 of NSP.

Experiments and Report

All of your experiments should be performed on the same corpus of text, hereafter known as CORPUS. You should create at least a 1,000,000 token corpus from Internet resources such as Project Gutenberg, etc. Please make CORPUS available to me at /home/cs/(your id)/CS8761/nsp/. You may use English or transliterated Hindi. You will produce a short written report describing the outcome of the following experiments.

Experiment 1:
Implement "true" mutual information for 2 word sequences. Call this module tmi.pm. True (or "real") mutual information is designated as I(X;Y) in our text (see page 67).

Identify a measure that is currently not supported in NSP that is suitable for discovering 2 and 3 word collocations. For example, the log-likelihood ratio can be formulated in a 2 word and 3 word version. However, you may not use that since it is already included in NSP. Please note that not all measures will extend from 2 to 3 words. You should consult some statistics books (look for tests or measures of association) to find candidates. Please discover your own measure! Do not work with your classmates on this. I expect that there will be a reasonable variety of measures used, and if the class somehow miraculously settles on 1 or 2 measures I will assume that you have not worked individually.

The textbook mentions the z-score and relative frequency ratios. I do not believe that either is appropriate for this problem. There are some minor variations to Pearson's chi-squared test (that essentially add a constant value to some of the cell counts). These variations are not sufficiently distinct from what already exists in NSP and are not suitable. Finally, please do not invent your own measures. There are plenty of existing measures of association available.

Once you have discovered a measure suitable for 2 and 3 word sequences, implement the 2 word version. Call the module 'user2.pm'. Please comment your module carefully, including a reference to wherever you found out about this measure. You should describe the measure completely enough so that I could compute values for it by hand based on your description and given some 2x2 table of values.

Once you have your implementation working (and do test it by computing some values by hand and making sure that your program reaches the same result) please carry out the following and summarize your findings in a written report. In each section, make sure that you show some of the bigrams or trigrams identified, along with their statistic values/scores and frequency counts. Also make sure to clearly indicate any command line options that you have used so I can recreate your results if I wish.

TOP 50 COMPARISON:

Run NSP on CORPUS using user2.pm and tmi.pm. Look over the top 50 or so ranks produced by each module. Which seems better at identifying significant or interesting collocations? How would you characterize the top 50 bigrams found by each module? Is one of these measures significantly "better" or "worse" then the other? Why do you think that is?

CUTOFF POINT:

Look up and down the list of bigrams as ranked by NSP for tmi.pm and user2.pm. Do you notice any "natural" cutoff point for scores, where bigrams above this value appear to be interesting or significant, while those below do not? If you do, indicate what the cutoff point is and discuss what this value "means" relative to the test that creates it. If you do not see any such cutoff, discuss why you can't find one. What does that tell you about these tests?

RANK COMPARISON:

Use rank.pl to compare each of these measures with ll.pm and mi.pm. Which is more like ll.pm and which is more like mi.pm? Explain and interpret what you observe. To be clear, you should compare ll.pm with tmi.pm and user2.pm, and then compare mi.pm with tmi.pm and user2.pm. Comment and interpret what you observe. (When running rank.pl, make sure that ll.pm and mi.pm are the first modules indicated each time you run it.)

OVERALL RECOMMENDATION:

Based on your experiences above and any other variations you care to pursue, which of the following measures is "the best" for identifying significant collocations in large corpora of text: mi.pm, ll.pm, user2.pm, or tmi.pm? If there is one, please explain why it is better. If none is better please explain. Your explanation should be specific to your investigations and not simply repeat conventional wisdom.

In your report, please divide it up into sections according to my subheadings above (TOP 50 COMPARISON, RANK COMPARISON, CUTOFF POINT, OVERALL RECOMMENDATION) You should dedicate 1-2 paragraphs to each subheading. Please feel free to include small portions of your output to illustrate your points.
Experiment 2:
Implement a module named ll3.pm that performs the log likelihood ratio test for 3 word sequences. Assume that the null hypothesis of the test is as follows:
p(W1)p(W2)p(w3) = p(W1,W2,W3)
Create your 3 word version of user2.pm and call it user3.pm. Please comment user3.pm just as carefully as you did user2.pm. Some of the information may be duplicated (ie references, general description) but that is ok. Make sure you make it clear how this measure extends to 3 word sequences and provide enough detail so I can compute values for this by hand based on your description.

TOP 50 RANK:

As in Experiment 1, except now for user3.pm and ll3.pm.

CUTOFF POINT:

As in Experiment 1, except now for user3.pm and ll3.pm.

OVERALL RECOMMENDATION:

As in Experiment 1, except now for user3.pm and ll3.pm.

Submission Guidelines

Submit your modules 'tmi.pm', 'user2.pm', 'll3.pm', and 'user3.pm'. Each should be commented and explain what test you are implementing and what it does. Your comments for 'user2.pm' and 'user3.pm' must be sufficient so that I can manually compute values for these tests. You must also provide a complete worked out example (showing, in detail, how this measure is computed for a specific case.) You must also indicate where you found out about this measure. Please provide a complete reference for printed materials, and complete working urls for web pages. Please name your report 'experiments.txt'. Clearly identify the discussion associated with Experiment 1 and 2, and use the SUBHEADINGS provided above to organize your discussion.

All of these should be plain text files. Make sure that your name, date, and class information are contained in each file, and that your .pm files are carefully commented.

Place all of these files into a directory that is named with your umd user id. In my case the directory would be called tpederse, for example. Then create a tar file that includes this directory and the files you will submit. Compress that tar file and submit it via the web drop from the class home page. Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be graded. The web drop has a limit of 10mb, so your files should be plain text. If you have large data files you wish to share, please include them in your /home/cs/(your id)/CS8761/nsp directory.

This is an individual assignment. You must write *all* of your code on your own. Do not get code from your colleagues, the Internet, etc. In addition, you are not to discuss which measure of association you are using with your classmates. It is essentially impossible that you will all independently arrive at the same measure or a small set of measures, so please work independently. You must also write your report on your own. Please do not discuss your interpretations of these results amongst yourselves. This is meant to make you think for yourself and arrive at your own conclusions.

by: Ted Pedersen - tpederse@umn.edu