CS 8761 Natural Language Processing - Fall 2004

Assignment 1 (Poor Man's LSA) - Due Wed, Sept 29, noon

This may be revised in response to your questions, so please check this page from time to time. (Last Revision Thu Sept 23)

Objectives

This assignment will help you learn how to process larger quantities of text and store important information about that text in an economical way. It will also introduce you to the method of Latent Semantic Analysis and similarity measurements in text in a straightforward way.

Specification

In this assignment, you will implement something akin to LSA, except it won't include SVD. This is a "Poor Man's" version of LSA, and perhaps when we are richer in knowledge we'll try our hand at the real thing. But until then...

You will write three Perl programs for this assignment. These must run on the Solaris/Sun platform we have available here at UMD, since this is the platform I will use to test your code.

Part A:
Write a program called matrix.pl that will accept as input any number of text files, and produce a co-occurrence matrix (sort of). Conceptually your program is producing a co-occurrence matrix that shows how often each word occurs in each context, but in reality you should not implement it using a "dense" 2-d matrix (where all cells are explicitly represented) for reasons that will be described shortly.

You may assume that each line in each input file represents a separate context. The cells in your co-occurrence matrix should show how many times each word occurs in each context. Treat all the files as one single body of text. However, rather than storing the count of the words, store the log of that count plus 1, in other words, each cell that has a non-zero value should contain the following:
 1 + log(count) 
As mentioned above, you should *not* use a simple 2-d matrix to store these values. A matrix stored in that form will get too large for our system to handle. Since it will be a very sparse matrix, we can use a more economical means of storing these counts, sometimes known as a "sparse" format. A sparse format typically does not explicitly store 0 values in cells. Thus, you should implement your co-occurrence matrix such that you only need to store the counts of non-zero values. The 0 values can be thought of as implicit and unmentioned.

matrix.pl should output the co-occurrence information to STDOUT in your sparse format, and this will become the input to your other two programs.

Thus, your matrix.pl program should run as follows:
 
matrix.pl file1.txt > occur.txt
matrix.pl file1.txt file2.txt > occur.txt
matrix.pl file2.txt news.txt > occur.txt
etc...
Make no assumptions about the names of input files, and allow an arbitrary number to be given on the command line. You may assume that the data you will process is plain text, and that each line contains a separate context. I will provide you with some data you can test with, and you should also create your own as well for your own testing.

Part B:
You will write a program called similar.pl that will take an arbitrary number of words as input, and produce complete pairwise similarity scores among this set of words. These scores should be from the cosine measure. Make sure you use the real valued cosine (see page 300 of text).

The input to your similar.pl program will be the output of matrix.pl, which should come to similar.pl from STDIN. The output from similar.pl should be a table which shows a matrix of scores for the set of words you have submitted. Please only show these scores to 4 digits of precision. This output should go to STDOUT.

Thus, you should be able to run your program as follows:
similar.pl cat dog house mouse < occur.txt > similar.txt
matrix.pl file3.txt news.txt | similar.pl cat dog house mouse > similar.txt
etc.
The output should be formatted like this:
			   cat     dog   house   mouse
 
	cat             1.0000   .3243   .1234   .9900
	dog              .3243  1.0000   .8776   .2211
	house            .1234   .8776  1.0000   .5321
	mouse            .9900   .2211   .5321  1.0000
Note that these values are made up, and are not meant to be indicative of actual values you might find for these words.

Make no assumption about the number of words that will be input to your program. If a word is input that is not known to your system, make sure you handle that gracefully by indicating a ? in the table entries associated with that word.

Part C:
You will write a program called knn.pl that will take a given word as input, and will output the N nearest neighbors to that word according to the information in your co-occurrence matrix. This program will take two command line arguments, the word you are interested in, and the number of neighbors you would like to find. You should find the N words that are closest to your given word with respect to the cosine measurement. Make sure that you use the real valued cosine (see page 300 of text).

The input to your knn.pl program will be the output of matrix.pl, which should come to knn.pl from STDIN. The output from knn.pl should be a table which includes the word you are interested in, as well as the N neighbors and their scores (please only show these scores to 4 digits of precision). This output should go to STDOUT. Thus, you should be able to run your program as follows:
 
knn.pl dog 5 < occur.txt > knn.txt
matrix.pl file1.txt file2.txt | knn.pl dog 5 > knn.txt
etc. 
The output should be formatted like this:
		      dog

	bone        .9998
	meat        .9977
	horse       .8890
	cow         .8740
	fish        .5432
Note that these values are made up, and are not meant to be indicative of actual values you might find for "dog".

Experiments and Report

You will produce a short written report describing the outcome of the following experiments.

Compare the results of your similar.pl program to the Matrix Comparison Demo of LSA using term-term comparison. You should carry out at least 4 different comparisons, each using a different set of 5 or more words. Try and choose sets of words where you have some idea about how related they should be. Also, try and make the sets of words as different from each other as possible. For example, one set of words might be verbs, another might be nouns, another might be place names, etc. In your report, make sure you describe why you selected the words you did, and what you hope they show.

In your report, include both the results generated by your program, and those from the LSA demo. Make sure you describe the settings that you used with the LSA demo (which topic space used, how many factors, etc.) Comment on how well each method seemed to perform, which was better and closer to your expectations, and why they might lead to different/same results. Your objective here should be to show that you can interpret the results of your system and the LSA system in a reasonably insightful manner that reflects some understanding of the underlying mechanisms.

Then, compare the results of your knn.pl program to the Near Neighbor Demo using term-term comparison.

You should carry out at least 5 different comparisons where you generate 20 neighbors, each using a different word to generate neighbors. Have 2 of your words be among those used in similar.pl experiments, and comment on the difference between what knn.pl and similar.pl report in those cases.

In your report, you should list the words you used, and explain why you chose them, and what you would expect their neighbors to be. Show the results from your program and the LSA demo, and comment on why they are similar and different, and what this might tell you about how both methods function.

Finally, in your report, please conclude by discussing how well your Poor Man's version of LSA approximates the real thing. Discuss improvements you could make to your Poor Man's version (short of adding SVD!), and how that might improve performance.

Submission Guidelines

Please name your programs 'matrix.pl' and 'knn.pl' and 'similar.pl'. Please name your report 'experiments.txt'. Each of these should be plain text files. Make sure that your name, date, and class information are contained in each file, and that your programs are commented.

Place these four files into a directory that is named with your umd user id. In my case the directory would be called tpederse, for example. Then create a tar file that includes this directory and your three files. Compress that tar file and submit it via the web drop from the class home page.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be graded. The web drop has a limit of 10mb, so your files should be plain text. If you have large data files you wish to share for some reason, please contact me via email and we'll figure out a way to do that.

This is an individual assignment. You must write *all* of your code on your own. Do not get code from your colleagues, the internet, etc. You are welcome to discuss LSA and similarity measurements in general with your classmates, friends, family, etc., but do not discuss implementation details. You must also write your report on your own. Please do not discuss your interpretations of these results amongst yourselves. This is meant to make you think for yourself and arrive at your own conclusions.

by: Ted Pedersen - tpederse@umn.edu