CS 8761 Natural Language Processing - Fall 2002 - MRD + Web => Corpus

Final Project - Stage III - Sentiment Classification

This may be revised in response to your questions. Last update Wed Dec 11 10:00 am

PLEASE NOTE: Do not collaborate outside your team.




Invent an algorithm to perform sentiment classification. Implement and evaluate it. Your algorithm must utilize a combination of corpus based information from the World Wide Web and a machine readable dictionary.


You must develop a solution to the sentiment classification problem with your teammates. In particular we want to automatically classify reviews as being positive or negative simply by referring to their content and without referring to the "summary judgment" of the reviewer.

Your team must develop one approach that employs the World Wide Web, LDOCE, AND Big Mac. All of these resources must be present in your solution. You may only access these resources through your Stage I and Stage II interfaces, which you are free to improve during the remainder of this project. All the data used by your algorithm must come through these interfaces. You should not encode any of your intuitions in your algorithm. Your program should be completely "data driven", where that data comes from the WWW and your MRDS.

As sources of inspiration, I have provided you with three papers: one by Niwa and Nitta (COLING-1994), one by Pang, et. al. (EMNLP-2002) and one by Turney (ACL-2002). These are given simply provided to give you ideas of what other people have done with this and similar problems, and provide some additional background on the problem. You should not simply attempt to re-implement one of these techniques. I would like your team to develop a new approach that combines a machine readable dictionary with corpus information from the World Wide Web! You are welcome to read up on the literature of "sentiment classification". This is an increasingly popular area of research so you can find related work other than the above.

The data employed by Pang, et. al. (EMNLP-2002) is available here. I have also put that data in our CS8761 directory as movie-data. We will use that as part of our evaluation.

In addition, each team must collect some review data from the Internet. Each team should locate at least 100 reviews, approximately 50 negative and 50 positive for something other than movies! Your reviews should be from a single domain (restaurants, cars, etc...). Each review should have at least 200 word tokens, and be downloaded and stored in its own file within a directory structure identical to the movie-data. This just means that positive and negative examples should be in different directories (pos and neg).

Please note that you should collect this data manually, and not attempt to automatically locate this data. Your Stage II tool is not appropriate for this task. Also, make sure you find reviews where there is a clearly identifiable positive or negative "summary" from the reviewer. Your algorithm must *not* use this summary judgment as a part of it's decision process. We just want it for evaluation purposes.

The team collected review data is now available here.

You must use your modules from Stage I and II as your only interfaces to the the World Wide Web, LDOCE, and Big Mac. These interfaces should be your only sources of data. Do not draw upon data from other sources. In particular, do not take the reviews you are going to classify as a source of data. In other words, we want to avoid a supervised learning technique such as we employed with word sense disambiguation.

Your team will also need to submit final versions of your Stage I and Stage II submissions. I expect that they will be "CPAN ready". Please remember that any data your solution requires must come through these interfaces!

Your sentiment classification program should use your stage I and stage II modules, and should run as follows:

sentiment.pl DIRECTORY

where DIRECTORY is a directory of reviews where positive and negative reviews are divided into positive (pos) and negative (neg) subdirectories. (Like the movie-data directory is structured). Each review is in a separate file. Your program should write to standard output. In general it should show the file name of each review, it's known classification, and then the judgment of your system. Your system should have THREE judgments: positive, negative, or can't decide. You should then compute precision and recall for your method based on the formulation of these scores in our textbook.

The following is an example of what your output should look like:
../test-data/neg/review1.txt                   negative   negative
../test-data/neg/review2.txt                   negative   positive
../test-data/pos/review2.txt                   positive   positive
../test-data/pos/review2.txt                   positive   undecided
2 1 1
Precision => 0.66
Recall    => 0.50
F-measure => 0.57
The values 2 1 1 indicate that 2 reviews were classified correctly, 1 was classified incorrectly, and that 1 was not classified (undecided).

You should also produce a trace file (trace.txt) that "explains" why your system reached the judgment it did for a particular review. The explanation can just be whatever values or facts that your algorithm used for reaching a particular conclusion. Make sure that your trace clearly indicates which file is being processed, and attempt to show the evidence that is being gathered to make a decision about that review.



There are three pieces of documentation that are required for your sentiment program.

Submission Guidelines

When you submit your final project on Tue Dec 17, submit all stages of the project. Thus, you may still make small changes to your Stage I and Stage II modules, but in general my expectation is that these will be "fine tuning" issues and not major changes in functionality over the version you submitted on Fri Dec 13.

Please put your stage I and stage II modules into directories called LDOCE, BigMac, and WebReader. Please use exactly those names. Please put those within a main directory named after your team (please use Alianza, Melgar, SportBoys, Cristal, and Minas as those names). This main directory should also include your sentiment.pl program plus your various reports. It should also include a directory called sanity-data that includes the sanity check data. Thus, in your main directory you should have the following:

LDOCE (directory)
BigMac (directory)
WebReader (directory)
sanity-data (directory)
Of course you may have other supporting files, documentation, and so forth. The above is the the minimum expectation, and the required organization and naming convention. Please carefully test the unpacking and installation of your systems. If I can't unpack, install your modules, and then run sentiment.pl immediately, your team will lose a significant number of points.

You must also submit all of the data that you have cached from your web searches such that I can run your programs using that rather than having to actually redo the web searches. You must submit this in such a way that your program can use it automatically without my having to do or configure anything.

Make sure you team name, individual names, date, etc. are included in your source code and on all of your submissions. You are encouraged to submit your modules to the CPAN archive, and I will post your systems on the class web page after the semester, so please provide appropriate info about authors, copyrights, distribution, etc.

This is a team assignment. You are strongly advised to divide up the work of the project into tasks that can be carried out in parallel by various team members. All team members should be acknowledged in the comments, etc. and all teammates will receive the same grade. Do not work with other teams. Each team should operate independent of all the other teams. Make your own decisions as a team and do not be influenced by the decisions of other teams if you happen to hear of them accidentally. You are free to work with your teammates as closely as is necessary.

by: Ted Pedersen - tpederse@umn.edu