Analysis Final Project: Sentiment Classification CS8761 Sporting Cristal: Archana Bellamkonda Rashmi Kankaria Ashutosh Nagle Paul Gordon 12/17/2002 Introduction This report describes experiments that were conducted using a program written to determine the sentiment of text files, specifically reviews. The program will either classify a review as positive, negative, or neutral (undecided). The data files containing the complete results of our experiments can be found at www.d.umn.edu/~gord0156/results. Experiments Four sets of experiments were conducted using our sentiment classifier. Each of these experiments were conducted on a set of approximately 100 reviews, equally divided between those with positive, and those with negative sentiment. The data came from four different sources. The first set, movie reviews, were those used by Pang, et al [1], in their work on sentiment classification. The second set were hand collected by our team from online digital camera reviews. The third set of data was taken from the hand-collected reviews of another team. This third set consisted of music reviews. The final set of data was our "sanity-data". These files were not reviews. They were meant to convey no sentiment, or at least no overall positive or negative sentiment. For our sanity data, we took approximately 200 word sections from the North American Free Trade Agreement (available at www.gutenberg.org). Bellow are the results of those experiments: Movie Data Camera Data Music Data NAFTA Data Precision 0.7303 0.9125 0.5842 0.7463 Recall 0.6500 0.6952 0.5200 0.5000 Number unassigned 11 25 12 33 Several things can be noted about the above information. One is that the precision and recall both have a wide range of values. The second is that number of unassigned reviews is a significant portion of the total number of reviews, especially for the Camera and NAFTA data. As a result of the large number of unassigned values, we have recalculated the precision, changing all unassigned values to what they would be assigned, if there were only positive and negative classifications. The results are: Movie Data Camera Data Music Data NAFTA Data Precision 0.73 0.86 0.58 0.60 This gives results that are not so widely spread, although the music data is still low, and the NAFTA data is still higher than would be expected (50%). One problem with our algorithm is that 0 doesn't seem to be the appropriate cutoff between positive and negative. By inspection, the ranked results are predominantly positive at the top and predominately negative at the bottom, so it appears to be separating them well. But, for movie data and music data, the transition from one to the other occurs around 2, not 0. An interesting experiment would be to vary the cutoff, and see how this effects these results. Unfortunately a cutoff of 2 would significantly reduce the precision of Camera data, since its highest value is 1.7. That suggests another problem with our algorithm. The transition value is not consistent. Perhaps there is a way to normalize our data. We are at present, not certain how to do this, but this could be an avenue of future investigation. Sanity Data Although our sanity data performed better than two of the other groups of data (which is not what was expected), there are a couple of things to note. 1) The sanity data had the highest number of neutral classifications. 2) The sanity data has the smallest range of values: Movie1 8.09 Camera 4.53 Music 5.01 NAFTA 2.64 3) When the neutral classifications were removed, the sanity data only outperformed one of the groups. However, 60% is larger than would be expected for information with no overall sentiment. It may be that trade agreements have more "character" than we thought. Execution Time Our program is usually able to classify 100 reviews in 60-90 minutes. If the web sentence cache files are not present, it will also incur a one-time collection cost, which can be significant, depending on how many words the program is searching for, and how many pages it processes for each word. For 40 words and 50 pages per word, it can take several hours. The web time aside, most of the time required by our program is for the lookup of definitions and POS tags. If we were able to refine our method of defining keywords, so that fewer were needed to attain the same results, we could process the reviews faster. Conclusions Leaving aside the music data for the moment, we think it is reasonable that the camera data would do better than the movie data. We think it is likely that non-professional reviewers will have a tendency to get to the point more quickly, and more bluntly, than professional reviewers, and that they are less likely to use techniques such as thwarted expectations. And it seems to have been the case with Alianza Lima, as well, that their hand collected data outperformed the movie data by 10% or more. However this does not explain the poor results of the music data. It would be interesting to see how other groups' algorithms performed with this data. Perhaps music reviews have their own vocabulary, or perhaps Music reviewers are more likely to mention positive aspects in bad reviews. But these are just guesses. Note: Yesterday when presenting the results of the camera experiment, we mistakenly reported a precision of 100%. This is incorrect. We apologize for the mistake. We should not have used information so recently collected in our talk. [1] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proc. 2002 Conf. on Empirical Methods in Natural Language Processing ( EMNLP 2002).