SportBoys Sentiment Classification - Aniruddha Mahajan - Anushri Parsekar - Prashant Rathi - Victoria Harp analysis.txt 12/17/2002 Introduction ------------ Our task was to build a sentiment classifier using the World Wide Web, LDOCE and BigMac. We implemented an algorithm (described in algorithm.txt) which integrated all the 3 resources we had at our disposal. This file contains an analysis of the algorithm, the experiments performed, the results obtained and attempts at drawing a conclusion from the information obtained. Our constant attempt was to keep any type of human-intervention or manual classifying of data to a minimum as having an intuition based algorithm beats the whole purpose of classifying sentiments. Implementation Details ---------------------- Initially our understanding was that if the good-count is greater than the bad-count, (goodcnt/badcnt > 1) then the review expresses a positive sentiment. Hence our program had a decider ratio of 1. However initial experiments showed that most negative reviews were being classified incorrect. More experimentation led us finally to ratios as given below - Good-Count / Bad-Count > 1.75 :: Positive < 1.50 :: Negative else Undecided Threshold on web-hits --------------------- As explained in the algorithm, the number of hits returned by word NEAR excellent query is used to scale (weighted factor) the good-count. This factor was observed to be too high in some cases and led to skewed results. As we did not want just a single word to change our sentiment-decision drastically, we implemented a threshold, which keeps in check the number of hits returned by any query. We have kept this threshold to be a million hits. Time taken ---------- Some parts of our algorithm require a significant amount of time to execute. These parts are -- * preparing initial data-structures e.g. hashes for LDOCE & BigMac * getting number of hits from the web (reduced by using cache) * matching good / bad words, retrieval of word-info No - Not factor --------------- The words preceeded by 'no' or 'not' present an opposite sentiment than normally interpreted. We thought that this contributed reasonably towards correct sentiment classification, hence we took care of it. We have implemented a data-structure: no_hash, which stores all such words and the sentiment normally expressed by these words is reversed. Experiments... -------------- We ran our program on all the data available to us. Namely the test data and the data collected by all the teams. The results obtained were as below - ---------------------------------------------------------- | Scores | |--------------------------------------------------------- |Data | Precision | Recall | F-score | |--------------------------------------------------------- |Movie-Data | 0.65 | 0.60 | 0.62 | |Playstation-Data| 0.54 | 0.52 | 0.53 | |Sanity-Data | 0.46 | 0.46 | 0.46 | |Car1-Data | 0.64 | 0.58 | 0.61 | |Music-Data | 0.62 | 0.61 | 0.62 | ---------------------------------------------------------- Observations on Data & their results ------------------------------------ Movie-Data These were movie reviews. The reviews themselves were pretty well written and to-the-point. As observed, it gave us our best results. Movie-Data had commonly used positive / negative words and had less complexity as such. Playstation-Data Car1-Data Music-Data Camera-Data PC-Data These data were collected by teams and were reviews on various subjects or objects. The algorithm gave mixed results for these. The data varied on topics as well as integrity of writing. The varied results obtained show that much depends upon the data as well. Camera-Data and PC-Data contained many technical terms and the positive and negative connotations were also expressed in technical jargon. This, we believe led to suppressed results as many of these words were just not present in LDOCE (which as it is has just 2000 words) or BigMac. Sanity-Data results ------------------- The sanity data we had considered was the NAFTA treaty, however the word 'good' appeared too often in it. So we switched to the Adventures of Sherlock Holmes from the Gutenberg Library. We expected sanity-data results to show considerable difference between precision and recall values. A large number of reviews should have been classified as undecided to achieve this. Extended Observations --------------------- Precision generally observed was from 0.54 to 0.65. 80% of positive reviews were classified correct 40% of negative reviews were classified correct maximum 10% reviews were undecided Why do we get more positive reviews than negative? -------------------------------------------------- (i) The algorithm detects positive words more easily than negative ones. Our thinking (now that we have observed the results) is that any review whether positive or negative, has a minimum fixed amount of positive words in it. It is the negative words which vary as per the sentiment of the review. (ii) Also, some times people use figures-of-speech to convey negative sentiment. e.g. :-feature is as useful as a screen-door on a submarine. Or The motion of the camera is so jerky it was being operated on the back of a pick-up-truck running low on kangaroo juice. If we have to improve our results then based on point (i) above, modifications are necessary in the algorithm. An obvious solution would be to get a minimum threshold on the good-count as observed from the sanity data, and enforce the threshold on the review-data. Why sanity-data? ---------------- Sanity-data is supposed be neutral-data and we are observing considerable good-count in it, while bad-count is very less. Conclusion ---------- Our algorithm is based on the fact that a review with positive sentiment should contain words which can be related to the word excellent by observing the frequency with which they occur together in the web-corpus. (and so also for negative sentiment, except it is related with poor). The algorithm tries to minimize human intervention at all levels. Increase in number of baseline (seed) words may give better results, but that increases human intervention. The amount of manual priming may be increased to improve the data obtained from the dictionaries. The implemented algorithm performs decently, though we are sure it’s performance can be improved upon using the previously mentioned techniques.