# \\\|/// # \\ ~ ~ // # ( @ @ ) #**********************************-oOOo-(_)-oOOo-************************************ # # Natural Language Processing # # CS8761 - Dr. Ted Pedersen # # FINAL PROJECT # # An approach to sentiment classification using machine readable dictionaries # and World Wide Web # # # Melgar FBC - Prashant Jain , Anand Takale , Nan Zhang ,Sumalatha Kothadi # # #***************************************-OoooO-*************************************** # .oooO ( ) # ( ) ) / # \ ( (_/ # \_) ------------------------------------------------------------------------------------ Objectives: ------------------------------------------------------------------------------------ Invent an algorithm to perform sentiment classification. Implement and evaluate it. Your algorithm must utilize a combination of corpus based information from the World Wide Web and a machine readable dictionary ------------------------------------------------------------------------------------ Specification: ------------------------------------------------------------------------------------ We had to develop a solution to the sentiment classification problem. We wanted to automatically classify reviews as being positive or negative simply by referring to their content and without referring to "summary judgement" of the reviewer. ----------------------------------------------------------------------------------- Interpretation of the Objective ----------------------------------------------------------------------------------- The stage 3 of the project basically uses the interfaces to Machine readable dictionaries such as LDOCE and BigMac and an interface to World Wide Web. Our task is to classify reviews based on the information obtained from the Web as well as from the information obtained from the Machine Readable Dictionaries.In short we have to use only the information obtained from the two sources and not from any other sources and there should be no human intuition involved while classifying the review. One more important aspect to noted here is that the review summary or the rating (of any kind...i.e stars etc.) is not to be considered while classifying the review. ----------------------------------------------------------------------------------- Resources Available ----------------------------------------------------------------------------------- 1. Machine Readable Dictionaries a. LDOCE b. BigMac 2. World Wide Web ---------------------------------------------------------------------------------- Information that we get from the resources available ---------------------------------------------------------------------------------- A. Machine Readable Dictionaries -------------------------------- 1. BigMac and LDOCE modules both return words of a particular part of speech. 2. BigMac and LDOCE both return synonyms of a given word. 3. BigMac returns all words sharing the same thesaurus code as that of a given word. 4. LDOCE returns all words with same active codes. B. World Wide Web ------------------ 1. Number of hits of word NEAR (good/bad words) 2. Sentences containing a particular word. ---------------------------------------------------------------------------------- Experiments: ---------------------------------------------------------------------------------- The sentiment.pl program was run on 4 data-sets. The data-sets on which the program was run are as follows: (1) music-data => Data which our team collected (2) car1-data => Data which some other team collected (3) test-data => movie data (4) sanity-data => was collected from Project Gutenberg The sentiment.pl was run on all the four data sets and precision ,recall and f-measure were calculated for each and every data-set ---------------------------------------------------------------------------------- Results: ---------------------------------------------------------------------------------- Precision --------- -------------------------------------------------------------------- | mrd.pl | wr.pl | sentiment.pl | ---------------|--------------|--------------|---------------------| | | | | music-data | 0.57 | 0.68 | 0.66 | | | | | ---------------|--------------|--------------|---------------------| | | | | test-data | 0.56 | 0.67 | 0.65 | | | | | ---------------|--------------|--------------|---------------------| | | | | car1-data | 0.64 | 0.73 | 0.71 | | | | | ---------------|--------------|--------------|---------------------| | | | | sanity-data | 0.39 | 0.42 | 0.44 | | | | | -------------------------------------------------------------------- Recall --------- -------------------------------------------------------------------- | mrd.pl | wr.pl | sentiment.pl | ---------------|--------------|--------------|---------------------| | | | | music-data | 0.56 | 0.66 | 0.64 | | | | | ---------------|--------------|--------------|---------------------| | | | | test-data | 0.54 | 0.66 | 0.63 | | | | | ---------------|--------------|--------------|---------------------| | | | | car1-data | 0.62 | 0.71 | 0.70 | | | | | ---------------|--------------|--------------|---------------------| | | | | sanity-data | 0.37 | 0.41 | 0.41 | | | | | -------------------------------------------------------------------- Analysis -------- How fast did our program run? ----------------------------- Initially, the program did take a long time to classify the reviews. It did around 5 reviews an hour. After performing some enhancements though,it got upto around 12-20 reviews an hour. We did incorporate the cache though which is built into a hash at the start of the program so that did enhance the performance of our program even more since it did not have to query for the words that were already present in the cache. How good were the results? -------------------------- The results that we got from running our sentiment.pl program were average. There were a few instances when these results were actually very good, but in most cases, the results hovered around the 60-70% precision mark. Since our approach was one that could be decomposed into a machine readable dictionary approach and a web approach, we also ran tests for these approaches individually. We found that the web approach gave us much better results than the machine readable dictionary approach.In most cases as is shown by the results above, the machine readable dictionary approach was pulling down the precision of the program. Our Web approach, though similar to Turney's was different because-: 1. We used each adjective instead of phrases. 2. We performed 3 tests of association and used all three to classify the adjective as good/bad. There were a very few cant decides that we got in our program and we would attribute this to the usage of more than one approach to classify the reviews. Therefore there was hardly anytime when we got a score of 0 for a review and hence there werent many times when we got a cant decide, for a particular review. We believe that using an ensemble method made sure that the number of cant decides were few and hence we also noticed that the precision and recall of our methods were almost the same or very close to each other. Why were the results as they were? ---------------------------------- The results for the Web Reader interface were really good. We were of the thought that these results were going to be the ones that would pull down the precision of our algorithm. However, we were surprised to see that Web Reader performed pretty well. We think that this was because we used 3 tests of association and used all three to classify the adjectives. We noticed however, that even though we got good results by using three tests of association, on using t-score alone, we were getting pretty good precision. How would we interpret this? We, feel that though we could have managed by using t-score alone, using the other two tests didnt actually harm the results of our program and in most cases rather, they actually improved or reconfirmed the results that we got from t-score. Using just a single test of association would've meant that we are just relying on the accuracy of that test alone. But using three, we can reconfirm and actually improve the correctness of the program. The machine readable dictionaries, which we thought would be the backbone of our algorithm actually turned out to be the ones that pulled down the accuracy.We were searching for good words and bad words in the review using the good word hash and the bad word hash that we created using BigMac and LDOCE. This approach would have given better results had we had an even distribution of good words and bad words in the hashes. However, the distribution that we got was fairly uneven, in which the good words were a lot less in number than the bad words. Therefore a lot of the negative reviews were classified as positive. We did not use a hand-tagged list of good adjectives and bad adjectives which would've certainly improved the performance of our machine readable dictionary code but we wanted to avoid all kinds of intuition so we did not do that. How could have we improved it? ------------------------------ We could've used sentences that can be taken from the web, about each adjective and then using the context in which the adjective is used, found if the adjective is good or bad. This method however was taking a long time to run and hence was proving to be a performance bottleneck and hence was not used. However, if the concept of a cache could be incorporated into this, this method if implemented,can provide pretty good results we believe. Also, we could have used an even distribution of good and bad words from the machine readable dictionaries so as to get better results from that part of our program. General Obeservations --------------------- 1. There were cases in the reviews(especially in movie data) in which in a negative review references were made a good performance of the actor/director in some previous movie which could possibly contribute to the positive sentiment rather than the negative sentiment of the review. 2. Even though active codes in LDOCE would've provided a very good way of getting good words and bad words but in getting to know which active code is good and which one is bad, there is a reasonable level of human intuition involved. 3. The results of the sanity data could be anything depending on which kind of data is used. References ----------- [1] Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews, Turney,P.D. ,Proceedings of the 40th Annual Meeting of the Association for Computational Liguistics (ACL), Philadelphia, July 2002, pp. 417-424 [2] Sanity Data collected from miscellaneous text from Project Gutenberg. [3] www.d.umn.edu/~tpederse