CS8761 - Natural Language Processing Analysis of Results of sentiment classifier =========================================== Date: 12/16/2002 Union Minas: Bridget Thomson McInnes, bthomson@d.umn.edu Deodatta Y Bhoite, bhoi0001@d.umn.edu Yanhua Li, lixx0333@d.umn.edu Nitin Agarwal, agar0054@d.umn.edu Kailash Aurangabadkar, aura0011@d.umn.edu Results ---------------------------------------------------------------------------- The sentiment classification was done on the following data 1. pc-data 2. playstation-data 3. test-data (subset of movie-data by Pang et al) 4. sanity-data (assignment descriptions gathered from various courses offered at UMD) We first tested the algorithms individually and then used several combinations of weights to get total results. The results for the first 3 data are summarized as follows PRECISION +-----------------------------------------------------------+ | pc play test avg | |-----------------------------------------------------------| | adjective .82 .76 .71 .76 | | | | active .52 .44 .53 .50 | | | | distance .64 .53 .65 .61 | | | | web + measures .57 .68 .56 .60 | | | | total (.5 .3 .1 .1) .75 .75 .66 .72 | | | | total (.7 .1 .1 .1) .75 .75 .67 .72 | | | | total (1 1 1 1) .74 .76 .71 .74 | | | +-----------------------------------------------------------+ RECALL +-----------------------------------------------------------+ | pc play test avg | |-----------------------------------------------------------| | adjective .61 .72 .63 .65 | | | | active .49 .44 .50 .48 | | | | distance .54 .50 .58 .54 | | | | web + measures .49 .44 .39 .44 | | | | total (.5 .3 .1 .1) .75 .75 .67 .72 | | | | total (.7 .1 .1 .1) .72 .75 .67 .71 | | | | total (1 1 1 1) .60 .56 .56 .57 | | | +-----------------------------------------------------------+ F-MEASURE +-----------------------------------------------------------+ | pc play test avg | |-----------------------------------------------------------| | adjective .70 .74 .67 .70 | | | | active .50 .44 .52 .49 | | | | distance .59 .52 .61 .57 | | | | web + measures .53 .53 .46 .51 | | | | total (.5 .3 .1 .1) .75 .75 .68 .73 | | | | total (.7 .1 .1 .1) .74 .74 .67 .72 | | | | total (1 1 1 1) .66 .64 .62 .64 | | | +-----------------------------------------------------------+ SANITY DATA +-----------------------------------------------------------+ | precision recall f-measure| |-----------------------------------------------------------| | adjective .48 .24 .70 | | | | active .49 .48 .49 | | | | distance .35 .32 .57 | | | | web + measures .43 .39 .51 | | | | total (.5 .3 .1 .1) .47 .46 .73 | | | | total (.7 .1 .1 .1) .46 .45 .46 | | | | total (1 1 1 1) .46 .4 .64 | | | +-----------------------------------------------------------+ Analysis ---------------------------------------------------------------------------- The Analysis of the sentiment classifier will be split into three sections: LDOCE, BigMac and WebReader due to the fact that the sections utilize the Longman's (LDOCE) Machine Readable Dictionary (MRD), MacQuarie's (BigMac) Machine Readable Dictionary (MRD) and the WebReader module respectively. Each section contains method(s) in determining the sentiment classification of a review. All the results of the analysis will be compared against the baseline. The baseline for sentiment classification is 50%. This is due to the fact that randomly assigning a sentiment to each of the reviews should result in an approximate 50% accuracy because there are an equal number of negative and positive reviews LDOCE: -------------------------------------------------------------------------- The methods that utilize the LDOCE MRD are known as the adjective method and the active codes method. The adjective method determines the sentiment classification of a review by evaluating each adjective in the review and assigning that adjective a sentiment. The total number of good and bad sentiments are tallied to determine the appropriate sentiment. An addition to the adjective method retains the word before the adjective. If this word is equal to 'NOT' then the sentiment classification for the adjective is negated. The adjective method performed the best out of all the methods. This was pleasing but surprising because of the simplicity of the method. pc-data: playstation-data: test-data: 64 14 27 72 23 5 62 25 12 precision = .82 precision = .76 precision = .71 recall = .61 recall = .72 recall = .63 fmeasure = .70 fmeasure = .74 fmeasure = .67 *the number underneath the test data names correspond to correct wrong unknown. The majority of incorrect classifications were due to either the lack of classified adjectives found in the review or the adjectives found should have been negated by the adjective method but the method was too naive to determine this. We believe that this classifier method could be extended to incorporate more negation schemes. An example where the lack of classified adjectives found in the review resulted in a wrong classification. In review pc_neg_03 the adjective method only picked up one adjective - beautiful. "...the screen is beautiful..." The review is a negative review for the computer itself but because the only descriptive adjective for our case is beautiful and describing the screen the method tagged the review as a positive review rather than as negative or unknown. The results of the sanity data for the adjective method were good. The results did not go above the baseline. The number of unknown classifica- tions were approximately equal to the number of correct and incorrect class- fications showing that the classifier could not determine the sentiment of the classification approximately half of the time. The sanity data results are below. sanity-data 24 26 50 precision = .48 recall = .24 fmeasure = .32 The number of negative and positive classifications were checked in the sanity-data, pc-data, playstation-data and the test-data and the results show that the classifier does not continually classify reviews as either negative or positive. sanity-data: pc-data: playstation-data test-data positive = 18 positive = 39 positive = 66 positive = 50 negative = 32 negative = 39 negative = 29 negative = 37 The active code method evaluates a review by inspecting each word in the review and determining its active code which in turn returns the appropriate sentiment for that word. After discerning each word's sentiment in the review, the number of positive and negative sentiments are tallied determin- ing the possible sentiment for the review. The active code method returned the worst results. pc-data: playstation-data: test-data: 52 49 4 44 55 1 50 45 4 precision = .51 precision = .44 precision = .52 recall = .49 recall = .44 recall = .51 fmeasure = .50 fmeasure = .44 fmeasure = .52 *the number underneath the test data names correspond to correct wrong unknown In theory this sounded like an exceptionally good evaluation technique and initially the results were favorable but in the end the results did not turn out as expected. We believe that this method could be improved by reviewing how the active codes were classified and possibly making changes in this regard. BigMac: ---------------------------------------------------------------------------- The methods that utilize the BigMac MRD are known as the synonym method and the distance method. In the synonym method each class in the Big Mac thesaurus is tagged with a positive, negative or neutral sentiment. A review is evaluated by determining the sentiment of each adjective in the review and determining its sentiment classification in the thesaurus. The number of positive and negative sentiments are tallied determining the possible sentiment for the review. This method did not do as well as expected. It was assumed that it would do well because the adjective method using LDOCE was performing favorably. The results returned by the thesaurus method were continuously positive. We believe that this is due to the over-generalization of keywords in the thesaurus file. If we manually classify the sub para heads in the thesaurus file and use that then it will possibly give better results. The distance method is a partial implementation of the distance vector concept described in [NN1994]. The distance of a review word is one link from the origin word if the origin word occurs in the definition of a specified word. A small set of positive and negative words were chosen as origin words. For each adjective in the review, it is determined whether a positive or negative origin word exists in their definition. The number of negative and positive occurrences are tallied to determine the possible sentiment classification of the review. The good words used in this test were: excellent, good, and awesome. The bad words were: bad, poor, boring, and horrible. The results using these words are below. pc-data: playstation-data: test-data: 57 32 16 50 44 6 57 31 11 precision = .64 precision = .53 precision = .65 recall = .54 recall = .50 recall = .58 fmeasure = .59 fmeasure = .52 fmeasure = .61 *the number underneath the test data names correspond to correct wrong unknown The distance method performed above baseline but did not do as well as the LDOCE adjective method. We were curious if we run the tests using another set of good and bad words what the results would the results improve. The good words were revised to: wonderful, superior, and merit. The bad words were revised to: badly, wrong, and amiss. The results using the pc-data did not improve but went down. Do to this test, we determined that the words that we picked were sufficient. pc-data: 7 5 93 Precision => 0.58 Recall => 0.06 F-measure => 0.11 *the number underneath the test data names correspond to correct wrong unknown We also realized that we had no emotion words as origin words. So we added happy and hate to the list of good origin words and the bad origin words respectively. We could classify one more review correctly which was previously classified as unknown. Thus, we conclude that the accuracy of this method is dependent on the choice of origin words and can be improved by making a wise choice after some amount of experimentation. We were also curious about if we incorporated more links into the algorithm would the results improve. This way it could be determined if a review word was two or three links away from an origin word. This then could be added at a lower weight to the positive or negative occurrences possible increasing the possible sentiment classification results. We did not have time to implement such a measure due to time constraints and the fact we thought of this algorithm method rather late. Testing the sanity data using the distance method revealed the following results: sanity-data: 32 60 8 Precision => 0.35 Recall => 0.32 F-measure => 0.33 *the number underneath the test data names correspond to correct wrong unknown These numbers were encouraging. The precision was below baseline which was to be expected. The number of unknown classification was low though. When calculating the number of positive and negative classifications by the method it was noted that 70 reviews were classified as positive, 22 negative, and 8 unknown. This shows that our method has a tendency to classify reviews positively. To verify this fact, we checked to see the how many reviews were classified as positive or negative in the pc-data, playstation-data and the test-data. The following results were given. pc-data: playstation-data: test-data: positive = 71 positive = 92 positive = 67 negative = 18 negative = 2 negative = 21 The distance method tends to classify reviews more positively than negatively which could possibly be corrected using different origin words. WebReader: ---------------------------------------------------------------------------- The third stage of the algorithm utilizes the World Wide Web through the Minas::WebReader perl module. A review is evaluated by inspecting each adjective in the review, which will be called 'review word', and determining its association with the words 'excellent' and 'horrible', which will be called 'set words'. This will be implemented by querying the web for: query1. the set word query2. the review word query3 the set word and the review word The number of documents returned for each of these queries will act as counts allowing for the following contingency table to be created: set wrd | !set wrd ______________________ | | | review wrd | n11 | n12 | n1p | | | ---------------------- | | | !review wrd | n21 | n22 | n2p | | | ---------------------- ---- np1 np2 | npp where n11 = query3, n1p = query2, np1 = query1. With this contingency table different measures of association can be calculated. This is similar to the PMI-IR measure described by [Tur2002] for sentiment classification. However, we use T-score, Poisson's measure and dice coefficient to measure the association of the word with "excellent" or "horrible". A variation of the ensemble technique described in [Var2002] is used to combine the results generated by the various tests/measures of association. Each measure/test of association returns an association class (excellent/ horrible) for the word. The class with the highest agreement between the different measures is assigned to the word. The results for the ensemble technique are: pc-data: playstation-data: test-data: 51 38 16 44 21 35 39 31 29 precision = .57 precision = .68 precision = .58 recall = .49 recall = .44 recall = .39 fmeasure = .52 fmeasure = .53 fmeasure = .46 *the number underneath the test data names correspond to correct wrong unknown. The combined results were not significant. The individual results performed much better than the ensemble technique using the pc-data. dice coefficient: Precision => 0.60952380952381 Recall => 0.60952380952381 F-measure => 0.60952380952381 possion test: Precision => 0.571428571428571 Recall => 0.571428571428571 F-measure => 0.571428571428571 tscore measure: Precision => 0.533333333333333 Recall => 0.533333333333333 F-measure => 0.533333333333333 The dice coefficient resulted in the highest accuracy while the possion test was a close second. It is also interesting to note that each of the test/measures of association resulted in a classification of negative or positive for each review. Only when the reviews were ensembled did an unknown classification arise. Another interesting fact we discovered about these measures is that in this context they are all negatively biased, i.e., they associate more words with "horrible" than with "excellent". This maybe the result of the choice of words we use as set words. This conclusion is based on the following numbers: positive negative unknown poisson 31 74 0 dice 43 62 0 tscore 15 90 0 The sanity data for the ensemble technique is: sanity-data: precision = .43 recall = .39 fmeasure = .41 The results show that the classification of the sanity-data done by this method is below baseline. Due to time constraints we could not experiment with different set words, however these are configurable in the sentiment.config so the user can easily experiment with different set words. Ensembled Results: --------------------------------------------------------------------------- The total results were calculated using two different approaches, in the first approach the results of the methods were weighted and the second approach the results were not. When the results were not weighted they were as follows: pc-data: playstation-data: test-data: precision = .74 precision = .76 precision = .71 recall = .60 recall = .56 recall = .56 fmeasure = .66 fmeasure = .64 fmeasure = .62WY The results were not as high as the adjective method used by itself therefore we decided to weight the results according to their projected accuracy. The adjective method was weighted at .5 because it returned the best results. The active method was weighted at .3 because initially the results of the tests looked favorable. The distance method and the test/ measures of association method were weighted at .1 each because the results returned for both the methods were not as high. This approach returned the following results: pc-data: playstation-data: test-data: precision = .75 precision = .75 precision = .66 recall = .75 recall = .75 recall = .67 fmeasure = .75 fmeasure = .75 fmeasure = .68 This was still not a favorable as the adjective method by itself. Given the active code method's results, it was decided to lower the weight of the active code method and increase the adjective method's weight. This did nothing to affect the results as shown below: Modified Weights: adjective weight = .7 active weight = .1 distance weight = .1 measure weight = .1 pc-data: playstation-data: test-data: precision = .75 precision = .75 precision = .67 recall = .72 recall = .75 recall = .67 fmeasure = .74 fmeasure = .75 fmeasure = .67 Surprisingly, the results of this test did not show any improvement. Consequently, it is believed that the ensemble method actually hurts the results of the methods that perform well and does not significantly improve the methods that perform on average. This is shown in two different areas of the algorithm. First, when the different tests and measure's of association are ensembled to determine the measure method's sentiment classification for the review. Second, when all the different method's are combined to determine the final sentiment classification of the review. References ------------------------------------------------------------------------------ [NN1994] Y.Niwa and Y.Nitta; Co-occurrence vectors from corpora vs. distance vectors from dictionaries; Proceedings of COLING'94; 1994. [Tur2002] P.Turney; Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews; Proceedings of ACL'02; 2002. [Var2002] N. Varma. Identifying Word Translations in Parallel Corpora using Measures of Association. Master's thesis, University of Minnesota, December 2002. COPYRIGHT AND LICENSE ------------------------------------------------------------------------------ Copyright (C) 2002 Bridget Thomson, Deodatta Y Bhoite, Kailash Aurangabadkar, Nitin Agarwal, Yanhua Li. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.