CS8761 - Natural Language Processing

Analysis of Results of sentiment classifier
===========================================

Date: 12/16/2002

Union Minas:

      Bridget Thomson McInnes, bthomson@d.umn.edu
      Deodatta Y Bhoite,       bhoi0001@d.umn.edu
      Yanhua Li,               lixx0333@d.umn.edu
      Nitin Agarwal,           agar0054@d.umn.edu
      Kailash Aurangabadkar,   aura0011@d.umn.edu


Results
----------------------------------------------------------------------------

The sentiment classification was done on the following data

1. pc-data
2. playstation-data
3. test-data (subset of movie-data by Pang et al)
4. sanity-data (assignment descriptions gathered from various courses offered
   at UMD)

We first tested the algorithms individually and then used several combinations
of weights to get total results.

The results for the first 3 data are summarized as follows

			  PRECISION
+-----------------------------------------------------------+
|			  pc      play      test      avg   |
|-----------------------------------------------------------|
| adjective		  .82     .76       .71       .76   |
|							    |
| active		  .52	  .44	    .53	      .50   |
|							    |
| distance		  .64	  .53	    .65	      .61   |
|							    |
| web + measures	  .57	  .68	    .56	      .60   |
|							    |
| total (.5 .3 .1 .1)	  .75	  .75	    .66	      .72   |
|							    |
| total (.7 .1 .1 .1)	  .75	  .75       .67	      .72   |
|							    |
| total (1 1 1 1)	  .74	  .76	    .71	      .74   |
|							    |
+-----------------------------------------------------------+

			  RECALL
+-----------------------------------------------------------+
|			  pc      play      test      avg   |
|-----------------------------------------------------------|
| adjective		  .61     .72       .63       .65   |
|							    |
| active		  .49	  .44	    .50	      .48   |
|							    |
| distance		  .54	  .50	    .58	      .54   |
|							    |
| web + measures	  .49	  .44	    .39	      .44   |
|							    |
| total (.5 .3 .1 .1)	  .75	  .75	    .67	      .72   |
|							    |
| total (.7 .1 .1 .1)	  .72	  .75       .67       .71   |
|							    |
| total (1 1 1 1)	  .60	  .56	    .56	      .57   |
|							    |
+-----------------------------------------------------------+

			 F-MEASURE
+-----------------------------------------------------------+
|			  pc      play      test      avg   |
|-----------------------------------------------------------|
| adjective		  .70     .74       .67       .70   |
|							    |
| active		  .50	  .44	    .52	      .49   |
|							    |
| distance		  .59	  .52	    .61	      .57   |
|							    |
| web + measures	  .53	  .53	    .46	      .51   |
|							    |
| total (.5 .3 .1 .1)	  .75	  .75	    .68	      .73   |
|							    |
| total (.7 .1 .1 .1)	  .74	  .74       .67       .72   |
|							    |
| total (1 1 1 1)	  .66	  .64	    .62	      .64   |
|							    |
+-----------------------------------------------------------+

			SANITY DATA 
+-----------------------------------------------------------+
|		        precision      recall      f-measure|
|-----------------------------------------------------------|
| adjective		  .48           .24         .70     |
|							    |
| active		  .49	        .48	    .49	    |
|							    |
| distance		  .35	        .32	    .57	    |
|							    |
| web + measures	  .43	        .39	    .51	    |
|							    |
| total (.5 .3 .1 .1)	  .47	        .46	    .73	    |
|							    |
| total (.7 .1 .1 .1)	  .46		.45	    .46	    |
|							    |
| total (1 1 1 1)	  .46	        .4	    .64	    |
|							    |
+-----------------------------------------------------------+


Analysis
----------------------------------------------------------------------------

The Analysis of the sentiment classifier will be split into three sections: 
LDOCE, BigMac and WebReader due to the fact that the sections utilize the
Longman's (LDOCE) Machine Readable Dictionary (MRD), MacQuarie's (BigMac)
Machine Readable Dictionary (MRD) and the WebReader module respectively. Each 
section contains method(s) in determining the sentiment classification of a 
review. All the results of the analysis will be compared against the baseline.
The baseline for sentiment classification is 50%. This is due to the fact that
randomly assigning a sentiment to each of the reviews should result in an 
approximate 50% accuracy because there are an equal number of negative and 
positive reviews


  LDOCE:
  --------------------------------------------------------------------------

  The methods that utilize the LDOCE MRD are known as the adjective method 
  and the active codes method. The adjective method determines the sentiment 
  classification of a review by evaluating each adjective in the review and 
  assigning that adjective a sentiment. The total number of good and bad 
  sentiments are tallied to determine the appropriate sentiment. An addition
  to the adjective method retains the word before the adjective. If this word
  is equal to 'NOT' then the sentiment classification for the adjective is 
  negated. The adjective method performed the best out of all the methods. 
  This was pleasing but surprising because of the simplicity of the method.

  pc-data:	      playstation-data:		test-data:   
     64  14  27            72  23  5                 62  25  12
     precision = .82       precision = .76           precision = .71
     recall    = .61       recall    = .72	   recall    = .63
     fmeasure  = .70       fmeasure  = .74	   fmeasure  = .67

   *the number underneath the test data names correspond to 
    correct wrong unknown.

  The majority of incorrect classifications were due to either the lack of 
  classified adjectives found in the review or the adjectives found should 
  have been negated by the adjective method but the method was too naive to
  determine this. We believe that this classifier method could be extended
  to incorporate more negation schemes.

  An example where the lack of classified adjectives found in the review
  resulted in a wrong classification. In review pc_neg_03 the adjective 
  method only picked up one adjective - beautiful. 

  "...the screen is beautiful..." 

  The review is a negative review for the computer itself but because the only
  descriptive adjective for our case is beautiful and describing the screen 
  the method tagged the review as a positive review rather than as negative or
  unknown.

  The results of the sanity data for the adjective method were good. The 
  results did not go above the baseline. The number of unknown classifica-
  tions were approximately equal to the number of correct and incorrect class-
  fications showing that the classifier could not determine the sentiment of
  the classification approximately half of the time. The sanity data results 
  are below. 

  sanity-data
	24  26  50
	precision = .48  
	recall    = .24
	fmeasure  = .32

  The number of negative and positive classifications were checked in the
  sanity-data, pc-data, playstation-data and the test-data and the results
  show that the classifier does not continually classify reviews as either
  negative or positive. 

  sanity-data:       pc-data:	       playstation-data    test-data
   positive = 18      positive = 39     positive = 66       positive = 50
   negative = 32      negative = 39     negative = 29       negative = 37

 
  The active code method evaluates a review by inspecting each word in the 
  review and determining its active code which in turn returns the appropriate
  sentiment for that word. After discerning each word's sentiment in the 
  review, the number of positive and negative sentiments are tallied determin-
  ing the possible sentiment for the review.

  The active code method returned the worst results. 
 
  pc-data:	        playstation-data:	  test-data:   
     52  49  4             44  55  1                 50  45  4
     precision = .51       precision = .44           precision = .52
     recall    = .49       recall    = .44	     recall    = .51
     fmeasure  = .50	   fmeasure  = .44	     fmeasure  = .52

     *the number underneath the test data names correspond to 
      correct wrong unknown

  In theory this sounded like an exceptionally good evaluation technique
  and initially the results were favorable but in the end the results did 
  not turn out as expected. We believe that this method could be improved 
  by reviewing how the active codes were classified and possibly making 
  changes in this regard.

  BigMac:
  ----------------------------------------------------------------------------
  
  The methods that utilize the BigMac MRD are known as the synonym method 
  and the distance method. In the synonym method each class in the Big Mac 
  thesaurus is tagged with a positive, negative or neutral sentiment. A review
  is evaluated by determining the sentiment of each adjective in the review 
  and determining its sentiment classification in the thesaurus. The number 
  of positive and negative sentiments are tallied determining the possible 
  sentiment for the review. This method did not do as well as expected. It 
  was assumed that it would do well because the adjective method using LDOCE 
  was performing favorably. The results returned by the thesaurus method were 
  continuously positive. We believe that this is due to the over-generalization
  of keywords in the thesaurus file. If we manually classify the sub para heads
  in the thesaurus file and use that then it will possibly give better results.

  The distance method is a partial implementation of the distance vector
  concept described in [NN1994]. The distance of a review word is one link 
  from the origin word if the origin word occurs in the definition of a 
  specified word. A small set of positive and negative words were chosen as 
  origin words. For each adjective in the review, it is determined whether 
  a positive or negative origin word exists in their definition. The number 
  of negative and positive occurrences are tallied to determine the possible 
  sentiment classification of the review. The good words used in this test
  were: excellent, good, and awesome. The bad words were: bad, poor, boring,
  and horrible. The results using these words are below.

  pc-data:	      playstation-data:		test-data:   
     57  32  16            50  44  6                 57  31  11
     precision = .64       precision = .53           precision = .65
     recall    = .54       recall    = .50	     recall    = .58
     fmeasure  = .59	   fmeasure  = .52	     fmeasure  = .61

     *the number underneath the test data names correspond to 
      correct wrong unknown

  The distance method performed above baseline but did not do as well as
  the LDOCE adjective method. We were curious if we run the tests using
  another set of good and bad words what the results would the results
  improve. The good words were revised to: wonderful, superior, and merit.
  The bad words were revised to: badly, wrong, and amiss. The results using
  the pc-data did not improve but went down. Do to this test, we determined
  that the words that we picked were sufficient.
     
     pc-data:
     7  5  93
	Precision  => 0.58
	Recall     => 0.06
	F-measure  => 0.11
	
     *the number underneath the test data names correspond to 
      correct wrong unknown

  We also realized that we had no emotion words as origin words. So we added
  happy and hate to the list of good origin words and the bad origin words
  respectively. We could classify one more review correctly which was 
  previously classified as unknown. Thus, we conclude that the accuracy of
  this method is dependent on the choice of origin words and can be improved
  by making a wise choice after some amount of experimentation.

  We  were also curious about if we incorporated more links into the algorithm
  would the results improve. This way it could be determined if a review
  word was two or three links away from an origin word. This then could be 
  added at a lower weight to the positive or negative occurrences possible 
  increasing the possible sentiment classification results. We did not have 
  time to implement such a measure due to time constraints and the fact we 
  thought of this algorithm method rather late.

  Testing the sanity data using the distance method revealed the following
  results:
	sanity-data:
	32  60  8
	   Precision  => 0.35
	   Recall     => 0.32
	   F-measure  => 0.33
	
        *the number underneath the test data names correspond to 
	 correct wrong unknown

  These numbers were encouraging. The precision was below baseline which was
  to be expected. The number of unknown classification was low though. When
  calculating the number of positive and negative classifications by the
  method it was noted that 70 reviews were classified as positive, 22 
  negative, and 8 unknown. This shows that our method has a tendency to 
  classify reviews positively. To verify this fact, we checked to see the how 
  many reviews were classified as positive or negative in the pc-data, 
  playstation-data and the test-data. The following results were given.

    pc-data:		  playstation-data:	 test-data:
      positive = 71	    positive =	92	   positive = 67
      negative = 18	    negative =	2	   negative = 21


  The distance method tends to classify reviews more positively than 
  negatively which could possibly be corrected using different origin
  words.

  WebReader:
  ----------------------------------------------------------------------------
  The third stage of the algorithm utilizes the World Wide Web through the 
  Minas::WebReader perl module. A review is evaluated by inspecting each 
  adjective in the review, which will be called 'review word', and determining
  its association with the words 'excellent' and 'horrible', which will be 
  called 'set words'. This will be implemented by querying the web for:
	   query1. the set word
	   query2. the review word
	   query3  the set word and the review word
  The number of documents returned for each of these queries will act as 
  counts allowing for the following contingency table to be created:

           	          set wrd  | !set wrd
                        ______________________
                        |          |         |
             review wrd |    n11   |   n12   | n1p
                        |          |         |
                        ----------------------
                        |          |         |
            !review wrd |    n21   |   n22   | n2p 
                        |          |         |
                        ---------------------- ----
                             np1       np2   | npp 

			     
			     where n11 = query3,
			           n1p = query2,
				   np1 = query1.
  
  With this contingency table different measures of association can be 
  calculated. This is similar to the PMI-IR measure described by [Tur2002] 
  for sentiment classification. However, we use T-score, Poisson's measure 
  and dice coefficient to measure the association of the word with "excellent"
  or "horrible".
      
  A variation of the ensemble technique described in [Var2002] is used to 
  combine the results generated by the various tests/measures of association. 
  Each measure/test of association returns an association class (excellent/
  horrible) for the word. The class with the highest agreement between the 
  different measures is assigned to the word.
      
  The results for the ensemble technique are:

  pc-data:	         playstation-data:	    test-data:   
     51  38  16            44  21  35                 39  31  29
     precision = .57       precision = .68            precision = .58
     recall    = .49       recall    = .44	      recall    = .39
     fmeasure  = .52       fmeasure  = .53	      fmeasure  = .46

   *the number underneath the test data names correspond to 
    correct wrong unknown.
    
   The combined results were not significant. The individual results
   performed much better than the ensemble technique using the pc-data.

   dice coefficient:
    Precision  => 0.60952380952381
    Recall     => 0.60952380952381
    F-measure  => 0.60952380952381

   possion test:
    Precision  => 0.571428571428571	
    Recall     => 0.571428571428571
    F-measure  => 0.571428571428571
     
   tscore measure:  
    Precision  => 0.533333333333333
    Recall     => 0.533333333333333
    F-measure  => 0.533333333333333

  The dice coefficient resulted in the highest accuracy while the possion
  test was a close second. It is also interesting to note that each of the
  test/measures of association resulted in a classification of negative
  or positive for each review. Only when the reviews were ensembled did
  an unknown classification arise.

  Another interesting fact we discovered about these measures is that in this
  context they are all negatively biased, i.e., they associate more words with
  "horrible" than with "excellent". This maybe the result of the choice of
  words we use as set words. This conclusion is based on the following numbers:
  
	     positive	     negative	     unknown
  poisson	31		74		0
  dice		43		62		0
  tscore	15		90		0

  The sanity data for the ensemble technique is:
  
   sanity-data:
     precision = .43
     recall    = .39
     fmeasure  = .41

  The results show that the classification of the sanity-data done by this
  method is below baseline.

  Due to time constraints we could not experiment with different set words,
  however these are configurable in the sentiment.config so the user can
  easily experiment with different set words.   

  Ensembled Results:
  ---------------------------------------------------------------------------

  The total results were calculated using two different approaches, in the 
  first approach the results of the methods were weighted and the second 
  approach the results were not.

  When the results were not weighted they were as follows:

   pc-data:	         playstation-data:	    test-data:   
     precision = .74       precision = .76            precision = .71
     recall    = .60       recall    = .56	      recall    = .56
     fmeasure  = .66       fmeasure  = .64	      fmeasure  = .62WY

  The results were not as high as the adjective method used by itself
  therefore we decided to weight the results according to their projected
  accuracy. The adjective method was weighted at .5 because it returned the
  best results. The active method was weighted at .3 because initially the
  results of the tests looked favorable. The distance method and the test/
  measures of association method were weighted at .1 each because the results
  returned for both the methods were not as high. This approach returned
  the following results:

  pc-data:	         playstation-data:	    test-data:   
     precision = .75       precision = .75            precision = .66
     recall    = .75       recall    = .75	      recall    = .67
     fmeasure  = .75       fmeasure  = .75	      fmeasure  = .68

  This was still not a favorable as the adjective method by itself. Given
  the active code method's results, it was decided to lower the weight of
  the active code method and increase the adjective method's weight. This
  did nothing to affect the results as shown below:

  Modified Weights:
         adjective weight = .7 
         active weight    = .1 
         distance weight  = .1 
	 measure weight   = .1

  pc-data:	         playstation-data:	    test-data:   
     precision = .75       precision = .75            precision = .67
     recall    = .72       recall    = .75	      recall    = .67
     fmeasure  = .74       fmeasure  = .75	      fmeasure  = .67

  Surprisingly, the results of this test did not show any improvement.

  Consequently, it is believed that the ensemble method actually hurts
  the results of the methods that perform well and does not significantly
  improve the methods that perform on average.  This is shown in two
  different areas of the algorithm. First, when the different tests and 
  measure's of association are ensembled to determine the measure method's
  sentiment classification for the review. Second, when all the different
  method's are combined to determine the final sentiment classification of
  the review.


References
------------------------------------------------------------------------------
[NN1994]  Y.Niwa and Y.Nitta; Co-occurrence vectors from corpora vs. distance 
	  vectors from dictionaries; Proceedings of COLING'94; 1994.

[Tur2002] P.Turney; Thumbs Up or Thumbs Down? Semantic Orientation Applied to 
	  Unsupervised Classification of Reviews; Proceedings of ACL'02; 2002.

[Var2002] N. Varma. Identifying Word Translations in Parallel Corpora using
	  Measures of Association. Master's thesis, University of Minnesota,
	  December 2002.


COPYRIGHT AND LICENSE
------------------------------------------------------------------------------

Copyright (C) 2002 Bridget Thomson, Deodatta Y Bhoite, Kailash Aurangabadkar, 
Nitin Agarwal, Yanhua Li.

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.