CS8761 - Natural Language Processing

Date : 12/16/2002

Algorithm for sentiment classification
======================================

CS8761

UNION MINAS

Bridget Thomson McInnes, bthomson@d.umn.edu
Deodatta Y Bhoite,       bhoi0001@d.umn.edu
Yanhua Li,               lixx0333@d.umn.edu
Nitin Agarwal,           agar0054@d.umn.edu
Kailash Aurangabadkar,   aura0011@d.umn.edu


The problem is to design an algorithm to perform sentiment classification 
of reviews. The algorithm to be implemented by Union Minas employs three 
main resources: Longman's Machine Readable Dictionary (LDOCE), MacQuarie's 
Machine Readable Dictionary (Big Mac), and the World Wide Web (WWW). LDOCE 
is accessed using the perl module Minas::LDOCE, BigMac is accessed using 
the module Minas::BigMac and WebReader is accessed using the module 
Minas::WebReader.
  
The algorithm for sentiment classification consists of 3 stages, one stage 
per resource. And each stage may be broken down into sections. Each section 
returns one of three results: good, bad or unknown. The results will then be 
combined using an ensemble approach to obtain the final classification.

Stage 1: (Big Mac)

      The first stage of the algorithm consists of two parts, Part I and
      Part II. Part I utilizes the thesaurus functionality provided by the
      BigMac interface, whereas Part II makes use of the words in definition
      functionality provided by the BigMac interface. 

      Part I  : Each class in the Big Mac thesaurus has been tagged with a 
      positive, negative or neutral sentiment. A review is evaluated by 
      inspecting each adjective in the review and determining its class in 
      the thesaurus. The class returns the appropriate sentiment for that 
      word. After discerning each word's sentiment in the review, the 
      number of positive and negative sentiments are tallied determining 
      the possible sentiment for the review.

      Part II : Part II partially implements the distance vector concept 
      described in [NN1994]. The distance of a word is one from the origin 
      word if the origin word occurs in the definition of the word. We choose 
      a small set of positive and negative words and find all the words at 
      unit distance. Thus we build two lists, one for positive words and the 
      other for negative words. We scan the review for these words and 
      classify it as positive or negative based on the class containing 
      maximum number of words.


Stage 2: (LDOCE)

      The second stage of the algorithm consists of two parts, Part I and
      and Part II. Each part utilizes different aspects of LDOCE.

      Part I   : Part I of Stage 2  utilizes the active codes located in 
      LDOCE. Each active code is tagged with a positive, negative or 
      neutral sentiment. A review is evaluated by inspecting each word in 
      the review and determining its active code which in turn returns the 
      appropriate sentiment for that word. After discerning each word's
      sentiment in the review, the number of positive and negative sentiments
      are tallied determining the possible sentiment for the review.


      Part II  : Part II of Stage 2 uses LDOCE to determine whether or not
      a word in the review is an adjective. If the word is an adjective,
      the file containing the classification of adjectives is consulted to
      determine the word's sentiment. 
      The number of positive and negative sentiments are tallied after each 
      word in the review has been examined to determine the possible sentiment 
      for the review. A basic negation algorithm is also implemented to support
      negation of adjectives. If a word is preceded by a "not" then we negate 
      the class of the adjective.


Stage 3: (WebReader)

      The third stage of the algorithm utilizes the World Wide Web through
      the Minas::WebReader perl module. A review is evaluated by inspecting
      each adjective in the review, which will be called 'review word', and
      determining its association with the words 'excellent' and 'horrible', 
      which will be called 'set words'. This will be implemented by querying 
      the web for:
	   query1. the set word (np1)
	   query2. the review word (n1p)
	   query3  the set word and the review word (n11)
      The value for npp is assumed as number of hits for excellent plus the
      number of hits for horrible. It can also be assumed as the number of 
      documents available with the search engine, but since this number is
      difficult to find out we assumed the npp according to the previous 
      assumption.

      The number of documents returned for each of these queries
      will act as counts allowing for the following contingency table to
      be created:

           	          set wrd  | !set wrd
                        ______________________
                        |          |         |
             review wrd |    n11   |   n12   | n1p
                        |          |         |
                        ----------------------
                        |          |         |
            !review wrd |    n21   |   n22   | n2p 
                        |          |         |
                        ---------------------- ----
                             np1       np2   | npp 

			     
			     where n11 = query3,
			           n1p = query2,
				   np1 = query1.


      With this contingency table different measures of association
      can be calculated. This is similar to the PMI-IR measure described  
      by [Tur2002] for sentiment classification. However, we use T-score, 
      Poisson's measure and dice coefficient to measure the association of 
      the word with "excellent" or "horrible".
      
      We also use a variation of the ensemble technique described in [Var2002]
      to combine the results generated by the various tests/measures of 
      association. Each measure/test of association returns an association 
      class (excellent/horrible) for the word. The class with the highest 
      agreement between the different measures is assigned to the word.
      
      The classification of the review is based on the number of words which 
      belong to a particular class in that review.

   
The results from each of the Stages will be weighted and a final result 
will be tallied and sent to standard out.

The weighting of each of the stages results will be determined upon the 
accuracy of their individual results. The movie data will be used as a
baseline for determining the accuracy of each of the stages results.

More information on this version of the design can be found in the README 
file.


References
------------------------------------------------------------------------------
[NN1994]  Y.Niwa and Y.Nitta. Co-occurrence vectors from corpora vs. distance 
	  vectors from dictionaries. Proceedings of COLING'94. 1994.

[Tur2002] P.Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to 
	  Unsupervised Classification of Reviews. Proceedings of ACL'02. 2002.

[Var2002] N. Varma. Identifying Word Translations in Parallel Corpora using
	  Measures of Association. Master's thesis, University of Minnesota,
	  December 2002.


COPYRIGHT AND LICENSE
------------------------------------------------------------------------------

Copyright (C) 2002 Bridget Thomson, Deodatta Y Bhoite, Kailash Aurangabadkar, 
Nitin Agarwal, Yanhua Li.

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.