CS8761 - Natural Language Processing Date : 12/16/2002 Algorithm for sentiment classification ====================================== CS8761 UNION MINAS Bridget Thomson McInnes, bthomson@d.umn.edu Deodatta Y Bhoite, bhoi0001@d.umn.edu Yanhua Li, lixx0333@d.umn.edu Nitin Agarwal, agar0054@d.umn.edu Kailash Aurangabadkar, aura0011@d.umn.edu The problem is to design an algorithm to perform sentiment classification of reviews. The algorithm to be implemented by Union Minas employs three main resources: Longman's Machine Readable Dictionary (LDOCE), MacQuarie's Machine Readable Dictionary (Big Mac), and the World Wide Web (WWW). LDOCE is accessed using the perl module Minas::LDOCE, BigMac is accessed using the module Minas::BigMac and WebReader is accessed using the module Minas::WebReader. The algorithm for sentiment classification consists of 3 stages, one stage per resource. And each stage may be broken down into sections. Each section returns one of three results: good, bad or unknown. The results will then be combined using an ensemble approach to obtain the final classification. Stage 1: (Big Mac) The first stage of the algorithm consists of two parts, Part I and Part II. Part I utilizes the thesaurus functionality provided by the BigMac interface, whereas Part II makes use of the words in definition functionality provided by the BigMac interface. Part I : Each class in the Big Mac thesaurus has been tagged with a positive, negative or neutral sentiment. A review is evaluated by inspecting each adjective in the review and determining its class in the thesaurus. The class returns the appropriate sentiment for that word. After discerning each word's sentiment in the review, the number of positive and negative sentiments are tallied determining the possible sentiment for the review. Part II : Part II partially implements the distance vector concept described in [NN1994]. The distance of a word is one from the origin word if the origin word occurs in the definition of the word. We choose a small set of positive and negative words and find all the words at unit distance. Thus we build two lists, one for positive words and the other for negative words. We scan the review for these words and classify it as positive or negative based on the class containing maximum number of words. Stage 2: (LDOCE) The second stage of the algorithm consists of two parts, Part I and and Part II. Each part utilizes different aspects of LDOCE. Part I : Part I of Stage 2 utilizes the active codes located in LDOCE. Each active code is tagged with a positive, negative or neutral sentiment. A review is evaluated by inspecting each word in the review and determining its active code which in turn returns the appropriate sentiment for that word. After discerning each word's sentiment in the review, the number of positive and negative sentiments are tallied determining the possible sentiment for the review. Part II : Part II of Stage 2 uses LDOCE to determine whether or not a word in the review is an adjective. If the word is an adjective, the file containing the classification of adjectives is consulted to determine the word's sentiment. The number of positive and negative sentiments are tallied after each word in the review has been examined to determine the possible sentiment for the review. A basic negation algorithm is also implemented to support negation of adjectives. If a word is preceded by a "not" then we negate the class of the adjective. Stage 3: (WebReader) The third stage of the algorithm utilizes the World Wide Web through the Minas::WebReader perl module. A review is evaluated by inspecting each adjective in the review, which will be called 'review word', and determining its association with the words 'excellent' and 'horrible', which will be called 'set words'. This will be implemented by querying the web for: query1. the set word (np1) query2. the review word (n1p) query3 the set word and the review word (n11) The value for npp is assumed as number of hits for excellent plus the number of hits for horrible. It can also be assumed as the number of documents available with the search engine, but since this number is difficult to find out we assumed the npp according to the previous assumption. The number of documents returned for each of these queries will act as counts allowing for the following contingency table to be created: set wrd | !set wrd ______________________ | | | review wrd | n11 | n12 | n1p | | | ---------------------- | | | !review wrd | n21 | n22 | n2p | | | ---------------------- ---- np1 np2 | npp where n11 = query3, n1p = query2, np1 = query1. With this contingency table different measures of association can be calculated. This is similar to the PMI-IR measure described by [Tur2002] for sentiment classification. However, we use T-score, Poisson's measure and dice coefficient to measure the association of the word with "excellent" or "horrible". We also use a variation of the ensemble technique described in [Var2002] to combine the results generated by the various tests/measures of association. Each measure/test of association returns an association class (excellent/horrible) for the word. The class with the highest agreement between the different measures is assigned to the word. The classification of the review is based on the number of words which belong to a particular class in that review. The results from each of the Stages will be weighted and a final result will be tallied and sent to standard out. The weighting of each of the stages results will be determined upon the accuracy of their individual results. The movie data will be used as a baseline for determining the accuracy of each of the stages results. More information on this version of the design can be found in the README file. References ------------------------------------------------------------------------------ [NN1994] Y.Niwa and Y.Nitta. Co-occurrence vectors from corpora vs. distance vectors from dictionaries. Proceedings of COLING'94. 1994. [Tur2002] P.Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. Proceedings of ACL'02. 2002. [Var2002] N. Varma. Identifying Word Translations in Parallel Corpora using Measures of Association. Master's thesis, University of Minnesota, December 2002. COPYRIGHT AND LICENSE ------------------------------------------------------------------------------ Copyright (C) 2002 Bridget Thomson, Deodatta Y Bhoite, Kailash Aurangabadkar, Nitin Agarwal, Yanhua Li. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.