SportBoys

Sentiment Classification
   - Aniruddha Mahajan
   - Anushri Parsekar
   - Prashant Rathi
   - Victoria Harp


analysis.txt    12/17/2002


Introduction
------------

    Our task was to build a sentiment classifier using the World Wide Web,
LDOCE and BigMac. We implemented an algorithm (described in algorithm.txt)
which integrated all the 3 resources we had at our disposal. This file
contains an analysis of the algorithm, the experiments performed, the results 
obtained and attempts at drawing a conclusion from the information obtained. 
Our constant attempt was to keep any type of human-intervention or manual
classifying of data to a minimum as having an intuition based algorithm
beats the whole purpose of classifying sentiments.



Implementation Details
----------------------

   Initially our understanding was that if the good-count is greater than
the bad-count, (goodcnt/badcnt > 1) then the review expresses a positive 
sentiment. Hence our program had a decider ratio of 1. However initial experiments
showed that most negative reviews were being classified incorrect. More 
experimentation led us finally to ratios as given below -

    Good-Count / Bad-Count > 1.75  :: Positive
                           < 1.50  :: Negative
                             else     Undecided 

       Threshold on web-hits
       ---------------------             
              As explained in the algorithm, the number of hits returned by 
       word NEAR excellent query is used to scale (weighted factor) the 
       good-count. This factor was observed to be too high in some cases
       and led to skewed results. As we did not want just a single word to 
       change our sentiment-decision drastically, we implemented a threshold,
       which keeps in check the number of hits returned by any query.
              We have kept this threshold to be a million hits.

       Time taken
       ----------
             Some parts of our algorithm require a significant amount of time
       to execute. These parts are --
             * preparing initial data-structures e.g. hashes for LDOCE & BigMac
             * getting number of hits from the web (reduced by using cache) 
             * matching good / bad words, retrieval of word-info
        

      No - Not factor
      ---------------
             The words preceeded by 'no' or 'not' present an opposite sentiment
      than normally interpreted. We thought that this contributed reasonably 
      towards correct sentiment classification, hence we took care of it.
             We have implemented a data-structure: no_hash, which stores all
      such words and the sentiment normally expressed by these words is reversed.



Experiments...
--------------

     We ran our program on all the data available to us. Namely the test data 
and the data collected by all the teams. The results obtained were as below -
 
 ----------------------------------------------------------
 |                         Scores                         |
 |---------------------------------------------------------
 |Data	 	  | Precision |	  Recall    |	 F-score  |
 |---------------------------------------------------------
 |Movie-Data	  |   0.65    |	   0.60     |	   0.62   |
 |Playstation-Data|   0.54    |	   0.52     |	   0.53   |
 |Sanity-Data	  |   0.46    |	   0.46     |	   0.46   |
 |Car1-Data	  |   0.64    |	   0.58     |	   0.61   |
 |Music-Data	  |   0.62    |	   0.61     |	   0.62   |
 ----------------------------------------------------------



Observations on Data & their results
------------------------------------

Movie-Data
            These were movie reviews. The reviews themselves were pretty well 
written and to-the-point. As observed, it gave us our best results. Movie-Data
had commonly used positive / negative words and had less complexity as such.


Playstation-Data
Car1-Data
Music-Data
Camera-Data
PC-Data
           These data were collected by teams and were reviews on various 
subjects or objects. The algorithm gave mixed results for these. The data
varied on topics as well as integrity of writing. The varied results 
obtained show that much depends upon the data as well. 
          Camera-Data and PC-Data contained many technical terms and the
positive and negative connotations were also expressed in technical jargon.
This, we believe led to suppressed results as many of these words were just 
not present in LDOCE (which as it is has just 2000 words) or BigMac.



Sanity-Data results 
-------------------

            The sanity data we had considered was the NAFTA treaty, however the
       word 'good' appeared too often in it. So we switched to the Adventures of
       Sherlock Holmes from the Gutenberg Library. 
       We expected sanity-data results to show considerable difference
       between precision and recall values. A large number of reviews should have 
       been classified as undecided to achieve this. 


Extended Observations
---------------------
        
Precision generally observed was from 0.54 to 0.65. 

  80% of positive reviews were classified correct     
  40% of negative reviews were classified correct
  maximum 10% reviews were undecided 


Why do we get more positive reviews than negative?
--------------------------------------------------

(i)   The algorithm detects positive words more easily than negative ones. Our
      thinking (now that we have observed the results) is that any review whether
      positive or negative, has a minimum fixed amount of positive words in it. 
      It is the negative words which vary as per the sentiment of the review. 
     
(ii)  Also, some times people use figures-of-speech to 	convey negative sentiment.
      e.g. :-feature is as useful as a screen-door on a submarine.   
      Or  The motion of the camera is so jerky it was being operated on the back
      of a pick-up-truck running low on kangaroo juice.

      If we have to improve our results then based on point (i) above, modifications are 
      necessary in the algorithm. An obvious solution would be to get a minimum threshold
      on the good-count as observed from the sanity data, and enforce the threshold on 
      the review-data. 


      Why sanity-data?
      ----------------
	      Sanity-data is supposed be neutral-data and we are observing considerable
	      good-count in it, while bad-count is very less. 


Conclusion
----------

	Our algorithm is based on the fact that a review with positive sentiment should
        contain words which can be related to the word excellent by observing the 
        frequency with which they occur together in the web-corpus. (and so also for 
        negative sentiment, except it is related with poor).
	The algorithm tries to minimize human intervention at all levels. Increase in
        number of baseline (seed) words may give better results, but that increases 
        human intervention. The amount of manual priming may be increased to improve 
        the data obtained from the dictionaries.
        The implemented algorithm performs decently, though we are sure it’s performance
        can be improved upon using the previously mentioned techniques.