#                                      \\\|/// 
#                                    \\  ~ ~  // 
#                                     (  @ @  )  
#**********************************-oOOo-(_)-oOOo-************************************
#
#         Natural Language Processing 
#
#         CS8761   -   Dr. Ted Pedersen
#
#         FINAL PROJECT
#
#         An approach to sentiment classification using machine readable dictionaries
#         and World Wide Web 
#
#         
#         Melgar FBC - Prashant Jain , Anand Takale , Nan Zhang ,Sumalatha Kothadi 
#
#
#***************************************-OoooO-***************************************
#                              .oooO     (   ) 
#                              (   )      ) / 
#                               \ (      (_/ 
#                                \_) 


------------------------------------------------------------------------------------
Objectives:
------------------------------------------------------------------------------------

Invent an algorithm to perform sentiment classification. Implement and
evaluate  it.  Your  algorithm must  utilize a  combination  of corpus
based  information from  the World  Wide  Web and  a machine  readable
dictionary


------------------------------------------------------------------------------------
Specification:
------------------------------------------------------------------------------------

We  had  to  develop   a  solution  to  the  sentiment  classification
problem. We wanted to automatically classify reviews as being positive
or negative simply by referring to their content and without referring
to "summary judgement" of the reviewer.

-----------------------------------------------------------------------------------
Interpretation of the Objective
-----------------------------------------------------------------------------------

The stage  3 of the project  basically uses the  interfaces to Machine
readable dictionaries  such as  LDOCE and BigMac  and an  interface to
World  Wide  Web.  Our  task  is  to classify  reviews  based  on  the
information  obtained from  the Web  as well  as from  the information
obtained from  the Machine Readable  Dictionaries.In short we  have to
use only  the information obtained from  the two sources  and not from
any  other sources  and there  should be  no human  intuition involved
while classifying the review. One  more important aspect to noted here
is  that the review  summary or  the rating  (of any  kind...i.e stars
etc.) is not to be considered while classifying the review.

-----------------------------------------------------------------------------------
Resources Available
-----------------------------------------------------------------------------------

1. Machine Readable Dictionaries
   a. LDOCE
   b. BigMac
2. World Wide Web

----------------------------------------------------------------------------------
Information that we get from the resources available
----------------------------------------------------------------------------------

A. Machine Readable Dictionaries
--------------------------------

1. BigMac and LDOCE modules both  return words of a particular part of
   speech.  
2.  BigMac and  LDOCE both return  synonyms of a  given word.
3. BigMac returns all words sharing the same thesaurus code as that of
   a given word.  
4. LDOCE returns all words with same active codes.

B. World Wide Web
------------------

1. Number of hits of word NEAR (good/bad words)
2. Sentences containing a particular word. 

----------------------------------------------------------------------------------
Experiments:
----------------------------------------------------------------------------------

The sentiment.pl program was run on 4 data-sets.
The data-sets on which the program was run are as follows:

(1) music-data => Data which our team collected
(2) car1-data => Data which some other team collected
(3) test-data => movie data
(4) sanity-data => was collected from Project Gutenberg

The  sentiment.pl was  run on  all the  four data  sets  and precision
,recall and f-measure were calculated for each and every data-set

----------------------------------------------------------------------------------
Results:
----------------------------------------------------------------------------------


Precision
---------

--------------------------------------------------------------------
	       |    mrd.pl    |    wr.pl     |     sentiment.pl    |
---------------|--------------|--------------|---------------------|
	       |	      |		     |		 	   |		
music-data     |    0.57      |	    0.68     |	     0.66	   |
	       |	      |		     |			   |
---------------|--------------|--------------|---------------------|
	       |	      |		     |			   |	
test-data      |    0.56      |	    0.67     |	     0.65	   |
	       |	      |		     |			   |
---------------|--------------|--------------|---------------------|
	       |	      |	             |			   |
car1-data      |    0.64      |	    0.73     |	     0.71	   |
	       |	      |		     |			   |
---------------|--------------|--------------|---------------------|
	       |	      |		     |			   |
sanity-data    |    0.39      |	    0.42     |	     0.44	   |
	       |	      |		     |			   |
--------------------------------------------------------------------


Recall
---------

--------------------------------------------------------------------
	       |    mrd.pl    |    wr.pl     |     sentiment.pl    |
---------------|--------------|--------------|---------------------|
	       |	      |		     |		 	   |		
music-data     |    0.56      |	   0.66	     |	    0.64	   |
	       |	      |		     |			   |
---------------|--------------|--------------|---------------------|
	       |	      |		     |			   |	
test-data      |    0.54      |	   0.66	     |	    0.63	   |
	       |	      |		     |			   |
---------------|--------------|--------------|---------------------|
	       |	      |	             |			   |
car1-data      |    0.62      |	   0.71	     |	    0.70	   |
	       |	      |		     |			   |
---------------|--------------|--------------|---------------------|
	       |	      |		     |			   |
sanity-data    |    0.37      |	   0.41	     |	    0.41	   |
	       |	      |		     |			   |
--------------------------------------------------------------------


Analysis
--------

How fast did our program run?
-----------------------------

Initially,  the  program  did  take   a  long  time  to  classify  the
reviews.  It did  around  5  reviews an  hour.  After performing  some
enhancements though,it got  upto around 12-20 reviews an  hour. We did
incorporate the cache  though which is built into a  hash at the start
of the program so that did enhance the performance of our program even
more since  it did not have to  query for the words  that were already
present in the cache.


How good were the results?
--------------------------

The results  that we  got from running  our sentiment.pl  program were
average. There were  a few instances when these  results were actually
very good,  but in most cases,  the results hovered  around the 60-70%
precision mark.  Since our approach  was one that could  be decomposed
into a  machine readable  dictionary approach and  a web  approach, we
also ran  tests for these  approaches individually. We found  that the
web approach  gave us  much better results  than the  machine readable
dictionary approach.In  most cases as  is shown by the  results above,
the  machine  readable  dictionary   approach  was  pulling  down  the
precision of the program.

Our Web  approach, though similar to Turney's  was different because-:

1. We  used each  adjective instead  of phrases.

2. We performed 3 tests of  association and used all three to classify
   the adjective as good/bad.

There were a very  few cant decides that we got in  our program and we
would  attribute  this to  the  usage of  more  than  one approach  to
classify the reviews. Therefore there was hardly anytime when we got a
score of 0 for a review and  hence there werent many times when we got
a  cant decide,  for a  particular review.  We believe  that  using an
ensemble method made sure that the number of cant decides were few and
hence we  also noticed  that the precision  and recall of  our methods
were almost the same or very close to each other.


Why were the results as they were?
----------------------------------

The results for the Web Reader  interface were really good. We were of
the thought  that these results were  going to be the  ones that would
pull down the  precision of our algorithm. However,  we were surprised
to see that  Web Reader performed pretty well. We  think that this was
because we used 3 tests of  association and used all three to classify
the  adjectives. We  noticed however,  that  even though  we got  good
results by using  three tests of association, on  using t-score alone,
we were  getting pretty good  precision. How would we  interpret this?
We, feel  that though  we could have  managed by using  t-score alone,
using  the other  two tests  didnt actually  harm the  results  of our
program  and   in  most  cases  rather,  they   actually  improved  or
reconfirmed the results that we  got from t-score. Using just a single
test of  association would've  meant that we  are just relying  on the
accuracy of  that test  alone. But using  three, we can  reconfirm and
actually improve the correctness of the program.

The  machine readable  dictionaries,  which we  thought  would be  the
backbone  of our algorithm  actually turned  out to  be the  ones that
pulled  down the  accuracy.We were  searching for  good words  and bad
words in  the review using  the good word  hash and the bad  word hash
that we created using BigMac and LDOCE. This approach would have given
better results had  we had an even distribution of  good words and bad
words in the hashes. However,  the distribution that we got was fairly
uneven, in which the good words were a lot less in number than the bad
words.  Therefore a  lot of  the negative  reviews were  classified as
positive. We did not use a hand-tagged list of good adjectives and bad
adjectives which  would've certainly  improved the performance  of our
machine readable dictionary  code but we wanted to  avoid all kinds of
intuition so we did not do that.

How could have we improved it?
------------------------------

We could've used sentences that can  be taken from the web, about each
adjective and then  using the context in which  the adjective is used,
found if the adjective is good  or bad. This method however was taking
a  long  time  to run  and  hence  was  proving  to be  a  performance
bottleneck and hence was not used.  However, if the concept of a cache
could  be  incorporated  into  this, this  method  if  implemented,can
provide pretty good results we believe.

Also, we  could have used an  even distribution of good  and bad words
from the  machine readable  dictionaries so as  to get  better results
from that part of our program.

General Obeservations
---------------------

1. There were cases in the  reviews(especially in movie data) in which
   in a negative review references were made a good performance of the
   actor/director  in   some  previous  movie   which  could  possibly
   contribute  to  the positive  sentiment  rather  than the  negative
   sentiment of the review.

2. Even though active codes in LDOCE would've provided a very good way
   of getting  good words and bad  words but in getting  to know which
   active code  is good and  which one is  bad, there is  a reasonable
   level of human intuition involved.

3. The results of the sanity data could be anything depending on which
   kind of data is used.


References
-----------

[1]  Thumbs  Up  or  Thumbs  Down?  Semantic  Orientation  Applied  to
Unsupervised  Classification of  Reviews, Turney,P.D.  ,Proceedings of
the  40th   Annual  Meeting  of  the   Association  for  Computational
Liguistics (ACL), Philadelphia, July 2002, pp. 417-424

[2]  Sanity  Data  collected  from  miscellaneous  text  from  Project
Gutenberg.

[3] www.d.umn.edu/~tpederse