Group Project: Sentiment Classification. 
   Updated 15th Dec 2002 
CS8761, Fall 2002 
Group: Sport Boys
   Aniruddha Mahajan
   Anushri Parsekar
   Prashant Rathi
   Victoria Harp

+------------------------------------+
| Sentiment Classification Algorithm |
+------------------------------------+

1.  Start.

2. Call the  functions to create  hashes for the LDOCE  and BigMac
dictionaries.  This is called just once as the hash created helps
further references to the LDOCE and BigMac. 

3.  Prepare two  lists of words, one  to store the  good words and
another  for the  bad  words.  These  lists  will help  us in  the
classification process.   They are  prepared by starting  with two
basic words  from the file  seed_words.txt.  In our seed  file, we
used "excellent" to seed the  positive list and "poor" to seed the
negative list.

4.  Call the  get_by_value function  of BigMac.  This  returns all
the synonyms from the thesaurus for the two seed words.  We filter
the   synonyms  by  rejecting   all  the   words  that   are  not
adjectives. This is done by calling the get_pos function in BigMac
which takes a word and a  part of speech and returns value as true
if the word exists as that particular part of speech.  Also we use
LDOCE  to create  a secondary  good/bad lists  by looking  for the
baseline words in the definitions of all words.

5. Next, we read data in the cache file to use it as a starting
point.

6. With  the array  of good  and bad  words at  our hand,  we then
process the necessary files  in positive and negative directories.
These positive and negative directories are sub-directories in the
directory which is provided as the command line parameter.

7.  The words in  each review are split into  two hashes depending
on whether there is 'no' or 'not' before each words as these words
make the intended meaning opposite.  So we have two hashes 'hash'
and 'no_hash'.

8. For each review, we find all the adjectives and adverbs in the
review.  We  do not process  the words on  the stop list  which is
specified in the file stop_list.txt.  Put all these words in either
of the two hashes.  

9. Using  the  get_word_info function  in  the BigMac  dictionary
interface, we  query for these  words and get all  the information
about them.  This information also contains the 'thes' (thesaurus)
code entries for that word  which reflects the related words for a
particular word.

10. We search for our 'good' or 'bad' words in these related words
list which  we have  for particular words  from the  review stored
either in hash or no_hash.

11. If we find a match for a  'good' word or a 'bad' word then the
word from the review and the 'good' or the 'bad' word that matched
is sent to the AltaVista search  engine to find the number of hits
that we  get when both  the words are  'NEAR' each other.   We use
this number of hits to scale the respective counter.

12. We double  the scaling  factor if the  word also  matches with
good/bad  word  in  the  secondary  lists.  Also,  we  maintain  a
threshold value  for the  number of hits.   When a word  gives too
many matches, it could weight the count too heavy which would skew
the good/bad ratio without the threshold.

13. We compute a factor for each review by dividing the good_count
value by the bad_count value.

14. If this  value is greater  than or equal  to 1.75, we  think we
have enough evidence that the particular review is positive,if the
value is less  than or equal to 1.5, then we  assign the review as
negative.   For the  values where  the good_cnt  is very  close to
bad_cnt,  we  are unable  to  distinguish  between positivity  and
negativity and  so we assign  "undecided".  The values  chosen are
1.5 and 1.75 because the  reviews generally contain plenty of good
words.  Thus, for  the review to be considered  a positive review,
it  must contain  more than  twice the  amount of  [weighted] good
words as compared to bad  words.  We determined the cutoffs values
through trial-and-error experimenation.

15. Finally, after going through  a series of positive and negative
reviews,  we determine the  accuracy of  our results  by comparing
them with the true value.

16.  We  print the  accuracy values  and store  the computational
reasoning output in trace.txt for the entire set of reviews.

17. Finally, we update the cache with the set of new good and bad
words.


Example: If  there is a word  'wonderful' in the  review, we first
check if 'wonderful' appears in the good list or the bad list.  If
it appears  in the  good list,  we send a  query to  AltaVista for
"wonderful NEAR  excellent".  If  it appears in  the bad  list, we
send  a query  for  "wonderful  NEAR poor."   The  number of  hits
returned  by this query  is used  as a  weight for  the respective
good-counter or bad-counter.  Then we increment the counters
accordingly.

If the program  does not find 'wonderful' in the  good list or the
bad  list,  we  find   the  related  words  for  'wonderful'  from
Bigmac. Then we check if any  of these related words appear in the
good list or the bad list.  If there are no words found, we assume
the word is neutral.

If the word has no or  not before it, say "not wonderful" then the
hits factor is used to increment the opposite counter.