Group Project: Sentiment Classification. Updated 15th Dec 2002 CS8761, Fall 2002 Group: Sport Boys Aniruddha Mahajan Anushri Parsekar Prashant Rathi Victoria Harp +------------------------------------+ | Sentiment Classification Algorithm | +------------------------------------+ 1. Start. 2. Call the functions to create hashes for the LDOCE and BigMac dictionaries. This is called just once as the hash created helps further references to the LDOCE and BigMac. 3. Prepare two lists of words, one to store the good words and another for the bad words. These lists will help us in the classification process. They are prepared by starting with two basic words from the file seed_words.txt. In our seed file, we used "excellent" to seed the positive list and "poor" to seed the negative list. 4. Call the get_by_value function of BigMac. This returns all the synonyms from the thesaurus for the two seed words. We filter the synonyms by rejecting all the words that are not adjectives. This is done by calling the get_pos function in BigMac which takes a word and a part of speech and returns value as true if the word exists as that particular part of speech. Also we use LDOCE to create a secondary good/bad lists by looking for the baseline words in the definitions of all words. 5. Next, we read data in the cache file to use it as a starting point. 6. With the array of good and bad words at our hand, we then process the necessary files in positive and negative directories. These positive and negative directories are sub-directories in the directory which is provided as the command line parameter. 7. The words in each review are split into two hashes depending on whether there is 'no' or 'not' before each words as these words make the intended meaning opposite. So we have two hashes 'hash' and 'no_hash'. 8. For each review, we find all the adjectives and adverbs in the review. We do not process the words on the stop list which is specified in the file stop_list.txt. Put all these words in either of the two hashes. 9. Using the get_word_info function in the BigMac dictionary interface, we query for these words and get all the information about them. This information also contains the 'thes' (thesaurus) code entries for that word which reflects the related words for a particular word. 10. We search for our 'good' or 'bad' words in these related words list which we have for particular words from the review stored either in hash or no_hash. 11. If we find a match for a 'good' word or a 'bad' word then the word from the review and the 'good' or the 'bad' word that matched is sent to the AltaVista search engine to find the number of hits that we get when both the words are 'NEAR' each other. We use this number of hits to scale the respective counter. 12. We double the scaling factor if the word also matches with good/bad word in the secondary lists. Also, we maintain a threshold value for the number of hits. When a word gives too many matches, it could weight the count too heavy which would skew the good/bad ratio without the threshold. 13. We compute a factor for each review by dividing the good_count value by the bad_count value. 14. If this value is greater than or equal to 1.75, we think we have enough evidence that the particular review is positive,if the value is less than or equal to 1.5, then we assign the review as negative. For the values where the good_cnt is very close to bad_cnt, we are unable to distinguish between positivity and negativity and so we assign "undecided". The values chosen are 1.5 and 1.75 because the reviews generally contain plenty of good words. Thus, for the review to be considered a positive review, it must contain more than twice the amount of [weighted] good words as compared to bad words. We determined the cutoffs values through trial-and-error experimenation. 15. Finally, after going through a series of positive and negative reviews, we determine the accuracy of our results by comparing them with the true value. 16. We print the accuracy values and store the computational reasoning output in trace.txt for the entire set of reviews. 17. Finally, we update the cache with the set of new good and bad words. Example: If there is a word 'wonderful' in the review, we first check if 'wonderful' appears in the good list or the bad list. If it appears in the good list, we send a query to AltaVista for "wonderful NEAR excellent". If it appears in the bad list, we send a query for "wonderful NEAR poor." The number of hits returned by this query is used as a weight for the respective good-counter or bad-counter. Then we increment the counters accordingly. If the program does not find 'wonderful' in the good list or the bad list, we find the related words for 'wonderful' from Bigmac. Then we check if any of these related words appear in the good list or the bad list. If there are no words found, we assume the word is neutral. If the word has no or not before it, say "not wonderful" then the hits factor is used to increment the opposite counter.