Experimental Results and Analysis By Alianza Lima Amine, Amruta, Sam, Suchitra ============================================================================= Experiments Done ============================================================================= Aim To observe the precision and recalls on various review and sanity data sets under different windowing and scaling conditions. Data Used - test data car1 data play station data sanity data (around 100 instances from Line data) ----------------------------------------------------------------------------- Experiments Set 1 : Run the program on various review and sanity data sets without scaling and windowing. Experimental Conditions : No scaling - All terms in the review are equally weighted No Windowing - The algorithm will assign the review positive tag if the calculated sentiment > 0 negative tag if the calculated sentiment < 0 undecided tag if the calculated sentiment = 0 Since All reviews were attempted, precision and recall were almost equal Results Obtained - TABLE1 ---------------------------------------- DATA PRECISION RECALL | test 55% 55% | car1 75% 75% | play station 55% 55% | sanity 49% 49% | ---------------------------------------- Hardly any review was classified as undecided as window size =1 Observations - Algorithm does well by classifying good number of reviews correctly Low accuracy for sanity data is expected as sanity data is supposed to be neutral and should result in about half being classified right. Conclusions - We expect to see around 75% accuracy for any kind of review data that supports our assumption that a review sentiment is really the sum of it's word sentiments. This assumption is clearly not correct for all types of data. ------------------------------------------------------------------------------ Experiment Set 2 : Run the program on various review and sanity data sets and observe results under various scaling techniques. Aim Draw conclusions about the sentiment structure of the review i.e. To see if the review data has strongly sentiment determining terms in specific section of the review Experimental Conditions : Four Types of Scaling methods - No Scale - All Terms are equally weighted Scale1 - Terms in the end of the review are more weighted Scale2 - Terms in the beginning of the review are more weighted Scale3 - Weights are assigned to the terms in the increasing order of their distance from Central term No Windowing - The algorithm will assign the review positive tag if the calculated sentiment > 0 negative tag if the calculated sentiment < 0 undecided tag if the calculated sentiment = 0 Since All reviews were attempted, precision and recall were equal Results Obtained - The table shows the precision values TABLE2 ----------------------------------------- DATA SCALE1 SCALE2 SCALE3 | test 56% 57% 59% | car1 74% 75% 66% | play-station 59% 58% 61% | sanity 49% 49% 48% | ----------------------------------------- [note- precision and recall are approximately same here as window = 1] Observations - Different scaling methods have different impacts on the precision values. Conclusions - If a particular scaling method gives higher accuracy with a particular data that means we can draw some fair conclusions about the location of key sentiment determining terms in the review data. e.g. If scale1 does very well for data set X, we can say that when the end terms were weighted more, the algorithm could classify most of the reviews correctly and hence the data X is expected to have some sentiment determining terms in the end. Future Experiments with scaling - (1) Do scaling according to the topic in a particular section to weigh the relevant terms. (This was the suggestion by somebody from the audience when the project was presented.) (2) Use some non-linear scaling techniques to see scaling effects more effectively ----------------------------------------------------------------------------- Experiment Set 3 : Run the program on various review and sanity data sets and observe results under various scaling techniques using wide window of undecided class. Aim To see the windowing effect when more reviews are classified into undecided class by putting the threshold on the minimum values of sentiment, the review data should have to decide its classification. Experimental Conditions : Four Types of Scaling methods - No Scale - All Terms are equally weighted Scale1 - Terms in the end of the review are more weighted Scale2 - Terms in the beginning of the review are more weighted Scale3 - Weights are assigned to the terms in the increasing order of their distance from central term Windowing - Calculate the window using the expected value of the negative and positive region. This is done by finding where the sentiment values cross the zero boundary and then taking the mean of all negative and positive sentiment values obtained. The program calculates two thresholds mean1 - that decides maximum negative sentiment the review should have to classify it as 'negative' mean2 - that decides the minimum positive sentiment the review should have to classify it as 'positive' The values between mean1 and mean2 are taken as 'neutral' sentiments and review is classified into 'undecided' class. Results Obtained - TABLE3 ----------------------------------------------------------------------- DATA PRECISION RECALL ATTEMPTED CORRECT test 62% 37% 66 39 car1 87% 54% 64 51 play station 59% 42% 97 58 sanity 46% 25% 45 25 ----------------------------------------------------------------------- Observations - (1) Widening the window assigns 'undecided' to more reviews and hence reduces the Attempted Reviews count. (2) The program attempts less and hence is expected to find less number of correct reviews as compared to the no window case. (3) Hence there is expected drop in the recall values. (4) But the drop in the number of attempted is more compared to that in correctly classified, which is why the precision shows sharp improvement. The following shows impact of window on precision TABLE4 -------------------------------------------- DATA W/o Window With Window | car1 data 75% 87% | test data 55% 60% | -------------------------------------------- (combining TABLE4 and TABLE2) (5) Table of reviews classified as undecided for window conditions - TABLE5 --------------------------------------------------- DATA Narrow Window(10%) Wide Window(33%) | test 7 33 | car1 9 36 | sanity 16 56 | --------------------------------------------------- The table shows that for both the window sizes, the algorithm found maximum undecided reviews for sanity data which are more than the expected window sizes. (The graphical representation of all these tables is in the presentation handout copy) ============================================================================= Analysis and Interpretation of various results - ============================================================================= Some Curious Questions raised in our minds and our analysis ------------------------------------------------------------------ Why the algorithm gives different accuracies for different data ? ------------------------------------------------------------------ Our Reasoning - The algorithm is based on the assumption that the sentiment of the review is the aggregate of the sentiment values of its content words. This assumption may not hold true for all data. A review data may have the aggregate sentiment value different from the given sentiment value. It seems that as the algorithm gave high accuracies for car1 data, our assumption holds true for car1 data. Also, for sanity data low accuracy shows that the sanity data really has balanced sentiment as we get high undecided count. ------------------------------------ Which is the best scaling technique? ------------------------------------ We conclude that the best scaling method would depend on the review structure and it would be interesting to see that a particular scaling really does well for a particular data. Our experiments were not enough to show the effect of scaling on precision but slight improvement in the results using scale3 method for test data without window supports this fact. (Refer to TABLE2) --------------------------------------------------------------------- Why we get higher precision and lower recall as we widen the window ? --------------------------------------------------------------------- Wider the window, the program assigns 'undecided' to more reviews and hence the count of attempted goes down. But still the drop in correctly assigned is not so sharp and hence we see net rise in the precision values. For all windowing experiments, the program assigned highest number of 'undecided' tags to sanity data than review data which confirms its correct behavior. Window puts a threshold on the minimum value of sentiment to call a review positive or negative. ===========================================================================