Experimental Results and Analysis
By
Alianza Lima
Amine, Amruta, Sam, Suchitra
=============================================================================
Experiments Done
=============================================================================
Aim
To observe the precision and recalls on various review and sanity data
sets under different windowing and scaling conditions.
Data Used -
test data
car1 data
play station data
sanity data (around 100 instances from Line data)
-----------------------------------------------------------------------------
Experiments Set 1 :
Run the program on various review and sanity data sets without scaling
and windowing.
Experimental Conditions :
No scaling - All terms in the review are equally weighted
No Windowing - The algorithm will assign the review
positive tag if the calculated sentiment > 0
negative tag if the calculated sentiment < 0
undecided tag if the calculated sentiment = 0
Since All reviews were attempted, precision and recall were almost equal
Results Obtained -
TABLE1
----------------------------------------
DATA PRECISION RECALL |
test 55% 55% |
car1 75% 75% |
play station 55% 55% |
sanity 49% 49% |
----------------------------------------
Hardly any review was classified as undecided as window size =1
Observations -
Algorithm does well by classifying good number of reviews correctly
Low accuracy for sanity data is expected as sanity data is supposed
to be neutral and should result in about half being classified right.
Conclusions -
We expect to see around 75% accuracy for any kind of review data that
supports our assumption that a review sentiment is really the sum of
it's word sentiments. This assumption is clearly not correct for all
types of data.
------------------------------------------------------------------------------
Experiment Set 2 :
Run the program on various review and sanity data sets and observe
results under various scaling techniques.
Aim
Draw conclusions about the sentiment structure of the review i.e.
To see if the review data has strongly sentiment determining terms
in specific section of the review
Experimental Conditions :
Four Types of Scaling methods -
No Scale - All Terms are equally weighted
Scale1 - Terms in the end of the review are more weighted
Scale2 - Terms in the beginning of the review are more
weighted
Scale3 - Weights are assigned to the terms in the increasing
order of their distance from Central term
No Windowing - The algorithm will assign the review
positive tag if the calculated sentiment > 0
negative tag if the calculated sentiment < 0
undecided tag if the calculated sentiment = 0
Since All reviews were attempted, precision and recall were equal
Results Obtained -
The table shows the precision values
TABLE2
-----------------------------------------
DATA SCALE1 SCALE2 SCALE3 |
test 56% 57% 59% |
car1 74% 75% 66% |
play-station 59% 58% 61% |
sanity 49% 49% 48% |
-----------------------------------------
[note- precision and recall are approximately same here as window = 1]
Observations -
Different scaling methods have different impacts on the precision
values.
Conclusions -
If a particular scaling method gives higher accuracy with a particular
data that means we can draw some fair conclusions about the location
of key sentiment determining terms in the review data.
e.g. If scale1 does very well for data set X, we can say that when
the end terms were weighted more, the algorithm could classify
most of the reviews correctly and hence the data X is expected to have
some sentiment determining terms in the end.
Future Experiments with scaling -
(1) Do scaling according to the topic in a particular section to weigh
the relevant terms. (This was the suggestion by somebody from the
audience when the project was presented.)
(2) Use some non-linear scaling techniques to see scaling effects more
effectively
-----------------------------------------------------------------------------
Experiment Set 3 :
Run the program on various review and sanity data sets and observe
results under various scaling techniques using wide window of
undecided class.
Aim
To see the windowing effect when more reviews are classified into
undecided class by putting the threshold on the minimum values of
sentiment, the review data should have to decide its classification.
Experimental Conditions :
Four Types of Scaling methods -
No Scale - All Terms are equally weighted
Scale1 - Terms in the end of the review are more weighted
Scale2 - Terms in the beginning of the review are more
weighted
Scale3 - Weights are assigned to the terms in the increasing
order of their distance from central term
Windowing - Calculate the window using the expected value of
the negative and positive region.
This is done by finding where the sentiment values cross the zero
boundary and then taking the mean of all negative and positive
sentiment values obtained. The program calculates two thresholds
mean1 - that decides maximum negative sentiment the review should have
to classify it as 'negative'
mean2 - that decides the minimum positive sentiment the review should
have to classify it as 'positive'
The values between mean1 and mean2 are taken as 'neutral' sentiments
and review is classified into 'undecided' class.
Results Obtained -
TABLE3
-----------------------------------------------------------------------
DATA PRECISION RECALL ATTEMPTED CORRECT
test 62% 37% 66 39
car1 87% 54% 64 51
play station 59% 42% 97 58
sanity 46% 25% 45 25
-----------------------------------------------------------------------
Observations -
(1) Widening the window assigns 'undecided' to more reviews and hence
reduces the Attempted Reviews count.
(2) The program attempts less and hence is expected to find less number
of correct reviews as compared to the no window case.
(3) Hence there is expected drop in the recall values.
(4) But the drop in the number of attempted is more compared to that in
correctly classified, which is why the precision shows sharp
improvement.
The following shows impact of window on precision
TABLE4
--------------------------------------------
DATA W/o Window With Window |
car1 data 75% 87% |
test data 55% 60% |
--------------------------------------------
(combining TABLE4 and TABLE2)
(5) Table of reviews classified as undecided for window conditions -
TABLE5
---------------------------------------------------
DATA Narrow Window(10%) Wide Window(33%) |
test 7 33 |
car1 9 36 |
sanity 16 56 |
---------------------------------------------------
The table shows that for both the window sizes, the algorithm found
maximum undecided reviews for sanity data which are more than the
expected window sizes.
(The graphical representation of all these tables is in the
presentation handout copy)
=============================================================================
Analysis and Interpretation of various results -
=============================================================================
Some Curious Questions raised in our minds and our analysis
------------------------------------------------------------------
Why the algorithm gives different accuracies for different data ?
------------------------------------------------------------------
Our Reasoning -
The algorithm is based on the assumption that the sentiment of the review is
the aggregate of the sentiment values of its content words.
This assumption may not hold true for all data. A review data may have the
aggregate sentiment value different from the given sentiment value.
It seems that as the algorithm gave high accuracies for car1 data, our
assumption holds true for car1 data. Also, for sanity data low accuracy shows
that the sanity data really has balanced sentiment as we get high undecided
count.
------------------------------------
Which is the best scaling technique?
------------------------------------
We conclude that the best scaling method would depend on the review structure
and it would be interesting to see that a particular scaling really does well
for a particular data. Our experiments were not enough to show the effect of
scaling on precision but slight improvement in the results using scale3 method
for test data without window supports this fact. (Refer to TABLE2)
---------------------------------------------------------------------
Why we get higher precision and lower recall as we widen the window ?
---------------------------------------------------------------------
Wider the window, the program assigns 'undecided' to more reviews and hence
the count of attempted goes down. But still the drop in correctly assigned
is not so sharp and hence we see net rise in the precision values.
For all windowing experiments, the program assigned highest number of
'undecided' tags to sanity data than review data which confirms its correct
behavior.
Window puts a threshold on the minimum value of sentiment to call a review
positive or negative.
===========================================================================