Experimental Results and Analysis 
				   By
				Alianza Lima
			Amine, Amruta, Sam, Suchitra

=============================================================================
Experiments Done
=============================================================================

Aim 
	To observe the precision and recalls on various review and sanity data 
	sets under different windowing and scaling conditions. 

Data Used -
		test data 
		car1 data	
		play station data
		sanity data (around 100 instances from Line data)

-----------------------------------------------------------------------------

Experiments Set 1 :	
	Run the program on various review and sanity data sets without scaling
	and windowing.
	
Experimental Conditions :
	No scaling 	- All terms in the review are equally weighted
	No Windowing 	- The algorithm will assign the review 
			  positive tag if the calculated sentiment > 0
			  negative tag if the calculated sentiment < 0
			  undecided tag if the calculated sentiment = 0
	Since All reviews were attempted, precision and recall were almost equal

Results Obtained -
TABLE1 
	----------------------------------------
	DATA            PRECISION     RECALL   |
	test            55%	          55%      |
	car1            75%	          75%      |
	play station    55%	          55%      |
	sanity          49%	          49%      |
	----------------------------------------

	Hardly any review was classified as undecided as window size =1 
	
Observations -
	Algorithm does well by classifying good number of reviews correctly
	Low accuracy for sanity data is expected as sanity data is supposed
	to be neutral and should result in about half being classified right.

Conclusions -
	We expect to see around 75% accuracy for any kind of review data that
	supports our assumption that a review sentiment is really the sum of
	it's word sentiments. This assumption is clearly not correct for all
	types of data.

------------------------------------------------------------------------------
Experiment Set 2 :

    Run the program on various review and sanity data sets and observe 
	results under various scaling techniques.

Aim	
	Draw conclusions about the sentiment structure of the review i.e.
	To see if the review data has strongly sentiment determining terms
	in specific section of the review

Experimental Conditions :

	Four Types of Scaling methods -
	No Scale	- All Terms are equally weighted
	Scale1 		- Terms in the end of the review are more weighted
	Scale2 		- Terms in the beginning of the review are more
			  		weighted 
	Scale3		- Weights are assigned to the terms in the increasing 
			  		order of their distance from Central term 

	No Windowing - The algorithm will assign the review
                     positive tag if the calculated sentiment > 0
                     negative tag if the calculated sentiment < 0
                     undecided tag if the calculated sentiment = 0

	Since All reviews were attempted, precision and recall were equal

Results Obtained -

	The table shows the precision values 
TABLE2
	-----------------------------------------
	DATA          SCALE1   SCALE2   SCALE3  |
	test          56%      57%      59%     |	
	car1          74%      75%      66%     |
	play-station  59%      58%      61%     |	
	sanity        49%      49%      48%     |
	-----------------------------------------
	[note- precision and recall are approximately same here as window = 1]	
	
Observations -
	Different scaling methods have different impacts on the precision
	values. 

Conclusions -
	If a particular scaling method gives higher accuracy with a particular
	data that means we can draw some fair conclusions about the location
	of key sentiment determining terms in the review data.

	e.g. If scale1 does very well for data set X, we can say that when 
	the end terms were weighted more, the algorithm could classify
	most of the reviews correctly and hence the data X is expected to have
	some sentiment determining terms in the end.

Future Experiments with scaling -

(1)	Do scaling according to the topic in a particular section to weigh 
	the relevant terms. (This was the suggestion by somebody from the 
	audience when the project was presented.)
	
(2)	Use some non-linear scaling techniques to see scaling effects more
	effectively	

-----------------------------------------------------------------------------

Experiment Set 3 :

    Run the program on various review and sanity data sets and observe
    results under various scaling techniques using wide window of 
	undecided class.

Aim     
	To see the windowing effect when more reviews are classified into 
	undecided class by putting the threshold on the minimum values of 
	sentiment, the review data should have to decide its classification.

Experimental Conditions :

	Four Types of Scaling methods -
	No Scale	- All Terms are equally weighted
	Scale1		- Terms in the end of the review are more weighted
	Scale2		- Terms in the beginning of the review are more
					weighted
	Scale3		- Weights are assigned to the terms in the increasing
					order of their distance from central term
	Windowing	- Calculate the window using the expected value of 
					the negative and positive region. 
  	
	This is done by finding where the sentiment values cross the zero 
	boundary and then taking the mean of all negative and positive 
	sentiment values obtained. The program calculates two thresholds 
	mean1 - that decides maximum negative sentiment the review should have 
		to classify it as 'negative' 
	mean2 - that decides the minimum positive sentiment the review should 
		have to classify it as 'positive'
		
	The values between mean1 and mean2 are taken as 'neutral' sentiments 
	and review is classified into 'undecided' class.

Results Obtained -	

TABLE3
	-----------------------------------------------------------------------
	DATA          PRECISION  RECALL  ATTEMPTED  CORRECT
	test          62%        37%     66         39
	car1          87%        54%     64         51
	play station  59%        42%     97         58
	sanity        46%        25%     45         25
	-----------------------------------------------------------------------
			
Observations -
(1)	Widening the window assigns 'undecided' to more reviews and hence 
	reduces the Attempted Reviews count.
(2)	The program attempts less and hence is expected to find less number 
	of correct reviews as compared to the no window case. 
(3)	Hence there is expected drop in the recall values. 
(4)	But the drop in the number of attempted is more compared to that in
	correctly classified, which is why the precision shows sharp
	improvement.
	
	The following shows impact of window on precision 

TABLE4
	--------------------------------------------
	DATA          W/o Window    With Window    |
	car1 data     75%           87%            |
	test data     55%           60%            |
	--------------------------------------------
	(combining TABLE4 and TABLE2)

(5)	Table of reviews classified as undecided for window conditions -

TABLE5
	---------------------------------------------------
	DATA        Narrow Window(10%)   Wide Window(33%) |	
	test        7                    33               |	
	car1        9                    36               |
	sanity      16                   56               |
	---------------------------------------------------
	
	The table shows that for both the window sizes, the algorithm found
	maximum undecided reviews for sanity data which are more than the 
	expected window sizes.
	
	(The graphical representation of all these tables is in the
	presentation handout copy)

=============================================================================
Analysis and Interpretation of various results -
=============================================================================

Some Curious Questions raised in our minds and our analysis 

------------------------------------------------------------------
Why the algorithm gives different accuracies for different data ?
------------------------------------------------------------------
Our Reasoning - 
The algorithm is based on the assumption that the sentiment of the review is
the aggregate of the sentiment values of its content words. 

This assumption may not hold true for all data. A review data may have the 
aggregate sentiment value different from the given sentiment value. 

It seems that as the algorithm gave high accuracies for car1 data, our 
assumption holds true for car1 data. Also, for sanity data low accuracy shows
that the sanity data really has balanced sentiment as we get high undecided
count.

------------------------------------
Which is the best scaling technique?
------------------------------------
We conclude that the best scaling method would depend on the review structure 
and it would be interesting to see that a particular scaling really does well
for a particular data. Our experiments were not enough to show the effect of 
scaling on precision but slight improvement in the results using scale3 method
for test data without window supports this fact. (Refer to TABLE2)

---------------------------------------------------------------------
Why we get higher precision and lower recall as we widen the window ?
---------------------------------------------------------------------
Wider the window, the program assigns 'undecided' to more reviews and hence
the count of attempted goes down. But still the drop in correctly assigned 
is not so sharp and hence we see net rise in the precision values.

For all windowing experiments, the program assigned highest number of 
'undecided' tags to sanity data than review data which confirms its correct 
behavior.

Window puts a threshold on the minimum value of sentiment to call a review 
positive or negative.

===========================================================================