++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
==================================================
Author: Amine Abou-Rjeili
Date: 10/30/2002
Filename: experiments.txt
Class: CS8761
Assignment: 3 - How to Find Meanings in Li(f|n)e
==================================================

Included files:
---------------
	select.pl
		Randomly divides the data into TRAIN and TEST partitions
	feat.pl
		Extracts features to be used for training and testing. Features must be
		extracted from the TRAIN partition and NEVER from the TEST partition
	convert.pl
		Converts instances into feature vector representation. Used to convert the
		TRAIN and TEST instances into a set feature vectors
	nb.pl
		Runs a Naive Bayes Classifier to learn from TRAIN and then classify the instances
		from TEST according to the learned model
	target.config
		Contains system configurations
	experiments.txt
		Results and analysis of experiments (this file)

Analysis:
---------
  This file contains the experiments and analysis that were carried out to test the Naive Bayes Classifier
  on the line data. The experiments were run using a 70-30 percent ratio to split the data into
  TRAIN and TEST partitions respectively. The following combinations of window sizes and frequency cutoffs
  were used: 
	Window Size:       0 2 10 25
	Frequency cutoffs: 1 2 5
  I run two sets of experiments as follows:
    1) Split the data into TRAIN and TEST partitions once and then run
       the experiments using these partitions. This experiment was carried out
       to compare the performance of the different combinations using the same
       TRAIN and TEST data (constant environment). The results of these experiments
       are summarized in TABLE 1.

    2) For each of the combination of window size and frequency cutoffs, a different
       set of TRAIN and TEST partitions was used. This means that these partitions
       differ in every run because they are randomly distributed from the entire line data.
       This experiment was carried to see how different partitions can affect the performance of
       each experiment. As can be observed from the data below, the performances of the 2
       experiments are very similar and so the changing environments did not produce a great
       impact. However, it can be seen that performance was degraded slightly with different
       partitions (in experiment 2). For example, consider the following:

                 Window Size| Frequency cutoff| Accuracy
        Experiment 1) 10	5	Accuracy: 0.8232 [Total number of correct 1024 out of 1244]
	Experiment 2) 10	5	Accuracy: 0.8063 [Total number of correct 1003 out of 1244]

       It can be seen that 1 has a better accuracy ~2%. One explanation of this is that it so
       happened that the TRAIN data was more reflective of the TEST data in case 1 as compared
       to case 2. Here, I must note that these 2 sets of experiments do not necessarily imply
       that having a constant environment is always better. It seems that it is a matter of finding
       the set of TRAINING data that is diverse enough that it reflects with most accuracy any set
       of test data.
       The results of experiment 2 are summarized in TABLE 2 below.


TABLE 1
-------
window size|frequency cutoff|accuracy
0       1	Accuracy: 0.5273 [Total number of correct 656 out of 1244]
0	2	Accuracy: 0.5273 [Total number of correct 656 out of 1244]
0	5	Accuracy: 0.5273 [Total number of correct 656 out of 1244]

2	1	Accuracy: 0.7508 [Total number of correct 934 out of 1244]
2	2	Accuracy: 0.7564 [Total number of correct 941 out of 1244]
2	5	Accuracy: 0.7412 [Total number of correct 922 out of 1244]

10	1	Accuracy: 0.7460 [Total number of correct 928 out of 1244]
10	2	Accuracy: 0.8135 [Total number of correct 1012 out of 1244]
10	5	Accuracy: 0.8232 [Total number of correct 1024 out of 1244]

25	1	Accuracy: 0.7572 [Total number of correct 942 out of 1244]
25	2	Accuracy: 0.8400 [Total number of correct 1045 out of 1244]
25	5	Accuracy: 0.8344 [Total number of correct 1038 out of 1244]

---------------------------------------------------------

TABLE 2
-------

Running experiments with different TRAIN and TEST each time

window size|frequency cutoff|accuracy
0	1	Accuracy: 0.5338 [Total number of correct 664 out of 1244]
0	2	Accuracy: 0.5579 [Total number of correct 694 out of 1244]
0	5	Accuracy: 0.5346 [Total number of correct 665 out of 1244]

2	1	Accuracy: 0.7291 [Total number of correct 907 out of 1244]
2	2	Accuracy: 0.7347 [Total number of correct 914 out of 1244]
2	5	Accuracy: 0.7235 [Total number of correct 900 out of 1244]

10	1	Accuracy: 0.7195 [Total number of correct 895 out of 1244]
10	2	Accuracy: 0.7910 [Total number of correct 984 out of 1244]
10	5	Accuracy: 0.8063 [Total number of correct 1003 out of 1244]

25	1	Accuracy: 0.7170 [Total number of correct 892 out of 1244]
25	2	Accuracy: 0.8103 [Total number of correct 1008 out of 1244]
25	5	Accuracy: 0.8119 [Total number of correct 1010 out of 1244]


From the above data we can observe, that the window size alone does not generally improve
accuracy but it is also dependent upon the frequency cut-off. For example, taking
the following instances:

    a) 10	5	Accuracy: 0.8232 [Total number of correct 1024 out of 1244]
    b) 25	1	Accuracy: 0.7572 [Total number of correct 942 out of 1244]

we can see that the accuracy actually decreased when we increased the window size but decreased
the frequency cutoff, with the exception of a window size of 0 in which the frequency cutoff is 
not used since there are no features. So by increasing the window size to 2 then we will get an 
increase in performance irrespective of the frequency cutoff. 
This general rule makes sense since if we have a small frequency cutoff, then we will have more features 
which just happen to be there by chance and do not contribute much to the sense of the target word. 
However, if we have a higher frequency cutoff then we get the words that occur more frequently around the 
target word in the context since the ones that occur infrequently are not considered.
On the other hand, if we set a very high frequency cutoff then we might get less features in the vector
which is similar to falling to a smaller window size. This can be seen from the following example taken from TABLE 1:

     25	2	Accuracy: 0.8400 [Total number of correct 1045 out of 1244]
     25	5	Accuracy: 0.8344 [Total number of correct 1038 out of 1244]

We can see here that a frequency cutoff of 2 performed slightly better than a cutoff of 5 (with window
size 25). This is not the case for TABLE 2 due to the different TRAIN and TEST partitions. So a model
with smaller frequency cutoff could generalize better than one with cutoff of 5, depending on the TRAIN
data. 

From the above experiments, we can see that the optimal combination of window size and frequency cutoff is

     Window Size: 25   Frequency Cutoff: 2   Accuracy: 84%

This is the highest accuracy performed, although the 25 5 combination performed slightly worse, 83.4%. 
This can be explained from the discussion above where we do not want to go too big in terms of window size
and too high in terms of frequency cutoff because then we will get into the case of overfitting the model.
So we need to find a middle term for best performance. In the experiments carried out the combination
25 2 turned out to be the optimal, as reported by TABLE 1. On the contrary we can observe that the results
from TABLE 2 disagree with this and place the combination 25 5 as better than 25 2, although very slightly so.
This is attributed to the different TRAIN and TEST partitions, as mentioned above.
As a conclusion, it can be observed from the data that with each window value, the accuracy starts low and
then increases as the frequency cutoff increases. However in some cases the accuracy drops as the cutoff
increases from 2 to 5.


REFERENCES:
-----------

Manning and Schutze. 2000. Foundations of Statistical Natural Language Processing :235-239
Mitchell. 1997. Machine Learning :ch.6

++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
Name		:	Nitin Agarwal			Date	1-Nov-02
Class		:	Natural Language Processing CS8761
===============================================================

Objective
To create sense tagged text and explore supervised approaches to the word sense disambiguation.

Procedure
Let us consider a huge corpus in which one word occurs in all the lines and could have various meanings. We have to split this corpus into 2 sets. We analyze one set using Statistical Natural Language Processing techniques to get the training data. Using this data we try to figure out the meaning of the target word in the test data. After this we compare the meaning of the word obtained using this method with the actual meaning and find the accuracy of this method. The value of the accuracy could be anywhere between 0 and 1. The closer the value is to 1 the better the method is.

Observations
I ran the test for all the 12 possible combinations of windows. However, unfortunately I got the same value for accuracy in all the 12 cases. Every time I ran the program I got the accuracy as 0.2370. This suggests that the program written by me is flawed somewhere. In my opinion it is most likely nb.pl, though I could figure out the mistake.

However, from the output that I got I noticed that the probability for many instances was 0. Again the likely reason is that, I just multiplied all the terms. Had I added the logarithms of all the terms and then calculated the antilog of the sum, I could have got different value. Nevertheless, I could not find an antilog function in Perl because of which I could calculate the probability using this method.

window size | frequency cutoff | accuracy 
	0		1			0.2370
	0		2			0.2370
	0		5			0.2370
	.		.			.
	.		.			.
	.		.			.
	25		5			0.2370

++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
Kailash Aurangabadkar
Assignment # 3
How to find meanings in li(n|f)e

--------------------------------------------------------------------------------

	       The objective of the assignment is to explore supervised
approaches to word sense disambiguation. In this assignment a sense tagged text
is created and implement a Naïve Bayesian classifier to perform word sense
disambiguation.
--------------------------------------------------------------------------------

             Word Sense Disambiguation: The task of disambiguation is to
determine which of the senses of an ambiguous word is invoked in a particular
use of the word.
--------------------------------------------------------------------------------

	      Process: In this assignment the Naïve bayesian classification
algorithm is used to assign the score to every instance in Test data for every
sense possible for the word. In this classifier the central idea is to look
around ambiguous words in a large context window. Each content word adds on
information to the disambiguation of the target word. We first find the
Probability vector from Train data for each sense and feature combination. If
we do not see a word in that context then we apply Witten Bell smoothing to
find the probability of that event. Then we find the probability of a content
word for every sense in the Test data window by using the Naïve Bayes
assumption. The sense with which we get maximum probability for that instance
from Test data is assigned as the sense of that instance. The accuracy of the
algorithm is then computed by comparing the actual sense of that word in that
line with the sense assigned by us.
--------------------------------------------------------------------------------

                 The assignment consists of implementing four programs, which
are:-

1. Select.pl:- This program divides a sense tagged corpus of text into Training
data and Test data.

2. Feat.pl:- This program finds the features in the specified window around the
target word. It also checks for the frequency of the features to be more than
the cutoff specified.

3. Convert.pl:- This program gives us the feature vector table which shows
whether the features obtained using feat.pl are present or not in the input
file.

4. Nb. Pl:- This program assigns sense tags to untagged data from test. It does
this by using the Naïve Bayesian algorithm, smoothing its value by using
Witten - Bell smoothing
--------------------------------------------------------------------------------

                 Experiment Results:-  The accuracy values for each combination
of window and frequency cutoff is as shown below:
--------------------------------------------------------------------------------

 Window Size        Frequency Cutoff       Accuracy value
--------------------------------------------------------------------------------

     0                   1                      0.4674

     0                   2                      0.4674

     0                   5                      0.4674
--------------------------------------------------------------------------------

     2                   1                      0.7383

     2                   2                      0.7262

     2                   5                      0.7021
---------------------------------------------------------------------------------

    10                   1                      0.7412

    10                   2                      0.7353

    10                   5                      0.7124
--------------------------------------------------------------------------------

    25                   1                      0.793

    25                   2                      0.7414

    25                   5                      0.7565
--------------------------------------------------------------------------------

                We see in general that as the window size goes on
increasing the accuracy value goes on increasing. This is quite obvious as we
see that if the window size increases then the content words around the
ambiguous word under consideration. This gives us more and more information
about the particular sense occurring in that instance. We also see that the
accuracy value goes on decreasing as the frequency cutoff value is increased.
This follows from the fact that a word is surrounded by less key words than
words like verbs, articles and prepositions. This is obvious from the fact that
when we say "telephone line" we know that line here means "cord".
But "telephone" is less frequent around "line"(meaning "cord") than are words
like "the", "is" etc. Thus we increase the frequency cutoff we are at a risk
of loosing the keyword and retaining the less important ones.
                The trend is followed generally in the observations
made from the experiments performed and summarized in the table above.
                Thus we see that making window size as large as
possible and frequency cutoff as low as possible (possibly 1) we can get more
and more accuracy in assigning senses to ambiguous words.

++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
cs8761 Natural Language Processing
Assignment 3

Archana Bellamkonda
November 1, 2002.


Problem Description : ( for part II - Naive Bayesian Classifier Implementation )
-------------------
The objective is to assign a sense to a given target word in a sentence
using Naive Bayesian Classifier. 
We start of with some sets of data where we have instances in a particular
sense that are previously collected. In this assignment, we do the sense
tagging in four steps -

1) select.pl :- This program collects all instances from given data and
   randomizes the instances after adding information about the sense from
   which they are extracted, and divides these instances into two sets of
   data, TEST and TRAIN depending on the percentage entered by the user.

2) feat.pl :- This program identifies all word types that occur within "w"
   postiions to the left or right of target word, and that occur "more than"
   "f" times in TRAIN data set. It doesnt include target word as a feature.

3) convert.pl :- This program converts the inpur file to a feature vector
   representation where features in output of feat.pl are read from
   standard input. Each instance in file is converted into series of binary
   values that indicate whether or not each type listed in list of features
   output by feat.pl has occured within specified window around the target
   word in the given instance.
   NOTE:- It also includes number of unobserved features in the specified
   window around the target word for every instance. This is the last
   number in the feature vector.

4) nb.pl :- This program wil learn a Naive Bayesian classifier and use that
   classifier to assign sense tags to test data and the senses are printed
   along with instance ids, actual sense and probability of the assigned sense.


Experiment : 
----------

			Frequency cutoff
			
			1		2		5
     windowsize	|__________________________________________	
	      0 |   0.4844	      0.4711	     0.4711	
	      2 |   0.8667	      0.5378	     0.4711
             10 |   0.9333	      0.4933	     0.4711	
	     25 |   0.9111	      0.8978	     0.5822


Experiments are done with the above shown window sizes and frequency
cutoffs. The results are as shown for all twelve combinations.

Observations : 
------------

-->Expected :
   --------
As window size increases, number of features that we observe
increase and hence we we will learn more about context of a given word and
hence we would be observing higher accuracy as window size
increases. (taking frequency cut off to be 1).
But, the meaning of a word depends on the surrounding context only. For
example, meaning of a word in a sentence will generally not depend on
meaning of word in other sentences. So, if we go on increasing the window
size, starting from zero, we will observe increase in accuracy upto a
certain level, and then when window size increases beyond the required
context, accuracy will drop again. 

Observed :
--------
As shown in the table above, we observed what we expected. Consider column
for frequency cutoff 1. As window size increases, accuracy increased,
reached a maximum of 0.9333, and then it dropped again to 0.9111.


-->Expected :
   --------
According to Zipf's law, most of the features are not repeated in a
text. So, as frequency cutoff increases for a particular window size, there
will be less number of features observed and hence we cant capture the
context properly. Thus, accuracy decreases as frequency cutoff increases
for a given window size.

Observed :
--------
As shown in the table above, we again observed what we expected. Consider
any row for a particular window size. The accuracy values are decreasing as
frequency cutoff increases.


--> Expected :
    --------
When we consider the case where frequency cutoffs(not 1) are
increasing, we also estimate that there will be more number of features
observed as window size increases and thus, we can expect accuracy to be
more.

Observed :
--------
Consider the columns for frequency cutoffs 2 and 5. We observed what we
expected. 


CONCLUSIONS
----------

Noise : 
-----
Thus, we should select window size to be optimal. We should observe where
we are getting noise, ie, what is the cut off where we are observing
unwanted features in our context, and limit our window size to be lower
than that cut off. 

Optimal Combination :
-------------------
The optimal combination will be the greatest window size below the cutoff
where we are observing noise and and frequency cutoff of "1" as we would
observe more features than in higher frequency cutoffs and hence would
learn the context of a sense in a better way.

Optimal Combination In Our Experiment:
-------------------------------------

As seen from the table above, the optimal combination is predicted to be a
window size of "10" and frequency cutoff of "1". We got the highest
accuracy at that point and it is 0.9333 as shown.

++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
#------------------------------------------------------------
# Assignment #3 : CS-8761 Corpus based NLP
# Title         : Word Sense Disambiguation [experiments.txt]
# Author        : Deodatta Bhoite
# Version       : 1.0
# Date          : 10-31-2002
#------------------------------------------------------------


Introduction
------------

In this assignment we have implemented the Naive Bayesian classifier for word 
sense disambiguation. The data is used is the 'line' data (all six senses). 
The classifier learns from a part of the data and then it is tested on the
remaining data to assign sense to the ambigous word based on the probabilities
it has learned from the training data. We also perform Witten Bell smoothing
to assign a non-zero probability to unobserved events.


Results of the Experiment#1 (all six senses)
--------------------------------------------

The results of the experiments are summarized in the following table.

Window size	frequency cutoff	accuracy
     0		       1		 0.5340
     0		       2		 0.5340
     0		       5		 0.5340
     2		       1		 0.1692
     2		       2		 0.0946
     2		       5		 0.0858
     10		       1		 0.1358
     10		       2		 0.1363
     10		       5		 0.1226
     25		       1		 0.1431
     25		       2		 0.1897
     25		       5		 0.1796


Observations
------------

We observe that the accuracy is far too better when window size is zero. i.e.
when we assign the maximum occuring sense to all the instances. Though it 
has a upper hand in the results, we acknowledge the fact that it is dependent
on the probability of the maximum occuring sense in the corpus which is not
necessarily always very high. (ex. 99 senses with s1=0.02 and s2 to s99=0.01
is the probability of senses. In this case accuracy will be 0.02.) Thus, though
it gives a higher accuracy than the other cases in this instance it is not a
good approach.

Among the other window sizes and frequency cutoffs, the window size of 25 and
frequency cutoff of 2 gives us the highest accuracy. We can justify the high
accuracy by the increase in the window size. The decrease in accuracy at 
frequency cutoff 5 and 1 probably suggests that the word providing strong 
evidence to the sense have a frequency greater than or equal to 2 but less than
5. The decrease of accuracy at frequency cutoff equal to 1 suggests that words
occuring once 'mislead' the classifier, by pointing towards a wrong sense.

We also see that, in general, the accuracy improves with increase in window
size (with the exceptions of W=2 and F=1 and W=0). There doesn't seem to be
a direct relation between the frequency cutoff and the accuracy.


Experiment#2 (cord2 & division2)
--------------------------------

I also used the classifier for classifying two senses. 

The results for cord2 and division2 as two senses of 'line' are:

Window size	frequency cutoff	accuracy
     0		       1		 0.5022
     0		       2		 0.5022
     0		       5		 0.5022
     2		       1		 0.5200
     2		       2		 0.5067
     2		       5		 0.5200
     10		       1		 0.7511
     10		       2		 0.6400
     10		       5		 0.6667
     25		       1		 0.8133
     25		       2		 0.8222
     25		       5		 0.8133

We observe that the accuracy is now increasing with the window size. The 
frequency cutoff don't seem to matter much. 

Interestingly, the maximum accuracy is again at window size 25 and frequency 
cutoff 2.

The overall accuracy of the classifier has also increased. And we will try to 
see how the classifier does as the number of senses increase.

Experiment#3 (cord2,division2 and formation2)
---------------------------------------------

The results of the classifier for cord2, division2 and formation2 as three
different senses of 'line' are as follows:

Window size	frequency cutoff	accuracy
     0		       1		 0.3424
     0		       2		 0.3424
     0		       5		 0.3424
     2		       1		 0.3393
     2		       2		 0.3333
     2		       5		 0.3212
     10		       1		 0.5485
     10		       2		 0.3909
     10		       5		 0.4121
     25		       1		 0.6667
     25		       2		 0.5636
     25		       5		 0.4394

We observe that the pattern of the output is not the same as for 2 senses. 
We see that the accuracy of window size 0 is higher than when it is 2. 
This will happen as we increase the number of senses. However the average
of the accuracy does increase as the window size increases. The highest
value is observed at window size 25.

Also note that the overall accuracy of the classifier has decreased as the
number of senses increases.


Conclusion
----------
The classifier performs better when the number of senses is less. The accuracy
of the classifier increases as the window size increases.
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
Naive Bayesian Classifier
Bridget Thomson McInnes
31 October, 2002
CS8761
----------------------------------------------------------------------------


EXPERIMENTS:
----------------------------------------------------------------------------

   |  Window Size |   Frequency Cutoffs  |  Accuracy    |  
   |--------------|----------------------|--------------|
   |      0       |	     1		 |   0.5386	|
   |--------------|----------------------|--------------|
   |      0       |	     2		 |   0.5346	|
   |--------------|----------------------|--------------|
   |      0       |	     5		 |   0.5471	|
   |--------------|----------------------|--------------|
   |      2       |	     1		 |   0.7299     |
   |--------------|----------------------|--------------|
   |      2       |	     2		 |   0.7122	|
   |--------------|----------------------|--------------|
   |      2       |	     5		 |   0.6479	|
   |--------------|----------------------|--------------|
   |     10       |	     1		 |   0.7627	|
   |--------------|----------------------|--------------|
   |     10       |	     2		 |   0.8142	|
   |--------------|----------------------|--------------|
   |     10       |	     5		 |   0.7805	|
   |--------------|----------------------|--------------|
   |     25       |	     1		 |   0.7620	|
   |--------------|----------------------|--------------|
   |     25       |	     2		 |   0.7625	|
   |--------------|----------------------|--------------|
   |     25       |	     5		 |   0.7203	|
   |----------------------------------------------------|


ANALYSIS:
----------------------------------------------------------------------------
The accuracies for the window size of zero is approximately the same for 
each of the runs, approximately 50% accuracy. This is due to the fact that 
there are not any features for the classifier to train from. The classifier 
picks the most frequent sense of the instances in the training data and 
applies this sense to each instance in the test data. Given this it might 
be thought that the accuracy for a window size of zero should be the same 
no matter what the frequency count is. This is no the case because the 
training and test instances are randomly chosen each time the program is 
run. Therefore, the number of instances for each tag in the test and 
training files vary at each run of the program.

The accuracy for the window size of two is definately higher than the 
accuracy for a window size of zero. This is as expected because with a 
window size of two the classifier is not picking the most frequent sense 
for every instance. It is using the features from the training data do 
determine the sense of the instances in the test data. The run made with 
the frequency cut off of 5 is lower than the runs with the frequency cut 
off of one and two. I believe this is the case because with a frequency 
cut off of five, relevant features are not being included. The chance of 
a relevant feature occurring 5 times is smaller than it occurring once or
twice.

The accuracy for the window size of ten is higher than the accuracy for 
a window size of one or two. I believe this is the case because more 
relevant features can be captured with a larger window size. There is a 
point though that a window size can be to large making it difficult to 
provide the appropriate sense to the target word. This is discussed in 
the next paragraph. With a window size of ten and a frequency cut off of 
five, the accuracy is lower than with a cutoff of one and two. As stated 
above, I believe this to be the case because with a frequency cut off of 
five, relevant features are not being included.

The accuracy for the window size of 25 is lower than the accuracy of a
window size of ten. I believe this is the case because the words to the
left and right of the target word are indicative of what tag corresponds 
to the instance. When a larger window size is employed the words to the 
left and the right of the target word become vague, and less indicative. 
This creates noise which makes it more difficult to determine the sense 
of the instance. The accuracy for the window size of 25 is higher that 
the accuracy of a window size of one or two. I believe that this is the 
case because as said above a larger window size the corresponding features 
become to vague and create noise while a smaller window size does not have 
create enough features to uniquely determine a tag of an instance.

++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
#Suchitra Goopy
#Natural Language Processing
#11/01/2002

The following data was used for this experiment:
cord2
phone2
---------------------------------------------------------------------------
Report
------

The main aim of this experiment was to perform "Word Sense Disambiguation".
Many words tend to have the same form but have different meanings.So given a 
word, one cannot accurately tell it's meaning just by looking at the word.
But on the other hand,if the whole sentence is given then we can guess
the meaning of the word.

For eq:
Consider the word "bank"

This word has many meanings and cannot be disambiguated by just looking
at it.

I went to the bank to get money.
I sat by the river bank.

Now it is easy to see which sense of the word bank is being used given
the whole sentence.So it is disambiguated mainly by the surrounding
words in the sentence.
The idea of performing this experiment was the same.We look at the
surrounding words and begin to build an idea of the sense of the word.
We then test a new set of data and we assign "sense-tags" to them
depending on their surrounding words.

There were four phases that had to be implemented during the course
of this experiment.

Phase 1: To take a few files and randomize the instances in them.An
         instance can consist of two or more sentence.
         We then put A percent of them in a file called TRAIN
         and the remaining sentences in the file called TEST

Phase 2: We then take the TRAIN file and begin to pick out words
         that occur within a given window around the target word.
         For example if a window of 2 is specified the for the following 
         sentence
       
         I went to the bank to get money

         Here the window words are to,the,to,get(considering that the target
         word here is "bank").Then we check if the window words occur
         above a certain specified frequency.If they do occur then they
         are selected as the features.

Phase 3: We the convert the features obtained in Phase 2 into a feature
         vector representation using both the TRAIN and TEST files
         We are careful not to let features be derived from the TEST file

Phase 4: Implement the Naive Bayesian Classifier

The experiments were performed with different combinations of window
sizes and frequencies.The window sizes and the frequency values and
the obtained accuracy is given below.

Window Size      Frequency    Accuracy
---------------------------------------
 0                 1           0.04012
 0                 2           0.04012
 0                 5           0.04012
 2                 1           0.04012
 2                 2           0.04012
 2		   5           0.04012
 10                1           5.37654
 10                2           5.37654
 10                5           5.37654
 25                1           5.37654
 25                2           5.37654
 25                5           5.37654

Analysis:
----------
From what I observed the accuracy values are not correct.It is not 
performing as I had expected.It is very unsusual that I would get the
above frequency values if it was working as I had expected.I assumed 
that as the window sizes increased I would have more number of "window words".
But as the frequency increased I would have lesser number of window words.
So for a high window value and low frequency value,I thought that the
disambiguation would work better because I have more number of features
to help me disambiguate the words.

For small number of window sizes I will not have many features which would
realy help me to disambiguate a word,because the important features that
help in the "disambiguating process" would not necessarily occur within
the small window size.

Thus a big window size and small frequency values would work the best.

Conclusion:
-----------
I have found that there were many ways that my experiment has not 
performed as I had expected.The experiment helped me understand how
the process of word sense disambiguation works.Tagging sentences at
the open mind site also helped me encounter new sentences,meanings
and how words are used.It helped me understand what had to be done
before the experiment was started.
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
Bayesian Classifier Experiments Report

by Paul Gordon (1913768)
for CS 8761, Statistical Natural Language Processing
November 1, 2002

Objective:

To draw conclusions about how window size and cutoff frequency affect the accuracy of sense tagging.


Introduction:

This report summarizes the results of a Bayesian Classifier for determining word sense.  The experiment varies two parameters, window size, and frequency cutoff.  The windows size is the number of tokens on either side of the target word (the word to be sense tagged), that will be included as part of the context used to determine the sense tag. A word must appear more than the frequency cutoff number of times for it to be used as a context word.

Hypothesis:

A reasonable first approximation might be a Markovian argument. That is, that as the window size increases, it will increase the accuracy, but that at some point, it will begin to stabilize or fall again as the classifier incorporates tokens further and further from the target word.

An argument for the frequency parameter might be that as the frequency cutoff increases, it reduces the amount of context, and so will reduce the accuracy.

Results:


window size | frequency cutoff | accuracy

       0                1        0.5510
       0                2        0.5510
       0                5        0.5510

       2                1        0.7309
       2                2        0.7213
       2                5        0.7357

      10                1        0.7663
      10                2        0.8161
      10                5        0.8032

      25                1        0.8000
      25                2        0.8474
      25                5        0.8305


Conclusions:

It appears that the hypothesis was wrong on both counts.  As the window size increases, the accuracy increases, without let up.  Also, as the frequency cut off increases, the accuracy also increases.  It appears that the increase in relevance brought about by increasing the frequency cutoff more than makes up for the reduction in context.  The results also seem to call into question the Markov model of local dependence.
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
Victoria Harp 1632107
CS8761 November 1, 2002
Programming Assignment #3 

    window size   |  frequency cutoff  |  accuracy
   --------------------------------------------------
	0		-		0.538955823293173
	2		1		0.601606425702811
	2		2		0.576706827309237
	2		5		0.525301204819277
	10		1		0.799196787148594
	10		2		0.769477911646586
	10		5		0.722088353413655
	25		1		0.809638554216867
	25		2		0.789558232931727
	25		5		0.719678714859438

As the frequency cutoff goes higher, the accuracy decreases.  This means we
are concentrating on stop words and ignoring contextual content.  Thus, the
most-optimal frequency cutoff is 1, where all words in the sentence-window
are given weight.

The most-optimal window was 25 (with a frequency cutoff of 1.)  I was
surprised by this result because I figured our Markov Assumption would hold
true and only words very close to the target would influence the meaning.
However, after more analysis, I realized that a large window has room to
hold stop words and garbage while also picking up the important clue words.
The smaller windows were hampered by stop-garbage.

Our experiment is slightly skewed because we are considering each
sentence individually.  If we were to expand our view to the paragraph
level, our data and analysis would change.  I suppose that could be
considered the Anti-Markov Assumption...  But that line of reasoning goes
beyond the scope of the current project.

The zeros in the probability column in the TAGGING outfile mean the
calculated probability value is very small and the significant digits are
too far to the right to be displayed.  

I checked and double-checked my formula, but I couldn't see any way to
crank up the accuracy.  I welcome your suggestions.

Good night!
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
Prashant Jain

CS8761 Assignment #3: Finding meaning in Li(f|n)e


Introduction
------------

The  objective   of  this   assignment  was  to   "explore  supervised
appproaches to word sense disambiguation.  We have been given the line
data which contains six files.There are a number of instances per file
but there  is only one  instance per  line. What we  have to do  is to
implement  a   Naive  Bayesian   Classifier  to  perform   word  sense
disambiguation. Basically what  it means is that given  an instance of
'line' in a  file, we have to  find out which would be  the best sense
that should be used with it.



Procedure
---------

We  had to  create four  files  that basically  implemented the  Naive
Bayesian Classifier.These were-:


select.pl
---------

This file divided the given data  into TEST and TRAIN data after sense
tagging it.  Its provided with the  argument of Percentage  as well as
the  target.config  file which  contains  the  regular expressions  to
extract lines and instance ids.


feat.pl
-------

This file uses  the TRAIN data and uses the  window size and frequency
count(provided  by  the user  at  the  command  line) to  get  feature
words(words of interest) which are put in the FEAT file.


convert.pl
----------

This  file converts  both the  TRAIN and  TEST data  into  its feature
vector  representations using  the  features provided  by the  feat.pl
file. This  is basically a binary  representations of instances/senses
and features.



nb.pl
-----

This is the  file in which we implement  naive bayesian classifier. We
use both the files that we get after converting the TRAIN data and use
them to assign sense values to  the TEST data and check how accurately
were we able to assign the correct senses.



Observations of Experiments
---------------------------

The  following table  describes the  results we  got from  running our
experiments over the various possibilities that had been given to us.



--------------------------------------------
|window size | frequency cutoff  | accuracy|
--------------------------------------------
|	0    |        1		 | 0.3728  |
|	0    |        2		 | 0.3728  |
|	0    |	      5		 | 0.3728  |
--------------------------------------------
|	2    |	      1		 | 0.7485  |
|	2    |	      2		 | 0.7485  |
|	2    |	      5		 | 0.7189  |
--------------------------------------------
|	10   |	      1		 | 0.7692  |				  
|	10   |	      2		 | 0.7663  |			  
|	10   |	      5		 | 0.7604  |
--------------------------------------------
|	25   |	      1		 | 0.7308  |
|	25   |	      2		 | 0.7337  |
|	25   |	      5		 | 0.7071  |
--------------------------------------------			  


We notice that generally as  we increase the window size, the accuracy
of our naive bayesian classifier increases. Intuitively speaking, this
should be expected since the  more feature words we incorporate in our
trials, the  more the chances  are that they  would occur in  the test
data. And more the number of samples, more the probablity of assigning
the correct sense.


We also notice  that as we increase the frequency  of the words, there
is  a definite  decrease  in the  accuracy.  Again, intuitively,  this
should be expected. Becuase as we keep increasing the frequency count,
if (as in  our case) the stop words etc.  havent been eliminated, they
would have a  higher chance of being the only  ones which occur (since
stop  words  like  'a', 'and',  'the'  etc.  appear  with a  lot  more
frequency than  say a  'instrument' which would  be a helpful  hint in
getting  the  sense  of  line  as  'phone')  rather  than  other  more
interesting yet a little lesser frequent words.


As we notice  from the data that we get, the  maximum accuracy that we
get is when the window size is 10 and the frequency count is 1. We see
that above that and indeed below  that, there is no other column which
has the same accuracy.Hence this is  the one which seems to be optimal
to me. 

Just to check I  ran nb.pl for two more values, the  window size of 15
and the window size of 5. The  results were just as I had expected. As
we  increased   the  window   size  from  10   to  15,   the  accuracy
decreased. This was an expected occurance  since we had seen a fall in
the values of  accuracy from the window size of 10  to the window size
of  25.  Again, when  decreasing  the window  size,  we  saw that  the
accuracy value increased. This wasnt entirely expected however.

|-----------|---------|---------|
|window-size|frequency|accuracy |
|-----------|---------|---------|
|   15	    |	5     |	0.7396  |
|-----------|---------|---------|
|    5	    |	5     |	0.7870  |
|-----------|---------|---------|


Hence  I would  like to  conclude by  saying that  the  naive bayesian
classifier implemented by me gives pretty decent results and shows the
property  of  increasing  accuracy  upto  a certain  window  size  and
frequency after which it gradually goes down.


References:
-----------

Manning  ,C.G.  &  Schutze  Hinrich.2000. Foundations  of  Statistical
Natural Language Processing.MIT Press.Cambridge Massachusetts.

++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++

Rashmi Kankaria
Assignment no : 3
Due Date : 2nd nov 2002.


Objective : To explore supervised approaches to word sense disambiguation.   

Introduction : 
      Supervised Disambiguation uses the set of training data which is
      already generated for a ambiguous word by sense tagging the
      words in training data and this labeled data is now used 
      to disambiguate the word in next instance where it has occurred.
    
      This experiment is an attempt to implement Naive Bayesian
      Classifier.
 
         



Window size Freq_cutoff  accuracy   most common sense
----------------------------------------------------------------------
0           1                       product ( with probability 0.5320)               

0           2			    product ( with probability 0.5320)     

0           5			    product ( with probability 0.5320)     
----------------------------------------------------------------------
2           1            0.6474

2           2            0.6450

2           5            0.6434

----------------------------------------------------------------------  
10          1            0.7687

10          2            0.7631
 
10          5            0.7562

---------------------------------------------------------------------- 
25          1            0.7897  

25          2            0.7731

25          5            0.7655
----------------------------------------------------------------------



In a case where window size is 0, then the sense with maximum probability , 
here product will be assigned as default sense.
This is empirically shown to be correct.

Q> What effect do you observe in overall accuracy as the window size and
   frequency cutoffs change ?

   For a given 70-30 split of the training and test data, we can
   observe that the accuracy goes on increasing as we increase the
   window_size and that is reasonable because as we increase the
   window size and so the context, more features will be present to
   disambiguate the word sense and hence more chance of the frequency
   of the feature for a given sense and hence higher will be the
   probability to guess the sense with respect to that word.
 
   As we look at the table, we can draw certain conclusions about the
   pattern of accuracy for a given window_size and frequency cutoff.

   For constant frequency cutoff , the accuracy increase with
   increase in window_size considerably.
   Eg : The accuracy increases from 0.645 to 0.7731 as we increase the
   window_size from 2 to 25.
   Thus it is easy to conclude that this will be always true that with
   increase in window_size, the accuracy will increase.
 
   For constant window_size, however with increase in frequency 
   cutoff,the accuracy does not increase linearly for all window sizes  as we 
   might have expected and i think that can be argued.
 
   With lower frequency cut off , for a given window size, only one feature word is 
   taken into consideration and that is most of the time stop word.
   The stop words does not give any significant information about the word in the
   context which is very peculiar to the word and which will help it to disambiguate.  
 
   On the other hand, if the window size is to large, there can be 
   some extraneous information ( noise) which can get added as features 
   but which on the contrary may not be so helpful.
 
   The most significant values here are wind_size = 10 and freq cutoff = 1
   and window  size = 25 and freq cutoff = 5.
   If you observe this,the accuracy of the first one is more than the second and 
   this can help us to deicde that optimal freq cut off and window size.


   The major flaw with Naive Bayesian Classifier  is that it considers all the features to be 
   indepndant.   
   This also affects the calculations of probabilities of features.


  Q> Are there any combinations of window size and frequency that appears to
   be optimal with respect to the others ? why?

   As argued above, there are few combinations of window_size and frequency cut off that appear optimal with respect to others. 
  
   As we can observe, for a given window size, for increasing cutoff, the accuracy decreases. 
   Also we observe that there is no significant change in accuracy as we change the window size from 10 to 25 so optimal window_size
   for this case can be 10 as just increasing the window size does not make significant change in the accuracy for 2 reasons.
   1. Context might be still lesser than the window_size.in this case, increasing the window size does not make any difference.
   2. Also the proximate context matters  to disambiguate the sense and hence having larger window might not help much.


   As far as the frequency cut off is considered, are most optimal number will be within 2-5 as higher frequency features are most 
   of the time stop words and are of not any help to disambiguate and a frequency as low as 1 will output all the feature words and 
   most of them are not relevant or occur very infrequently with the word which we are going to disambiguate.any word within the 
   gives a range will show strong association with the word we need to disambiguate over a large training set.

   
   This gives us the optimal values of window size and frequency cutoff.   
   



References :

1. Foundations of Statistical natural language processing 
 by Christopher D Manning and Hinrich Schtutze.
 pg 235 - 239.

2. Programming Perl ( 3rd edition) by Larry Wall,Tom christiansen and
   Jon Orwant. 






++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++

  NAME:  SUMALATHA KUTHADI
  CLASS: NATURAL LANGUAGE PROCESSING
  DATE:  11 / 1 / 02

		    CS8761 : ASSIGNMENT 3

  -> OBJECTIVE: TO EXPLORE SUPERVISED APPROACHES TO WORD SENSE DISAMBIGUATION.

  -> INTRODUCTION:

     -> WORD SENSE DISAMBIGUATION: ASSIGN A MEANING TO A WORD IN CONTEXT FROM SOME SET OF
        PREDEFINED MEANINGS(OFTEN TAKEN FROM A DICTIONARY).

     -> SENSE TAGGING: ASSIGNING MEANINGS TO WORDS.

     -> FROM A SENSE TAGGED TEXT WE CAN GET THE CONTEXT IN WHICH PARTICULAR MEANING OF A
        WORD IS FOUND.

     -> CONTEXT FOR HUMAN: TEXT + BRAIN

     -> CONTEXT FOR MACHINE: TEXT + DICTIONARY/DATABASE

  -> MAIN PARTS OF ASSIGNMENT:

     -> TO SELECT RANDOMLY A% OF INPUT TEXT AND PLACE THEM IN TRAIN FILE.
        REMAINING TEXT IS PLACED IN TEXT FILE.

     -> TO SELECT FEATURES FROM THE INPUT FILE (TRAIN FILE) WHICH SATISFY A FREQUENCY CUTOFF.

     -> TO CREATE A FEATURE VECTOR FOR EACH INSTANCES THAT ARE PRESENT IN BOTH TRAIN AND TEXT FILES.

     -> TO TO LEARN NAIVE BAYESIAN CLASSIFIER FROM THE OUTPUT OF THIRD PART OF ASSIGNMENT AND TO USE
        THAT CLASSIFIER TO ASSIGN SENSE TAGS TO TEST FILE.

  -> WHEN CREATING SENSE TAGGED TEXT YOU ARE BUILDING UP A COLLECTION OF CONTEXTS IN WHICH
     MEANINGS OF A WORD OCCUR. THESE CAN BE USED AS TRAINING EXAMPLES.

  -> THE BASIC PRINCIPLE INVOVLED IN WORD SENSE DISAMBIGUATION IS TO SELECT TH EVALUE OF THE
     SENSE THAT MAXIMISES THE PROBABILITY OF THAT SENSE OCCURING IN THE GIVEN CONTEXT (MOST LIKELY SENSE).

  -> WHILE USING "NAIVE BAYESIAN CLASSIFIER " WE ASSUME THAT THE FEATURES ARE CONDITIONALLY INDEPENDENT, THEY
     DEPEND ONLY ON THE SENSE.

  -> NAIVE BAYESIAN CLASSIFIER : S=ARGMAX P(SENSE) PRODUCT( P(C(i/SENSE)) SUCH THAT i=1 TO N
                                 WHERE S IS SENSE.

 -> REPORT:
    -> WE ARE RUNNING THE PROGRAMS WITH 12 COMBINATIONS OF WINDOW SIZE AND FREQUENCY CUTOFFS
       USING A 70 _ 30 TRAINING-TEST DATA RATIO.

       WINDOW SIZE               FREQUENCY CUTOFF            ACCURACY

           0                        1                         0.5311

	   2			    1                         0.5344

	   10			    1                         0.5331

	   25			    1                         0.5040

	   0			    2                         0.5311

	   2			    2                         0.5000

	   10			    2  			      0.4991

	   25			    2                         0.4941

	   0			    5                         0.5311

	   2			    5                         0.4481

	   10			    5                         0.4341

	   25			    5                         0.4121

 -> OBSERVATIONS:

    -> WHEN THE WINDOW SIZE IS ZERO, NO MATTER THE WHAT'S THE VALUE OF FREQUENCY, ACCURACY IS ALMOST EQUAL.

    -> WHEN THE FREQUENCY IS KEPT CONSTANT AND WINDOW IS INCREASED, THE ACCURACY IS DECREASING.

    -> WHEN THE WINDOW SIZE IS KEPT CONSTANT AND FREQUENCY IS INCREASED, THE ACCURACY IS DECREASING.

    -> THERE IS SOME RELATION BETWEEN FREQUENCY, WINDOW SIZE AND OVERALL ACCURACY,
      BECAUSE THE MEANING OF A WORD CAN BE GUESSED FROM IT'S SURROUNDING WORDS. 















++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++

# ********************************************************************************

# experiments.txt            Report for Assignment #3      Word Sense Disambiguation
# Name: Yanhua Li
# Class: CS 8761
# Assignment #3: Nov. 1, 2002

# *********************************************************************************


This assignment is to apply a Naive Bayesian Classifier to perform word sense disambiguation.

I carried out the first 6 experiments with all 6 "line" files.
Since CPU limit was exceeded when I did experiments with window size 10, I changed
to work with 2 files. CPU limit was exceeded when I used window size 25 for all
three experiments. So I can't get results for last three experiments.

Resulting Table
******************************************************************

window size     |     frequency cutoff    |    accuracy               6 files

0                     1                        0.530120481927711  
0                     2                        0.530120481927711
0                     5                        0.530120481927711
2                     1                        1     
2                     2                        1  
2                     5                        1 
__________________________________________________________________
                                                                      2 files
10                    1                        1
10                    2                        1 
10                    5                        1

25                    1                        -                  
25                    2                        -
25                    5                        -


I worked with several other combinations of window size and frequency for 2 files.
like
window size     |     frequency cutoff    |    accuracy   
4                     5                         1
3                     4                         1
results for accuracy are the same.

From the results, we can see my Naive Baysian Classifier works very well! 
Except window size 0, other window sizes and frequency cutoffs do not affect
accuracy at all because my classifier always assigns correct senses to instances.
It is amazing?! I was thinking that window sizes and frequency and the combination 
of those two should have something to do with accuracy. I thought when window size
and frequency cutoff were both not too big and not too small, the accuracy would be
the best. But if the method is too good, always does the right thing, then those
factors don't matter anymore. 

For window size 0, we don't have any feature so we actually didn't use classifier
to assign senses but give every instance the "most common sense", so accuracy is
low. It is expected. It has something to do with window size (because it is 0)
but have nothing to do with frequency (because there is no frequency of any feature).




++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
# Aniruddha Mahajan (Andy)
# CS8761 - Fall 2002
# Assignment 3 -- Word Sense Disambiguation


    Ordinary or rather every-day language is ambiguous in the sense that it lacks
    specific instructions which would enable the user to perfectly understand the
    information in the form it was expressed. Again, in everyday life we as humans
    work around this ambiguity through various ways, be it because of the 'extra'
    sense obtained from speech, through gestures or simply through anticipating
    and deducing what that particular piece of grammar might mean. This is word
    sense disambiguation.

          Computers have to be taught to disambiguate the sense of a word as they
    cannot learn by themselves... in some ways similar to a small child .. yet in
    some ways so different. The disambiguating ability is to be achieved through a
    set of training data. Supervised word sense disambiguation has data that
    is tagged .. or with its sense is connected to that datum as training data.
    So basically the program has the training data to 'learn' word sense anticipation
    or compute the sense to be expected whenever a certain few words are around.

          In given program a corpus of 'line' data was used. It had a total of 6
    senses according to which the data is grouped. The experiments performed on the
    data included 4 programs -- select.pl, feat.pl, convert.pl & nb.pl
    In the last program we implement a Naive Bayesian Classifier along with
    Witten-Bell smoothing to the probability mass distribution. Due to the
    smooothing, unobserved events are not taken with a probability of zero, but
    the probability mass distribution is shifted a bit to make way for new ones.

    The table shows the results for various testing combinations as conducted...

        Window Size  Freqency cutoff   Accuracy
       ------------------------------------------------

            0              1            0.5261
            0              2            0.5261
	    0              5            0.5261

	    2              1            0.7760
	    2              2            0.7622
	    2              5            0.7301

	   10              1            0.8418
	   10              2            0.8263
	   10              5            0.8161

	   25              1            0.8490
	   25              2            0.8386
	   25              5            0.8345
       -------------------------------------------------

       From shown table we can observe that clearly as window size increases,
       the accuracy of prediction also increases. Though this might become stagnant
       after some time .. and possibly decrease after more increase in the window
       size. On the contrary as the frequency cutoff increases within a window size,
       the accuracy is found to decrease gently.
           Also the data with a window size of zero (0) will always have the same
     accuracy - no matter what the frequency cut-off (for a same piece of training data).
     Another notable point is that uptill a window size of 2 the accuracy seems reasonable
     enough. During the intial testing of these programs I did the experiments on only 2
     senses and those too of only half the size. Still a reasonable accuracy was achieved
     with W=4 and F=2.
           The percentage decider 'A' plays a major role in the contributors towards WSD.
     Here an A of 70 was taken ... thus the train data was more than twice the test data
     and our little program was introduced to a fairish amount of feature words. Ofcourse just
     having the data corpus larger would also help and would not biuld the restrictiveness of a
     small coepus with a large A value.

	w8_087:11423:  <<phone2>> 0.0004 <<product2>>
	w7_047:14219:  <<product2>> 0.0002 <<text2>>
	w7_107:826:  <<product2>> 0.0170 <<product2>>
	w7_125:15262:  <<product2>> 0.0000 <<product2>>
	w9_7:12942:  <<division2>> 0.0002 <<division2>>
	w7_124:728:  <<product2>> 0.0000 <<product2>>
	w7_117:6660:  <<division2>> 0.0010 <<division2>>
	w7_114:6618:  <<product2>> 0.0031 <<product2>>
	w7_082:3841:  <<product2>> 0.0000 <<product2>>
	w7_034:16236:  <<product2>> 0.5386 <<product2>>
	w9_19:6617:  <<formation2>> 0.0002 <<product2>>
	w7_084:9614:  <<phone2>> 0.0000 <<division2>>
	ACCURACY = 0.7759

	Above pasted is a piece of the output obtained by W=2 F=1 on all 6 sense data files.++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
Typed and Submitted by: Ashutosh Nagle
Assignment 3: How to Find Meanings in Li(f|n)e.
Course:CS8761
Course Name: NLP
Date: Nov 1, 2002.

+-----------------------------------------------------------+
|                     Introduction                          |
+-----------------------------------------------------------+

	This assignment has following two parts:

1) Data Creation
~~~~~~~~~~~~~~~~

   Here I tagged 564 word provided by the Open Mind Word Expert project.


2) Naive Bayseian Classifier Implementation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	I have implemented this in four parts as is required
by the assignment. The point that I feel, deserves mention is
of Witten-Bell Smoothing. Here,

Probability of an observed event = Its frequency/(Types+Tokens) 

Probability of an unseen event = Types/{z*(Types + Tokens)}
			       = (Types/z)/(Types + Tokens)

	Thus, in Witten-Bell smoothing, the frequency of an unseen 
event is considered to be (Types/z) and the size of the event 
space is considered to be (Types + Tokens), which is, obviously,
the same for both, the probabilities of seen and unseen events. 

    Also here, the probabilities do not (also should not) add to 1.
Because, here we are looking at a particular combination of sense and
feature. A given instance can have multiple features for the same 
sense that it has. So if we sum the number of the features occured
with a a particular sense, the sum will not be equal to the number
of occurences of the sense. Every sense can potentially occur with
any number of features. So the values of the various features for a
given sense, are not the various values of same random variable but
are infact various random variable which overlap on each other.



One Sense Dominates:-
    One of the senses always dominates. When I performed the 
experiments with all the 6 files in line data "production2" was 
the dominating sense. When I performed it with "formation2" "cord2" 
and "test2", "formation2" was the dominating sense.
    Accuracy goes fairly high (0.9211) when the experiment is performed 
with fewer files. And I think this is fairly explainable -- When we have 
more senses to choose from, the probability of error is more (5/6).But 
when we work with fewer sense the domain shrinks and the probability of 
error comes down (2/3). For the same reason the number of possible senses 
has a much greater effect on the accuracy that the window size or frequency
offs have.

     Also, I am seeing a strange phenomenon, which I initially felt to be
a bug in my code. I have spent 2 long days trying to figure this out, but 
do not see any bug in the code. For all combinations of window size and
frequency cut off, I get almost the same accuracy. When I print about 8-10 
digits after the decimal point, I see little variation in the values, but 
not otherwise. So, to check if the values can change or not at all, I 
performed the experiments by taking different number of possible senses. 
(i.e. different number of files from line data.) and I found that the 
values do change but for a particular number of files (possible senses), they
remain the same.

       Tabulated below are my observations:

W=Window Size;
F=Frequency Cut Off;
A(i)=Accuracy with i files taken from the line data.


+------------------------------------------------------------------------+
|	W	|	F	|	A(3)	|	A(4)	|   A(6) |  
+_______________|_______________|_______________|_______________|________+
+~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~|~~~~~~~~~~~~~~~|~~~~~~~~+
+---------------|---------------|---------------|---------------|--------+
|	0	|	1	|	0.9211	|	0.7378	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	0	|	2	|	0.9211	|	0.7378	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	0	|	5	|	0.9211	|	0.7378 	|  0.7089|
+_______________|_______________|_______________|_______________|________+
+---------------|---------------|---------------|---------------|--------+
|	2	|	1	|	0.9211	|	0.7378 	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	2	|	2	|	0.9211	|	0.7378	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	2	|	5	|	0.9211	|	0.7378	|  0.7089|
+_______________|_______________|_______________|_______________|________+
+---------------|---------------|---------------|---------------|--------+
|	10	|	1	|	0.9211	|       0.7378	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	10	|	2	|	0.9211	|	0.7378	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	10	|	5	|	0.9211	|	0.7378	|  0.7089|
+_______________|_______________|_______________|_______________|________+
+---------------|---------------|---------------|---------------|--------+
|	25	|	1	|	0.9211	|       0.7378	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	25	|	2	|	0.9211	|	0.7378	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	25	|	5	|	0.9211	|	0.7378	|  0.7089|
+---------------|---------------|---------------|---------------|--------+
|	0	|	1	|	0.9211	|	0.7378	|  0.7089|
+_______________|_______________|_______________|_______________|________+
+------------------------------------------------------------------------+


	I have kept all the output file in the directory

	/home/cs/nagl0033/NLP/assignment3/

	Here there are a few subdirectories too. A directory "ij" contains the
files of experiments performed with window size 'i' and frequency cutoff 'j'.

      Intutive Reasoning: I feel as we increase the window size, we consider 
more features. So we should be able to predict the sense more accurately. But 
as the window size becomes too large, we start considering the words that do 
not even affect the ambiguous word. But the presence of a word distant from
the target word in one instance may adversely affect the calculations, if that 
word, by chance occurs in the vicinity of the target word in some other 
instance.
       
       Secondly, as the frequency cut off is increased, only the words occuring
more frequently in the vicinity of the target word are considered. And those 
which might have occured by fluke are ignored. And this, therefore, should aid 
better prediction. On the other hand, if the frequency cut offs are too high, 
then it is equvalent to looking for some small that has a very huge effect and
ignoring a number of elements (features), each of which has a fairly reasonable
effect, but not a huge one. As the words follow Zipfian distribution, we are 
likely to find very very few features if the cut off is too high.

Expectation:
    Based on this reasoning, I had expected the (10,5) case (i.e. where window 
size is 10 and cutoff is 5) to give me optimal result. But I am not able to 
see this in the my table, probably because of some arithmetic error !!  



++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
			
			 CS 8761 NATURAL LANGUAUGE PROCESSING 
			 FALL 2002
			 ANUSHRI PARSEKAR
			 pars0110@d.umn.edu
			  
   
                   ASSIGNMENT 3 : HOW TO FIND MEANINGS IN LIFE. 
Introduction
------------
          Many times we come across words which have several meanings (or senses)and so there can be an
ambiguity about their meaning. Word sense disambiguation deals with assigning meanings or senses to 
ambiguous words by observing the context or the neighboring words. In this assignment we attempt to
disambiguate various meanings of the word line by using sense tagged line in which six different senses
of the word line have been identified and the data is divided into seperate files for each sense.    
	 
Programs implemented
---------------------
select.pl : This program tags every instance of the target word (here line) according to the file in 
            which it is and then randomly divides the data into TRAIN and TEST parts.
feat.pl   : This module identifies the feature words associated with the word to be disambiguated
	    (here 'line'). The range of feature words can be controlled by changing the window size
	    as well as the frequency of occurance of the feature word.The feature words are obtained
	    from the TRAIN file only.
convert.pl: The input to this file is the TRAIN or TEST file and the list of feature words.Each instance 
	    in file is converted into a series of binary values that indicate whether or not each feature
	    word listed has occurred within the specified window around the target word in the given instance.
nb.pl     : This program learns a Naive Bayesian classifier from the feature vector of TRAIN and use that 
	    classifier to assign sense tags to the occurences of target word in the TEST file.
	    It tag assigned is the sense having the maximum prob (sense|context).The context word are
	    assumed to be conditionally independent.

Expected results
----------------
1. When the window size is zero we do not have any feature words or context to guess the sense of the target
   word. The only guideline we have is the probability of each senseand we assign the same sense which has the
   maximum probability to all the instances of the test data. The accuracy thus obtained will not dependent 
   on the frequency.
2. As we increase the window size, the no of feature words will increase. We expect an increase in 
   accuracy with increase in window size because for a larger window would mean that we are considering a
   wider context. Hence feature words which were ignored for a small window size will also be used and might 
   give a better accuracy.
3. Increasing the frequency cutoff will again reduce number of feature words which may reduce the accuracy. 
4. If we run the program on just 50 % of training data then it should reduce the accuracy as the classifier 
   will have less data to learn from.     
5. If the target word has lesser no of senses then a better accuracy can be expected because the word can be
   has to be assigned a sense which is choosen from a smaller number.


Results
--------
------------------------------------------------------------------------------------------
|window size |   frequency cutoff | accuracy for    |  accuracy	 for    |accuracy	 | 
|            |                    | 70% Train data  |  50% Train data   |only two senses |	  
-----------------------------------------------------------------------------------------|
|	     |			  |	            |                   |	 	 |
|     0      |      1		  |   0.5374        |       0.5590	|   0.4979	 |
|     0      |      2             |   0.5374        |			|		 |
|     0	     |	    5             |   0.5374        |			|		 |		 
-------------|--------------------|-----------------|-------------------|----------------|
|     2	     |	    1             |   0.6737        |       0.7413      |   0.7657	 |
|     2      |      2             |   0.6358        |                   |		 |
|     2	     |	    5             |   0.6132        |       0.6963      |   0.7448	 |
-------------|--------------------|-----------------|------------------------------------|
|     10     |	    1             |   0.7429        |       0.8148      |   0.8870	 |
|     10     |      2             |   0.7172        |			|		 |
|     10     |      5             |   0.6817        |       0.7689      |   0.8452	 |
-------------|--------------------|-----------------|------------------------------------|
|     25     |	    1             |   0.7663        |			|   0.8912		 |
|     25     |      2             |   0.7566        |			|		 |
|     25     | 	    5		  |   0.7228        |			|   0.8703	 |
------------------------------------------------------------------------------------------


Conclusions
-----------
1. The accuracy increases as we take more number of feature words. Thus, when we increase the window 
   size or take a low frequency cutoff we are considering a more detailed and context wider context.
   Hence the accuracy increases.
2. However a very large window size does not help much because the target word is most of the times
   unrelated to distant words which appear as feature words when the window size is large.  
3. The memory and time required for a high window and low frequecy cutoff are quite high as compared to 
   the increse in accuracy that they give. This should be considered while choosing a optimal combination
   on window size and frequecy cutoff.
4. The window size of 10 and the frequecy cutoff of 1 or 2 seem to give a good accuracy for the line data.
   However, we cannot make general statements on basis of this, since the senses of line are distinct and
   have a seperate set of feature words for each sense.       
5. Evenif the experiment was performed by taking only 50 % of the line dat as the TRAIN data instead of 70%,
   the accuracy seems to have increased.This is opposite to what was expected. 
6. As expected if the experiment is done using just two senses of the target word, the accuracy has mostly
   increased.
7. Overall, the performance of the classifier is quite good for the line data with average accuracy above
   0.7 . ++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
 REPORT
--------
 -----------------------------------------------------------------------------
  Prashant Rathi
  Date -1st Nov. 2002 
  CS8761 - Natural Language Processing
 -----------------------------------------------------------------------------

  Assignment no.3
 ------------------ 
  Objective :: To explore supervised approaches to word sense disambiguation
  You will create sense -- tagged text and see what can be learned from the 
  same.Sense --tag some text and implement a Naive Bayesian Classifier to
  perform word sense disambiguation.

 -----------------------------------------------------------------------------

  IMPLEMENTATION:
  --------------
  The implementation is divided into four programs each of which does a 
  specific function.
  - select.pl creates Train Data and Test Data from the 'line' data by 
    selecting instances randomly.
  - feat.pl works on this Train Data and outputs the feature words selected
    given the window-size and frequency cutoff conditions.
  - convert.pl creates feature vector tables for the Test and train data.
  - nb.pl implements the Naive Bayesian classifier which helps in word sense
    disambiguation and produces the accuracy results.

  * target.config is used to specify the regular expression for the target word
    and for matching instance_id's.This makes the programs more generalized
    and can be easily used for other data with slight modifications.

-------------------------------------------------------------------------------
 OBSERVATIONS:
 ------------ 
  Experiments were carried out with window sizes of 0,2,10 and 25 and 
  frequency cut-offs of 1,2 and 5.For these 12 combinations of frequency and
  window sizes and 70-30 training-test data ratio, following observations were
  made -  


 ---------------------------------------------
  window size | frequency cut-off | accuracy |
 ---------------------------------------------
       0      |          1        | 0.5418   |	 
       0      |          2	  | 0.5418   |
       0      |          5	  | 0.5418   |
   
       2      |          1	  | 0.7355   |
       2      |          2	  | 0.7251   |
       2      |          5	  | 0.7002   |
   
       10     |          1	  | 0.7685   |
       10     |          2	  | 0.7902   |
       10     |          5	  | 0.7982   |
  
       25     |	         1        | 0.8053   |
       25     |          2        | 0.8264   |
       25     |          5        | 0.8384   |
  --------------------------------------------

-The accuracy results are as shown in the table.These are the accuracy results 
observed after applying the Witten Bell smoothing.I also observerd the accuracy
results without applying smoothing.These values were lower in comparison.Thus,
by distributing some probability among the unobserved types (smoothing), we 
improve the accuracy results.

-The accuracy values observed when the window-size is 0 are all the same for 
different frequency values.This is because it is just the maximum occuring
sense in the test data which is selected each time and with window-size as 0,
frequency does not come into picture.For window size of 0,the feat does not 
contain any features and the Test.fv and train.fv just have the instance_id's 
and the senses.This is the lower bound of the performance.The upper bound being
human performance.Also it was observed that product2 was the most frequent
sense in this case.

-It can be also observed from the table that as the window size increases 
we get better accuracy.I think it is due to the fact that more we know of a 
context around the target word,higher is the probability of getting the sense
right.

-For small windows, according to me the frequency variations do not show many
radical changes in accuracy as they do for larger windows.

-The smoothing helps improving the accuracy if the feature vectors are large 
and contain a lot of 0 values( i.e. unobserved types).

-For window sizes of 10 and 25, it is seen that as frequency increases, the 
accuracy increases.This may be because the more a feature word occurs with 
the target word, the more it contributes to getting the correct sense.

-But in case of window size 2, I observed that as frequency increases, accuracydecreases slightly.The only thing I could deduce from this is that with
less frequency we get more no. of words and hence may be better context to get 
the correct sense.

-The probability values assigned using the classifier has very low values as can be seen from the output of the nb.pl.

-It is also important to maintain distinction between the Test and the Train 
data.If the Test data itself is used for training purposes, then the results 
would be flawed as they don't allow to explore all possibilities.It might also 
be a good idea to create few variant sets of test data for exhaustive testing.

-I think the way the instances are picked randomly plays a factor in the 
accuracy values obtained.This is because it depends on the no. of instances
picked with a particular sense.

-------------------------------------------------------------------------------
CONCLUSION:
-----------

According to me, the word sense disambiguation implemented using the Naive 
Bayesian Classifier does give reasonable amount of accuracy (atleast the 
results for the 'line' data show that.It is therefore one of the important
methods in supervised learning methods.

-------------------------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
CS 8761: Assignment 3 - Naive Bayesian Classifier
Due:	 11/1/2002 12pm

Sam Storie
10/29/2002


+------------------------------------------------------------------------+
| Methods                                                                |
+------------------------------------------------------------------------+

Introduction:
	A Bayesian classifier (BC) is a technique used to classify a set of events 
into "bins" and then use certain information to classify new events into one 
of those bins. In the context of word sense disambiguation we use a set of
words surrounding a "target word" that we're trying to determine the sense of
that word. These surrounding words are considered to be "features" of the 
word and are treated as potential identifying items concerning the sense of 
the target word. This assignment deals with determining the sense of the word 
"line". To do this we examine a set of data where the sense of line has been
predetermined and use this data to "train" our classifier. Then we use the
information from the training data to try to determine the sense of some
unseen testing data. The details of this process is described in subsequent
sections.

Process:
	This assignment is broken into four seperate programs. The first program
examines a set of pre-tagged instances for line. To ensure our classifier
isn't developing a bias this program seperates this data into training and
testing data. The classifier is developed only using the training data, and
isn't shown the testing data until the actual performance testing is done.
This program randomly selects a certain percentage of the instances to be
placed into a file called TRAIN and the remainder are placed into a file
called TEST. This ensures that our classifier won't develop a bias.
	The second stage of the classifier is a program that determines which
words occur within a certain window of the target word in the training
instances. Recall from the introduction we are considering these contexual
words to be some sort of indicator of the sense for that instance. Of course
there are some overlaps for common words, but this is taken into account
during this process. To help identify words that appear often within this
window we can set a frequency cutoff to eliminate uncommon words. The result
of this stage is a list of words that occur within a certain window size, and
that also occur a certain number of times. This stage is only performed on the
*training* data, since using the testing data would introduce a bias towards
the features present in the testing data. This would obviously nullify any
results we ultimately obtained from that data.
	The third program uses the list of words from the previous stage to create
feature vectors for all the instances of training and testing data. A feature
vector is simply a list of 1's and 0's to indicate whether a given word
occured within the window used to generate the features. If a given instance
had a 1 for the word "IBM", then this would mean that "IBM" appeared within
the window of "line" for this instance. These feature vectors are generated
for both the training and testing data.
	The final stage is to actually create the classifier and attempt to assign
senses to the training data. To do this we examine the feature vectors for the
training data and determine how often each feature occured with each sense.
From this data we can compute the probability associated for a sense based on
what words occur within the window size specified in the earlier stages. Then
when we examine the feature vector for a test instance we can determine which
sense is most likely (based on the training instances used) given the features
that are in the window. The probability of each sense is computed by
multiplying the probabilities of each feature occuring, given the sense. Then
we simply select the sense with the largest resulting probability. The final
step is to apply a sense with the classifier and check it against the actual
sense. The final accuracy (number correct / total number of instances) is 
reported to the user.
	Technically, using probabilities of the features occuring with the target
word is a very involved process, but we are implementing a *naive* BC. This
means that we treat the probability of each feature occuring with the target
word as conditionally independent. If we did not make this assumption we
would need to consider the probability of not just the features seperately,
but rather occuring in the sequence they actually appear. Zipf's law shows 
us that once you start considering sequences of more than 3-4 words the
probability is very hard to determine. This is simply because those sequences
probably do not occur very often and it's hard to assign how likely they
should appear. This is a non-trivial distinction, but impressively still 
generates some useful results. As a final note, in the case of a feature not 
occuring within a training instance, which would give a probability of zero, 
we use Wittman-Bell smoothing to give those cases a probability to use in the 
calculations.


+------------------------------------------------------------------------+
| Results                                                                |
+------------------------------------------------------------------------+
	I ran this set of four programs with a combination of window sizes and
frequency cut-offs. Per the assignment, the initial set of instances was split
into 70% training data and 30% were set aside for testing. The results are 
summarized as such:


      Window size | Frequency cutoff | Accuracy
     +------------------------------------------+ 
           0               1             .5490 
           0               2             .5361
           0               5             .5418
           0              10             .5080
           2               1             .4679
           2               2             .4735
           2               5             .4461
           2              10             .3915 <-- Worst result
          10               1             .5812
          10               2             .5820
          10               5             .6326
          10              10             .6391
          25               1             .5924
          25               2             .6761
          25               5             .6833
          25              10             .7219 <-- Best result
     +------------------------------------------+		  



	As you can see in the table, the best accuracy obtained was about 72% and
this was consistent across several trials. I found it rather interesting to
see how the window size affected the accuracy across the trials. For a window
size of 0 I would expect that the accuracy would match the frequency of the
most common sense. This is because, with no features to help in the
classification, the only data that can be used is how often each sense
occured. The result is that every testing sense is tagged with the same sense.
In the case of line, the "product" sense occurs about 55% of the time (2218 out
of 4149 instances), and this is close to the results obtained for 
the window size of 0. There is a little flucuation due to the randomization of 
the data, but it's easy to see this is close to what would be expected based 
on the frequency of the "product" sense.
	Perhaps more interesting is the case when the window size of 2 is used.
Intuitively I imagined that including more data would help determine the
sense, but it's clear that this is not only failing to increase the accuracy,
it's actually causing it to go down. After seeing this result I thought about
what might be causing this and I believe it's due to what types of words occur
within that small window. Semantically important words probably do not
immediately precede or follow a target word, but are seperated by common words
like "the", "for", "at", etc. The classifier is basing its decisions on 
meaningless words that have skewed its choices towards an uncommon sense
(frequency wise), and when used to guess instances where another sense 
dominates, produces poor results. I think this is also evident when the you
examine what happens when the frequency cut-off is raised. This is reducing
the feature set to only those "bad words" and are further skewing the
decisions.
	More of what I expected is shown in the results for a window size of 10
and 25. I imagined when you allow for the sense-indicating features to
appear farther away from the target you should get better results. I think
this is exactly what is happening. The corollary here is that by doing so we
open the chance for more "bad words" to start appearing as features, but it's
easy to see that as we restrict the frequency requirement (to eliminate more
of these extra features) our results improve. While those same meaningless
words will still appear as features, now we have other features to base the
final decision on. With a window size of 25 we're essentially examining the 
entire instance, so there's a high chance that if some identifying feature 
occurs, the classifier will include this and it will influence the final 
decisions.
	On a final note, the meaning of a 72% success rate is difficult to guage 
because for a human this task is often trivial. However, given that the actual 
tests are based solely on the instance itself, I think it's rather impressive 
that such a simple technique can get a correct result 72% of the time. I 
imagine with some tweaking of the various parameters of the experiment, or even the specific data being used, could probably produce better results. Of course,
these results are based on the amount of training data available (about 3600
instances in this case) and the breadth of examples that data covers. Still,
with these limitations in mind, the Naive Bayesian Classifier is a simple, and
clearly useful technique when performing word sense disambiguation.
	


+-------------------------------------------------------------------------+
| References and More Information                                         |
+-------------------------------------------------------------------------+
(1) Original assignment on Dr. Pederson's web page
	- http://www.d.umn.edu/~tpederse/Courses/CS8761/Assign/a3.html

++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++

                                      \\\|/// 
                                    \\  ~ ~  // 
                                     (  @ @  )  
**********************************-oOOo-(_)-oOOo-************************************
                                                                                        
     CS 8761 - Natural Language Processing - Dr. Ted Pedersen                     
                                                                                   
     Assignment 3 : HOW TO FIND MEANINGS IN LIFE              		   		                
                                                                      
     Due : Friday , November 1,  Noon
                                                                                   
     Author : Anand Takale (  )                                         
                                                                                  
***************************************-OoooO-***************************************
                              .oooO     (   ) 
                              (   )      ) / 
                               \ (      (_/ 
                                \_) 


Objectives : To explore supervised approaches to word sense disambiguation. To create
----------   sense-tagged text and see what can be learned from the same. 


Specification : Sense tag some text and implement a Naive Bayesian classifier to perform 
-------------   word sense disambiguation.


----------------------
Part I : Data Creation
----------------------

Tag 500 instances/sentences on Open Mind Word Expert website.

login id : Anand
Project : CS8761-UMD
Instances Tagged : 515

There was no output to be turned in from Open Mind sense tagging.


--------------------------------------------------
Part II : Naive Bayesian Classifier Implementation
--------------------------------------------------

This assignment mainly deals with word sense disambiguation. The problem to be solved
is that many words have several meanings or senses. For such words given out of context
there is ambiguity about how they are to be interpreted. Thus our task is to do the
disambiguation i.e to determine which of the senses an ambiguous word is invoked in 
a particular use of the word.

This assignment required us to calculate the Naive Bayesian Classifier. In short it
learns a Naive Bayesian classifier from TRAIN data and use the classifier to assign
sense tags to TEST data.

The idea of Bayes classifier is that it looks at the words around an ambiguous word
in a large context window. Each content word contributes potentially useful information
about which sense of the ambiguous word is likely to be used with it. The classifier 
does not do feature selection. Instead it collects evidence from all features.

Implement a Naive Bayesian Classifier to perform word sense disambiguation. Perform your
experiments with the "line" data.

The assignment required the implementation of four modules:

(1) select.pl -- program to construct TRAIN and TEST data
(2) feat.pl -- program to extract features.
(3) convert.pl -- program to build the feature vector representation
(4) nb.pl -- program to learn a Naive Bayesian Classifier from TRAIN.FV and use that
             classifier to assign sense tags to TEST.FV

Our main task was to create a Naive Bayesian Classifier from the TRAIN data and then 
disambiguating the words from the TEST data.

We had to perform disambiguation of "line" data. For this the six files containing
different senses of line were combined and lines were read randomly from any of the
files and put into TRAIN and TEST files. The TRAIN file consisted of 70% data while
the TEST file consisted of 30% files.

After creating the TRAIN and TEST files , the features which had a frequency greater
than the cutoff frequency were selected. After observing the extracted features it 
was noted that the conjunctions , prepositions and the determiners like a , an , 
and , the occured almost in all the cases. To be more specific about the task of 
disambiguation we need to find more dependent collocations and try to eliminate 
all the 'stop' words like the conjunctions , determiners etc. so that the accuracy
of the classifier improves.

------------
Experiments
------------

Experiment with window sizes of 0,2,10 and 25. Use frequency cutoffs of 1,2 and 5.
Run your classifiers with all 12 possible combinations of window size and frequency
cutoff using a 70-30 training-test data ratio.

Report the accuracy values that you obtain for each combination in a table.


------------------------------------------------------------
window size     |     frequency cutoff    |    accuracy
------------------------------------------------------------
	0       |	   1              |	0.5399
	0       |	   2		  |     0.5399
	0       |	   5		  |     0.5399
------------------------------------------------------------
	2       |	   1		  |     0.6401
	2       |	   2		  |	0.5911
	2       |	   5		  |	0.5413
-----------------------------------------------------------
	10      |	   1		  |	0.7044
	10      |	   2		  |	0.6996
	10      |	   5		  |	0.5662
-----------------------------------------------------------
	25	|	   1		  |	0.6843
	25	|	   2		  |	0.6707
	25	|	   5		  |	0.5390
------------------------------------------------------------


-----------------------------------------------------------------------------------
What effect do you observe in overall accuracy as the window size and the frequency
cutoffs change ?
----------------------------------------------------------------------------------

After observing the data recorded in the tables above we come to the following 
conclusions : 

(1) When the window size is zero i.e. there are no features available to train 
    the Naive Bayesian Classifier , the sense that occurs the most in the TRAIN data
    is assigned as the calculated sense for all the instances of the TEST data. In 
    case of line data the accuracy observed for  window size of 0 was 0.5399. This 
    was the probability of the sense that occured the most. From this we can conclude
    that even without having any prior knowledge i.e. without training the Naive 
    Bayesian Classifier. So this is the lowest accuracy that can be obtained from 
    this classifier. So we see for the line data that the minimum accuracy obtained 
    will be 0.5399. 

(2) For a window size of 2 the value of accuracy increases considerably. We see that
    the value of accuracy goes upto 0.6401 for a cutoff frequency of 1. What we can
    conclude from this is that when the window size is one we can predict sense of 
    almost 64% of the test instances.So we see that disambiguation becomes easier 
    when we have some context rather than having no context at all. One more observation
    that can be made at this point is that according to Zipfs' Law most of the bigrams
    occur rarely while few occur frequently.We observe the same effect as that of 
    Zipf's law. For the window of size 2 we observe that as the cutoff frequency 
    goes on increasing the accuracy goes on decreasing. This is what is observed by 
    Zipf's law. This proves that most of the bigrams occur very few times. So as the
    frequency cutoff goes on increasing the chances of all the words occuring becomes 
    less. This is what we observe from the table tabulated above.

(3) Again observing values of the accuracy for different window sizes we observe 
    that the value of accuracy goes on increasing till a certain point before it
    fall off again. Also the phenomenon of decrease in accuracy with increase in 
    window size is observed. Now in both these cases i.e with increase in window
    size and with increase in cutoff frequency we see that there is a considerable
    decrease in accuracy. This observed fact can be explained as follows: when we 
    try to increase the window size from 0 to 25 the number of features present goes 
    on increasing. As the number of features goes on increasing the Naive Bayesian 
    Classifier undergoes lot of learning. But all the features are not important to
    train the classifier.The stop words i.e .all the conjunctions and determiners
    actually are of no use to train the classifier. What we are looking is a more
    solid feature which is closely connected to the tagged word. By finding such 
    solid features which are closely connected to the tagged words it becomes 
    much more easier to train the classifier. In short the decrease in accuracy 
    can be considered as consequence of increase in noise words which get added
    to the feature vector. Now these noise words are introduced as a result of
    increase in window size or increase in cut-off frequency.
    


------------------------------------------------------------------------------------
Are there any combinations of window size and frequency that appears to be optimal 
with respect to the others ? Why ?
-----------------------------------------------------------------------------------

From the observations i.e. from the table we observe that the optimum value of accuracy
is 0.7044. This value of accuracy is observed when the window size is 10 and the cutoff
frequency is 1.

This optimum value of the window size and the cutoff frequency are specific to this data.

(1)  The range of optimum value of window size will be the same. The value of accuracy
     draws a curve with respect to the window size. Initially as the window size increases
     the accuracy also increases till it reaches an optimum value from where it again 
     starts to fall off. This is due to the fact that as the window size increases the
     number of noise words also increase which reduce the overall accuracy of the classifier.
     This is because the more connected tokens occur fewer times than the stop words i.e.
     conjunctions , determiners etc.

(2)  As far the relation between the cutoff frequency and the accuracy goes , it is 
     observed that as the cut-off frequency increases the accuracy goes on decreasing.
     This observed fact is due to the reason that as the cutoff frequency increases the
     more dependent words ( or we can say more specific words to that sense or context )
     which occur very few times to be eliminated from being considered as features. This
     leaves only the stop words i.e the conjunctions , prepositions  , determiners etc. 
     as the features. Now these stop words can't train the classifier as the specific 
     dependent tokens do. These stop words occur in almost all the instances equi-probably
     and hence it becomes more difficult to disambiguate the sense causing the accuracy 
     to go lower.

So in short we can say that a midrange window size and lower cutoff frequency would most
probably give the most optimum value for accuracy. We also conclude that a window size of
0 gives the least accuracy.

-----------------
References :
-----------------

(1) Programming Perl - O-Reilly Publications
(2) Foundations of Statistical Natural Language Processing  - Christopher D. Manning and Hinrich Schutze
(3) Problem Definition and some related concepts and issues - Pedersen T.  ( www.d.umn.edu/~tpederse )