+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
CS8761: Assignment 4 - Report 
-----------------------------
Author: Amine Abou-Rjeili
Date: 11/10/2002


Files included in tar file:
---------------------------

select.pl - Splits the data into TRAIN and TEST set according to the given ratio

feat.pl - Each line from the input file is split into 3 parts; key, sense, sentence,
	  and the features extracted from the sentence part only.

convert.pl - Converts a data file into feature representation

nb.pl - Runs a Naive Bayesian Classifier on data. Changes madeas compared to previous version:
        1) The probability of the sense that occurred the most times is displayed, in the case
           that there are no features and classification is based on the "most common sense"
           algorithm.
	2) The probability is displayed using 15 decimal places instead of the mathematical
	   'e' format

run-experiment - performs the experiments given a set of data. The data is split 70-30 TRAIN/TEST ratio
	         and the experiments are carried out according to the constant and variable environments
		 (see description below). The following files will be output:  
		      experiment.txt - Report of results
		      FEAT           - list of features
		      TAGGING.0-1    - tagging detail for window 0 frequency cutoff 1
		      TAGGING.0-2    - tagging detail for window 0 frequency cutoff 2
		      TAGGING.0-5    - tagging detail for window 0 frequency cutoff 5
		      TAGGING.10-1   - tagging detail for window 10 frequency cutoff 1
		      TAGGING.10-2   - tagging detail for window 10 frequency cutoff 2
		      TAGGING.10-5   - tagging detail for window 10 frequency cutoff 5
		      TAGGING.2-1    - tagging detail for window 2 frequency cutoff 1
		      TAGGING.2-2    - tagging detail for window 2 frequency cutoff 2
		      TAGGING.2-5    - tagging detail for window 2 frequency cutoff 5
		      TAGGING.25-1   - tagging detail for window 25 frequency cutoff 1
		      TAGGING.25-2   - tagging detail for window 25 frequency cutoff 2
		      TAGGING.25-5   - tagging detail for window 25 frequency cutoff 5
		      target.config  - config file
		      TEST           - test examples
		      TEST.FV        - test examples in feature format
		      TRAIN          - train examples
		      TRAIN.FV       - train examples in feature format


Files added to get statistics and convert the Open Mind data to data appropriate to the classifier:

count_words.pl - Counts the total number of words that have been tagged from the filenames given as
                 command line arguments. This script is used to count how many words were tagged from
		 the entire cs8761-umd.full.detailed file.

find_good_words.pl - Calculates statistics from the data from the file as given on the command line. 
		     This is the cs8761-umd.full.detailed file. The statistics calculated are the criteria
		     that have been specified in the assignment that make a "good" word. In addition, the
		     script will filter out any duplicate and bad examples and split the input file
		     into multiple files with each corresponding to a particular word. Each 'word' file
		     will contain all examples for that particular word. Each example will occur only once.
		     For example, if we assume that the cs8761-umd.full.detailed file contains examples for the
		     words 
			   'edge' 
			   'energy' 
			   'hope', 
		     then this script will create 3 files in the current directory called 
			  'edge'
			  'energy'
			  'hope' 
		     that will correspond to each word from the main file. 
		     Each file will contain the examples from the detailed file except for the ones that have been filtered out.
		     Examples will be filtered out according to the following rules:
			      1) an example has been tagged more than once and none of the tagged examples agree at least 2 times.
			      2) an example has been tagged more than 2 times and more than 1 sense has an agreement rate of 2
			         or more, then the sense with the highest agreement rate will be taken. The others will be dropped.
			      3) in case there is a tie with more than 2 senses in terms of agreement rate, the last sense will
			         be picked and the rest dropped
		     The following rule will create a file where each example ID occurs at most one time

convert_data.pl - Converts the files as outputted from the above script into the format that the classifier understands. This is the
		  same format as the line-data. However, the input format must be the same as outputted from the script
		  find_good_words.pl

For more information about the included scripts see the documentation provided at the beginning of the script.


EXAMPLE RUN USING PROVIDED SCRIPTS:
-----------------------------------
	shell> find_good_words.pl /home/cs/tpederse/CS8761/Open-Mind/cs8761-umd.full.detailed > GOOD_WORDS2

Select words from the list of possible good words and run as follows e.g. aspect:

       shell> convert_data.pl aspect aspect /home/cs/tpederse/CS8761/Open-Mind/ids-to-sentences

Now we will have a list of files corresponding to the senses of word 'aspect' in the appropriate format

Optional:
---------
	shell> run-experiment all_senses_file

This will run all the experiments on the provided senses file.

EXPERIMENTS AND ANALYSIS:
-------------------------
	The task for this assignment is to use the Naive Bayesian classifier implemented in assignment 3 and run it
using 3 "good" words from the data for the class obtained from the Open Mind project. A good word is one defined as follows
(as taken from Assignment 4 specifications)

     * Has a reasonable rate of agreement among the two taggers.
     * Has a somewhat balanced distribution of senses. At the very least avoids the case where a single sense dominates the distribution.
     * Has a reasonable number of examples.

To help in identifying "good" words, I created the script find_good_words.pl to print out some statistics and decompose the
the examples from the main Open Mind data file into separate files for each word (see description above for more information)
An output extract from a run of the script on the cs8761-umd.full.detailed file is as follows:
	  
	  
'Word'  'Agreement Count'       'Example Count' 'Number of senses'
arc     64      3 - 1:19:00:: 1:25:00:: 1:25:01::
Sense Frequency
        1:19:00:: = 7
        1:25:00:: = 9
        1:25:01:: = 48

argument        94      3 - 1:10:00:: 1:10:02:: 1:10:03::
Sense Frequency
        1:10:00:: = 39
        1:10:02:: = 32
        1:10:03:: = 23

art     110     4 - 1:06:00:: 1:10:00:: 1:04:00:: 1:09:00::
Sense Frequency
        1:06:00:: = 31
        1:10:00:: = 19
        1:04:00:: = 35
        1:09:00:: = 25

aspect  100     5 - 1:24:00:: 1:07:02:: 1:09:01:: 1:09:00:: 1:07:01::
Sense Frequency
        1:24:00:: = 3
        1:07:02:: = 23
        1:09:01:: = 15
        1:09:00:: = 55
        1:07:01:: = 4

The output has the following format:
    * The first line of each word is as follows:
	word total_number_of_occurences total_number_of_sense - list_of_senses
	The tab character (\t) is used as the field delimiter
    * The second line contains a list of the senses with a count of how many times that sense was encountered in the examples file
      Each sense is displayed on its own line and is indented with a tab character (\t) at the beginning of the line.

These statistics show the sense distribution for a word together with the total number of available examples according to
the filtering rules as specified above. However, these filtering rules introduce the following pitfall:
    * In the case that each example has been tagged ONLY ONCE then all the examples will be included, since there is
      nothing to contradict a particular example, so it is assumed to be correct. This is the case of the word 'art'
      from the cs8761-umd.full.detailed data. Therefor, in addition to looking at these statistics it is recommended
      to also check the actual data file. These statistics are meant to help in the process of identifying good words
      and not meant to actually do the identification. 

As mentioned above, this script will also produce a number of files for each word encountered in the main data file. 

After the find_good_words.pl script has been run, the output was examined to identify six good words. These are as
follows (the statistics are also shown):

	* argument        94      3 - 1:10:00:: 1:10:02:: 1:10:03::
	  Sense Frequency
		1:10:00:: = 39
		1:10:02:: = 32
		1:10:03:: = 23

	* aspect  100     5 - 1:24:00:: 1:07:02:: 1:09:01:: 1:09:00:: 1:07:01::
	  Sense Frequency
		1:24:00:: = 3
		1:07:02:: = 23
		1:09:01:: = 15
		1:09:00:: = 55
		1:07:01:: = 4

	* edge    106     6 - 1:15:00:: 1:07:00:: 1:06:01:: 1:06:00:: 1:25:00:: 1:07:01::
	  Sense Frequency
		1:15:00:: = 19
		1:07:00:: = 38
		1:06:01:: = 6
		1:06:00:: = 15
		1:25:00:: = 15
		1:07:01:: = 13

	* energy  148     6 - 1:14:00:: 1:07:00:: 1:19:00:: 1:26:00:: 1:07:02:: 1:07:01::
	  Sense Frequency
		1:14:00:: = 6
		1:07:00:: = 51
		1:19:00:: = 67
		1:26:00:: = 12
		1:07:02:: = 6
		1:07:01:: = 6

	* hope    111     5 - 1:07:00:: 1:18:00:: 1:12:00:: 1:09:00:: 1:12:01::
	  Sense Frequency
		1:07:00:: = 1
		1:18:00:: = 9
		1:12:00:: = 29
		1:09:00:: = 44
		1:12:01:: = 28

	* length  111     5 - 1:06:00:: 1:07:00:: 1:07:02:: 1:07:03:: 1:07:01::
	  Sense Frequency
		1:06:00:: = 6
		1:07:00:: = 27
		1:07:02:: = 23
		1:07:03:: = 23
		1:07:01:: = 32

I chose these six words because they have a good number of senses (more than 2) and an acceptable balanced distribution, so
there is no sense that totally dominates the distribution. 

For the experiments, I used the 70-30 ratio as in assignment 3. I tried other ratios and the results were approximately
the same with some ratios having worse results, so I decided to adopt his ratio scheme. The reasoning behind trying different ratios
is that in all these examples, the number of examples we have to train and test is relatively small as compared to the line-data
examples. So I tried of using a higher ratio for training such as 75-25 80-20 90-10 but the results did not show any improvement
and in the case of the 90-10 they were actually worse. Therefor the same window size and frequency cutoff as in assignment 3
were used for the experiments here. The following results were obtained for each word:
     

argument-data/experiment.txt 
-----------------------------
window size|frequency cutoff|accuracy
0       1       Accuracy: 0.3929 [Total number of correct 11 out of 28]
0       2       Accuracy: 0.3929 [Total number of correct 11 out of 28]
0       5       Accuracy: 0.3929 [Total number of correct 11 out of 28]
2       1       Accuracy: 0.5357 [Total number of correct 15 out of 28]
2       2       Accuracy: 0.4643 [Total number of correct 13 out of 28]
2       5       Accuracy: 0.5357 [Total number of correct 15 out of 28]
10      1       Accuracy: 0.4286 [Total number of correct 12 out of 28]
10      2       Accuracy: 0.5000 [Total number of correct 14 out of 28]
10      5       Accuracy: 0.3929 [Total number of correct 11 out of 28]
25      1       Accuracy: 0.4643 [Total number of correct 13 out of 28]
25      2       Accuracy: 0.5000 [Total number of correct 14 out of 28]
25      5       Accuracy: 0.3571 [Total number of correct 10 out of 28]

---------------------------------------------------------
Running experiments with different TRAIN and TEST each time

window size|frequency cutoff|accuracy
0       1       Accuracy: 0.2500 [Total number of correct 7 out of 28]
0       2       Accuracy: 0.3571 [Total number of correct 10 out of 28]
0       5       Accuracy: 0.5000 [Total number of correct 14 out of 28]
2       1       Accuracy: 0.5714 [Total number of correct 16 out of 28]
2       2       Accuracy: 0.6429 [Total number of correct 18 out of 28]
2       5       Accuracy: 0.5357 [Total number of correct 15 out of 28]
10      1       Accuracy: 0.3571 [Total number of correct 10 out of 28]
10      2       Accuracy: 0.5714 [Total number of correct 16 out of 28]
10      5       Accuracy: 0.4643 [Total number of correct 13 out of 28]
25      1       Accuracy: 0.3571 [Total number of correct 10 out of 28]
25      2       Accuracy: 0.6429 [Total number of correct 18 out of 28]
25      5       Accuracy: 0.5714 [Total number of correct 16 out of 28]


aspect-data/experiment.txt
-------------------------- 
window size|frequency cutoff|accuracy
0       1       Accuracy: 0.4667 [Total number of correct 14 out of 30]
0       2       Accuracy: 0.4667 [Total number of correct 14 out of 30]
0       5       Accuracy: 0.4667 [Total number of correct 14 out of 30]
2       1       Accuracy: 0.4333 [Total number of correct 13 out of 30]
2       2       Accuracy: 0.3333 [Total number of correct 10 out of 30]
2       5       Accuracy: 0.4000 [Total number of correct 12 out of 30]
10      1       Accuracy: 0.4333 [Total number of correct 13 out of 30]
10      2       Accuracy: 0.3333 [Total number of correct 10 out of 30]
10      5       Accuracy: 0.3000 [Total number of correct 9 out of 30]
25      1       Accuracy: 0.4333 [Total number of correct 13 out of 30]
25      2       Accuracy: 0.4000 [Total number of correct 12 out of 30]
25      5       Accuracy: 0.4333 [Total number of correct 13 out of 30]

---------------------------------------------------------
Running experiments with different TRAIN and TEST each time

window size|frequency cutoff|accuracy
0       1       Accuracy: 0.6667 [Total number of correct 20 out of 30]
0       2       Accuracy: 0.5333 [Total number of correct 16 out of 30]
0       5       Accuracy: 0.5333 [Total number of correct 16 out of 30]
2       1       Accuracy: 0.4000 [Total number of correct 12 out of 30]
2       2       Accuracy: 0.5000 [Total number of correct 15 out of 30]
2       5       Accuracy: 0.4000 [Total number of correct 12 out of 30]
10      1       Accuracy: 0.5000 [Total number of correct 15 out of 30]
10      2       Accuracy: 0.3667 [Total number of correct 11 out of 30]
10      5       Accuracy: 0.3333 [Total number of correct 10 out of 30]
25      1       Accuracy: 0.4333 [Total number of correct 13 out of 30]
25      2       Accuracy: 0.5000 [Total number of correct 15 out of 30]
25      5       Accuracy: 0.3333 [Total number of correct 10 out of 30]


edge-data/experiment.txt 
------------------------
window size|frequency cutoff|accuracy
0       1       Accuracy: 0.3871 [Total number of correct 12 out of 31]
0       2       Accuracy: 0.3871 [Total number of correct 12 out of 31]
0       5       Accuracy: 0.3871 [Total number of correct 12 out of 31]
2       1       Accuracy: 0.2903 [Total number of correct 9 out of 31]
2       2       Accuracy: 0.2903 [Total number of correct 9 out of 31]
2       5       Accuracy: 0.3226 [Total number of correct 10 out of 31]
10      1       Accuracy: 0.4516 [Total number of correct 14 out of 31]
10      2       Accuracy: 0.3871 [Total number of correct 12 out of 31]
10      5       Accuracy: 0.3226 [Total number of correct 10 out of 31]
25      1       Accuracy: 0.3871 [Total number of correct 12 out of 31]
25      2       Accuracy: 0.2581 [Total number of correct 8 out of 31]
25      5       Accuracy: 0.2903 [Total number of correct 9 out of 31]

---------------------------------------------------------
Running experiments with different TRAIN and TEST each time

window size|frequency cutoff|accuracy
0       1       Accuracy: 0.3226 [Total number of correct 10 out of 31]
0       2       Accuracy: 0.3548 [Total number of correct 11 out of 31]
0       5       Accuracy: 0.3548 [Total number of correct 11 out of 31]
2       1       Accuracy: 0.3548 [Total number of correct 11 out of 31]
2       2       Accuracy: 0.2903 [Total number of correct 9 out of 31]
2       5       Accuracy: 0.2903 [Total number of correct 9 out of 31]
10      1       Accuracy: 0.2903 [Total number of correct 9 out of 31]
10      2       Accuracy: 0.2581 [Total number of correct 8 out of 31]
10      5       Accuracy: 0.3548 [Total number of correct 11 out of 31]
25      1       Accuracy: 0.3871 [Total number of correct 12 out of 31]
25      2       Accuracy: 0.3548 [Total number of correct 11 out of 31]
25      5       Accuracy: 0.1613 [Total number of correct 5 out of 31]


energy-data/experiment.txt 
--------------------------
window size|frequency cutoff|accuracy
0       1       Accuracy: 0.5455 [Total number of correct 24 out of 44]
0       2       Accuracy: 0.5455 [Total number of correct 24 out of 44]
0       5       Accuracy: 0.5455 [Total number of correct 24 out of 44]
2       1       Accuracy: 0.5000 [Total number of correct 22 out of 44]
2       2       Accuracy: 0.4773 [Total number of correct 21 out of 44]
2       5       Accuracy: 0.4091 [Total number of correct 18 out of 44]
10      1       Accuracy: 0.5909 [Total number of correct 26 out of 44]
10      2       Accuracy: 0.5909 [Total number of correct 26 out of 44]
10      5       Accuracy: 0.6364 [Total number of correct 28 out of 44]
25      1       Accuracy: 0.5227 [Total number of correct 23 out of 44]
25      2       Accuracy: 0.5909 [Total number of correct 26 out of 44]
25      5       Accuracy: 0.5455 [Total number of correct 24 out of 44]

---------------------------------------------------------
Running experiments with different TRAIN and TEST each time

window size|frequency cutoff|accuracy
0       1       Accuracy: 0.4545 [Total number of correct 20 out of 44]
0       2       Accuracy: 0.4773 [Total number of correct 21 out of 44]
0       5       Accuracy: 0.4773 [Total number of correct 21 out of 44]
2       1       Accuracy: 0.3636 [Total number of correct 16 out of 44]
2       2       Accuracy: 0.3864 [Total number of correct 17 out of 44]
2       5       Accuracy: 0.2955 [Total number of correct 13 out of 44]
10      1       Accuracy: 0.4091 [Total number of correct 18 out of 44]
10      2       Accuracy: 0.4318 [Total number of correct 19 out of 44]
10      5       Accuracy: 0.3864 [Total number of correct 17 out of 44]
25      1       Accuracy: 0.4318 [Total number of correct 19 out of 44]
25      2       Accuracy: 0.3864 [Total number of correct 17 out of 44]
25      5       Accuracy: 0.5000 [Total number of correct 22 out of 44]


hope-data/experiment.txt 
------------------------
window size|frequency cutoff|accuracy
0       1       Accuracy: 0.4545 [Total number of correct 15 out of 33]
0       2       Accuracy: 0.4545 [Total number of correct 15 out of 33]
0       5       Accuracy: 0.4545 [Total number of correct 15 out of 33]
2       1       Accuracy: 0.4545 [Total number of correct 15 out of 33]
2       2       Accuracy: 0.4242 [Total number of correct 14 out of 33]
2       5       Accuracy: 0.3939 [Total number of correct 13 out of 33]
10      1       Accuracy: 0.3636 [Total number of correct 12 out of 33]
10      2       Accuracy: 0.3939 [Total number of correct 13 out of 33]
10      5       Accuracy: 0.3939 [Total number of correct 13 out of 33]
25      1       Accuracy: 0.3636 [Total number of correct 12 out of 33]
25      2       Accuracy: 0.3636 [Total number of correct 12 out of 33]
25      5       Accuracy: 0.3333 [Total number of correct 11 out of 33]

---------------------------------------------------------
Running experiments with different TRAIN and TEST each time

window size|frequency cutoff|accuracy
0       1       Accuracy: 0.3636 [Total number of correct 12 out of 33]
0       2       Accuracy: 0.3636 [Total number of correct 12 out of 33]
0       5       Accuracy: 0.4242 [Total number of correct 14 out of 33]
2       1       Accuracy: 0.2424 [Total number of correct 8 out of 33]
2       2       Accuracy: 0.2424 [Total number of correct 8 out of 33]
2       5       Accuracy: 0.3030 [Total number of correct 10 out of 33]
10      1       Accuracy: 0.3636 [Total number of correct 12 out of 33]
10      2       Accuracy: 0.2727 [Total number of correct 9 out of 33]
10      5       Accuracy: 0.1515 [Total number of correct 5 out of 33]
25      1       Accuracy: 0.4242 [Total number of correct 14 out of 33]
25      2       Accuracy: 0.1818 [Total number of correct 6 out of 33]
25      5       Accuracy: 0.2727 [Total number of correct 9 out of 33]


length-data/experiment.txt 
--------------------------
window size|frequency cutoff|accuracy
0       1       Accuracy: 0.2121 [Total number of correct 7 out of 33]
0       2       Accuracy: 0.2121 [Total number of correct 7 out of 33]
0       5       Accuracy: 0.2121 [Total number of correct 7 out of 33]
2       1       Accuracy: 0.3636 [Total number of correct 12 out of 33]
2       2       Accuracy: 0.3030 [Total number of correct 10 out of 33]
2       5       Accuracy: 0.2727 [Total number of correct 9 out of 33]
10      1       Accuracy: 0.3333 [Total number of correct 11 out of 33]
10      2       Accuracy: 0.4242 [Total number of correct 14 out of 33]
10      5       Accuracy: 0.3939 [Total number of correct 13 out of 33]
25      1       Accuracy: 0.3030 [Total number of correct 10 out of 33]
25      2       Accuracy: 0.4242 [Total number of correct 14 out of 33]
25      5       Accuracy: 0.3333 [Total number of correct 11 out of 33]

---------------------------------------------------------
Running experiments with different TRAIN and TEST each time

window size|frequency cutoff|accuracy
0       1       Accuracy: 0.2121 [Total number of correct 7 out of 33]
0       2       Accuracy: 0.2121 [Total number of correct 7 out of 33]
0       5       Accuracy: 0.2121 [Total number of correct 7 out of 33]
2       1       Accuracy: 0.3939 [Total number of correct 13 out of 33]
2       2       Accuracy: 0.2727 [Total number of correct 9 out of 33]
2       5       Accuracy: 0.3636 [Total number of correct 12 out of 33]
10      1       Accuracy: 0.3636 [Total number of correct 12 out of 33]
10      2       Accuracy: 0.4545 [Total number of correct 15 out of 33]
10      5       Accuracy: 0.3030 [Total number of correct 10 out of 33]
25      1       Accuracy: 0.2121 [Total number of correct 7 out of 33]
25      2       Accuracy: 0.2424 [Total number of correct 8 out of 33]
25      5       Accuracy: 0.3636 [Total number of correct 12 out of 33]


Here I must note that the same 2 cases of experiments as in assignment 3 were used. For clarity, I will briefly describe
both these experiments:

     1) Split the data into TRAIN and TEST partitions once and then run
       the experiments using these partitions. This experiment was carried out
       to compare the performance of the different combinations using the same
       TRAIN and TEST data (constant environment). The results of these experiments
       are summarized in TABLE 1.

    2) For each of the combination of window size and frequency cutoffs, a different
       set of TRAIN and TEST partitions was used. This means that these partitions
       differ in every run because they are randomly distributed from the entire line data.
       This experiment was carried to see how different partitions can affect the performance of
       each experiment. As can be observed from the data below, the performances of the 2
       experiments are very similar and so the changing environments did not produce a great
       impact. However, in some cases performance was degraded slightly with different
       partitions (in experiment 2). Here, I must note that these 2 sets of experiments do not necessarily imply
       that having a constant environment is always better. It seems that it is a matter of finding
       the set of TRAINING data that is diverse enough that it reflects with most accuracy any set
       of test data.

The above is extracted from the experiments.txt file from assignment 3. It was also planned to further split the
training examples into subgroups, run the experiment on each group and then get the average accuracy over all the groups.
However, after careful consideration, I decided not carry out this experiment because of the small number of examples
provided and thus would not produce any significant results in the accuracy. As can be seen from the output of the
experiments, the results look very poor compared to the "line" data. One reason for the poor performance is the lack
of examples. Clearly, if the classifier is provided with about one thousand examples to train on ("line" data), it will
perform better than when provided with about one hundred examples (this data). Also, it seemed that in some cases the performance
actually dropped when a window has been introduced. One reason for this is that most of the features where not seen in the
training examples. This is due, again, to the small number of examples provided. Another reason for the poor performance
with this data, is that in some cases the features do not provide a clear-cut hint to as to what the sense is. The "line"
data was comparatively straightforward in that the features identified the sense in a clear manner. I believe this data is more 
complex than the "line" data.


Conclusion:
----------

This assignment, gave me some insight into the pitfalls of a Naive Bayesian Classifier. One such insight, is that
a lot of data is required to train the classifier if it is to perform at an acceptable level. Also the features
must provide some hint as to the classification and cannot be totally vague. On the other hand, the classifier seemed to handle
noise very good. I noticed this when I run the classifier with all of the data for some of the words. This includes a lot
of noise, in the form of having the same instance classified with more than one sense. With such data the accuracy 
dropped slightly as compared with the filtered data but it was comparable.
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Name		:	Nitin Agarwal			Date	11-Nov-02
Class		:	Natural Language Processing CS8761
===============================================================

Objective
To create sense tagged text and explore supervised approaches to the
word sense disambiguation.

Procedure
Let us consider a huge corpus in which one word occurs in all the lines
and could have various meanings. We have to split this corpus into 2
sets. We analyze one set using Statistical Natural Language Processing
techniques to get the training data. Using this data we try to figure
out the meaning of the target word in the test data. After this we
compare the meaning of the word obtained using this method with the actual
meaning and find the accuracy of this method. The value of the accuracy
could be anywhere between 0 and 1. The closer the value is to 1 the
better the method is.

The above mentioned test has been done using the word "Line". The
experiment has been performed in four phases.

Phase 1(select.pl)
When this file is run, we get two output files namely TRAIN and TEST
that have various lines containing the word "Line" with different
meanings associated with them. Before writing into the two output files all
the input lines are randomized so that various senses of the word "Line"
are mixed together.

Phase 2(feat.pl)
The outcome of this file is a feature vector. This vector has distinct
features associated with it depending on the sense of "Line" in its
each occurrence. The vector also depends on the values of window size and
the frequency value specified in the command line. The more the window
size is, the more features would be associated with each occurrence of
"Line". We would later see that a higher window size results in
estimating the sense of each occurrence with more accuracy. Conversely, a high
frequency value would result in a slightly lower accuracy as this value
is the cut-off value for frequency and only all the words that occur
more than this value in all the windows would be considered in the
feature vector. This file is only executed for the TRAIN file from select.pl
to obtain the features.

Phase 3(convert.pl)
The output of feat.pl together with first TRAIN and then TEST is the
input to program and the output is a binary feature vector. This binary
vector shows if any instance has any of the features. This file is run
for several different combinations of window and frequency values.

Phase 4(nb.pl)
This program processes the binary vectors for a pair of TEST.FV and
TRAIN.FV with the same values for window size and frequency cut-off. Naive
Bayesian classifier is implemented to determine the sense of the target
word in TEST file using the data obtained from TRAIN file. For the word
types that did not occur in the vector, smoothing is performed to give
them a small probability value. This is because, we assume that these
words just did not occur in this set of data and may occur in another
set of data. We get 12 sets of outputs after running this file for all
window and frequency combinations. These values are tabulated as under.
The row values are the window size and the column values are the cut-off
values for frequency.


window size/frequency cutoff
		1		2		5
0		0.5571		0.5571		0.5571
2		0.7250		0.7113		0.6908
10		0.8086		0.7973		0.7867
25		0.8267		0.8132		0.8012

Observation
After looking at the above table we see that, for all the experiments
with the window size of 0 we get the same values of accuracy as 0.5571.
This is so because if window size is 0, then we are not getting any
words and any frequency value does not matter. There is no feature vector
and we just have some random senses assigned to each instance. Of these
instances some happen to be correct and hence we have non-zero values
for all the cases with the window size of 0.

When window size is 2, we get 4 words for each instance (2 on left and
2 on right). If frequency cut-off is taken as 1 then any word that
occurs more than once in all the instances is considered into the feature
vector. Using the program nb.pl which is explained above we get accuracy
of 0.7250 which goes to show that we estimated the sense of about 72%
of the instances correctly. With the higher values of frequency we
notice that this value of accuracy is declining. The reason being, when the
frequency cut-off is higher, lesser number of words are in the feature
vector, limiting our data from TRAIN file resulting in a lower
accuracy.

For the window size of 10 we get an even higher value for accuracy.
This goes on to say that there are some good features of a word even as
far as about 10 words from it. Therefore, it is a good idea to consider a
window that is not too small. As discussed above, again accuracy drops
with an increase in the frequency cut-off for the same reasons cited
above.

Finally, we run the experiment with the window size of 25. Again, as
was expected we have increase in accuracy. Nevertheless, this time the
difference is very little. In addition, with the increasing value of
frequency cut-off we have accuracy that is dropping, which is similar to
the cases considered earlier.

However, there is an important point worth mentioning before we
conclude. We notice from the above table that with the increase in window
size, although there is an increase in accuracy, it is very low. And
accuracy improves only marginally by increasing the window size from 10 to
25. If we increase the window size further, we may still get a better
accuracy but that would be hardly worth the processing time required to
process the data with that window size.  Hence, in order to get even
better accuracy values, it is not just enough to increase the window size.
Instead, we should work to improve the classifier that can give us
better results.

Assignment 4
------------
Objective
TO continue the exploration of supervised approaches to word sense
disambiguation using data from Open mind.

Experiment
First we write programs to put the data obtained from the Open mind
into the format similar to the line data that was used in the previous
assignment. Following programs have been written to achieve this.

separator1.pl and separator2.pl
The two programs are used to separate the data in the files
cs8761-umd.full.detailed and ids-to-sentences respectively. These files divide the
contents of each file into several files depending on the target word.
Therefore, after executing above 2 files we have a list of output
files. separator1.pl outputs all the files in a format <instance>.tags.
Hence, if we have an instance called "line" then this file would output
"line.tags". Similarly, separator2.pl produces a list of files named as
<instance>.data. For "line" the file would be "line.data". The two files
are run as follows

perl separator1.pl
perl separator2.pl

readtaginfo.pl
This file reads the information from "*.tags " files created using
separator.pl and analyzes the information to determine good words. The
program returns a file with the tag information for all the words and also
marks a few good words. The good words are as defined in the assignment
statement on Dr. Ted Pedersen's web page for the class CS8761. This
program when executed computes the following:

1) Senses:	This contains the information about unique senses associated
with a word. The number of senses excludes "unclear" and
"unlisted-sense" if they were assigned for any word.
Good sense:	The program checks for all the senses of a word that occur
more than 40% and less than 160% of the average sense.

2) Distribution:	A good word needs to be evenly distributed among many
of its senses. The more evenly distributed the senses are, the better
the word is.
Good distribution:	If more than half of the senses of a word have good
sense then the word is considered to have a good distribution.

3) Instances:	This is the total number of unique instances for any
given word.
Good instance:	Any word that has more than 100 instances is considered
as a good instance.

4) Agreement:	Agreement for a given word is the agreement between the
users who tagged that word. If 3 users tagged an instance of a word and
they all tagged the same sense then the agreement for that sense is 1.
If just 2 of them agreed on a sense then the agreement would be 0.67.
However, if all of them disagree then we would have agreement equal to
0.33. The agreement of a sentence is given as the average of the
agreements of all the individual instances for that word.
Good agreement: I have considered a word to have good agreement if its
agreement is more than 60% and less than 100%. The later condition is
to check for words that were tagged only by one user. And, the former
condition is close to the value (65%) mentioned by Dr. Rada Michalcea in
her colloquium.

All the "Good" values mentioned above were narrowed down by the author
using trial and error method to get 6 best words for this assignment.
Others may identify 6 good words on an entirely different set of
conditions depending on their choice. However, this program yielded only 5
good words namely- "depth", "difficulty", "edge", "length" and "shape".
Sixth word was selected by inspection from the output of this file. And
this last word is "behavior". This word was also selected on a similar
criteria as is discussed above, although manually.
The program is run as follows:

perl readtaginfo.pl arc.tags depth.tags shape.tags.............

The resulting file is "taginfo".

sensor.pl
This file is executed for the 6 good words obtained from the above
program. This will read a pair of .tags and .data file to give a set of
files that would be named based on the senses assigned to that word. The
number of output files would be equal to the number of senses for a
particular word. The program takes care not to include multiple occurrences
of the same instance. Furthermore, care has been taken to assign an
instance to a sense that most users thought was correct. For instance, if
3 users tagged an instance of a word and all of them agreed upon it,
then the instance would be assigned to this sense. However, if just 2 of
them agreed upon it then the instance would be assigned to the most
agreed upon sense. Considering, the case when all of them disagree, the
instance is assigned randomly to a sense. If an instance was thought of
as "unclear" or "unlisted-sense" then again it is assigned to a sense
randomly.
The program is run as follows:

perl sensor.pl <instance>
If the instance is line then the command would like this
perl sensor.pl line

While running this file, care should be to change the regular
expression in the file. The new regular expression could be obtained from the
first line of <instance>.config file.

At this point we run the output obtained from sensor.pl for various
words using the program files developed in the Assignment 3 to obtain the
accuracy for each word using Naive Bayesian classifier.

The following tables show the accuracy values for the 6 good words for
all 12 combinations of window size and frequency cut-offs as was done
for line data in assignment 3.
behavior
window size/frequency cutoff
		1		2		5
0		0.55		0.55		0.55
2		0.55		0.55		0.55
10		0.5		0.5		0.55
25		0.5		0.5		0.525


depth
window size/frequency cutoff
		1		2		5
0		0.3333		0.3333		0.3333
2		0.2667		0.2444		0.3333
10		0.2667		0.2444		0.2444
25		0.2667		0.2444		0.2444


difficulty
window size/frequency cutoff
		1		2		5
0		0.375		0.375		0.375
2		0.35		0.35		0.375
10		0.325		0.325		0.325
25		0.325		0.325		0.325


edge
window size/frequency cutoff
		1		2		5
0		0.225		0.225		0.225
2		0.25		0.25		0.25
10		0.275		0.275		0.275
25		0.3		0.3		0.275


length
window size/frequency cutoff
		1		2		5
0		0.1364		0.1364		0.1364
2		0.2045		0.1818		0.1591
10		0.1591		0.2045		0.1818
25		0.1591		0.2045		0.2045


shape
window size/frequency cutoff
		1		2		5
0		0.1282		0.1282		0.1282
2		0.1026		0.1282		0.1282
10		0.1282		0.1282		0.1282
25		0.1282		0.1282		0.1282

The values in the above tables do not tally with what was expected. The
values in many tables decrease with the increase with window size and
at times they increase with increase with increase in frequency cut-off.
This totally contradicts the statement made above about the accuracy
values. Moreover, the values are too low. It is difficult to say how
other words would behave if this is the behavior of good words.

+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Kailash Aurangabadkar
Assignment # 4
A continuing quest for meanings in li(n|f)e

--------------------------------------------------------------------------------

          The objective of the assignment is to continue to explore supervised
approaches to word sense disambiguation. In this assignment a sense tagged text
is created. Then Na�ve Bayesian classifier is implemented to perform word sense
disambiguation.This assignment focuses on fixing the errors in the assignment
number 3, so that we get optimal values for accuracy. After fixing the classifier
we identify six "good" words in the Open Mind data that we created as a class.
Good words are to be identified using following criteria:
     *Has a reasonable rate of agreement among the two taggers.
     *Has a somewhat balanced distribution of senses. At the very least avoids
      the case where a single sense dominates the distribution.
     *Has a reasonable number of examples.
Once the good words have been identified we have to convert the Open Mind data
into the form of the line data so we can use our Assignment 3 classifier.
By executing the  classifier on that data we have to resolve the word sense
disambiguation problem in the open mind data.


--------------------------------------------------------------------------------

             Word Sense Disambiguation: The task of disambiguation is to
determine which of the senses of an ambiguous word is invoked in a particular
use of the word.

--------------------------------------------------------------------------------

	      Process: In this assignment the Na�ve bayesian classification
algorithm is used to assign the score to every instance in Test data for every
sense possible for the word. In this classifier the central idea is to look
around ambiguous words in a large context window. Each content word adds on
information to the disambiguation of the target word. We first find the
Probability vector from Train data for each sense and feature combination. If
we do not see a word in that context then we apply Witten Bell smoothing to
find the probability of that event. Then we find the probability of a content
word for every sense in the Test data window by using the Na�ve Bayes
assumption. The sense with which we get maximum probability for that instance
from Test data is assigned as the sense of that instance. The accuracy of the
algorithm is then computed by comparing the actual sense of that word in that
line with the sense assigned by us.
             Then algorithms have to be developed to analyze the open mind data
and to convert it into "line" data format.

--------------------------------------------------------------------------------

                 The assignment consists of two parts:
Part 1:
              Part 1 consists of checking and fixing four programs of assignment 3,
 which are:-

1. Select.pl:- This program divides a sense tagged corpus of text into Training
data and Test data.

2. Feat.pl:- This program finds the features in the specified window around the
target word. It also checks for the frequency of the features to be more than
the cutoff specified.

3. Convert.pl:- This program gives us the feature vector table which shows
whether the features obtained using feat.pl are present or not in the input
file.

4. Nb. Pl:- This program assigns sense tags to untagged data from test. It does
this by using the Na�ve Bayesian algorithm, smoothing its value by using
Witten - Bell smoothing
--------------------------------------------------------------------------------

Experiment Results of Part 1:-
             The accuracy values for each combination of window and frequency
cutoff is as shown below:
--------------------------------------------------------------------------------

 Window Size        Frequency Cutoff       Accuracy value
--------------------------------------------------------------------------------

     0                   1                      0.5350

     0                   2                      0.5350

     0                   5                      0.5350
--------------------------------------------------------------------------------

     2                   1                      0.7468

     2                   2                      0.7476

     2                   5                      0.7378
---------------------------------------------------------------------------------

    10                   1                      0.8212

    10                   2                      0.8166

    10                   5                      0.8148
--------------------------------------------------------------------------------

    25                   1                      0.8442

    25                   2                      0.8392

    25                   5                      0.8378
--------------------------------------------------------------------------------

                We see in general that as the window size goes on
increasing the accuracy value goes on increasing. This is quite obvious as we
see that if the window size increases then the content words around the
ambiguous word under consideration. This gives us more and more information
about the particular sense occurring in that instance. We also see that the
accuracy value has no particular effect of the value of frequency cutoff. This because
the Naive Bayesian classifier is impenetrable to noise, and hence irrelevant data
occurring around words have no remarkable effect on the sense tagging.

                The trend is followed generally in the observations
made from the experiments performed and summarized in the table above.
                Thus we see that making window size as large as
possible we can get more and more accuracy in assigning senses to ambiguous words.

-----------------------------------------------------------------------------------

Part 2:
               Part 2 consists of analyzing the open mind data for finding good words
and to convert the data into "line" data format used by us during assignment 3. For
this purpose 4 programs were created:

getsentence.pl:- This program splits the file "ids-to-sentences� into seperate files
for each word type.

getsenses.pl:- This program splits the file "cs8761-umd.full.detailed� into seperate
files for each word type.

getinfo.pl:- This program takes as argument the sense tag files for each word seperated
by getsenses.pl to find the values of criteria to be checked to consider a word as a
good word.

getsenseword.pl:- This program splits the file containing instances for a single word
into files for each sense tag occuring with the word, so that it is in "line" data
format.

------------------------------------------------------------------------------------

Experiment Results of Part 2:-
               A sample output from the getinfo.pl file for the sample word "Difficulty"
is shown below:

     tagdifficulty
     Examples : 135
     Tags : 4
     TAG: difficulty%1:07:00:: , NUMBER: 52
     TAG: difficulty%1:26:00:: , NUMBER: 57
     TAG: difficulty%1:09:02:: , NUMBER: 43
     TAG: difficulty%1:04:00:: , NUMBER: 59
     Agreement : 0.719753086419753

From this output we come to know that the word difficulty has 135 instances, 4 senses
and there was 72% agreement between the persons who were tagging the word on open mind
project website. This output also gives us the distribution among senses.
       I have chose the following six words depending on the output of getinfo.pl :
Difficulty
{Output of getinfo.pl
     tagdifficulty
     Examples : 135
     Tags : 4
     TAG: difficulty%1:07:00:: , NUMBER: 52
     TAG: difficulty%1:26:00:: , NUMBER: 57
     TAG: difficulty%1:09:02:: , NUMBER: 43
     TAG: difficulty%1:04:00:: , NUMBER: 59
     Agreement : 0.719753086419753
}
Art
{Output of getinfo.pl
     tagart
     Examples : 110
     Tags : 4
     TAG: art%1:04:00:: , NUMBER: 35
     TAG: art%1:06:00:: , NUMBER: 31
     TAG: art%1:09:00:: , NUMBER: 25
     TAG: art%1:10:00:: , NUMBER: 19
     Agreement : 1
}
Captain
{Output of getinfo.pl
     tagcaptain
     Examples : 180
     Tags : 9
     TAG: unlisted-sense , NUMBER: 1
     TAG: captain%1:18:00:: , NUMBER: 7
     TAG: captain%1:18:01:: , NUMBER: 11
     TAG: captain%1:18:02:: , NUMBER: 31
     TAG: unclear , NUMBER: 44
     TAG: captain%1:18:03:: , NUMBER: 46
     TAG: captain%1:18:04:: , NUMBER: 27
     TAG: captain%1:18:05:: , NUMBER: 20
     TAG: captain%1:18:06:: , NUMBER: 62
     Agreement : 0.812962962962963
}
Length
{Output of getinfo.pl
     taglength
     Examples : 149
     Tags : 6
     TAG: length%1:07:01:: , NUMBER: 47
     TAG: length%1:07:02:: , NUMBER: 34
     TAG: unclear , NUMBER: 3
     TAG: length%1:07:03:: , NUMBER: 36
     TAG: length%1:06:00:: , NUMBER: 21
     TAG: length%1:07:00:: , NUMBER: 45
     Agreement : 0.875838926174497
}
Distribution
{Output of getinfo.pl
      tagdistribution
      Examples : 135
      Tags : 6
      TAG: unclear , NUMBER: 7
      TAG: distribution%1:04:00:: , NUMBER: 56
      TAG: unlisted-sense , NUMBER: 2
      TAG: distribution%1:04:01:: , NUMBER: 76
      TAG: distribution%1:07:00:: , NUMBER: 38
      TAG: distribution%1:09:00:: , NUMBER: 26
      Agreement : 0.769135802469136
}
Unit
{Output of getinfo.pl
     tagunit
     Examples : 151
     Tags : 9
     TAG: unit%1:03:00:: , NUMBER: 11
     TAG: unlisted-sense , NUMBER: 2
     TAG: unit%1:14:00:: , NUMBER: 82
     TAG: unit%1:23:00:: , NUMBER: 12
     TAG: unit%1:24:00:: , NUMBER: 65
     TAG: unit%1:06:01:: , NUMBER: 19
     TAG: unit%1:17:00:: , NUMBER: 47
     TAG: unit%1:09:00:: , NUMBER: 18
     TAG: unclear , NUMBER: 9
     Agreement : 0.724613686534216
}

             The accuracy values for each combination of window and frequency
cutoff for "difficulty" is as shown below:
--------------------------------------------------------------------------------

 Window Size        Frequency Cutoff       Accuracy value
--------------------------------------------------------------------------------

     0                   1                      0.3095

     0                   2                      0.3095

     0                   5                      0.3095
--------------------------------------------------------------------------------

     2                   1                      0.4286

     2                   2                      0.4048

     2                   5                      0.3810
---------------------------------------------------------------------------------

    10                   1                      0.4286

    10                   2                      0.4524

    10                   5                      0.4048
--------------------------------------------------------------------------------

    25                   1                      0.4048

    25                   2                      0.3333

    25                   5                      0.3333
--------------------------------------------------------------------------------

         The accuracy values for each combination of window and frequency
cutoff for "art" is as shown below:
--------------------------------------------------------------------------------

 Window Size        Frequency Cutoff       Accuracy value
--------------------------------------------------------------------------------

     0                   1                      0.3143

     0                   2                      0.3143

     0                   5                      0.3143
--------------------------------------------------------------------------------

     2                   1                      0.3429

     2                   2                      0.3143

     2                   5                      0.3143
---------------------------------------------------------------------------------

    10                   1                      0.3429

    10                   2                      0.4286

    10                   5                      0.4286
--------------------------------------------------------------------------------

    25                   1                      0.457

    25                   2                      0.3714

    25                   5                      0.3714
--------------------------------------------------------------------------------

               The accuracy values for each combination of window and frequency
cutoff for "captain" is as shown below:
--------------------------------------------------------------------------------

 Window Size        Frequency Cutoff       Accuracy value
--------------------------------------------------------------------------------

     0                   1                      0.3148

     0                   2                      0.3148

     0                   5                      0.3148
--------------------------------------------------------------------------------

     2                   1                      0.3148

     2                   2                      0.3333

     2                   5                      0.3148
---------------------------------------------------------------------------------

    10                   1                      0.4074

    10                   2                      0.3148

    10                   5                      0.3148
--------------------------------------------------------------------------------

    25                   1                      0.3889

    25                   2                      0.4259

    25                   5                      0.3148
--------------------------------------------------------------------------------

               The accuracy values for each combination of window and frequency
cutoff for "length" is as shown below:
--------------------------------------------------------------------------------

 Window Size        Frequency Cutoff       Accuracy value
--------------------------------------------------------------------------------

     0                   1                      0.2766

     0                   2                      0.2766

     0                   5                      0.2766
--------------------------------------------------------------------------------

     2                   1                      0.3404

     2                   2                      0.3191

     2                   5                      0.2979
---------------------------------------------------------------------------------

    10                   1                      0.3404

    10                   2                      0.3404

    10                   5                      0.2125
--------------------------------------------------------------------------------

    25                   1                      0.2125

    25                   2                      0.2766

    25                   5                      0.2766
--------------------------------------------------------------------------------

               The accuracy values for each combination of window and frequency
cutoff for "distribution" is as shown below:
--------------------------------------------------------------------------------

 Window Size        Frequency Cutoff       Accuracy value
--------------------------------------------------------------------------------

     0                   1                      0.3898

     0                   2                      0.3898

     0                   5                      0.3898
--------------------------------------------------------------------------------

     2                   1                      0.5000

     2                   2                      0.5000

     2                   5                      0.5000
---------------------------------------------------------------------------------

    10                   1                      0.3810

    10                   2                      0.4524

    10                   5                      0.3571
--------------------------------------------------------------------------------

    25                   1                      0.4524

    25                   2                      0.3333

    25                   5                      0.5238
--------------------------------------------------------------------------------

         The accuracy values for each combination of window and frequency
cutoff for "unit" is as shown below:
--------------------------------------------------------------------------------

 Window Size        Frequency Cutoff       Accuracy value
--------------------------------------------------------------------------------

     0                   1                      0.3458

     0                   2                      0.3458

     0                   5                      0.3458
--------------------------------------------------------------------------------

     2                   1                      0.4792

     2                   2                      0.4375

     2                   5                      0.4375
---------------------------------------------------------------------------------

    10                   1                      0.3125

    10                   2                      0.3333

    10                   5                      0.3333
--------------------------------------------------------------------------------

    25                   1                      0.4167

    25                   2                      0.4167

    25                   5                      0.3333
--------------------------------------------------------------------------------
	 These seem to be very low accuracy values. This is because there were a
few examples (in order of 100) as compared to line data (in order of 4000). As the
number of instances is less there is less data to learn from.
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
cs8761 Natural Language Processing
Assignment 3

Archana Bellamkonda
November 1, 2002.


Problem Description : ( for part II - Naive Bayesian Classifier Implementation )
-------------------
The objective is to assign a sense to a given target word in a sentence
using Naive Bayesian Classifier. 
We start of with some sets of data where we have instances in a particular
sense that are previously collected. In this assignment, we do the sense
tagging in four steps -

1) select.pl :- This program collects all instances from given data and
   randomizes the instances after adding information about the sense from
   which they are extracted, and divides these instances into two sets of
   data, TEST and TRAIN depending on the percentage entered by the user.

2) feat.pl :- This program identifies all word types that occur within "w"
   postiions to the left or right of target word, and that occur "more than"
   "f" times in TRAIN data set. It doesnt include target word as a feature.

3) convert.pl :- This program converts the inpur file to a feature vector
   representation where features in output of feat.pl are read from
   standard input. Each instance in file is converted into series of binary
   values that indicate whether or not each type listed in list of features
   output by feat.pl has occured within specified window around the target
   word in the given instance.
   NOTE:- It also includes number of unobserved features in the specified
   window around the target word for every instance. This is the last
   number in the feature vector.

4) nb.pl :- This program wil learn a Naive Bayesian classifier and use that
   classifier to assign sense tags to test data and the senses are printed
   along with instance ids, actual sense and probability of the assigned sense.


Experiment : 
----------

			Frequency cutoff
			
			1		2		5
     windowsize	|__________________________________________	
	      0 |   0.5454	      0.5289	     0.5248	
	      2 |   0.8512	      0.8471	     0.8471
             10 |   0.8926	      0.8595	     0.8223 	
	     25 |   0.8801	      0.8800	     0.7933


Experiments are done with the above shown window sizes and frequency
cutoffs. The results are as shown for all twelve combinations. (Here
Experiment is done with phone2 and division2 files)

Observations : 
------------

-->Expected :
   --------
As window size increases, number of features that we observe
increase and hence we we will learn more about context of a given word and
hence we would be observing higher accuracy as window size
increases. (taking frequency cut off to be 1).
But, the meaning of a word depends on the surrounding context only. For
example, meaning of a word in a sentence will generally not depend on
meaning of word in other sentences. So, if we go on increasing the window
size, starting from zero, we will observe significant increase in accuracy upto a
certain level, and then when window size increases beyond the required
context, accuracy will not increase significatnly. 

Observed :
--------
As shown in the table above, we observed what we expected. Consider column
for frequency cutoff 1. As window size increases, accuracy increased
significantly, from 0.5454 to 0.8512. From then onwards, accuracy didnot
increase significantly.


-->Expected :
   --------
According to Zipf's law, most of the features are not repeated in a
text. So, as frequency cutoff increases for a particular window size, there
will be less number of features observed and hence we cant capture the
context properly. Thus, accuracy decreases as frequency cutoff increases
for a given window size.

Observed :
--------
As shown in the table above, we again observed what we expected. Consider
any row for a particular window size. The accuracy values are decreasing as
frequency cutoff increases.


--> Expected :
    --------
When we consider the case where frequency cutoffs are
increasing, we also estimate that there will be more number of features
observed as window size increases and thus, we could expect accuracy to be
more.

Observed :
--------
Consider the columns for frequency cutoffs 2 and 5. We observed what we
expected. 

NOTE : 
----
In some cases above, I used precision to be 16 digits after decimal as the
probability values are very small and if considered for 4 digits after
decimal, they are all zeros.

OUTPUTS FOR TESTCASES :
---------------------
Test Case 1:
-----------

cord2

w7_039:13446: My line was cut. 
w7_039:13447: My line was cut.
w7_039:13448: My line was cut. 
w7_039:13449: My line was cut.
w7_039:13450: My line was cut. 

text2

w7_039:12446: The line of text is very unclear. 
w7_039:12447: The line of text is very unclear. 
w7_039:12448: The line of text is very unclear. 
w7_039:12449: The line of text is very unclear. 
w7_039:12450: The line of text is very unclear. 

Output:

w7_039:13448:  cord 0.0031 cord
w7_039:13446:  cord 0.0031 cord
w7_039:13450:  cord 0.0031 cord
1

Test Case 2 :
------------
cord2

w7_039:13446: A line B C
w7_039:13447: D line E F
w7_039:13448: G line H I
w7_039:13449: J line K L
w7_039:13450: M line N O

text2

w7_039:12446: P Q line
w7_039:12447: R S line
w7_039:12448: T U line
w7_039:12449: V W line
w7_039:12450: X Y line

Output:

w7_039:12447:  text 0.1429 text
w7_039:13446:  text 0.0212 cord
w7_039:13447:  text 0.0212 cord
0.333333333333333

Test Case 3 :
------------

cord2

w7_039:13446: A line B C
w7_039:13447: D line E F
w7_039:13448: G line H I
w7_039:13449: J line K L
w7_039:13450: M line N O


text2

w7_039:12446: P Q line
w7_039:12447: R S line
w7_039:12448: T U line
w7_039:12449: V W line
w7_039:12450: X Y line

Output:

w7_039:13448:  text 0.0212 cord
w7_039:12446:  text 0.1429 text
w7_039:13446:  text 0.0212 cord
0.333333333333333


CONCLUSIONS
----------

Noise : 
-----
Thus, we should select window size to be optimal. We should observe where
we are getting noise, ie, what is the cut off where we are observing
unwanted features in our context, and limit our window size to be lower
than that cut off. Accuracy increases as window size increases and as noise
also enters the window, there will be a drop in accuracy, though not
significant. We have to observe where we are getting that drop.
 

Optimal Combination :
-------------------
The optimal combination will be the greatest window size below the cutoff
where we are observing noise and and frequency cutoff of "1" as we would
observe more features than in lower frequency cutoffs and hence would
learn the context of a sense in a better way.

Optimal Combination In Our Experiment:
-------------------------------------

As seen from the table above, the optimal combination is predicted to be a
window size of "10" and frequency cutoff of "1". We got the highest
accuracy at that point and it is 0.8926 as shown.

+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
#----------------------------------#
# Deodatta Bhoite		   #
# CS8761			   #
# Assignment no 4		   #
# Date: 11-11-02		   #
#----------------------------------#


Corrected Naive Bayesian output
-------------------------------

The output of the experiments of the naive bayesian classifier for the line
data(all senses) is as follows:

---------------
 W  F  Accuracy
---------------
 0  1  0.5349
 0  2  0.5349
 0  5  0.5349
- - - - - - - -
 2  1  0.7317
 2  2  0.7197
 2  5  0.6908
- - - - - - - -
 10 1  0.8286
 10 2  0.8201
 10 5  0.7939
- - - - - - - -
 25 1  0.8309
 25 2  0.8137
 25 5  0.8060
---------------

As we can see that the maximum accuracy is when the window size is 25 and 
frequency cutoff is 1. The accuracy increases as the window size increases
and the accuracy also increases as the frequency cutoff decreases. This
trend is probably observed because of the increase in the number of features
as the window size increases and the frequency cutoff decreases.

The accuracy for window size 0 is constant for any frequency cutoff. Here the
classifier assigns the sense that occurs maximum number of times to all the
instances.


Searching for `Good' words
---------------------------

How to run the scripts?
- - - - - - - - - - - -
For finding the `good' words in the open mind data use the scripts stats.pl 
and best.pl.They can be run as follows:

% stats.pl /home/cs/tpederse/CS8761/Open-Mind/OMWE-tagging > summary

% best.pl


Working of the scripts
- - - - - - - - - - - -
The stats.pl script finds various statistics about the data, viz. number of 
tags per word, number of examples per word, number of tags per example, number
of senses per word, the agreement ratio between the users who tagged the word,
and the normalized variance of the distribution of senses. All this information
is tabularized and written to "table.txt". 

It also assigns a particular sense to each instance id and stores in the file
"assign.txt".

The best.pl script reads the tabular information in table.txt and sorts it 
according to various columns and prints it in "sort.txt". For example, it sorts
the variance in ascending order, whereas agreement ration in descending order,
etc.

It then assigns scores to various words based on their ranks in the sorted list
in various columns and prints out the scores in ascending order. Least score
signifies a good word, since it is the word that is top ranked in all sorted 
columns. Thus, the top n words can be identified as the top n words in the file
"score.txt".

Of course, if we give more importance to a particular feature (like agreement
ratio or variance) we will take a weighted rank, but I have given equal 
importance to all the features so no weights are used.


The `Good' words
- - - - - - - - -

The top 6 words in the CS-8761 project and their scores are as follows:

captain	        57
unit	        58
chapter	        66
volume	        83
structure	84
depth	        88

The top 6 words in the Open-mind data and their scores are as follows:

circuit	        195
feeling	        221
restraint	224
experience	225
grip	        232
interest	242

We note that the scores are not normalized hence, we cannot do cross comparison
between the two outputs.

So we will try to analyze by adding details of the features to the words:

Word	    Ex   T/E    AR     Var

captain	   180   2.03   0.73   0.0103 *
unit	   151   2.63   0.70   0.0148
chapter	   150   2.50   0.92   0.0531 *
volume	   119   2.78   0.77   0.0171 *
structure  150   2.08   0.79   0.0517
depth	   150   2.13   0.48   0.0137

circuit	   405   2.24   0.67   0.0089
feeling	   198   2.39   0.62   0.0090
restraint  401   2.31   0.57   0.0045 
experience 120   3.56   0.93   0.0004 *
grip	   410   2.45   0.76   0.0309 *
interest   1703  2.07   0.70   0.0099 *

(Ex=Examples/word; T/E=Tags/example; AR=agreement ratio; Var=variance of sense
distribution)

Note that I have considered T/E as a feature only because I believe having 2
users in agreement is more stronger proof for the word being of that sense than
having only one user tagging the word with 100% agreement of sense.

We reject depth, circuit, feeling, restraint for low agreement ratio. We don't
select 'structure' for high variance, however we keep chapter because of it's
high agreement ratio.

We realize the drawbacks of the scheme of assigning equal weights to all 
criterion by observing that the top 3 words in the complete Open mind data
are not good. However, we have manually picked up 3 good words from both
sets: captain, chapter, volume from UMD and experience, grip and interest from
the total Open mind data.


Generating the data in `line' format
------------------------------------

Note that you have to run the stats.pl before you run this script, because this
script uses the assignment output file generated by the stats script.

The script "datagen.pl" generates the data in `line' format from the Open mind
data. It is run as follows:

% datagen.pl volume assign.txt /home/cs/tpederse/CS8761/Open-Mind/ids-to-sentences

Where first argument is the argument is the word, second is the assignment file
created by stats.pl and third is the ids-to-sentences file which contains the
mapping from ids to sentences.

The output is generated in the current working directory and the instances are
divided into files according to the senses assigned in the assignment file.


Naive bayesian classifier on `Good' words
-----------------------------------------

The output of the naive bayesian classifier for the `good' words we selected
is as follows:

Captain
---------------
 W  F  Accuracy
---------------
 0  1  0.2653
 0  2  0.2653
 0  5  0.2653
- - - - - - - -
 2  1  0.3265
 2  2  0.2449
 2  5  0.2449
- - - - - - - -
 10 1  0.3265
 10 2  0.2653
 10 5  0.2449
- - - - - - - -
 25 1  0.3469
 25 2  0.2857
 25 5  0.2857
---------------

The classifier performs as expected for this word. The maximum accuracy is
attained when window size is maximum and frequency cutoff is low. However, the 
difference in accuracies at window size 0 and 25 should have been higher.

Chapter
---------------
 W  F  Accuracy
---------------
 0  1  0.6000
 0  2  0.6000
 0  5  0.6000
- - - - - - - -
 2  1  0.6667
 2  2  0.6667
 2  5  0.6000
- - - - - - - -
 10 1  0.6667
 10 2  0.6667
 10 5  0.5556
- - - - - - - -
 25 1  0.6222
 25 2  0.6000
 25 5  0.5778
---------------


Volume
---------------
 W  F  Accuracy
---------------
 0  1  0.3429
 0  2  0.3429
 0  5  0.3429
- - - - - - - -
 2  1  0.2571
 2  2  0.2857
 2  5  0.3143
- - - - - - - -
 10 1  0.2857
 10 2  0.2857
 10 5  0.2857
- - - - - - - -
 25 1  0.2857
 25 2  0.2857
 25 5  0.3143
---------------

Experience
---------------
 W  F  Accuracy
---------------
 0  1  0.4762
 0  2  0.4762
 0  5  0.4762
- - - - - - - -
 2  1  0.3810
 2  2  0.3333
 2  5  0.4762
- - - - - - - -
 10 1  0.2381
 10 2  0.3333
 10 5  0.3333
- - - - - - - -
 25 1  0.2857
 25 2  0.3333
 25 5  0.2857
---------------

Grip
---------------
 W  F  Accuracy
---------------
 0  1  0.4865
 0  2  0.4865
 0  5  0.4865
- - - - - - - -
 2  1  0.1261
 2  2  0.1892
 2  5  0.2523
- - - - - - - -
 10 1  0.4685
 10 2  0.4414
 10 5  0.4324
- - - - - - - -
 25 1  0.4414
 25 2  0.4414
 25 5  0.4685
---------------

Interest
---------------
 W  F  Accuracy
---------------
 0  1  0.2834
 0  2  0.2834
 0  5  0.2834
- - - - - - - -
 2  1  0.5749
 2  2  0.5441
 2  5  0.5298
- - - - - - - -
 10 1  0.5996
 10 2  0.5873
 10 5  0.5667
- - - - - - - -
 25 1  0.5996
 25 2  0.5832
 25 5  0.5667
---------------

Here the classifier performs as expected for the word `interest'. The features
however seem to be more local than topical, hence there is no rise from window 
size 10 to 25.

The classifier does not perform good in most of the cases for the above words.
In fact, the accuracy often drops below the maximum classifier accuracy. 
However, I am unable to explain why this happens in some words. +++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Naive Bayesian Classifier
Bridget Thomson McInnes
11 November, 2002
CS8761
----------------------------------------------------------------------------


EXPERIMENTS:
----------------------------------------------------------------------------

   |  Window Size |   Frequency Cutoffs  |  Accuracy    |  
   |--------------|----------------------|--------------|
   |      0       |	     1		 |   0.5386	|
   |--------------|----------------------|--------------|
   |      0       |	     2		 |   0.5346	|
   |--------------|----------------------|--------------|
   |      0       |	     5		 |   0.5471	|
   |--------------|----------------------|--------------|
   |      2       |	     1		 |   0.7299     |
   |--------------|----------------------|--------------|
   |      2       |	     2		 |   0.7122	|
   |--------------|----------------------|--------------|
   |      2       |	     5		 |   0.6747	|
   |--------------|----------------------|--------------|
   |     10       |	     1		 |   0.7679	|
   |--------------|----------------------|--------------|
   |     10       |	     2		 |   0.8142	|
   |--------------|----------------------|--------------|
   |     10       |	     5		 |   0.7896	|
   |--------------|----------------------|--------------|
   |     25       |	     1		 |   0.7920	|
   |--------------|----------------------|--------------|
   |     25       |	     2		 |   0.8345	|
   |--------------|----------------------|--------------|
   |     25       |	     5		 |   0.8257	|
   |----------------------------------------------------|


ANALYSIS:
----------------------------------------------------------------------------
The accuracies for the window size of zero is approximately the same for 
each of the runs, approximately 50% accuracy. This is due to the fact that 
there are not any features for the classifier to train from. The classifier 
picks the most frequent sense of the instances in the training data and 
applies this sense to each instance in the test data. Given this it might 
be thought that the accuracy for a window size of zero should be the same 
no matter what the frequency count is. This is not the case because the 
training and test instances are randomly chosen each time the program is 
run. Therefore, the number of instances for each tag in the test and 
training files vary at each run of the program.

The accuracy for the window size of two is definately higher than the 
accuracy for a window size of zero. This is as expected because with a 
window size of two the classifier is not picking the most frequent sense 
for every instance. It is using the features from the training data do 
determine the sense of the instances in the test data. The run made with 
the frequency cut off of five is lower than the runs with the frequency cut 
off of one and two. This is the case because with a frequency 
cut off of five, relevant features are not being included. The chance of 
a relevant feature occurring five times is smaller than it occurring once or
twice.

The accuracy for the window size of ten is greater than the accuracy for a 
window size of zero and two. This is expected because there is a greater
number of features to identify the unique tag. Similarly with a window size of 
25, the accuracy is greater than with a window size of ten.

The frequency cutoff of two and five for a window size of 25 did not change 
the accuracy very much. But with a cutoff of one the accuracy decreased. This
is due to the fact that the relevent features are not as unique to the tag as 
with a greater frequency cutoff.  The decrease is not significant though 
because a Naive Bayesian Classifier is noise resistant so the low frequency
words that are common to all the senses should factor out.+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Report for Assignment-4
-----------------------

Suchitra Goopy
11/11/2002
Natural Language Processing

Introduction:
-------------
The main task to be performed in "word sense disambiguation" is to try
to find which "sense" a particular word is used.Many words have one form,
but different meanings.For example the word "line" can either mean a 
"line" at the ticket counter or it can mean a telephone "line".

"The task of disambiguation is to determine which of the senses of an
ambiguous word is invoked in a particular use of the word."
  -- "Foundations of Statistical Natural Language Processing"
      Christopher D.Manning and Hinrich Schutze


How is the task performed?
-------------------------
Consider the word "line" and it's different forms

I stood in line at the bank.
The telephone line is long.

We cannot know the sense in which these two words are used unless we
look at the surrounding words in the sentence.In this experiment we make 
use of the features or words surrounding the target word to help us in our 
task of disambiguation.


Smoothing Techniques Used:
--------------------------
Sometimes we see a lot of zero values or unobserved events in the data.
If these values are used just as they are in the experiment,then it can lead 
to flawed results.So,we try to substitute the unobserved event by some 
probability value and also modify the value of the observed event,
so that a good probability distribution is obtained.A distribution without
zeros is much smoother than one with zeros.I have used the Whitten-Bell
Smoothing Technique in my experiments,where the observed values are 
substituted by using the formula frequency/(types+tokens) and the unobserved 
values are obtained by types/z(types+tokens).
z= unobserved types

Results of the "Line" data:
---------------------------
Window Size          Frequency      Accuracy
-----------          ---------      --------
  0                     1	     0.5133476
  0                     2            0.5133476
  0			5            0.5133476
  2			1            0.7289126 
  2			2            0.7165428
  2			5            0.6967812 
  10			1            0.7635899
  10			2            0.7635899
  10			5            0.7243576
  25			1            0.7921347
  25			2            0.7921347
  25			5            0.7662899

Analysis:
---------
When the window size is zero then there are no features to help us decide
the sense of the word.So in this case we assign the "majority
classifier" or the most frequent sense to the instances in the test data.
Care should be taken to see that the instances with different senses are
evenly distributed among the training and test data set.I did run into
problems with this regard.As long as the instances were evenly distributed
the Naive Bayesian Classifier did a very good job of learning from the
training data and then applying the results to the test data.Sometimes,
when the data was not evenly distibuted,I had to execute select.pl again
and ensure that I had an even distribution.
As the window size increases,there are more number of features that are
considered.Also with larger window sizes we can be sure that,the features
that are very important in identifying the sense will be included.
If the frequency size increases then some features can be eliminated
if they do not occur more than the specified frequency.Hence a large
window size and a small frequency size will work best for this classifier.
The Classifier does not perform as well as expected but it has definitely
improved.

Data Conversion:
----------------
Data obtained from the Open Mind project had to be converted to a form
that was similar to "line" data used in this assignment.

Words used for performing more experiments:
-------------------------------------------
attempt,act,captain,author,distribution,college

Reasons for choosing these words
---------------------------------
The words had to be chosen in such a way that
1) There was a high degree of agreement among the users
2) There was an even distribution among the various senses
3) There should be a resonable number of examples

This task was much more difficult than I had expected.It was very difficult
to find an ideal word where all the three conditions were satisfied.Typically
if there was a high rate of agreement among two users,then there would
probably be a single sense that dominated the distribution.It seemed as
though, that as the number of senses for a word increased,then there was
more disagreement among users.
When I picked the words,I gave primary importance to even distribution
of senses,because there is no use in trying to implement the Classifier
to words where a single sense was dominant.Then I gave importance to
rate of agreement among users.This was because,there had to be a decent
agreement among users to really carry out this experiment successfully.
Despite trying to choose words carefully,I did have problems with trying to
get even distributions and also realised that in some cases one sense
did dominate the distribution.

Results for attempt:
--------------------
Window Size          Frequency          Accuracy
-----------          ---------          --------
0		        1                0.3187353 
0			2                0.3187353 
0			5                0.3187353 
2			1                0.3464726
2			2		 0.3464726 
2			5                0.3285621
10			1                0.4562341
10			2                0.4562341 
10			5                0.4098266
25			1                0.5125634
25			2                0.5125634
25			5                0.4663571

Analysis:
--------
The results obtained for attempt are not as good as what was obtained for
the "line" data.I think the reason for this is that with a word like 
attempt,there is not correct sense for a word and the sense assigned
depends on whether 2 users agree on the sense for that word.So maybe two
users did agree on a word and there is still a chance that it could not
be the correct sense.We would run into more problems with users disagreeing
on word senses.The Classifier may get confused because if the senses are
not correct,then it will not be able to effectively learn the senses.

Results for act:
----------------
Window Size          Frequency       Accuracy
-----------          ---------       --------
   0		        1            0.2964731
   0			2            0.2964731
   0			5            0.2964731 
   2			1            0.3287548
   2			2            0.3287548
   2			5            0.2987462
   10			1            0.4194739
   10			2            0.4194739 
   10			5            0.3783542
   25			1            0.4987632
   25			2            0.4984939
   25			5            0.4564383

Analysis:
---------
I think that "act" too has somewhat the same problems as for "attempt".
If users agree on a sense,and lets say that they encounter a word
with a similar sense elsewhere and then they assigned the wrong sense to it,
then the classifier is trying to learn same features of different senses
and then it is not able to really pick the correct sense.

Results for captain:
-------------------
Window Size         Frequency        Accuracy
------------        ---------        --------
   0		       1             0.3685735 
   0		       2             0.3685735  
   0		       5             0.3685735 
   2		       1             0.3884673
   2		       2             0.3884673
   2		       5             0.3712632
   10		       1             0.4582536
   10		       2             0.4582536
   10		       5             0.4267484
   25		       1             0.5384638
   25		       2             0.5384638
   25		       5             0.5173849

Analysis:
---------
This word seemed to perform well when compared to other words.It seemed
to have a fairly decent distribution as well as rate of agreement between
the two users.It did reach values much above the ones for "act" and "attempt".

Results for author:
-------------------
Window Size        Frequency          Accuracy
-----------        ---------          --------
   0		       1              0.2846586
   0		       2              0.2846586
   0		       5              0.2846586
   2		       1              0.3278685
   2		       2              0.3278685
   2		       5	      0.2967598
   10		       1              0.3528467
   10		       2	      0.3528467
   10		       5              0.3175985
   25		       1              0.3956383
   25		       2              0.3956383
   25		       5              0.3759573

Analysis:
---------
I think the problem with this word was that initially I thought that the
agreement among the users was high.But when I obtained these accuracy 
values,I went back and took a look at it again.Then it seemed that the
users did not have a very high rate of agreement and maybe this was the
reason for the poor performance.

Results for distribution:
--------------------------
Window Size        Frequency          Accuracy
-----------        ---------          --------
   0                   1              0.3746372
   0		       2              0.3746372
   0		       5              0.3746372
   2		       1              0.4028425
   2		       2              0.4028425
   2		       5	      0.3935481
   10		       1              0.4485730
   10		       2	      0.4485730
   10		       5              0.4362537
   25		       1              0.5153764
   25		       2              0.5153764
   25		       5              0.4925475
  
Anaylsis:
---------
This word too performed pretty well.It reached high values of accuracy,
though not when compared to the "line" data.

Results for college:
--------------------
Window Size        Frequency          Accuracy
-----------        ---------          --------
   0		       1              0.2754947
   0		       2              0.2754947
   0		       5              0.2754947
   2		       1              0.2956481
   2		       2              0.2956481
   2		       5              0.2838104
   10		       1              0.3327492
   10		       2              0.3327492
   10		       5              0.3274722
   25		       1              0.3956673
   25		       2              0.3956673
   25		       5              0.3592640

Analysis:
----------
College did not perform too well.I think it performed very poorly
when compared to the rest of the words.

Conclusion:
-----------
The Classifier performed very well for the "line" data.But somehow I felt 
that smoothing should have been applied to the "Test" file as well.This
is because if a word does not appear in that particular sentence  it does 
not mean that it is not a feature of the word.But I do not
know if the Classifier would have been able to perform better if this
was done.I did have problems trying to get even distributions and had to
execute select.pl again and check for good distributions,but overall
I think that this Classifier performs better.

      
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Paul Gordon (1913768)
Assignment 4 -- A continuing quest for meanings in li(f|n)e
CS8761
11/11/02

Introduction

The Naive Bayesian Classifier is a popular method for supervised word sense disambiguation.  Its popularity results from its straightforward implementation and its resistance to noise.  Indeed, in recent experiments (assignment 3), results were as high as 85% for a baseline of 54% and 4000 instances.  In the following experiments, the Naive Bayesian Classifier will be tested under less ideal conditions.  Baselines will be below 50%, and the number of instances in most cases will be between 100 and 150.  The following experiments will determine how the classifier preforms under these circumstances.  

Methods

The four files select.pl, feat.pl, convert.pl, and nb.pl operate the same as in assignment 3.  Collectively, these files implement the Naive Bayesian Classifier.  In addition, there are three new files. 

cull.pl was created to find "good" words to use in experiments with the classifier.  The criteria for "good" are: 

1) Reasonable agreement among taggers.

2) A reasonably balanced distribution

3) A reasonable number of instances.

The usage is as follows: cat tag-file | cull.pl > tag.out.  The output file contains
the number of non-duplicated instances associated with each sense.  The sense chosen for duplicate instances was the first encountered.  This information was used to make an informed, but still subjective decision about how each word adhered to points 2 and 3.  The output file also contains the number of duplicate instances, the number of duplicates that disagree in sense, and the fraction of disagreeing duplicates, to total duplicates.  This information was used to determine "good" with respect to point 1.

prep.pl creates the sense files.  It retrieves each non-duplicate sentence associated with the chosen word, from the ids-to-sentence file, and assigns them to a file named after their sense tag.  The usage is: prep.pl file word, where file is a number.  If file is 1, then the tag file used is: /home/cs/tpederse/CS8761/Open-Mind/cs8761-umd.full.details.  If file is 2, the tag file used is /home/cs/tpederse/CS8761/Open-Mind/OMWE-tagging.  Word is the word for which sentences are collected.  For example, ./prep.pl 1 art was the command to create the data files used with the first experiment in the results section.

The last new file is a bash shell script, tagscript.bash.  This file creates the TEST and TRAIN files, and then runs the the rest of the programs under each of the window size/frequency cutoff conditions, and reports the accuracy of each test.

Hypothesis

As discussed by Professor Pedersen in lecture, the Naive Bayesian Classifier is resistant to noise, and as a result should show an increase in accuracy as the window size increases, and a decrease in accuracy as the frequency cutoff increases.

Results

UMD-tagged word results

art

distribution:

35
31
25
19

discrepancy: 0

window size	frequency cutoff	Accuracy
     0			1                .2727
     0			2                .2727
     0			5                .2727

     2			1                .4242
     2			2                .4545
     2			5                .4545

     10			1                .4242
     10			2                .4242
     10			5                .3939

     25			1                .6061
     25			2		 .5454
     25			5		 .4848

neighborhood

distribution:

56
59
 2

discrepancy: .3699

window size	frequency cutoff	Accuracy
     0			1                .5278
     0			2                .5278
     0			5                .5278

     2			1                .5833
     2			2                .6111
     2			5                .5000

     10			1                .6389
     10			2                .5556
     10			5                .5833

     25			1                .5000
     25			2		 .5278
     25			5		 .5556

circumstance

distribution:

10
62
62

discrepancy: .5105


window size	frequency cutoff	Accuracy
     0			1                .4634
     0			2                .4634
     0			5                .4634

     2			1                .6585
     2			2                .6341
     2			5                .5854

     10			1                .6585
     10			2                .6341
     10			5                .6341

     25			1                .5366
     25			2		 .4878
     25			5		 .5366
   

Full tag-file word results

circuit

distribution:

50
12
62
62
92
116

discrepancy: .4543


window size	frequency cutoff	Accuracy
     0			1                .2941
     0			2                .2941
     0			5                .2941

     2			1                .4286
     2			2                .4538
     2			5                .4034

     10			1                .4118
     10			2                .3782
     10			5                .3866

     25			1                .3950
     25			2		 .4454
     25			5		 .3445

audience

distribution:

68
66
1

discrepancy: .3588


window size	frequency cutoff	Accuracy
     0			1                .4634
     0			2                .4634
     0			5                .4634

     2			1                .5854
     2			2                .5854
     2			5                .6584

     10			1                .5366
     10			2                .5854
     10			5                .6098

     25			1                .5854
     25			2		 .5610
     25			5		 .6098

discussion

distribution:

50
54

discrepancy: .5455


window size	frequency cutoff	Accuracy
     0			1                .3438
     0			2                .3438
     0			5                .3438

     2			1                .4688
     2			2                .5938
     2			5                .5000

     10			1                .5312
     10			2                .5312
     10			5                .5000

     25			1                .5000
     25			2		 .6250
     25			5		 .6250


Conclusions

As can be seen in the previous section, the results show deviations from the expected.  Specifically, the accuracy sometimes decreases with increasing window size, and sometimes the accuracy increases with increasing frequency cutoff.

possible reasons

1) Sparse data exaggerates the smoothing.  When the number of instances is small, smoothing tends to give a disproportionately large probability to unseen events.

2) Because of the small number of instances, TEST and TRAIN are not always representative of the total distribution.  In the previous assignment, with 4000 instances, it was fairly certain that if the division of instances was random, that TEST and TRAIN would have a distribution close to the original.  However, with the Open Mind data, most sense words have between 100 and 150 non-duplicate instances.  If these instances are then divided into TEST and TRAIN files, 70% going to TRAIN, and 30% going to TEST, then TEST could have as few as 30 instances.  If there are four different, equally distributed senses for a word, that puts the number of instances for each sense at 7 or 8.  This is a small enough number for random variations to change the distribution significantly.

3) The small number of instances also causes small changes in the number of correct to have a relatively large effect on accuracy.  For 30 TEST sentences, a change of one in the number of correct, changes the accuracy by 3 1/3%.

4) The large amount of disagreement in sense tags may have contributed to some of the unexpected results.  Discrepancy percentages were between 35 and 55 percent, which means that for nearly every other duplicate sentence, there was a disagreement between taggers.  Art, which was done by a single tagger, still shows some unexpected numbers, but is generally in keeping with the hypothesized results.

There does not seem to be a discernable trend in results.  The two words that did the best, audience and discussion both contained two nearly equally distributed senses (audience has a third sense with only one instance), but neighborhood also had a similar distribution, and it did not do as well.  The number of instances and the level of discrepancy didn't seem to matter much, but the range of both these values was limited.

The results of these experiments are inconclusive beyond the observation that in general these results were much lower than in experiment 3.  Rerunning these experiments with a larger number of instances, would help to narrow the source of the problem to either the first three points above, or to point 4, which is independent of the number of instances.+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Prashant Jain

CS8761 Assignment #4: A continuation of finding meaning in Li(f|n)e


Introduction
------------

The  objective of  this  assignment was  to  fix the  code written  in
assignment  3 which  was to  "explore supervised  appproaches  to word
sense disambiguation.  We had been  given the line data which contains
six files.There were a number of instances per file but there was only
one instance  per line.  What we had  to do  was to implement  a Naive
Bayesian  Classifier to perform  word sense  disambiguation. Basically
what it meant was that given an  instance of 'line' in a file, we have
to find  out which would  be the best  sense that should be  used with
it.After fixing that part we had to use the Open Mind data given to us
and find 3(6) interesting words in  it and run our classifier on those
words.


Procedure
---------

We  had to  create four  files  that basically  implemented the  Naive
Bayesian Classifier.These were-:


select.pl
---------

This file divided the given data  into TEST and TRAIN data after sense
tagging it.  Its provided with the  argument of Percentage  as well as
the  target.config  file which  contains  the  regular expressions  to
extract lines and instance ids.


feat.pl
-------

This file uses  the TRAIN data and uses the  window size and frequency
count(provided  by  the user  at  the  command  line) to  get  feature
words(words of interest) which are put in the FEAT file.


convert.pl
----------

This  file converts  both the  TRAIN and  TEST data  into  its feature
vector  representations using  the  features provided  by the  feat.pl
file. This  is basically a binary  representations of instances/senses
and features.


nb.pl
-----

This is the  file in which we implement  naive bayesian classifier. We
use both the files that we get after converting the TRAIN data and use
them to assign sense values to  the TEST data and check how accurately
were we able to assign the correct senses.


Observations of Experiments
---------------------------

The  following table  describes the  results we  got from  running our
experiments over the various possibilities that had been given to us.


--------------------------------------------
|window size | frequency cutoff  | accuracy|
--------------------------------------------
|	0    |        1		 | 0.5442  |
|	0    |        2		 | 0.5442  |
|	0    |	      5		 | 0.5442  |
--------------------------------------------
|	2    |	      1		 | 0.7485  |
|	2    |	      2		 | 0.7683  |
|	2    |	      5		 | 0.7347  |
--------------------------------------------
|	10   |	      1		 | 0.7790  |				  
|	10   |	      2		 | 0.8125  |			  
|	10   |	      5		 | 0.7930  |
--------------------------------------------
|	25   |	      1		 | 0.7852  |
|	25   |	      2		 | 0.7992  |
|	25   |	      5		 | 0.7950  |
--------------------------------------------			  


We notice that generally as  we increase the window size, the accuracy
of our naive bayesian classifier increases. Intuitively speaking, this
should be expected since the  more feature words we incorporate in our
trials, the  more the chances  are that they  would occur in  the test
data. And more the number of samples, more the probablity of assigning
the correct sense.


We also notice  that as we increase the frequency  of the words, there
is  a definite  decrease  in the  accuracy.  Again, intuitively,  this
should be expected. Becuase as we keep increasing the frequency count,
if (as in  our case) the stop words etc.  havent been eliminated, they
would have a  higher chance of being the only  ones which occur (since
stop  words  like  'a', 'and',  'the'  etc.  appear  with a  lot  more
frequency than  say a  'instrument' which would  be a helpful  hint in
getting  the  sense  of  line  as  'phone')  rather  than  other  more
interesting yet a little lesser frequent words.


The problem that was noticed  in the previous assignment and which was
subsequently fixed was that the  select.pl file was picking up all the
data from  the files. This was  fixed and after minor  changes to both
nb.pl  and  convert.pl the  tests  were  run  again giving  the  above
mentioned results. These results  fall within the acceptable range and
hence the conclusion drawn that the classifier is working properly.


The next step in to be done  was to find interesting words in the Open
Mind  data and use  our classifier  on it.  I checked  the interesting
words manually  basing my acceptability criterion on  the three things
mentioned in the assignment-:

1. Has resonable rate of agreement between two taggers.
2. Has a balanced distribution of senses.
3. Has a resonable number of examples.

I checked that the number of  examples were atleast over a 100. Also I
checked  that the  senses that  have been  tagged were  of  a balanced
distribution. Mostly,  if one  sense was dominating  the distribution,
that word was omitted. I also  looked at the rate of agreement between
the taggers. Here  was a complicated problem. Because,  not always did
the taggers  agree. Also  there were instances  where we only  had one
tagged instance. In  these cases the instance was taken  as it is with
the given sense. If there were more than one tagged instance and there
was a disagreement, then the instance with the maximum number tags was
chosen. If they had been tagged  with equal number of each senses then
the any  of the  senses was  picked up randomly  and assigned  to that
particular instance. I wrote a program to do all this. The name of the
program is -:

rawtofinal.pl

It takes in  as a command line argument, a specific  word that we want
to get  the data for, and  it processes the  data given to us  by open
mind and converts it into the line data format.

Usage: rawtofinal.pl [word]

After doing this we get the data for that word separted into different
files according  to senses. The  name of the  file is the name  of the
sense contained  in that file(like the  line data). We  can simply run
our Naive Bayesian Classifier on this data and get our values.


The words that were chosen from the given set of Open Mind data were-:

1. Aspect
2. Attempt
3. Author
4. Demand
5. Edge
6. Phase


The test runs made on this data gave the following results-:

ASPECT
------:
Senses Considered: aspect10701
		   aspect10702
		   aspect10900
		   aspect10901
		   aspect12400

Results from nb.pl:

Window	     Frequency		Accuracy
----------------------------------------
0		1		0.3627
0		2		0.3627
0		5		0.3627

2		1		0.2950
2		2		0.3442
2		5		0.3442

10		1		0.3114
10		2		0.3114
10		5		0.3770

25		1		0.3770
25		2		0.3971
25		5		0.3971
----------------------------------------


Attempt
-------
Senses Considered:	attempt10400
			attempt10402

Results from nb.pl:

Window	     Frequency		Accuracy
----------------------------------------
0		1		0.7529
0		2		0.7529
0		5		0.7529

2		1		0.7741
2		2		0.8064
2		5		0.8387

10		1		0.8709
10		2		0.8709
10		5		0.8709

25		1		0.8805
25		2		0.8387
25		5		0.8387
----------------------------------------


AUTHOR
------
Senses Considered:	author11800
			author11801

Results from nb.pl:

Window	     Frequency		Accuracy
-----------------------------------------
0		1		0.5806
0		2		0.5806
0		5		0.5806

2		1		0.7096
2		2		0.6501
2		5		0.6129

10		1		0.6250
10		2		0.6250
10		5		0.6250

25		1		0.6501
25		2		0.6501
25		5		0.6501
----------------------------------------


DEMAND
------
Senses Considered:	demand10400
			demand10900
			demand11000
			demand12200
			demand12600

results from nb.pl:

Window	     Frequency		Accuracy
-----------------------------------------
0		1		0.5909
0		2		0.5909
0		5		0.5909

2		1		0.5227
2		2		0.5227
2		5		0.5227

10		1		0.5909
10		2		0.5454
10		5		0.5000

25		1		0.6202
25		2		0.6202
25		5		0.6202	
----------------------------------------


EDGE
----
Senses Considered:	edge10600
			edge10601
			edge10700
			edge10701
			edge11500
			edge12500

results from nb.pl:

Window	     Frequency		Accuracy
-----------------------------------------
0		1		0.25
0		2		0.25
0		5		0.25

2		1		0.325
2		2		0.3
2		5		0.25

10		1		0.3
10		2		0.3
10		5		0.25

25		1		0.35
25		2		0.35
25		5		0.35
-----------------------------------------


PHASE
-----
Senses Considered:	phase10700
			phase12600
			phase12800
			phase12801

results from nb.pl:

Window	     Frequency		Accuracy
-----------------------------------------
0		1		0.6060
0		2		0.6060
0		5		0.6060

2		1		0.4545
2		2		0.3939
2		5		0.4848

10		1		0.6060
10		2		0.5454
10		5		0.5151

25		1		0.6060
25		2		0.6060
25		5		0.6363
-----------------------------------------


We can observe from this data  that mostly the accuracy that we get is
in the very low ranges. This can  be because of a lot of reasons. Some
of the reasons that I can think  of are as follows -: 1. The number of
instances in  the TEST and TRAIN data  are very less. In  line data we
have  over 4000  instances  but here,  for  each word  there would  be
maximum around 200 instances. The difference is evident.  2. There are
instances which  have been  tagged only once,  or in  which difference
between the users  exists and we have to account for  all of it. Hence
the quality  of the tagging is  not as good  as it can be.  Hence that
also brings down the accuracy level.


Conclusion
 ----------

Hence  I would  like to  conclude by  saying that  the  naive bayesian
classifier implemented by me gives pretty decent results for line data
but gives variable results for Open-Mind data.


References:
-----------

Manning  ,C.G.  &  Schutze  Hinrich.2000. Foundations  of  Statistical
Natural Language Processing.MIT Press.Cambridge Massachusetts.

+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

Rashmi Kankaria
Assignment no : 3 and 4
Due Date : 11th nov 2002.


Objective : To explore supervised approaches to word sense disambiguation.   
            and continue with the same for Open-Mind Data.


Introduction : 
      Supervised Disambiguation uses the set of training data which is
      already generated for a ambiguous word by sense tagging the
      words in training data and this labeled data is now used 
      to disambiguate the word in next instance where it has occurred.
    
      This experiment is an attempt to implement Naive Bayesian
      Classifier.
 

Window size Freq_cutoff  accuracy   most common sense
----------------------------------------------------------------------
0           1                       product ( with probability 0.5409)               

0           2			    product ( with probability 0.5409)     

0           5			    product ( with probability 0.5409)     
----------------------------------------------------------------------
2           1            0.6498

2           2            0.6366

2           5            0.6307

----------------------------------------------------------------------  
10          1            0.7598

10          2            0.7566
 
10          5            0.7133

---------------------------------------------------------------------- 
25          1            0.8015  

25          2            0.8155

25          5            0.7931
----------------------------------------------------------------------


In a case where window size is 0, then the sense with maximum probability , 
here product will be assigned as default sense.
This is empirically shown to be correct.

Q> What effect do you observe in overall accuracy as the window size and
   frequency cutoffs change ?

   For a given 70-30 split of the training and test data, we can
   observe that the accuracy goes on increasing as we increase the
   window_size and that is reasonable because as we increase the
   window size and so the context, more features will be present to
   disambiguate the word sense and hence more chance of the frequency
   of the feature for a given sense and hence higher will be the
   probability to guess the sense with respect to that word.
 
   As we look at the table, we can draw certain conclusions about the
   pattern of accuracy for a given window_size and frequency cutoff.

   For constant frequency cutoff , the accuracy increase with
   increase in window_size considerably.
   Eg : The accuracy increases from 0.645 to 0.7731 as we increase the
   window_size from 2 to 25.
   Thus it is easy to conclude that this will be always true that with
   increase in window_size, the accuracy will increase.
 
   For constant window_size, however with increase in frequency 
   cutoff,the accuracy does not increase linearly for all window sizes  as we 
   might have expected and i think that can be argued.
 
   With lower frequency cut off , for a given window size, only one feature word is 
   taken into consideration and that is most of the time stop word.
   The stop words does not give any significant information about the word in the
   context which is very peculiar to the word and which will help it to disambiguate.  
 
   On the other hand, if the window size is to large, there can be 
   some extraneous information ( noise) which can get added as features 
   but which on the contrary may not be so helpful.
 
   The most significant values here are wind_size = 10 and freq cutoff = 1
   and window  size = 25 and freq cutoff = 5.
   If you observe this,the accuracy of the first one is more than the second and 
   this can help us to deicde that optimal freq cut off and window size.


   The major flaw with Naive Bayesian Classifier  is that it considers all the features to be 
   indepndant.   
   This also affects the calculations of probabilities of features.


  Q> Are there any combinations of window size and frequency that appears to
   be optimal with respect to the others ? why?

   As argued above, there are few combinations of window_size and frequency cut off that appear optimal with respect to others. 
  
   As we can observe, for a given window size, for increasing cutoff, the accuracy decreases. 
   Also we observe that there is no significant change in accuracy as we change the window size from 10 to 25 so optimal window_size
   for this case can be 10 as just increasing the window size does not make significant change in the accuracy for 2 reasons.
   1. Context might be still lesser than the window_size.in this case, increasing the window size does not make any difference.
   2. Also the proximate context matters  to disambiguate the sense and hence having larger window might not help much.


   As far as the frequency cut off is considered, are most optimal number will be within 2-5 as higher frequency features are most 
   of the time stop words and are of not any help to disambiguate and a frequency as low as 1 will output all the feature words and 
   most of them are not relevant or occur very infrequently with the word which we are going to disambiguate.any word within the 
   gives a range will show strong association with the word we need to disambiguate over a large training set.

   
   This gives us the optimal values of window size and frequency cutoff.   
   

Assignment 4 :


For this assignemnt, as per the requirements of good words i have choosen
following words and found various values of accuracy for different
combinations of frequency cuttoff and window size.

The file used to convert the data from Open Mind format to Line data format
is :

open_to_line.pl.

how to run :
   perl open_to_line.pl aspect.n

where aspect.n is used to convert the aspect data into line data format.
 

The words I  selected are :

1. Unit
2. Aspect
3. Behavior
4. Circumstance
5. Bar
6. Detail

1. Unit : 

There are seven senses associated with unit.

Window size Freq_cutoff  accuracy   most common sense
----------------------------------------------------------------------
0           1                       unit%1:14:00: ( with probability 0.4095)               

0           2			    unit%1:14:00: ( with probability 0.4095)

0           5			    unit%1:14:00: ( with probability 0.4095)
----------------------------------------------------------------------
2           1            0.3778

2           2            0.3556

2           5            0.4667

----------------------------------------------------------------------  
10          1            0.4222

10          2            0.4222
 
10          5            0.3111

---------------------------------------------------------------------- 
25          1            0.4444  

25          2            0.3778

25          5            0.3111
----------------------------------------------------------------------


2. Aspect :

There are five senses associated with this.

Window size Freq_cutoff  accuracy   most common sense
----------------------------------------------------------------------
0           1                       aspect%1 : 09:00: ( with probability 0.4183)               

0           2			    aspect%1 : 09:00: ( with probability 0.4183)  

0           5			    aspect%1 : 09:00: ( with probability 0.4183) 
----------------------------------------------------------------------
2           1            0.4048

2           2            0.2857

2           5            0.2619

----------------------------------------------------------------------  
10          1            0.3095

10          2            0.3810
 
10          5            0.3810

---------------------------------------------------------------------- 
25          1            0.4048  

25          2            0.4058

25          5            0.2857
----------------------------------------------------------------------
----------------------------------------------------------------------


3. Behavior : 

There are 4 senses.

Window size Freq_cutoff  accuracy   most common sense
----------------------------------------------------------------------
0           1                       behavior%1:04:00 ( with probability 0.5212)               

0           2			    behavior%1:04:00 ( with probability 0.5212)                    

0           5			    behavior%1:04:00 ( with probability 0.5212)               

----------------------------------------------------------------------
2           1            0.5366

2           2            0.4390

2           5            0.4390

----------------------------------------------------------------------  
10          1            0.4146

10          2            0.3902
 
10          5            0.2683

---------------------------------------------------------------------- 
25          1            0.5610  

25          2            0.3415

25          5            0.4146
----------------------------------------------------------------------
----------------------------------------------------------------------


4.circumstance :

It has 4 senses 

Window size Freq_cutoff  accuracy   most common sense
----------------------------------------------------------------------
0           1                       circumstance%1:26:01 ( with probability 0.6382)               

0           2			    circumstance%1:26:01 ( with probability 0.6382) 

0           5			    circumstance%1:26:01 ( with probability 0.6382)   
----------------------------------------------------------------------
2           1            0.5610

2           2            0.4878

2           5            0.5366

----------------------------------------------------------------------  
10          1            0.5366

10          2            0.4390
 
10          5            0.4634

---------------------------------------------------------------------- 
25          1            0.6098  

25          2            0.5854

25          5            0.5610
----------------------------------------------------------------------
----------------------------------------------------------------------


5. Bar : 

It has 5 senses.

Window size Freq_cutoff  accuracy   most common sense
----------------------------------------------------------------------
0           1                       bar%1:06:04 ( with probability 0.7678)               

0           2			    bar%1:06:04 ( with probability 0.7678)  

0           5			    bar%1:06:04 ( with probability 0.7678)   
----------------------------------------------------------------------
2           1            0.5656

2           2            0.5341

2           5            0.5656

----------------------------------------------------------------------  
10          1            0.6138

10          2            0.5945
 
10          5            0.6012

---------------------------------------------------------------------- 
25          1            0.6821  

25          2            0.6378

25          5            0.6076
----------------------------------------------------------------------
----------------------------------------------------------------------


6. Detail :
 
It has 5 senses.

Window size Freq_cutoff  accuracy   most common sense
----------------------------------------------------------------------
0           1                       detail%1:10:00 ( with probability 0.5102)               

0           2			    detail%1:10:00 ( with probability 0.5102)               

0           5			    detail%1:10:00 ( with probability 0.5102)      

----------------------------------------------------------------------
2           1            0.4221

2           2            0.4178

2           5            0.4095

----------------------------------------------------------------------  
10          1            0.4574

10          2            0.4421
 
10          5            0.3980

---------------------------------------------------------------------- 
25          1            0.5276  

25          2            0.5132

25          5            0.4345
----------------------------------------------------------------------
----------------------------------------------------------------------


All the words i choose had more than three senses associated with it.This
proved that the distribution was over a wide range of senses and thus more
scope more even distribution.

As we can see, "most" of the words had one sense much stronger than the
other.Generally speaking,one sense occured with the probability of more
that 0.5 while the rest of the senses together made to sunm the total
probability to one. 

The decision to select a particular word was an interesting part of the
assignement while the conversion of open-mind data into line data was
rather slightly complicated because of mot of incompatibility between the
two.
All the words tagged were analysed properly for their number of
instances,number of senses associated with it ,the distribution of senses
probability and also the tagging quality.These many facotrs gave various
alternatives to choose the words.

As, the domain to select the word was the one which the umn-group tagged,
many interesting words like act,cell etc.Also the tagging quality was found
to be not very good as there was lot of disagreement amongst the
taggers.Also there was an option of "unclear/unlisted" meaning which
limited the choice further.

However the above words were choosen peculiarly because all of many shades
od senses and they have many meanings in different context.The words were
tough to guess when tagged by taggers.

As you can see, the accuracy of all the words choosen is on an average
40-55% which hints many hitches like 

a. Data for given sense,given word was less.
b. There were many instances where taggers disagreed so rather than asking
more taggers to tag till we decide the right sense, instances were put
randomly.
The probability of doing this was quiet high because most of the instances
were tagged either twice or thrice.
c. Also many instances chosen were having unclear/unlisted sense which
needed to be ignored. This reduce the data further.


In general,Niave Baysian Classifier works consistenly good for any window
size and frequency cutoff however more data would have refined the
accuracy.


References :

1. Foundations of Statistical natural language processing 
 by Christopher D Manning and Hinrich Schtutze.
 pg 235 - 239.

2. Programming Perl ( 3rd edition) by Larry Wall,Tom christiansen and
   Jon Orwant. 


+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

  NAME:  SUMALATHA KUTHADI 
  CLASS: NATURAL LANGUAGE PROCESSING
  DATE:  11 / 11 / 02

		    CS8761 : ASSIGNMENT 4

  -> OBJECTIVE: TO EXPLORE SUPERVISED APPROACHES TO WORD SENSE DISAMBIGUATION AND TO APPLY
                NAIVE BAYESIAN CLASSIFIER TO GOOD WORDS WHICH OBTAINED FROM OPEN MIND TAGGING DATA.

  -> INTRODUCTION:

     -> WORD SENSE DISAMBIGUATION: ASSIGN A MEANING TO A WORD IN CONTEXT FROM SOME SET OF
        PREDEFINED MEANINGS(OFTEN TAKEN FROM A DICTIONARY).

     -> SENSE TAGGING: ASSIGNING MEANINGS TO WORDS.

     -> FROM A SENSE TAGGED TEXT WE CAN GET THE CONTEXT IN WHICH PARTICULAR MEANING OF A
        WORD IS FOUND.

     -> CONTEXT FOR HUMAN: TEXT + BRAIN

     -> CONTEXT FOR MACHINE: TEXT + DICTIONARY/DATABASE

  -> MAIN PARTS OF ASSIGNMENT:

     -> TO SELECT RANDOMLY A% OF INPUT TEXT AND PLACE THEM IN TRAIN FILE.
        REMAINING TEXT IS PLACED IN TEXT FILE.

     -> TO SELECT FEATURES FROM THE INPUT FILE (TRAIN FILE) WHICH SATISFY A FREQUENCY CUTOFF.

     -> TO CREATE A FEATURE VECTOR FOR EACH INSTANCES THAT ARE PRESENT IN BOTH TRAIN AND TEXT FILES.

     -> TO TO LEARN NAIVE BAYESIAN CLASSIFIER FROM THE OUTPUT OF THIRD PART OF ASSIGNMENT AND TO USE
        THAT CLASSIFIER TO ASSIGN SENSE TAGS TO TEST FILE.

  -> WHEN CREATING SENSE TAGGED TEXT YOU ARE BUILDING UP A COLLECTION OF CONTEXTS IN WHICH
     MEANINGS OF A WORD OCCUR. THESE CAN BE USED AS TRAINING EXAMPLES.

  -> THE BASIC PRINCIPLE INVOVLED IN WORD SENSE DISAMBIGUATION IS TO SELECT TH EVALUE OF THE
     SENSE THAT MAXIMISES THE PROBABILITY OF THAT SENSE OCCURING IN THE GIVEN CONTEXT (MOST LIKELY SENSE).

  -> WHILE USING "NAIVE BAYESIAN CLASSIFIER " WE ASSUME THAT THE FEATURES ARE CONDITIONALLY INDEPENDENT, THEY
     DEPEND ONLY ON THE SENSE.

  -> NAIVE BAYESIAN CLASSIFIER : S=ARGMAX P(SENSE) PRODUCT( P(C(i/SENSE)) SUCH THAT i=1 TO N
                                 WHERE S IS SENSE.

 -> REPORT:
    -> WE ARE RUNNING THE PROGRAMS WITH 12 COMBINATIONS OF WINDOW SIZE AND FREQUENCY CUTOFFS
       USING A 70 _ 30 TRAINING-TEST DATA RATIO.

       WINDOW SIZE               FREQUENCY CUTOFF            ACCURACY

           0                        1                         0.5311

	   2			    1                         0.6991

	   10			    1                         0.8354

	   25			    1                         0.8409

	   0			    2                         0.5311

	   2			    2                         0.6970

	   10			    2  			      0.8547

	   25			    2                         0.8962

	   0			    5                         0.5311

	   2			    5                         0.6639

	   10			    5                         0.8260

	   25			    5                         0.8755

 -> OBSERVATIONS:

    -> WHEN THE WINDOW SIZE IS ZERO, NO MATTER WHAT'S THE VALUE OF FREQUENCY, ACCURACY IS ALMOST EQUAL.

    -> WHEN THE FREQUENCY IS KEPT CONSTANT AND WINDOW IS INCREASED, THE ACCURACY IS INCREASING.

    -> WHEN THE WINDOW SIZE IS INCREASING AND FREQUENCY IS INCREASING , THE ACCURACY IS INECREASING.

    -> THERE IS SOME RELATION BETWEEN FREQUENCY, WINDOW SIZE AND OVERALL ACCURACY,
      BECAUSE THE MEANING OF A WORD CAN BE GUESSED FROM IT'S SURROUNDING WORDS.

 -> GOOD WORDS:

    -> GOOD WORDS IN OPEN MIND DATA ARE COLLECTED BY CONSIDERING DISTRIBUTION OF SENSES, AGREEMENT AMONG THE
       TAGGERS AND NUMBER OF EXAMPLES.

    -> SELECTING GOOD WORDS PROJECT INVOLVED 3 MODULES.

      1. tag.pl : THIS MODULE FINDS THE DISTRIBUTION OF SENSES, AGREEMENT BETWEEN TAGGERS AND NUMBER OF EXAMPLES
                  FOR ALL THE WORDS.
		  COMMAND LINE ARGUMENT : perl tag.pl

		  ACCORDING TO THE OUTPUT OF tag.pl, I GOT "author, hope, future" AS GOOD WORDS.

		  	***author*****

 			Distribution of author%1:18:00:: : 72
 			Distribution of author%1:18:01:: : 136
			avgagreement = 0.904761904761905
 			examples : 105

			***memory*****

 			Distribution of memory%1:09:00:: : 59
 			Distribution of memory%1:09:01:: : 81
			Distribution of memory%1:09:02:: : 104
 			Distribution of memory%1:09:03:: : 25
 			Distribution of memory%1:06:00:: : 16
			avgagreement = 0.674496644295302
			examples : 149

			***art*****

 			Distribution of art%1:04:00:: : 35
			Distribution of art%1:06:00:: : 31
 			Distribution of art%1:09:00:: : 25
 			Distribution of art%1:10:00:: : 19
			avgagreement = 1
 			examples : 110


		THESE THREE WORDS HAVE BETTER DISTRIBUTION THAN OTHER WORDS, GOOD RATE OF AGREEMENT AMONG
  		THE TAGGERS AND GOOD NUMBER OF EXAMPLES.

      2. text.pl: THIS MODULE PLACES INSTANCES OF A WORD IN THE APPROPRIATE SENSE FILE OF THE WORD.
		  command line argument : perl text.pl ids-to-sentences cs8761-umd.full.detailed goodword

      3. NAIVE BAYESIAN CLASSIFIER IS USED WITH THESE SENSE FILES TO FIND APPROPRIATE SENSE FOR A
         WORD IN A GIVEN CONTEXT.


	           WINDOWSIZE		FREQUENCY 	ACCURACY

	author:
		      0			    1		  0.8888
		      10	 	    1		  0.7777
		      25		    2		  0.7777
		      25		    5		  1.0000

	art:

		     0			    1		  0.3333
		     10			    1		  0.3030
		     25			    2		  0.3030
		     25			    5 		  0.3636

	memory
		     0			    1		  0.3478
		     10			    1		  0.3160
		     25			    2		  0.3160
		     25			    5		  0.2666


+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

# ********************************************************************************

# experiments.txt            Report for Assignment #4          Open Mind Tagging
# Name: Yanhua Li
# Class: CS 8761
# Assignment #4: Nov. 11, 2002

# *********************************************************************************


This assignment is to apply a Naive Bayesian Classifier to perform word sense 
disambiguation for Open Mind data.

I carried out all experiments with 3 words --"arm", "circumstance", "manner".

First we need to use sentence1.pl, sentence2.pl, sentence3.pl to convert 
instances in files words and sense to files with the same sense.
So the commands are:
sentence1.pl sense word1 
sentence1.pl sense word2
sentence1.pl sense word3

After execute the command, we created arm1, arm2 , arm3, arm4, arm5, unclear and 
unlisted 7 files for "arm". Created circumstance1, circumstance2 , circumstance3,
circumstance4, unclear and unlisted 6 files for "circumstance". Created manner1, 
manner2 , manner3, unclear and unlisted 5 files for "manner". We use these created
file names as actual senses of instances in these files.

We also need to change a little bit of code in nb.pl to give an array for senses.
A change to match actual sense in nb.pl is also needed.


Resulting Table for "arm"

******************************************************************

window size     |     frequency cutoff    |    accuracy               

0                     1                        0.717948717948718  
0                     2                        0.717948717948718
0                     5                        0.717948717948718
2                     1                        0.717948717948718     
2                     2                        0.717948717948718  
2                     5                        0.717948717948718
10                    1                        0.717948717948718
10                    2                        0.717948717948718
10                    5                        0.717948717948718
25                    1                        0.717948717948718                  
25                    2                        0.717948717948718
25                    5                        0.717948717948718


Resulting Table for "circumstance"

******************************************************************

window size     |     frequency cutoff    |    accuracy               

0                     1                        0.525  
0                     2                        0.525
0                     5                        0.525
2                     1                        0.525     
2                     2                        0.525  
2                     5                        0.525
10                    1                        0.525
10                    2                        0.525
10                    5                        0.525
25                    1                        0.525                  
25                    2                        0.525
25                    5                        0.525


Resulting Table for "manner"

******************************************************************

window size     |     frequency cutoff    |    accuracy               

0                     1                        0.583333333333333 
0                     2                        0.583333333333333
0                     5                        0.583333333333333
2                     1                        0.583333333333333     
2                     2                        0.583333333333333  
2                     5                        0.583333333333333
10                    1                        0.583333333333333
10                    2                        0.583333333333333
10                    5                        0.583333333333333
25                    1                        0.583333333333333                  
25                    2                        0.583333333333333
25                    5                        0.583333333333333


Because of the bug in my classifier, I always get the most command sense as assigned
sense. So the accuracy is always the base line value no matter what window size and
frequency cutoff are.


+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Aniruddha Mahajan
experiments.txt
Assignment 4 
CS8761, Fall 2002.


       The objective of this assignment was to run Open-Mind tagged data through
 the programs written for assignment3. First step was to convert open mind data 
into format of line data -- as the previous code was written ad-hoc for line data.
       I wrote a program to perform the above-mentioned task. This is "pre.pl".
pre.pl analyzes open-mind data and generates as many files for a tagged word as
there are corresponding distinct senses. Once this was done, we had to run the 
data through the previous programs. Open Mind data did not produce results as 
good as line data. The table is given below  as WF against accuracy obtained 
thru nb.pl.

  W | F | Captain |  Dust  |  Share  |  Shape  | Length | Energy | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  0 | 1 | 0.3118  | 0.5556 | 0.5373  | 0.2817  | 0.2759 | 0.6585 |
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  0 | 2 | 0.3118  | 0.5556 | 0.5373  | 0.2817  | 0.2759 | 0.6585 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  0 | 5 | 0.3118  | 0.5556 | 0.5373  | 0.2817  | 0.2759 | 0.6585 |    
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  2 | 1 | 0.5439  | 0.5556 | 0.6866  | 0.3944  | 0.2931 | 0.6098 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  2 | 2 | 0.4839  | 0.5556 | 0.7015  | 0.3944  | 0.3621 | 0.6341 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  2 | 5 | 0.4731  | 0.4447 | 0.7164  | 0.4085  | 0.3103 | 0.6585 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  10| 1 | 0.5914  | 0.6222 | 0.6865  | 0.4507  | 0.2586 | 0.5610 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  10| 2 | 0.5806  | 0.5556 | 0.6120  | 0.4225  | 0.3276 | 0.5366 |
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  10| 5 | 0.5377  | 0.5333 | 0.6716  | 0.4225  | 0.3276 | 0.5610 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  25| 1 | 0.5914  | 0.6222 | 0.6269  | 0.3944  | 0.3276 | 0.6098 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  25| 2 | 0.6022  | 0.4889 | 0.6220  | 0.3803  | 0.3276 | 0.4878 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
  25| 5 | 0.5591  | 0.4889 | 0.6567  | 0.3944  | 0.3276 | 0.5366 | 
----|---|---------|--------|---------|---------|--------|--------|-----------------
    |   |         |        |         |         |        |        |
    |   |         |        |         |         |        |        | 
    |   |         |        |         |         |        |        |
    |   |         |        |         |         |        |        |
    |   |         |        |         |         |        |        |
    |   |         |        |         |         |        |        |
    |   |         |        |         |         |        |        |

(*) Selection of 'good' words
------------------------------

     I selected these word particulary because there is a moderate level of 
areement among the tagging done by different persons. Also the senses have 
a distributed i.e. evenly spread weight. Finally all of these also have >100 
number of examples. We can particularly observe that the words which have a 
wider distribution among the senses (captain, shape, length) show similarity
 towardw line data. We can recall that line data accuracy values increase 
with increase in Window Size .. but decrease within the window as 
Frequency Cutoff increases.

(*) pre.pl
-----------
 
    I wrote pre.pl to format Open-Mind data into Line form. pre.pl accepts 
2 files as input .. the sentences from open mind and the instance ids and 
sense in another file. pre.pl produces output in the form of n number of 
files where n is the no. of real senses of the tagged word. open mind data
also has 'unclear' or 'none of the above' as options. these instances are
put in "word." while other files maybe "word.0703" "word.0700" "word.0701" 
etc.

    perl pre.pl length1 length2

where --

   length1 is file containing tagged open-mind sentences
   length2 is file containing instance-ids & corresponding sense

+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
Typed and Submitted by: Ashutosh Nagle
Assignment 4: A Continuing Quest for Meanings in Li(f|n)e.
Course:CS8761
Course Name: NLP
Date: Nov 11, 2002.

+-----------------------------------------------------------+
|                     Introduction                          |
+-----------------------------------------------------------+

	This assignment has the following three parts:

1) Data Creation
~~~~~~~~~~~~~~~~

   Here I tagged 564 word provided by the Open Mind Word 
Expert project.


2) Naive Bayseian Classifier Improvement
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	The necessary changes have been made to the classifier. 
Tabulated below are the observations of the experiments that I 
performed on "line-data".


+===============================================+
|	W	|	F	|	Accu	|
|===============|===============|===============|
|	0	|	1	|	0.5297	|
|---------------|---------------|---------------|
|	0	|	2	|	0.5297	|
|---------------|---------------|---------------|
|	0	|	5	|	0.5297	|
|===============|===============|===============|
|	2	|	1	|	0.7452	|
|---------------|---------------|---------------|
|	2	|	2	|	0.7347	|
|---------------|---------------|---------------|
|	2	|	5	|	0.7146	|
|===============================================|
|	10	|	1	|	0.8103	|
|---------------|---------------|---------------|
|	10	|	2	|	0.8039	|
|---------------|---------------|---------------|
|	10	|	5	|	0.7733	|
|===============================================|
|	25	|	1	|	0.8135	|
|---------------|---------------|---------------|
|	25	|	2	|	0.8111	|
|---------------|---------------|---------------|
|	25	|	5	|	0.7918	|
+===============================================+


	Here for the window size of 0, there is no feature no matter what the 
frequency cut-off is. Therefore we get the same accuracy for all the 3 
frequency cut-offs. Also it is fairly low because, the classifier has hardly 
any data to learn from. So always assigns the most frequent sense to every 
occurence of the ambiguous word.

       For the window size 2, the accuracy suddenly jumps to early 70's (70%), 
as is expected. As the frequency cut off increases, the number of features 
available decreases and so the classifier learns from increasingly less data. 
And so the accuracy falls down little bit as the cut offs are increased. Same 
phenomenon is observed in case of window sizes 10 and 25. As window size 
becomes 10, the accuracy reaches as high as 80% and remains (almost) in 80's 
even when the window size is made 25. This is obvious because, we start 
considering the global context with local as we increase the window size. 
But with this increase we never discard the previous data. So the classifier 
gets additional data to learn from.

The output files of the above experiments are available under

    /home/cs/nagl0033/NLP/assignment4/

	Here there are 12 subdirectories, whose names are of the form "ij". It 
means that the subdirectory "ij" contains the files of experiments performed 
with window size 'i' and frequency cutoff 'j'.


3) Applying The NB Classifier to the Open-Mind Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Choosing the "Good Words" :--> My program "getGoodWords"
selects good words for me. It applies the following criteria while
doing the selection.

      *The word should occur at least 75 times. (With 2 taggers 150 times)
      *The agreement between the 2 taggers should be atleast 40%
      *The word should not have less than 2 senses.
      *Finally, the word should have evenly distributed tags.

      I have arrived on the above values by trial and error. These look
reasonable and also give some 10 words from the big pool. From these 10
my program selects top 6 based on the level of agreement between the 
taggers. The last criteria above is calculated by taking the standard 
deviation of the number of senses. If the standard deviation is less than
the 30% of the mean (average) number of tags per sense then the word is 
accepted else rejected. Also my program throws, all the words marked with 
"unlisted", "unclear" or "empty" tags.

	Converting The Data-Format :--> The program "convert" is used to do 
this.

HOW TO RUN THE PROGRAM.

    1. To Retrieve The Good Words :-->  Typing ./getGoodWords executes the 
program. No parameter is required. It creates a temporary file "tempFile3" 
in the same directory. Also it expects the data to be available in 

     "/home/cs2/tpederse/CS8761/Open-Mind/cs8761-umd.full.detailed"

     The program gives small summary about the good words, shows the 
six finally chosen by it and the agreement percentage among the taggers for
each. It also creates a file "tempFile3" in the directory where it runs and 
stores the instance ids and the corresponding tags for all of the 6 seleted 
words. 

    2. To Convert The Data :--> This is done by the program "convert". Like 
the previous program, this program also does not take any input parameter.
It reads the "tempFile3" created by "getGoodWrods" program, gets the instance 
ids of the selected instances, reads the corresponding sentences from the file 
"/home/cs2/tpederse/CS8761/Open-Mind/ids-to-sentences" and converts them to 
the line-data-format.
	 After this is executed, 6 sub-directories are created in the directory
where it was executed, each corresponding to one of the good words. Each
of these directories contains as many files as the number of senses of the
the good word it represents.

    3. Now the classifier is to be run as usual on this data. But here,
the name of the files are different. For this, I have made copies of the
4 files of the classifier (select.pl, feat.pl, convert.pl and nb.pl). Each 
set allows only the filenames of the corresponding good word.

    Tabulated below are the observations of my experiments


+====================================================================+
|    	| 	|			Accuracy		     |
|   W	|   F	|____________________________________________________|
|	|	|captain|completion|distribution|enemy | past |shape |
|=======|=======|=======|==========|============|======|======|======|
|	|   1	|0.4615	|0.6176	   |0.4848	|0.3333|0.5000|0.2963|
|	|-------|-------|----------|------------|------|------|------|
|   0	|   2	|0.4615	|0.6176	   |0.4848	|0.3333|0.5000|0.2963|
|	|-------|-------|----------|------------|------|------|------|
|	|   5	|0.4615	|0.6176	   |0.4848	|0.3333|0.5000|0.2963|   
|=======|=======|==================|============|======|======|======|
|	|   1	|0.4103	|0.1765	   |0.5152	|0.1212|0.3529|0.4815|     
|	|-------|-------|----------|------------|------|------|------|
|   2	|   2	|0.3333	|0.2059	   |0.5758	|0.1515|0.4412|0.4074|     
|       |-------|-------|----------|------------|------|------|------|
| 	|   5	|0.2821	|0.2353	   |0.4848	|0.3030|0.4118|0.3333|     
|=======|=======|=======|==========|============|======|======|======|
|  	|   1	|0.3077	|0.6765	   |0.2727	|0.3333|0.2941|0.2963|    
|       |-------|-------|----------|------------|------|------|------|
|   10	|   2	|0.2821	|0.6471	   |0.1818	|0.3333|0.3529|0.3704|     
|       |-------|-------|----------|------------|------|------|------|
|  	|   5	|0.2821	|0.5000	   |0.3939	|0.3333|0.2647|0.1852|     
|=======|=======|=======|==========|============|======|======|======|
|  	|   1	|0.2821	|0.7059	   |0.1818	|0.1515|0.2353|0.1852|     
|       |-------|-------|----------|------------|------|------|------|
|   25	|   2	|0.2821	|0.5882	   |0.1818	|0.1818|0.2353|0.1481|     
|       |-------|-------|----------|------------|------|------|------|
|  	|   5	|0.2821	|0.5588	   |0.2121	|0.0909|0.3235|0.1111|    
+====================================================================+


	I see from the table that the classifier performs better for the word
"completion" than any of the other words. The most strange part is that the 
accuracy decreases as the window size is increased. The word "completion" has
much more data than that for any other word above. This, I think, is the reason
for better performance of the classifier for "completion" than for any other 
word.
	In any word, for a given window size, as the frequency cut-off is
increased, the accuracy decreases and this, I feel, is reasonable because with
increasing cut-off, the number of features decreases and so the classifier 
learns from less and less data.

CONCLUSION: Thus for the classifier to work well, the data fed for training 
must be large. 
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

			
			 CS 8761 NATURAL LANGUAUGE PROCESSING 
			 FALL 2002
			 ANUSHRI PARSEKAR
			 pars0110@d.umn.edu
			  
   
                   ASSIGNMENT 3 : HOW TO FIND MEANINGS IN LIFE. 
Introduction
------------
          Many times we come across words which have several meanings (or senses)and so there can be an
ambiguity about their meaning. Word sense disambiguation deals with assigning meanings or senses to 
ambiguous words by observing the context or the neighboring words. In this assignment we attempt to
disambiguate various meanings of the word line by using sense tagged line in which six different senses
of the word line have been identified and the data is divided into seperate files for each sense.    
	 
Programs implemented
---------------------
select.pl : This program tags every instance of the target word (here line) according to the file in 
            which it is and then randomly divides the data into TRAIN and TEST parts.
feat.pl   : This module identifies the feature words associated with the word to be disambiguated
	    (here 'line'). The range of feature words can be controlled by changing the window size
	    as well as the frequency of occurance of the feature word.The feature words are obtained
	    from the TRAIN file only.
convert.pl: The input to this file is the TRAIN or TEST file and the list of feature words.Each instance 
	    in file is converted into a series of binary values that indicate whether or not each feature
	    word listed has occurred within the specified window around the target word in the given instance.
nb.pl     : This program learns a Naive Bayesian classifier from the feature vector of TRAIN and use that 
	    classifier to assign sense tags to the occurences of target word in the TEST file.
	    It tag assigned is the sense having the maximum prob (sense|context).The context word are
	    assumed to be conditionally independent.

Expected results
----------------
1. When the window size is zero we do not have any feature words or context to guess the sense of the target
   word. The only guideline we have is the probability of each senseand we assign the same sense which has the
   maximum probability to all the instances of the test data. The accuracy thus obtained will not dependent 
   on the frequency.
2. As we increase the window size, the no of feature words will increase. We expect an increase in 
   accuracy with increase in window size because for a larger window would mean that we are considering a
   wider context. Hence feature words which were ignored for a small window size will also be used and might 
   give a better accuracy.
3. Increasing the frequency cutoff will again reduce number of feature words which may reduce the accuracy. 
4. If we run the program on just 50 % of training data then it should reduce the accuracy as the classifier 
   will have less data to learn from.     
5. If the target word has lesser no of senses then a better accuracy can be expected because the word can be
   has to be assigned a sense which is choosen from a smaller number.


Results
--------
-----------------------------------------------------
|window size |   frequency cutoff | accuracy for    | 
|            |                    | 70% Train data  |  	  
-----------------------------------------------------
|	     |			  |	            |                  	 	 
|     0      |      1		  |      0.5568     |     
|     0      |      2             |      0.5568     |					
|     0	     |	    5             |      0.5568     |						
-------------|--------------------|-----------------|
|     2	     |	    1             |     0.7478      |      
|     2      |      2             |     0.7301      |          
|     2	     |	    5             |     0.7139      |       
-------------|--------------------|-----------------|
|     10     |	    1             |      0.8268     |       
|     10     |      2             |      0.8276     |		
|     10     |      5             |      0.7792     |     
-------------|--------------------|-----------------|
|     25     |	    1             |      0.8426     |		
|     25     |      2             |      0.8485     |		
|     25     | 	    5		  |      0.8187     |			
-----------------------------------------------------


Conclusions
-----------
1. The accuracy increases as we take more number of feature words. Thus, when we increase the window 
   size or take a low frequency cutoff we are considering a more detailed and context wider context.
   Hence the accuracy increases.
2. However a very large window size does not help much because the target word is most of the times
   unrelated to distant words which appear as feature words when the window size is large.  
3. The memory and time required for a high window and low frequecy cutoff are quite high as compared to 
   the increse in accuracy that they give. This should be considered while choosing a optimal combination
   on window size and frequecy cutoff.
4. The window size of 10 and the frequecy cutoff of 1 or 2 seem to give a good accuracy for the line data.
   However, we cannot make general statements on basis of this, since the senses of line are distinct and
   have a seperate set of feature words for each sense.       


                                   ASSIGNMENT 4 : A CONTINUING QUEST FOR MEANINGS IN LIFE  

Introduction
-------------
	This assignment deals with word sense disambiguiation (using naive bayesian classifier ) of words 
which were tagged in the open mind project. The open mind data was divided into two files, one contained 
the instance id and the sentences while the other had the instance id and senses assigned to the words by 
the user. A program was required to be implemented to get the open mind data into seperate files according
 to the sense assigned.

Program implemented 
------------------- 		
	A program was implemeted to do the above mentioned task. The input to the program are two files,
word_ln (contains the id and the sentences for a given word) and word_tag (contains the id and the senses).
THe output is a number of of files sorted according to the senses.
e.g. for the word aspect,
     perl id.pl aspect_ln aspect_tag  			
output files:  aspect.0 aspect.1 aspect.2 aspect.3 aspect.4
These files can be used as the input to select.pl along with a  aspect.config file.  


Results

------------------------------------------------------------------------------------------
|window size |   frequency cutoff |     ASPECT      |   PHASE           |  DISTRIBUTION	 |
|            |                    |   senses = 5    |  senses = 4       |  senses= 5     |
-----------------------------------------------------------------------------------------|
|	     |			  |	            |                   |	 	 |
|     0      |      1		  |   0.50000       |   0.71428570     	|   0.50000 	 |
|     0      |      2             |   0.50000       |	0.7142857	|   0.50000	 |
|     0	     |	    5             |   0.50000       |	0.7142857	|   0.50000	 |
-------------|--------------------|-----------------|-------------------|----------------|
|     2	     |	    1             |   0.531250       |	0.7142857       |  0.4736842 	 |
|     2      |      2             |   0.468750      |   0.7142857       |  0.4473684	 |
|     2	     |	    5             |   0.500000      |   0.714285        |  0.5000000	 |
-------------|--------------------|-----------------|------------------------------------|
|     10     |	    1             |   0.375000      |   0.6190476       |  0.3947368	 |
|     10     |      2             |   0.468750      |   0.7142857	|  0.3157895	 |
|     10     |      5             |   0.500000      |   0.714285        |  0.4210526	 |
-------------|--------------------|-----------------|------------------------------------|
|     25     |	    1             |   0.250000      |	0.6190476       |  0.5789474  	 |
|     25     |      2             |   0.187500      |   0.7142857	|  0.4210526	 |
|     25     | 	    5		  |   0.375000      |	0.6666667 	|  0.4736842	 |
------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------
|window size |   frequency cutoff |     SHARE      |    DUST           | CAPTAIN	 |
|            |                    |   senses = 4   |  senses = 3       | senses=7        |
-----------------------------------------------------------------------------------------|
|	     |			  |	           |                   |	 	 |
|     0      |      1		  |   0.4545455    |  0.1363636        | 0.3695652  	 |
|     0      |      2             |   0.4545455    |  0.1363636        | 0.3695652	 |
|     0	     |	    5             |   0.4545455    |  0.1363636        | 0.3695652 	 |
-------------|--------------------|----------------|-------------------|-----------------|
|     2	     |	    1             |  0.6666667     |  0.5000000        | 0.4130435	 |
|     2      |      2             |   0.4242424    |  0.5454545        | 0.4565217	 |
|     2	     |	    5             |   0.4545455    |  0.1363636        | 0.2608696	 |
-------------|--------------------|----------------|-------------------------------------|
|     10     |	    1             |   0.6363636    | 0.3181818         | 0.4130435 	 |
|     10     |      2             |   0.5757576    | 0.3636364         | 0.3260870 	 |
|     10     |      5             |   0.4545455    |   0.3636364       | 0.3043478 	 |
-------------|--------------------|----------------|-------------------------------------|
|     25     |	    1             |   0.5454545    |  0.2727273        | 0.3695652 	 |
|     25     |      2             |   0.5151515    | 0.4545455	       |  0.3478261      |
|     25     | 	    5		  |   0.4242424    | 0.3181818 	       |  0.2826087	 |
------------------------------------------------------------------------------------------


Conclusions:
1. The acuracy for all the words choosen from the open mind data is much lower than the line data.
   This maybe because the senses for words chosen were not clearly distinct but were interrelated,
   whereas, for the line data the meanings were completely different. This made disambiguation easier.
   Also the data from open mind project was tagged by a number of users. Hence, the correctness of 
   the tagged data needs to be checked. The number of instances in line were very high than       
   the open mind data, therefore more  training data was available in case of line data to learn from..

2. The window size of 2 appears to give better accuracy in case of open mind data. 

3. No clear trends in accuracy with respect to change in window size and frequecy are observed in 
   case of open  mind data.    

4. The naive bayesian classifier gives poor results for open mind data since the context for each 
   sense is not distinct and some times difficult  disambiguate even manually. 


+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
 REPORT
--------
 -----------------------------------------------------------------------------
  Prashant Rathi
  Date -11th Nov. 2002 
  CS8761 - Natural Language Processing
 -----------------------------------------------------------------------------

  Assignment no.4
 ------------------ 
  Objective ::To continue your exploration of supervised approaches to word 
  sense disambiguation

 -----------------------------------------------------------------------------
  Observations on the open-mind data :
  ----------------------------------     
       I considered two files from the open-mind data cs8761-umd.full.detailed 
  and ids-to-sentences.The first file contained all the information about the
  tagging done by different users and the second file contained the actual 
  sentences corresponding to the instance-ids.I wrote a program (splitter.pl)
  to split the cs8761-umd.full.detailed file into various files with a single 
  file for each word.Similarly, I separated the ids-to-sentences file into 
  smaller data files corresponding to every word as this reduces the file 
  access time as ids-to-sentences is a very large file and it takes time 
  to access (separator.pl).
       Now we had to select "good" words depending upon the following criteria-
  a. Has reasonable rate of agreement among the two taggers
  b. Has somewhat balanced distribution of senses.
  c. Has a reasonable number of examples.

  Now to find such words , I have written a program (getstats.pl).This program
  outputs all the information about each word and the output can be seen in the   RESULTS file attached. This file helps us to identify the "good" words by
  studying the different characteristics.
  
  By studying this file, I picked up a few good words which were-         
  share,captain,edge,dust,distribution,shape 
   
  Steps to be followed to run for word say 'shape':
  ------------------------------------------------
  1.Use 'getstats.pl tagshape >RESULTS' to get information  about the word 
    shape.This is optional.(res file contains the script for this).
  2 Use 'separator.pl shape' to get data from ids-to-sentences all containing 
    'shape' instances.
  3.Use 'splitter.pl' to create the file tagshape from cs8761-umd.full.detailed
    We need this tagshape further.
  4.Use 'createdata.pl tagshape shape' to get all sense files corresponding
    to shape instance.
  5 Then run all assignment3 files on these sense files.We may have to change
    the target.config files before running these files.
    a. select.pl b.feat.pl c.convert.pl d.nb.pl 
 
  The observations made for these words follow next.
 
-------------------------------------------------------------------------------
 OBSERVATIONS:
 ------------ 
  Experiments were carried out with window sizes of 0,2,10 and 25 and 
  frequency cut-offs of 1,2 and 5.For these 12 combinations of frequency and
  window sizes and 70-30 training-test data ratio, following observations were
  made -  

  share
  -----
 ---------------------------------------------
  window size | frequency cut-off | accuracy |
 ---------------------------------------------
       0      |          1        | 0.4667   |	 
       0      |          2	  | 0.4667   |
       0      |          5	  | 0.4667   |
   
       2      |          1	  | 0.5556   |
       2      |          2	  | 0.4889   |
       2      |          5	  | 0.4667   |
   
       10     |          1	  | 0.4667   |
       10     |          2	  | 0.4667   |
       10     |          5	  | 0.4222   |
  
       25     |	         1        | 0.4444   |
       25     |          2        | 0.4667   |
       25     |          5        | 0.4444   |
  --------------------------------------------

  captain
  -------
 ---------------------------------------------
  window size | frequency cut-off | accuracy |
 ---------------------------------------------
       0      |          1        | 0.3148   |	 
       0      |          2	  | 0.3148   |
       0      |          5	  | 0.3148   |
   
       2      |          1	  | 0.3889   |
       2      |          2	  | 0.3148   |
       2      |          5	  | 0.2222   |
   
       10     |          1	  | 0.3889   |
       10     |          2	  | 0.3519   |
       10     |          5	  | 0.2593   |
  
       25     |	         1        | 0.3704   |
       25     |          2        | 0.3519   |
       25     |          5        | 0.2407   |
  --------------------------------------------

  edge
  ----
 ---------------------------------------------
  window size | frequency cut-off | accuracy |
 ---------------------------------------------
       0      |          1        | 0.2500   |	 
       0      |          2	  | 0.2500   |
       0      |          5	  | 0.2500   |
   
       2      |          1	  | 0.3750   |
       2      |          2	  | 0.2500   |
       2      |          5	  | 0.3000   |
   
       10     |          1	  | 0.3000   |
       10     |          2	  | 0.3000   |
       10     |          5	  | 0.3250   |
  
       25     |	         1        | 0.2750   |
       25     |          2        | 0.2500   |
       25     |          5        | 0.2000   |
  --------------------------------------------
  
  dust
  ----
 ---------------------------------------------
  window size | frequency cut-off | accuracy |
 ---------------------------------------------
       0      |          1        | 0.3333   |	 
       0      |          2	  | 0.3333   |
       0      |          5	  | 0.3333   |
   
       2      |          1	  | 0.4444   |
       2      |          2	  | 0.3333   |
       2      |          5	  | 0.2222   |
   
       10     |          1	  | 0.4167   |
       10     |          2	  | 0.4167   |
       10     |          5	  | 0.3333   |
  
       25     |	         1        | 0.3333   |
       25     |          2        | 0.4167   |
       25     |          5        | 0.4167   |
  --------------------------------------------
  
  distribution
  ------------
 ---------------------------------------------
  window size | frequency cut-off | accuracy |
 ---------------------------------------------
       0      |          1        | 0.4000   |	 
       0      |          2	  | 0.4000   |
       0      |          5	  | 0.4000   |
   
       2      |          1	  | 0.3500   |
       2      |          2	  | 0.3500   |
       2      |          5	  | 0.3500   |
   
       10     |          1	  | 0.4000   |
       10     |          2	  | 0.4000   |
       10     |          5	  | 0.4500   |
  
       25     |	         1        | 0.4000   |
       25     |          2        | 0.4500   |
       25     |          5        | 0.3750   |
  --------------------------------------------
  
  shape
  -----
 ---------------------------------------------
  window size | frequency cut-off | accuracy |
 ---------------------------------------------
       0      |          1        | 0.3077   |	 
       0      |          2	  | 0.3077   |
       0      |          5	  | 0.3077   |
   
       2      |          1	  | 0.3590   |
       2      |          2	  | 0.3846   |
       2      |          5	  | 0.3846   |
   
       10     |          1	  | 0.4077   |
       10     |          2	  | 0.3846   |
       10     |          5	  | 0.3077   |
  
       25     |	         1        | 0.3590   |
       25     |          2        | 0.4103   |
       25     |          5        | 0.3333   |
  --------------------------------------------

- The results of the open-mind data observed show very low accuracy values.

- This may be due to the fact that data contained limited instances.

- In the cases where there was a disagreement between users tagging I pick up
  the senses randomly, this might result in inappropriate distribution among
  senses.

- Also, sometimes only one user has tagged a particular word (art) so there
  was agreement totally but such words do not qualify as good words.

- The quality of tagging may not be satisfactory. 

-------------------------------------------------------------------------------
CONCLUSION:
-----------
 So considering the reasons mentioned above, I think the results observed for
 the Open-Mind data show low accuracy. This shows the data is not as good as 
 the line data which had many more instances and tagging was of much high
 quality.

-------------------------------------------------------------------------------+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
CS 8761: Assignment 4 - Naive Bayesian Classifier Repaired 
Due:	 11/11/2002 12pm

Sam Storie
11/4/2002


+------------------------------------------------------------------------+
| Methods                                                                |
+------------------------------------------------------------------------+

Introduction:
	A Bayesian classifier (BC) is a technique used to classify a set of events 
into "bins" and then use certain information to classify new events into one 
of those bins. In the context of word sense disambiguation we use a set of
words surrounding a "target word" that we're trying to determine the sense of
that word. These surrounding words are considered to be "features" of the 
word and are treated as potential identifying items concerning the sense of 
the target word. This assignment deals with determining the sense of the word 
"line". To do this we examine a set of data where the sense of line has been
predetermined and use this data to "train" our classifier. Then we use the
information from the training data to try to determine the sense of some
unseen testing data. The details of this process is described in subsequent
sections.

Process:
	This assignment is broken into four seperate programs. The first program
examines a set of pre-tagged instances for line. To ensure our classifier
isn't developing a bias this program seperates this data into training and
testing data. The classifier is developed only using the training data, and
isn't shown the testing data until the actual performance testing is done.
This program randomly selects a certain percentage of the instances to be
placed into a file called TRAIN and the remainder are placed into a file
called TEST. This ensures that our classifier won't develop a bias.
	The second stage of the classifier is a program that determines which
words occur within a certain window of the target word in the training
instances. Recall from the introduction we are considering these contexual
words to be some sort of indicator of the sense for that instance. Of course
there are some overlaps for common words, but this is taken into account
during this process. To help identify words that appear often within this
window we can set a frequency cutoff to eliminate uncommon words. The result
of this stage is a list of words that occur within a certain window size, and
that also occur a certain number of times. This stage is only performed on the
*training* data, since using the testing data would introduce a bias towards
the features present in the testing data. This would obviously nullify any
results we ultimately obtained from that data.
	The third program uses the list of words from the previous stage to create
feature vectors for all the instances of training and testing data. A feature
vector is simply a list of 1's and 0's to indicate whether a given word
occured within the window used to generate the features. If a given instance
had a 1 for the word "IBM", then this would mean that "IBM" appeared within
the window of "line" for this instance. These feature vectors are generated
for both the training and testing data.
	The final stage is to actually create the classifier and attempt to assign
senses to the training data. To do this we examine the feature vectors for the
training data and determine how often each feature occured with each sense.
From this data we can compute the probability associated for a sense based on
what words occur within the window size specified in the earlier stages. Then
when we examine the feature vector for a test instance we can determine which
sense is most likely (based on the training instances used) given the features
that are in the window. The probability of each sense is computed by
multiplying the probabilities of each feature occuring, given the sense. Then
we simply select the sense with the largest resulting probability. The final
step is to apply a sense with the classifier and check it against the actual
sense. The final accuracy (number correct / total number of instances) is 
reported to the user.
	Technically, using probabilities of the features occuring with the target
word is a very involved process, but we are implementing a *naive* BC. This
means that we treat the probability of each feature occuring with the target
word as conditionally independent. If we did not make this assumption we
would need to consider the probability of not just the features seperately,
but rather occuring in the sequence they actually appear. Zipf's law shows 
us that once you start considering sequences of more than 3-4 words the
probability is very hard to determine. This is simply because those sequences
probably do not occur very often and it's hard to assign how likely they
should appear. This is a non-trivial distinction, but impressively still 
generates some useful results. As a final note, in the case of a feature not 
occuring within a training instance, which would give a probability of zero, 
we use Wittman-Bell smoothing to give those cases a probability to use in the 
calculations.


+------------------------------------------------------------------------+
| Result from classifier improvements                                    |
+------------------------------------------------------------------------+
	I ran this set of four programs with a combination of window sizes and
frequency cut-offs. Per the assignment, the initial set of instances was split
into 70% training data and 30% were set aside for testing. The initial data
was split up only once, and the resulting TRAIN and TEST files were repeatedly
used for each of these tests. The results are summarized as such:


      Window size | Frequency cutoff | Accuracy
     +------------------------------------------+ 
           0               1             .5350 <-- Baseline case (worst)
           0               2             .5350
           0               5             .5350
           0              10             .5350
           2               1             .7502
           2               2             .7213
           2               5             .6827
           2              10             .6217 
          10               1             .8129
          10               2             .8008
          10               5             .7637
          10              10             .7454
          25               1             .8457 <-- Best result 
          25               2             .8156
          25               5             .7802
          25              10             .7655
     +------------------------------------------+		  


	As you can see in the table, the best accuracy obtained was about 85% and
this was consistent across several trials. I found it rather interesting to
see how the window size affected the accuracy across the trials. For a window
size of 0 I would expect that the accuracy would match the frequency of the
most common sense. This is because, with no features to help in the
classification, the only data that can be used is how often each sense
occured. The result is that every testing sense is tagged with the same sense.
In the case of line, the "product" sense occurs about 54% of the time (2218 out
of 4149 instances), and this is close to the results obtained for 
the window size of 0.
	It's much more interesting when the window sized is increased, even to a
size of 2. Intuitively this might seem to not impact the classifier much,
since the most common words that are within 2 of the target are probably words
like "the", "at", "for", etc. However, it's clear that even though these words
might occur often next to the target, there's something else helping the
classifier. I believe things like collocations, or common bigrams/trigrams are
providing some features that are clearly helping to classify the test
instances. Traditionally these features are called "local features", and
provide some much needed features to help the classifier.
	Once the window size is increased to 10 or 25 the classifier is able to
include "topical features", or those that do not occur immediately next to the
target word. In terms of "line", these types of features included "IBM" for
the product sense, or "telephone" for the cord sense. Also, observe how the
results actually decrease as the frequency cut-off is increased. It's easy to
imagine that removing common words might help the classification task, but
what's really happening is it's reducing the amount of data the classifier has
to use. This helps show how these types of classifiers are fairly immune to
"noise" in the data. It's also tempting to just increase the window size and
include every word possible, but there is trade-off here because the
computational time required for large amounts of data can grow quite large. As
an example it required an AMD Athlon 1700+ computer with 512M of RAM about 10
minutes to process the case of W=25 and F=1. This might not seem too bad, but
this was substantially longer than the smaller window size cases, and didn't 
radically improve the results. This is also implementation dependent, but the 
tradeoff still exists.
	On a final note, the meaning of a 85% success rate is difficult to guage 
because for a human this task is often trivial. However, given that the actual 
tests are based solely on the instance itself, I think it's rather impressive 
that such a simple technique can get a correct result 85% of the time. I 
imagine with some tweaking of the various parameters of the experiment, or
even the specific data being used, might produce slightly better results. Of 
course, these results are based on the amount of training data available 
(about 3600 instances in this case) and the breadth of examples that data 
covers. Still, with these limitations in mind, the Naive Bayesian Classifier 
is a simple, and clearly useful technique when performing word sense 
disambiguation.
	

+-------------------------------------------------------------------------+
| References and More Information                                         |
+-------------------------------------------------------------------------+
(1) Original assignment on Dr. Pederson's web page
	- http://www.d.umn.edu/~tpederse/Courses/CS8761/Assign/a3.html	

(2) Revision assignment on Dr. Pederson's web page
	- http://www.d.umn.edu/~tpederse/Courses/CS8761/Assign/a4.html	

+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++


***********************************************************************************************
* Refer to README.TXT for listing of various files attached with assignment 4 and their usage *
***********************************************************************************************


                                      \\\|/// 
                                    \\  ~ ~  // 
                                     (  @ @  )  
**********************************-oOOo-(_)-oOOo-************************************
                                                                                        
     CS 8761 - Natural Language Processing - Dr. Ted Pedersen                     
                                                                                   
     Assignment 4 : A CONTINUING QUEST FOR MEANINGS IN LI(F|N)E             		   		               
                                                                      
     Due : Monday , November 11,  Noon
                                                                                   
     Author : Anand Takale (  )                                         
                                                                                  
***************************************-OoooO-***************************************
                              .oooO     (   ) 
                              (   )      ) / 
                               \ (      (_/ 
                                \_) 

-------------
Objectives :
-------------

To continue your exploration of supervised approaches to word sense disambiguation.

--------------
Specification:
--------------

(1) Fix your Naive Bayesian Classifier from Assignment 3. If you got less than 10 on this
assignment you have a problem that requires fixing. Once you have your classifier, rerun
the line data as described for Assignment 3 and rewrite your report reflecting new results.

(2) Identify three "good" words in Open Mind data that we created as a class. ( This is found
in file /home/cs/tpederse/CS8761/Open-Mind-cs8761-umd.full.details 

(3) Once you have identified your words, convert the Open Mind data into the form of the line
data so you can use your assignment 3 classifier. Run your classifier on that data and comment
on your results.

------------------------------------------------------------
Part I -- Fixing Naive Bayesian Classifier from Assignment 3
------------------------------------------------------------

As the results of Naive Bayesian Classifier for asssignment 3 , I was getting the accuracy 
around 0.7 range which was less than expected. I tried to correct the earlier classifier 
i.e. I tried to see if there were any bugs or logical errors in nb2.pl which was submitted
as a part of tar file for Assignment 3. However this became quite a tedious job. As a result 
I recoded the entire nb.pl and have placed in the tar file of assignment 4 as nb2.pl.
The earlier nb.pl is also included in the tar file if required for reference.
After recoding nb.pl as nb2.pl I realized that there was a mistake with the smoothing part
of nb.pl as well as there was some minor correction of denominator of the classifier which
did not affect the accuracy but affected the associated probability. These corrections have 
been made to the classifier. Here the same report for Assignment 3 is attached along with 
the new values recorded.


*******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************


                                      \\\|/// 
                                    \\  ~ ~  // 
                                     (  @ @  )  
**********************************-oOOo-(_)-oOOo-************************************
                                                                                        
     CS 8761 - Natural Language Processing - Dr. Ted Pedersen                     
                                                                                   
     Assignment 3 : HOW TO FIND MEANINGS IN LIFE              		   		                
                                                                      
     Due : Friday , November 1,  Noon
                                                                                   
     Author : Anand Takale (  )                                         
                                                                                  
***************************************-OoooO-***************************************
                              .oooO     (   ) 
                              (   )      ) / 
                               \ (      (_/ 
                                \_) 


Objectives : To explore supervised approaches to word sense disambiguation. To create
----------   sense-tagged text and see what can be learned from the same. 


Specification : Sense tag some text and implement a Naive Bayesian classifier to perform 
-------------   word sense disambiguation.


----------------------
Part I : Data Creation
----------------------

Tag 500 instances/sentences on Open Mind Word Expert website.

login id : Anand
Project : CS8761-UMD
Instances Tagged : 515

There was no output to be turned in from Open Mind sense tagging.


--------------------------------------------------
Part II : Naive Bayesian Classifier Implementation
--------------------------------------------------

This assignment mainly deals with word sense disambiguation. The problem to be solved
is that many words have several meanings or senses. For such words given out of context
there is ambiguity about how they are to be interpreted. Thus our task is to do the
disambiguation i.e to determine which of the senses an ambiguous word is invoked in 
a particular use of the word.

This assignment required us to calculate the Naive Bayesian Classifier. In short it
learns a Naive Bayesian classifier from TRAIN data and use the classifier to assign
sense tags to TEST data.

The idea of Bayes classifier is that it looks at the words around an ambiguous word
in a large context window. Each content word contributes potentially useful information
about which sense of the ambiguous word is likely to be used with it. The classifier 
does not do feature selection. Instead it collects evidence from all features.

Implement a Naive Bayesian Classifier to perform word sense disambiguation. Perform your
experiments with the "line" data.

The assignment required the implementation of four modules:

(1) select.pl -- program to construct TRAIN and TEST data
(2) feat.pl -- program to extract features.
(3) convert.pl -- program to build the feature vector representation
(4) nb.pl -- program to learn a Naive Bayesian Classifier from TRAIN.FV and use that
             classifier to assign sense tags to TEST.FV

Our main task was to create a Naive Bayesian Classifier from the TRAIN data and then 
disambiguating the words from the TEST data.

We had to perform disambiguation of "line" data. For this the six files containing
different senses of line were combined and lines were read randomly from any of the
files and put into TRAIN and TEST files. The TRAIN file consisted of 70% data while
the TEST file consisted of 30% files.

After creating the TRAIN and TEST files , the features which had a frequency greater
than the cutoff frequency were selected. After observing the extracted features it 
was noted that the conjunctions , prepositions and the determiners like a , an , 
and , the occured almost in all the cases. To be more specific about the task of 
disambiguation we need to find more dependent collocations and try to eliminate 
all the 'stop' words like the conjunctions , determiners etc. so that the accuracy
of the classifier improves.

------------
Experiments
------------

Experiment with window sizes of 0,2,10 and 25. Use frequency cutoffs of 1,2 and 5.
Run your classifiers with all 12 possible combinations of window size and frequency
cutoff using a 70-30 training-test data ratio.

Report the accuracy values that you obtain for each combination in a table.


------------------------------------------------------------
window size     |     frequency cutoff    |    accuracy
------------------------------------------------------------
	0       |	   1              |	0.5397
	0       |	   2		  |     0.5397
	0       |	   5		  |     0.5397
------------------------------------------------------------
	2       |	   1		  |     0.7197
	2       |	   2		  |	0.7213
	2       |	   5		  |     0.7012
-----------------------------------------------------------
	10      |	   1		  |	0.8143
	10      |	   2		  |	0.8231
	10      |	   5		  |	0.7907
-----------------------------------------------------------
	25	|	   1		  |	0.8446
	25	|	   2		  |	0.8437
	25	|	   5		  |	0.8283
------------------------------------------------------------


-----------------------------------------------------------------------------------
What effect do you observe in overall accuracy as the window size and the frequency
cutoffs change ?
----------------------------------------------------------------------------------

After observing the data recorded in the tables above we come to the following 
conclusions : 

(1) When the window size is zero i.e. there are no features available to train 
    the Naive Bayesian Classifier , the sense that occurs the most in the TRAIN data
    is assigned as the calculated sense for all the instances of the TEST data. In 
    case of line data the accuracy observed for  window size of 0 was 0.5397. This 
    was the probability of the sense that occured the most. From this we can conclude
    that even without having any prior knowledge i.e. without training the Naive 
    Bayesian Classifier. So this is the lowest accuracy that can be obtained from 
    this classifier. So we see for the line data that the minimum accuracy obtained 
    will be 0.5397. 

(2) For a window size of 2 the value of accuracy increases considerably. We see that
    the value of accuracy goes upto 0.7197 for a cutoff frequency of 1. What we can
    conclude from this is that when the window size is one we can predict sense of 
    almost 64% of the test instances.So we see that disambiguation becomes easier 
    when we have some context rather than having no context at all. One more observation
    that can be made at this point is that according to Zipfs' Law most of the bigrams
    occur rarely while few occur frequently.We observe the same effect as that of 
    Zipf's law. For the window of size 2 we observe that as the cutoff frequency 
    goes on increasing the accuracy goes on decreasing. This is what is observed by 
    Zipf's law. This proves that most of the bigrams occur very few times. So as the
    frequency cutoff goes on increasing the chances of all the words occuring becomes 
    less. This is what we observe from the table tabulated above.

(3) Again observing values of the accuracy for different window sizes we observe 
    that the value of accuracy goes on increasing till a certain point before it
    fall off again. Also the phenomenon of decrease in accuracy with increase in 
    window size is observed. Now in both these cases i.e with increase in window
    size and with increase in cutoff frequency we see that there is a considerable
    decrease in accuracy. This observed fact can be explained as follows: when we 
    try to increase the window size from 0 to 25 the number of features present goes 
    on increasing. As the number of features goes on increasing the Naive Bayesian 
    Classifier undergoes lot of learning. But all the features are not important to
    train the classifier.The stop words i.e .all the conjunctions and determiners
    actually are of no use to train the classifier. What we are looking is a more
    solid feature which is closely connected to the tagged word. By finding such 
    solid features which are closely connected to the tagged words it becomes 
    much more easier to train the classifier. In short the decrease in accuracy 
    can be considered as consequence of increase in noise words which get added
    to the feature vector. Now these noise words are introduced as a result of
    increase in window size or increase in cut-off frequency.
    

------------------------------------------------------------------------------------
Are there any combinations of window size and frequency that appears to be optimal 
with respect to the others ? Why ?
-----------------------------------------------------------------------------------

From the observations i.e. from the table we observe that the optimum value of accuracy
is 0.8446. This value of accuracy is observed when the window size is 25 and the cutoff
frequency is 1.

This optimum value of the window size and the cutoff frequency are specific to this data.

(1)  The range of optimum value of window size will be the same. The value of accuracy
     draws a curve with respect to the window size. Initially as the window size increases
     the accuracy also increases till it reaches an optimum value from where it again 
     starts to fall off. This is due to the fact that as the window size increases the
     number of noise words also increase which reduce the overall accuracy of the classifier.
     This is because the more connected tokens occur fewer times than the stop words i.e.
     conjunctions , determiners etc.

(2)  As far the relation between the cutoff frequency and the accuracy goes , it is 
     observed that as the cut-off frequency increases the accuracy goes on decreasing.
     This observed fact is due to the reason that as the cutoff frequency increases the
     more dependent words ( or we can say more specific words to that sense or context )
     which occur very few times to be eliminated from being considered as features. This
     leaves only the stop words i.e the conjunctions , prepositions  , determiners etc. 
     as the features. Now these stop words can't train the classifier as the specific 
     dependent tokens do. These stop words occur in almost all the instances equi-probably
     and hence it becomes more difficult to disambiguate the sense causing the accuracy 
     to go lower.

So in short we can say that a midrange window size and lower cutoff frequency would most
probably give the most optimum value for accuracy. We also conclude that a window size of
0 gives the least accuracy.So if we go on increasing the window size beyond 25 , the accuracy 
will increase till a certain window size after which it will again start falling slowly. 
However whatever be the window size and the cutoff frequency , accuracy won't fall below
the accuracy observed when window size=0.

-----------------
References :
-----------------

(1) Programming Perl - O-Reilly Publications
(2) Foundations of Statistical Natural Language Processing  - Christopher D. Manning and Hinrich Schutze
(3) Problem Definition and some related concepts and issues - Pedersen T.  ( www.d.umn.edu/~tpederse )

*******************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************


---------------------------------
Part ( II ) -- Finding Good Words
---------------------------------

A good word is one that has the following properties--

(1) Has a reasonable rate of agreement among the two taggers
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    After analyzing the Open-Mind data from the file 
    /home/cs/tpederse/CS8761/Open-Mind/cs8761-umd.full.details the following things were
    observed--
    (a) In all there were 13204 instances that were tagged. Out of these there were 7350 
	unique instances. 
    (b) Among the 7350 unique instances ---
	2089 instances were tagged by single user
	4794 instances were tagged by two users
	381 instances were tagged by three users
	55 instances were tagged by four users
	22 instances were tagged by five users
	9 instances were tagged by six users
    After observing the above figures it was important to decide aa to what could be the
    criteria for deciding agreement between two users ? As there were 2089 instances that
    were tagged by single person itself. So should the words that were just tagged by a 
    single person be considered as good words ? To solve the problem of agreement I eliminated
    the words which were tagged by single user as being the words that were in agreement. There
    were many words like distance , chair , bar , authority , channel , objective, shoulder and
    art that were only tagged by single users. However these words were not eliminated from 
    being considered as good words.

    As for the words which were tagged by two or more users , even if two users agreed on some
    particular sense  , it was considered as agreement. i.e for the instances which were 
    tagged by 6 users , even if two or more users agreed on some particular sense it was 
    considered as agreement. Agreement for different words was found by using a program
    called agreement.pl. While calculating the agreement the program calculates the agreement 
    ratio of the instances where only two or more users have tagged the same instance. The 
    program disregards all the instances that a single user has tagged. The case where
    a single user has tagged an instance can be considered as a 100% agreement or 0% agreement 
    as criteria for a GOOD word was agreement among two taggers. However in case of instances
    which were tagged by single user , it cannot be decided whether it is not possible to 
    decide the agreement ratio. As I have not considered the cases of single user tagging an 
    instance as agreement I have got less agreement ratios. However if the ratios are considered
    with the single user tagged instances being considered as agreement the ratios would increase
    considerably ..... as of now it is around 0.40 to 0.50 mark. However if instances tagged by 
    single user are considered as agreement the ratio goes above till around 0.80 mark.

    After looking at the output of agreement.pl we see that following words can be classified 
    as possible (satisfy at least 1/3 criterias ) GOOD words : 
    chapter , past , cofee , captain , team , edge , unit , song , tissue , sun , skill , 
    college , blood , chance , structure , lip


(2) Has a some what balanced distribution of senses. At the very least avoid the case where a 
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    single sense dominates the distribution.
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    The second criteria for finding a good word is that the sense distribution for that word 
    should be equally distributed. i.e. every sense of the word should contribute towards the 
    total number of instances. Also we should not consider any word that has a distribution
    in which one sense dominates the other senses. The word for which one sense dominates the
    other senses were removed directly from consideration of being good words. Some of the words
    that were removed directly because of this property of GOOD word were 
    (i) lip ---> 16, 1, 183, 5, 20
    (ii) chair ---> 1, 16, 1

    To find the sense distribution of each and every word a perl program called sense_dist.pl 
    was written and run on Open-Mind data. The program outputs the word and count of how many 
    times different senses occur.

    After looking at sense distribution of all the words some of the words satisfying this 
    criteria of GOOD word were as follows : 
    difficulty , distribution , memory , captain , shape , hope , argument , demand , edge , 
    volume , aspect , art , behaviour  , unit and length.

(3) Has a reasonable number of examples.
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    The third criteria for finding a GOOD word is that the word should a reasonable number
    of instances. Each word should have considerable number of instances. So the words having 
    more number of instances should be considered as words satisfying the third criteria of 
    being a GOOD word. A perl program instances.pl outputs the word and count of instances of
    each word. The words having reasonable number of examples are as follows : 
    pain , chapter , dust , attemp , future , past  , coffee , atmosphere , arc , radiation , 
    difficulty , song , enemy , tissue , depth  , memory , captain , shape , team , onset , 
    author , factor , argument , newspaper , phase , demand , edge , college , behaviour , blood , 
    chance , structure.


-----------------------------------------------------------------------------------------------------
****************************************** GOOD WORDS **********************************************
-----------------------------------------------------------------------------------------------------

After considering all the three criterias for finding the good words the following words were found
to be amongst the better words in the cs8761-umd project.Here the agreement ratio also considers
the instances tagged by single users as agreement.

-----------------------------------------------------------------------------------------------------------
word            number of instances            sense distribution                            agreement
-----------------------------------------------------------------------------------------------------------
difficulty            271                      48 , 80 , 80 , 63			      0.7214
distribution	      291		       45 , 35 , 2 , 127 , 74 , 8		      0.7783
memory		      287		       16 , 25 , 81 , 2 , 104 , 59		      0.6394
captain		      366		       55 , 27 , 40 , 21 , 57 , 40 , 1 ,11 , 114      0.8196
shape		      241		       77 , 6 , 24 , 42 , 18 , 24 , 20 , 30	      0.6784
hope		      227		       21 , 53 , 1 , 4 , 6 , 91 , 51		      0.8366
argument	      269		       3 , 111 , 63 , 7 , 1 , 84		      0.7342	
demand		      263		       8 , 57 , 37 , 100 , 1 , 60		      0.6102
edge		      188		       19 , 38 , 30 , 34 , 4 , 54 , 9		      0.6273
volume		      331		       42 , 12 , 9 , 41 , 43 , 136 , 42 , 6	      0.7284	
aspect		      255		       46 , 1 , 10 , 121 , 65 , 12		      0.5922
behaviour	      270		       35 , 1 ,63 , 1 , 60 , 110		      0.6962
length		      195		       34 , 36 , 47 , 22 , 53 , 3		      0.6462
art		      110		       31, 19 , 35 , 25				      1.0000

The word "art" has agreement of 1.0000 because only one user tagged all the instances.	

After going through the outputs of all the three files i.e. instances.pl , agreement.pl and sense_dist.pl , 
I came up with the following six GOOD words - 
(1) art
(2) shape
(3) captain
(4) edge
(5) length
(6) distribution.

These six words satisfied all the three criteria. They were above the threshold value. This threshold value
can be taken as per the user's choice. In this case the treshold conditions that I considered was number of 
instances being above 100 , a good sense distribution i.e words having equally distributed more than 4 senses 
and a good agreement ratio ( in this case considered to be about and more than 0.6 ). The word art has been 
tagged by only one user. But it has a good sense distribution. So it is considered as being a GOOD word.


-----------------------------------------------------------
Running The Naive Bayesian Classifier on the Open-Mind data
-----------------------------------------------------------

After the Open-Mind data was converted into line format the Naive Bayesian Classifier was executed on the
Open-Mind data. 

Complete Procedure for finding accuracy after word has been identified as GOOD word:
------------------------------------------------------------------------------------

(1) Say word "art" has been identified as a GOOD word.
(2) Run pickup.pl as follows
    pickup.pl art
    This command seperates all the instances of art from Open-Mind data from ids-to-sentences file. The 
    format of the pickup.pl is as follows
    instance_ID sense actual_instance
(3) Care should be taken that if older file of same sense exist , they should be deleted. This is 
    because I open the file in append mode.
    Execute seperate.pl as follows
    seperate.pl art
    It will seperate all the instances of "art" into different files according to the senses of the 
    instances. The file names have been created by the sense id of the sense of instance that is 
    stored in that file. Say for art the filenames are
    art%1:04:00: 
    art%1:06:00:
    art%1:09:00:
    art%1:10:00
(4) Execute select.pl as follows
    select.pl 70 art.config art%1:04:00:  art%1:06:00 art%1:09:00: art%1:10:00
(5) Execute feat.pl as follows
    feat.pl TRAIN W F > FEAT
(6) Execute convert.pl as follows
    cat FEAT | convert.pl W TRAIN > TRAIN.FV
    cat FEAT | convert.pl W TEST > TEST.FV
(7) Execute nb3.pl on TRAIN.FV and TEST.FV
    nb.pl TRAIN.FV TEST.FV > TAGGING

The Naive Bayesian Classifier was run on all the six words for combination of window size of 0,2,10,25 
and frequency cutoffs of 1 , 2 and 5. The following tables show the observation.

Word : Edge
-----------

------------------------------------------------------------
window size     |     frequency cutoff    |    accuracy
------------------------------------------------------------
	0       |	   1              |	0.22
	0       |	   2		  |     0.22
	0       |	   5		  |     0.22
------------------------------------------------------------
	2       |	   1		  |     0.29
	2       |	   2		  |	0.28
	2       |	   5		  |     0.25
-----------------------------------------------------------
	10      |	   1		  |	0.42
	10      |	   2		  |	0.36
	10      |	   5		  |	0.29
-----------------------------------------------------------
	25	|	   1		  |	0.52
	25	|	   2		  |	0.49
	25	|	   5		  |	0.49
------------------------------------------------------------

Word : length
-------------

------------------------------------------------------------
window size     |     frequency cutoff    |    accuracy
------------------------------------------------------------
	0       |	   1              |	0.25
	0       |	   2		  |     0.25
	0       |	   5		  |     0.25
------------------------------------------------------------
	2       |	   1		  |     0.28
	2       |	   2		  |	0.28
	2       |	   5		  |     0.27
-----------------------------------------------------------
	10      |	   1		  |	0.36
	10      |	   2		  |	0.38
	10      |	   5		  |	0.29
-----------------------------------------------------------
	25	|	   1		  |	0.45
	25	|	   2		  |	0.44
	25	|	   5		  |	0.39
------------------------------------------------------------

Word : distribution
-------------------

------------------------------------------------------------
window size     |     frequency cutoff    |    accuracy
------------------------------------------------------------
	0       |	   1              |	0.52
	0       |	   2		  |     0.52
	0       |	   5		  |     0.52
------------------------------------------------------------
	2       |	   1		  |     0.57
	2       |	   2		  |	0.59
	2       |	   5		  |     0.56
-----------------------------------------------------------
	10      |	   1		  |	0.59
	10      |	   2		  |	0.59
	10      |	   5		  |	0.60
-----------------------------------------------------------
	25	|	   1		  |	0.56
	25	|	   2		  |	0.63
	25	|	   5		  |	0.58
------------------------------------------------------------

Word : art
-----------


------------------------------------------------------------
window size     |     frequency cutoff    |    accuracy
------------------------------------------------------------
	0       |	   1              |	0.36
	0       |	   2		  |     0.36
	0       |	   5		  |     0.36
------------------------------------------------------------
	2       |	   1		  |     0.39
	2       |	   2		  |	0.39
	2       |	   5		  |     0.36
-----------------------------------------------------------
	10      |	   1		  |	0.48
	10      |	   2		  |	0.49
	10      |	   5		  |	0.45
-----------------------------------------------------------
	25	|	   1		  |	0.53
	25	|	   2		  |	0.29----> looks like some wrong value
	25	|	   5		  |	0.50
------------------------------------------------------------


Word : shape
------------

------------------------------------------------------------
window size     |     frequency cutoff    |    accuracy
------------------------------------------------------------
	0       |	   1              |	0.34
	0       |	   2		  |     0.34
	0       |	   5		  |     0.34
------------------------------------------------------------
	2       |	   1		  |     0.41
	2       |	   2		  |	0.39
	2       |	   5		  |     0.38
-----------------------------------------------------------
	10      |	   1		  |	0.48  
	10      |	   2		  |	0.48
	10      |	   5		  |	0.44
-----------------------------------------------------------
	25	|	   1		  |	0.57
	25	|	   2		  |	0.53
	25	|	   5		  |	0.46
------------------------------------------------------------


Word : captain
----------------

------------------------------------------------------------
window size     |     frequency cutoff    |    accuracy
------------------------------------------------------------
	0       |	   1              |	0.31
	0       |	   2		  |     0.31
	0       |	   5		  |     0.31
------------------------------------------------------------
	2       |	   1		  |     0.43
	2       |	   2		  |	0.43
	2       |	   5		  |     0.41
-----------------------------------------------------------
	10      |	   1		  |	0.41
	10      |	   2		  |	0.40
	10      |	   5		  |	0.44
-----------------------------------------------------------
	25	|	   1		  |	0.55
	25	|	   2		  |	0.55
	25	|	   5		  |	0.54
------------------------------------------------------------


From the above observations I came upto following conclusions

(1) The conclusions that came up as result of observation and analysis of line data seems
    to be true for the Open-Mind data too. As the window size increases the accuracy increases
    and as the frequency cutoff increases the accuracy tends to decrease.
(2) However as observed above the accuracy is way too lower as compared to line data. The 
    reasons cited for such low accuracy could be due to wrong tagging done by the Open-Mind
    users. Another reason for low accuracy could be due to the fact that unclear or no sense 
    had been tagged to the instance.Another reason for low accuracy is that the number of
    instances for one word considering all the senses is very less as compared to the line data.
    We can obviously see that the Open-Mind data for a particular word is very less as compared
    to the line data which was a big chunk of data.
(3) It can also be observed that all the GOOD words behaved in much the way as expected i.e. 
    accuracy increases as window size increases , accuracy decreases as frequency cuto-off decreases
    and the accuracy never falls below a certain value ( This value is the value when the 
    window size is 0 ).
(4) Thus we observe that the classifier behaves as expected and that Naive Bayesian Classifier
    can be considered as one of the GOOD supervised approaches to word sense disambiguation.
(5) The words selected were only from the cs-861 project. If the entire Open-mind data would 
    have been considered we could have definitely got better GOOD words and better agreement 
    ratios and still better accuracies.


---------------
References : 
--------------
1. For Naive Bayesian Classifier - refer to report of Assignemnt 3
2. Problem Definition and related concepts - Pedersen T. ( www.d.umn.edu/~tpederse )
3. Open Mind data ( www.teach-computers.org ) 
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
#********************************************************************
# Name: Nan Zhang
# Data: 11/10/02
# Class: CS8761
#********************************************************************

window size | frequency cutoff | accuracy
      0		      1	  	   .5466
      0		      2	  	   .5466
      0		      5	  	   .5466
      2		      1	  	   .7363
      2		      2	  	   .7178
      2		      5	  	   .6905
      10	      1	  	   .8207
      10	      2	  	   .8127
      10	      5	  	   .7830
      25	      1	  	   .8344
      25	      2	  	   .8280
      25	      5	  	   .7910


Basicly, the accuracies are continuous. As window size increases, accuracy increases. And as frequency cutoff increases, accuracy decreases.
So when window size is 25 and frequency cutoff is 1, I get the maximal accuracy .8344. But it is not obviously larger than other accuracies
whose window size is 25.